Most reviews of how the major models handle Arabic miss the mistakes entirely, because the reviewers can't read the output closely enough to see them. A Khaleeji reader catches Egyptian markers in the first paragraph: yashaghal instead of yashtaghal, delwa'ti instead of alheen, the wrong politeness style on a customer reply. None of it is subtle to the right reader. This piece reads the public community discussion (Hugging Face's Arabic-NLP discussion, the LMArena dialect-specific reports, the developer forums each lab maintains) for what each frontier model does on Saudi-market Arabic work.
The five models read across the analysis are Claude Opus 4.7, GPT-5, Gemini 3.x, Qwen 3 235B MoE, and Llama 4 Maverick. The workload axes are Modern Standard Arabic, the three major regional registers (Khaleeji, Egyptian, Levantine), and code-switched Arabic-English customer email. Maghrebi Arabic (Moroccan, Algerian, Tunisian) is a separate problem and is not in scope; coverage there is weaker across all five models.
The qualitative scoreboard
The table below is a qualitative summary, not a numeric leaderboard, because the workloads under it don't have public numeric benchmarks. The grades reflect the consensus of the public community discussion across the labs' own forums, the Arabic-NLP corners of Hugging Face, and the developer reports across the Saudi and Gulf tech community.
| Workload | Claude Opus 4.7 | GPT-5 | Gemini 3.x | Qwen 3 235B | Llama 4 Maverick |
|---|---|---|---|---|---|
| EN → Khaleeji marketing | Strong | Egyptian drift | MSA drift | Strong | Weak |
| MSA business reply | Strong | Good, cold tone | Good | Good | OK |
| Style poetry (e.g. Darwish) | OK | Translated feel | OK | Best attempt | Weak |
| Labor-law summary (MSA) | Strong, verify | OK | Strong | OK | Weak |
| Egyptian → English | Strong | Strong | Strong | Strong | OK |
| Code-switched email | Strong | Over-translates | Under-translates | OK | Weak |
Khaleeji is where the models split
The most diagnostic workload is English-to-Khaleeji marketing copy for a young Saudi audience. The community discussion is consistent that Claude produces output that reads like a Saudi copywriter wrote it: Khaleeji-coded vocabulary, product names left in Latin script (which is how Saudi users write them), sentence rhythm that matches the dialect, with only light editing needed before shipping. Qwen 3 produces something close to as good, with a slight Levantine tone that creeps in during longer outputs (the Qwen training data is weighted more toward Levantine sources than Khaleeji).
Gemini's pattern is MSA drift. When the model is uncertain about a dialect choice, it falls back to a more formal MSA register, so the output comes out technically correct but stylistically off. A Khaleeji reader notices immediately that the text was written by someone trying to sound Khaleeji rather than someone who is. GPT-5 drifts the other way, toward Egyptian: it defaults to Egyptian phrasings and vocabulary even when the prompt names a Gulf audience. The drift is subtle enough that you'd struggle to articulate it without exposure to both dialects, yet it's enough to break immersion for the intended reader. Llama 4 Maverick produces MSA with a few dialect words sprinkled in regardless of the prompt; the dialectal training has not landed yet on that lineage.
MSA is mostly a solved problem (the differences are taste)
The MSA workloads are the closest race. Every model in this range produces competent Modern Standard Arabic. The differences are in tone, sentence rhythm, and the small word choices that mark whether text was written by someone steeped in the language or by a model approximating it.
For a Saudi-business support reply, Claude lands the politeness register and the formulaic opening and close without the over-formality some models default to. Qwen and Gemini produce strong MSA with minor stylistic awkwardness, an old-fashioned phrasing here, an unusual word choice there. Nothing embarrassing. GPT-5's MSA is technically correct but tonally cold, and the voice reads translated rather than native. English structural shapes show through underneath it, and the seams are visible to a careful Arabic reader.
Every one of these models knows the words. What separates them is tone, rhythm, and local idiom, which is the part no leaderboard bothers to measure.
Code-switching is the hardest test
Mixed Arabic-English customer email is the toughest single workload for every model. Saudi internet writing routinely drops English brand names and technical terms, and sometimes whole English sentences, into otherwise-Arabic prose. A reply that matches that style works; switching to formal MSA reads as tone-deaf.
Claude handles this best in the community reports. Its reply code-switches naturally, leaving technical terms in English where translating them would feel artificial and reaching for Arabic on the emotional and relationship parts. The other models tend to go to one extreme or the other: either everything gets pushed into Arabic, technical terms included, or the reply stays mostly English with a few Arabic phrases dropped in as decoration. Neither matches how Saudi users actually write.
MSA
Claude Closest race overallKhaleeji
Claude Clearest gap to GPT-5Egyptian
GPT-5 Training-data preferenceLevantine
Qwen Edge on this registerCode-switching
Claude Best on Saudi mixPoetry
Qwen Best style-imitation attemptMSA, Khaleeji, Egyptian, Levantine. The register decides the model.
Claude for Khaleeji and MSA. Qwen for Levantine. GPT-5 for Egyptian.
Name the city, the age range, the tone. Defaults aren't enough.
For customer-facing copy, always. Don't skip this step.
Where no model is yet trustworthy
A few categories give every frontier model trouble, and deploying any of them there without human review is a mistake.
Legal text. MSA legal summaries are good enough to draft from but not to publish. Specific terms carry specific meanings; misremembering an article number or substituting a near-synonym changes the legal implication. Don't deploy any of these models for Arabic legal work without a qualified human reviewer.
Classical Arabic. None of the frontier models is fluent in pre-modern Classical Arabic the way they are in MSA. Quotes from medieval texts, exegesis of religious texts, anything in the classical style: expect significant errors and budget for expert review.
Specific regional dialects. Khaleeji is itself a family of dialects. Najdi differs from Hijazi differs from Qatari differs from Bahraini. No model distinguishes between them at the level a native does. For text that specifically needs Hijazi or Bahraini coloring, the models won't capture it without significant prompting and editing.
The pick for production
For Arabic work in early 2026, the default is Claude Opus 4.7. The model handles MSA and Khaleeji and code-switching better than the alternatives, and that tone sensitivity is usually what decides whether the text ships or goes back for a rewrite. The pricing on Opus is covered in the Opus review; the model has the headroom for the careful tonal work this kind of content needs.
Qwen 3 235B is the strong second pick. It's the right call when license clarity matters (Apache 2.0) and your audience speaks one of the dialects in its training mix. For Levantine work, Qwen edges Claude in the public reports; for Khaleeji, Claude keeps the lead. The open-weight tier piece walks through where Qwen fits in the broader open-weight picture, and the small-models piece covers when to drop to a smaller model.
Gemini is fine for general MSA and slips on dialect. GPT-5 produces Egyptian-flavored output even when prompted for Saudi: don't use it for Khaleeji audiences. Llama 4 Maverick is not yet competitive for serious Arabic work, despite improvements over Llama 3. The multimodal piece covers Gemini's separate Arabic strength on document images.
For serious Saudi-market work, the pattern that holds up is to draft the customer-facing copy with Claude, run a Khaleeji-fluent human reviewer over the output, and budget for a heavier editing pass than English would need. The gap from the best model to a native writer is real, but it has closed enough that the workflow now beats translating from scratch. Anthropic and Alibaba have both visibly invested in Arabic, and they're positioning for a half-billion-person market that their less-invested rivals are content to leave on the table.