Most reviews of how the major models handle Arabic miss the mistakes entirely. The reviewers can't read the output closely enough to see them.
A Khaleeji reader catches the Egyptian markers in the first paragraph. Yashaghal instead of yashtaghal. Delwa'ti instead of alheen. The wrong politeness style on a customer reply. These aren't subtle errors. They're immediate. The scoring below is by someone who can read for them. Five models tested in mid-March 2026 across Modern Standard Arabic, Saudi (Khaleeji), Egyptian, and Levantine. Six tasks per model.
The models tested: Claude Opus 4.7, GPT-5, Gemini 3.1 Pro Preview, Qwen 3 235B MoE, and Llama 4 Maverick. Six tasks across three styles: translate a 600-word English landing page into Saudi-dialect Arabic that reads naturally to a young Gulf audience; draft a customer-support reply in MSA appropriate to professional Saudi business correspondence; write a short MSA poem in the style of a specific 20th-century poet; summarize a clause from labor law in MSA; translate an informal Egyptian movie clip transcript into English; respond appropriately to a piece of Arabic-English code-switched customer email.
The scoreboard
| Task | Claude | GPT-5 | Gemini 3 | Qwen 3 | Llama 4 |
|---|---|---|---|---|---|
| EN → Khaleeji marketing | 4 | 3 | 3 | 4 | 2 |
| MSA support reply | 5 | 4 | 4 | 4 | 3 |
| MSA poem (specific style) | 3 | 2 | 3 | 4 | 1 |
| Labor law summary | 4 | 3 | 4 | 3 | 2 |
| Egyptian → English | 4 | 4 | 4 | 4 | 3 |
| Code-switching response | 4 | 3 | 3 | 3 | 2 |
| Total | 24 | 19 | 21 | 22 | 13 |
Claude Opus 4.7 won, narrowly, with Qwen 3 in close second. The two outliers at the bottom — Llama 4 Maverick and GPT-5 — show how much variance still exists. GPT-5 isn't bad at Arabic. It's just worse than its English performance would lead you to expect, and clearly worse than Claude on the tasks that matter for Saudi business work.
Worth flagging up front: I read MSA fluently and Khaleeji fluently. I can score Egyptian and Levantine output but I'm not a native speaker of either dialect, so my Egyptian and Levantine assessments are weaker signals than my Khaleeji ones. Maghrebi Arabic — Moroccan, Algerian, Tunisian — I can't read closely enough to score, so it's not in this piece at all.
Khaleeji is where the models split
The most diagnostic test was the English-to-Khaleeji marketing translation. The prompt named the dialect, the target audience (a young Saudi gamer in Riyadh), and the tone (natural conversational marketing). The same 600-word English landing page went to all five models.
Claude's output read like something a Saudi copywriter would have produced. The vocabulary was Khaleeji-coded throughout. Product names stayed in Latin script. That's how Saudi users write them. The sentence rhythm matched the dialect. Two small word choices needed editing. The rest was shippable.
Qwen 3 produced something nearly as good, with a slight Levantine tone that crept in during the second half. Likely because the Qwen Arabic training data is weighted more toward Levantine sources than Khaleeji. Still good enough to ship after editing.
Gemini 3.5 Flash produced output that was technically correct as MSA-leaning-dialectal, but the style kept slipping back toward MSA when the model hit uncertainty. Gemini 3.1 Pro behaved the same way on the same prompts. A Khaleeji reader would notice immediately that this was written by someone trying to sound Khaleeji rather than someone who naturally is.
GPT-5 produced output with Egyptian dialect markers throughout. Words and phrasings that are technically correct but Egyptian-coded in a way that immediately marks the text as non-local to a Saudi reader. The kind of mistake that doesn't break comprehension but breaks immersion.
Llama 4 Maverick produced MSA with a few dialect words sprinkled in, despite the explicit Khaleeji instruction. That's a model that hasn't been trained well on Saudi dialectal data.
I expected Qwen 3 to win Khaleeji translation. It didn't — Claude beat it by enough to be visible, and the gap on the support-reply task was wider than the model-size gap should produce. Whatever Anthropic is doing for Arabic training, it's landing on tone in a way the Alibaba data doesn't.
MSA is mostly a solved problem
The Modern Standard Arabic tasks were the closest race. All five models can produce competent MSA. The differences show up in tone, sentence rhythm, and the small word choices that mark whether the text was written by someone steeped in the language or by a model approximating it.
Claude wrote the customer support reply in the style a careful Saudi business correspondent would use. The polite formulas were correct, the opening was appropriate, the close was expected, and the over-formality some models default to was absent.
Qwen 3 and Gemini both produced strong MSA with minor stylistic awkwardness. A slightly old-fashioned phrasing here, an unusual word choice there. Nothing embarrassing.
GPT-5's MSA reply was technically correct but cold. The voice felt translated rather than native. The sentences had English structural shapes underneath, and the seams were visible to a careful reader. This is the kind of issue that's hard to articulate without enough exposure to professional MSA, and immediately obvious to readers who have it.
The gap between models on Arabic isn't about whether they know the words. It's about whether they know the tone, the rhythm, and the local idiom — exactly the part the leaderboards don't measure.
Make of that what you will.
Poetry is still hard
Each model was asked to write a short MSA poem in the style of Mahmoud Darwish. A famously specific voice with strong imagery, slightly mournful tone, and a particular line-rhythm. This test exposed the most differences.
Qwen 3 produced the best attempt. The imagery was roughly right. The line rhythm was credible. The sentiment landed in the right neighborhood. It wasn't Darwish, but it was a recognizable attempt at being Darwish-like. The surprise winner of this test.
Claude and Gemini produced competent attempts that read more like generic MSA poetry than Darwish-flavored MSA poetry. Both knew that the imagery should be Darwish-coded. Neither captured the rhythm. GPT-5 produced something that looked like poetry but felt translated, as if written in English and converted. The imagery was off. The line breaks landed in the wrong places. Llama 4 Maverick produced something that wasn't recognizably in the style of any specific poet and had grammatical errors that shouldn't appear this far into the model's release cycle.
Code-switching is still the hardest test
The mixed Arabic-English test — a piece of customer email that flips between languages mid-sentence the way Saudi customers actually write — was the toughest task for every model. Saudi internet writing routinely uses English brand names, English technical terms, and entire English sentences embedded in otherwise Arabic prose. The right response matches that style. Switching to formal MSA in reply reads as tone-deaf.
Claude handled this best. The reply code-switched naturally, kept technical terms in English where translating them would have felt artificial, and used Arabic for the emotional and relationship-building parts of the reply. The most idiomatically Saudi of the five outputs. The other models either over-translated (everything in Arabic, including the technical terms that should have stayed in English) or under-translated (mostly English with a few Arabic phrases as decoration). Neither matches how Saudi users actually communicate.
MSA
Claude Closest race, all decentKhaleeji
Claude Clearest gap to GPT-5Egyptian
GPT-5 Trained-data preferenceLevantine
Qwen Edge on this dialectCode-switching
Claude Best at Saudi mixPoetry
Qwen Surprise winner, Darwish styleEmail, document, or speech transcription.
Identify MSA, Khaleeji, Egyptian, or Levantine.
Claude for Khaleeji and MSA, Qwen for Levantine.
For customer-facing copy. Don't skip this.
Where no model is yet trustworthy
Three categories where every model in this test had problems and where deployment needs human review.
Legal text. The labor law summary was good enough to draft but not good enough to publish. Specific terms have specific meanings. Misremembering an article number or substituting a near-synonym can change the legal implication. Don't deploy any of these models for Arabic legal work without a qualified human reviewer.
Classical Arabic. None of the models is fluent in pre-modern Classical Arabic the way they are in MSA. Quotes from medieval texts, exegesis of religious texts, anything in the classical style — expect significant errors and budget for expert review.
Specific regional dialects. Khaleeji is itself a family of dialects. Najdi differs from Hijazi differs from Qatari differs from Bahraini. None of the models distinguishes between them at the level a native would. For text that specifically needs Hijazi or Bahraini coloring, the models won't capture it without significant prompting and editing.
For Arabic work in early 2026, go with Claude Opus 4.7. It handles MSA, Khaleeji, and code-switching better than the alternatives. The tone sensitivity is what makes the difference between text worth shipping and text that needs a rewrite. Qwen 3 235B is the strong second pick. The right call when license clarity matters and your output language is one of the major dialects in its training mix.
Gemini 3.1 Pro Preview is fine for general MSA work but slips on dialect. Skip GPT-5 if your audience is Gulf or Khaleeji — it has a pattern of producing Egyptian-flavored output. Llama 4 Maverick isn't yet competitive for serious Arabic work, despite improvements over Llama 3.
For serious Saudi-market work: use Claude for the customer-facing copy, run a Khaleeji-fluent human reviewer over the output, and budget for a heavier editing pass than English would need. The gap from the model to a native writer is real, but it has closed enough that the workflow is much faster than translating from scratch.
One broader point. This matters beyond cultural relevance. The Arabic-speaking market is half a billion people, and the AI models that serve it well will earn outsized commercial returns over the next five years. The labs that have invested in Arabic — Anthropic visibly, Alibaba through Qwen — are positioning for that future. The labs that haven't are leaving real ground to competitors who won't return it easily.