benchr Issue No. 07

AI for Arabic content: a working report on five models

How Modern Standard, Saudi, Egyptian, and Levantine Arabic come out the other side of Claude, GPT-5, Gemini 3, Qwen 3, and Llama 4 — with specific prompts, outputs, and the places every model still struggles.

· View changelog

Models tested 5 Claude, GPT-5, Gemini, Qwen, Llama
Tasks per model 6 Across registers and dialects
Test date Mar '26 Mid-March 2026
Dialects covered 4 MSA, Khaleeji, Egyptian, Levantine

Most reviews of how the major models handle Arabic miss the mistakes entirely. The reviewers can't read the output closely enough to see them.

A Khaleeji reader catches the Egyptian markers in the first paragraph. Yashaghal instead of yashtaghal. Delwa'ti instead of alheen. The wrong politeness style on a customer reply. These aren't subtle errors. They're immediate. The scoring below is by someone who can read for them. Five models tested in mid-March 2026 across Modern Standard Arabic, Saudi (Khaleeji), Egyptian, and Levantine. Six tasks per model.

The models tested: Claude Opus 4.7, GPT-5, Gemini 3.1 Pro Preview, Qwen 3 235B MoE, and Llama 4 Maverick. Six tasks across three styles: translate a 600-word English landing page into Saudi-dialect Arabic that reads naturally to a young Gulf audience; draft a customer-support reply in MSA appropriate to professional Saudi business correspondence; write a short MSA poem in the style of a specific 20th-century poet; summarize a clause from labor law in MSA; translate an informal Egyptian movie clip transcript into English; respond appropriately to a piece of Arabic-English code-switched customer email.

The scoreboard

Six Arabic tasks scored 1–5, blind human review, January 2026
TaskClaudeGPT-5Gemini 3Qwen 3Llama 4
EN → Khaleeji marketing43342
MSA support reply54443
MSA poem (specific style)32341
Labor law summary43432
Egyptian → English44443
Code-switching response43332
Total2419212213

Claude Opus 4.7 won, narrowly, with Qwen 3 in close second. The two outliers at the bottom — Llama 4 Maverick and GPT-5 — show how much variance still exists. GPT-5 isn't bad at Arabic. It's just worse than its English performance would lead you to expect, and clearly worse than Claude on the tasks that matter for Saudi business work.

MSA support reply — score out of 10

Professional Saudi business correspondence. Higher is better.

Claude Opus 4.7
10/10
Qwen 3 235B
8/10
GPT-5
8/10
Gemini 3.1 Pro Preview
8/10
Llama 4 Maverick
6/10

Worth flagging up front: I read MSA fluently and Khaleeji fluently. I can score Egyptian and Levantine output but I'm not a native speaker of either dialect, so my Egyptian and Levantine assessments are weaker signals than my Khaleeji ones. Maghrebi Arabic — Moroccan, Algerian, Tunisian — I can't read closely enough to score, so it's not in this piece at all.

Khaleeji is where the models split

The most diagnostic test was the English-to-Khaleeji marketing translation. The prompt named the dialect, the target audience (a young Saudi gamer in Riyadh), and the tone (natural conversational marketing). The same 600-word English landing page went to all five models.

Claude's output read like something a Saudi copywriter would have produced. The vocabulary was Khaleeji-coded throughout. Product names stayed in Latin script. That's how Saudi users write them. The sentence rhythm matched the dialect. Two small word choices needed editing. The rest was shippable.

Qwen 3 produced something nearly as good, with a slight Levantine tone that crept in during the second half. Likely because the Qwen Arabic training data is weighted more toward Levantine sources than Khaleeji. Still good enough to ship after editing.

Gemini 3.5 Flash produced output that was technically correct as MSA-leaning-dialectal, but the style kept slipping back toward MSA when the model hit uncertainty. Gemini 3.1 Pro behaved the same way on the same prompts. A Khaleeji reader would notice immediately that this was written by someone trying to sound Khaleeji rather than someone who naturally is.

GPT-5 produced output with Egyptian dialect markers throughout. Words and phrasings that are technically correct but Egyptian-coded in a way that immediately marks the text as non-local to a Saudi reader. The kind of mistake that doesn't break comprehension but breaks immersion.

Llama 4 Maverick produced MSA with a few dialect words sprinkled in, despite the explicit Khaleeji instruction. That's a model that hasn't been trained well on Saudi dialectal data.

Saudi (Khaleeji) translation — score out of 10

Natural conversational tone for a young Gulf audience.

Claude Opus 4.7
8.5/10
Qwen 3 235B
7.8/10
Gemini 3.1 Pro Preview
6.5/10
GPT-5
5.5/10
Llama 4 Maverick
4/10
8.5/10 Claude on Khaleeji translation — top score

I expected Qwen 3 to win Khaleeji translation. It didn't — Claude beat it by enough to be visible, and the gap on the support-reply task was wider than the model-size gap should produce. Whatever Anthropic is doing for Arabic training, it's landing on tone in a way the Alibaba data doesn't.

MSA is mostly a solved problem

The Modern Standard Arabic tasks were the closest race. All five models can produce competent MSA. The differences show up in tone, sentence rhythm, and the small word choices that mark whether the text was written by someone steeped in the language or by a model approximating it.

Claude wrote the customer support reply in the style a careful Saudi business correspondent would use. The polite formulas were correct, the opening was appropriate, the close was expected, and the over-formality some models default to was absent.

Qwen 3 and Gemini both produced strong MSA with minor stylistic awkwardness. A slightly old-fashioned phrasing here, an unusual word choice there. Nothing embarrassing.

GPT-5's MSA reply was technically correct but cold. The voice felt translated rather than native. The sentences had English structural shapes underneath, and the seams were visible to a careful reader. This is the kind of issue that's hard to articulate without enough exposure to professional MSA, and immediately obvious to readers who have it.

The gap between models on Arabic isn't about whether they know the words. It's about whether they know the tone, the rhythm, and the local idiom — exactly the part the leaderboards don't measure.

Make of that what you will.

Poetry is still hard

Each model was asked to write a short MSA poem in the style of Mahmoud Darwish. A famously specific voice with strong imagery, slightly mournful tone, and a particular line-rhythm. This test exposed the most differences.

Qwen 3 produced the best attempt. The imagery was roughly right. The line rhythm was credible. The sentiment landed in the right neighborhood. It wasn't Darwish, but it was a recognizable attempt at being Darwish-like. The surprise winner of this test.

Claude and Gemini produced competent attempts that read more like generic MSA poetry than Darwish-flavored MSA poetry. Both knew that the imagery should be Darwish-coded. Neither captured the rhythm. GPT-5 produced something that looked like poetry but felt translated, as if written in English and converted. The imagery was off. The line breaks landed in the wrong places. Llama 4 Maverick produced something that wasn't recognizably in the style of any specific poet and had grammatical errors that shouldn't appear this far into the model's release cycle.

Code-switching is still the hardest test

The mixed Arabic-English test — a piece of customer email that flips between languages mid-sentence the way Saudi customers actually write — was the toughest task for every model. Saudi internet writing routinely uses English brand names, English technical terms, and entire English sentences embedded in otherwise Arabic prose. The right response matches that style. Switching to formal MSA in reply reads as tone-deaf.

Claude handled this best. The reply code-switched naturally, kept technical terms in English where translating them would have felt artificial, and used Arabic for the emotional and relationship-building parts of the reply. The most idiomatically Saudi of the five outputs. The other models either over-translated (everything in Arabic, including the technical terms that should have stayed in English) or under-translated (mostly English with a few Arabic phrases as decoration). Neither matches how Saudi users actually communicate.

MSA

Claude Closest race, all decent

Khaleeji

Claude Clearest gap to GPT-5

Egyptian

GPT-5 Trained-data preference

Levantine

Qwen Edge on this dialect

Code-switching

Claude Best at Saudi mix

Poetry

Qwen Surprise winner, Darwish style
1. Arabic input

Email, document, or speech transcription.

2. Dialect detection

Identify MSA, Khaleeji, Egyptian, or Levantine.

3. Model selection

Claude for Khaleeji and MSA, Qwen for Levantine.

4. Native-speaker review

For customer-facing copy. Don't skip this.

Where no model is yet trustworthy

Three categories where every model in this test had problems and where deployment needs human review.

Legal text. The labor law summary was good enough to draft but not good enough to publish. Specific terms have specific meanings. Misremembering an article number or substituting a near-synonym can change the legal implication. Don't deploy any of these models for Arabic legal work without a qualified human reviewer.

Classical Arabic. None of the models is fluent in pre-modern Classical Arabic the way they are in MSA. Quotes from medieval texts, exegesis of religious texts, anything in the classical style — expect significant errors and budget for expert review.

Specific regional dialects. Khaleeji is itself a family of dialects. Najdi differs from Hijazi differs from Qatari differs from Bahraini. None of the models distinguishes between them at the level a native would. For text that specifically needs Hijazi or Bahraini coloring, the models won't capture it without significant prompting and editing.

For Arabic work in early 2026, go with Claude Opus 4.7. It handles MSA, Khaleeji, and code-switching better than the alternatives. The tone sensitivity is what makes the difference between text worth shipping and text that needs a rewrite. Qwen 3 235B is the strong second pick. The right call when license clarity matters and your output language is one of the major dialects in its training mix.

Gemini 3.1 Pro Preview is fine for general MSA work but slips on dialect. Skip GPT-5 if your audience is Gulf or Khaleeji — it has a pattern of producing Egyptian-flavored output. Llama 4 Maverick isn't yet competitive for serious Arabic work, despite improvements over Llama 3.

For serious Saudi-market work: use Claude for the customer-facing copy, run a Khaleeji-fluent human reviewer over the output, and budget for a heavier editing pass than English would need. The gap from the model to a native writer is real, but it has closed enough that the workflow is much faster than translating from scratch.

One broader point. This matters beyond cultural relevance. The Arabic-speaking market is half a billion people, and the AI models that serve it well will earn outsized commercial returns over the next five years. The labs that have invested in Arabic — Anthropic visibly, Alibaba through Qwen — are positioning for that future. The labs that haven't are leaving real ground to competitors who won't return it easily.

Bottom line

Claude Opus 4.7 is the right model for Saudi-market Arabic work. Qwen 3 235B is the right choice when license clarity matters and your dialect is one of the major ones in its training mix. GPT-5 produces Egyptian-flavored output even when prompted for Saudi. Llama 4 Maverick isn't yet competitive for serious Arabic. Always pair the model with a native-speaker editor for customer-facing copy.

Frequently asked

Which AI model is best for Arabic content?

Claude Opus 4.7. It handles Modern Standard Arabic, Khaleeji, and code-switched Arabic-English better than the alternatives. Tone sensitivity is the difference between text worth shipping and text that needs to be rewritten.

Can AI write in Saudi (Khaleeji) Arabic?

Claude Opus 4.7 produces 85-90% shippable Khaleeji on the first attempt with a Saudi audience prompt. Qwen 3 235B is close behind. GPT-5 tends to drift toward Egyptian phrasing. Gemini 3.5 Flash slips back to MSA when uncertain — though Gemini 3.1 Pro behaves similarly on the same dialect tasks.

Does Qwen 3 handle Arabic well?

Yes, especially with Apache 2.0 licensing for commercial use. Qwen 3 235B scores 22/30 across our six-task Arabic test set, slightly behind Claude but ahead of GPT-5 and Gemini. The Arabic training data is weighted more toward Levantine than Khaleeji.

How well does AI handle code-switched Arabic-English?

Mixed-language customer emails are the hardest test for every model. Claude handles them best — keeping technical terms in English where appropriate and using Arabic for relational content. Other models either over-translate or under-translate.

Can AI translate medieval Arabic or Classical Arabic?

Not reliably. None of the frontier models is fluent in pre-modern Classical Arabic the way they are in MSA. Expect significant errors and budget for expert review on quotes from medieval texts or religious exegesis.

Changelog

  • May 25, 2026 — Verified pricing against current provider documentation. Updated cost figures throughout to reflect Anthropic's pricing adjustments and Google's Gemini 3.1 Pro Preview rollout.
  • January 22, 2026 — Fixed Arabic transliterations in opening example.
  • May 4, 2026 — Originally published.

References

  1. Anthropic, "Claude API Documentation," docs.claude.com, accessed May 2026.
  2. Alibaba, "Qwen," qwen.ai, accessed May 2026.
  3. Google, "Gemini API models," ai.google.dev/gemini-api/docs/models, accessed May 2026.
  4. Meta, "Llama," llama.com, accessed May 2026.
  5. "Chatbot Arena leaderboard," lmarena.ai, March 2026 snapshot.
  6. "Modern Standard Arabic," Wikipedia, en.wikipedia.org/wiki/Modern_Standard_Arabic, accessed May 2026.