Essay·May 2026

AI for Arabic content: a working report on five models

How Claude, GPT-5, Gemini, Qwen, and Llama handle MSA and three regional Arabic dialects, and where every model still struggles.

By the benchr team · Updated May 30, 2026 · View changelog

Models compared 5 Claude, GPT-5, Gemini, Qwen, Llama

Workload axes 6 Across MSA and three dialects

Top pick — Saudi Opus 4.7 Per public community reports

Open-weight pick Qwen 3 Apache 2.0, strong MSA

Most reviews of how the major models handle Arabic miss the mistakes entirely, because the reviewers can't read the output closely enough to see them. A Khaleeji reader catches Egyptian markers in the first paragraph: yashaghal instead of yashtaghal, delwa'ti instead of alheen, the wrong politeness style on a customer reply. None of it is subtle to the right reader. This piece reads the public community discussion (Hugging Face's Arabic-NLP discussion, the LMArena dialect-specific reports, the developer forums each lab maintains) for what each frontier model does on Saudi-market Arabic work.

The five models read across the analysis are Claude Opus 4.7, GPT-5, Gemini 3.x, Qwen 3 235B MoE, and Llama 4 Maverick. The workload axes are Modern Standard Arabic, the three major regional registers (Khaleeji, Egyptian, Levantine), and code-switched Arabic-English customer email. Maghrebi Arabic (Moroccan, Algerian, Tunisian) is a separate problem and is not in scope; coverage there is weaker across all five models.

The qualitative scoreboard

The table below is a qualitative summary, not a numeric leaderboard, because the workloads under it don't have public numeric benchmarks. The grades reflect the consensus of the public community discussion across the labs' own forums, the Arabic-NLP corners of Hugging Face, and the developer reports across the Saudi and Gulf tech community.

How each frontier model handles Saudi-market Arabic workloads, May 2026
Workload	Claude Opus 4.7	GPT-5	Gemini 3.x	Qwen 3 235B	Llama 4 Maverick
EN → Khaleeji marketing	Strong	Egyptian drift	MSA drift	Strong	Weak
MSA business reply	Strong	Good, cold tone	Good	Good	OK
Style poetry (e.g. Darwish)	OK	Translated feel	OK	Best attempt	Weak
Labor-law summary (MSA)	Strong, verify	OK	Strong	OK	Weak
Egyptian → English	Strong	Strong	Strong	Strong	OK
Code-switched email	Strong	Over-translates	Under-translates	OK	Weak

Khaleeji is where the models split

The most diagnostic workload is English-to-Khaleeji marketing copy for a young Saudi audience. The community discussion is consistent that Claude produces output that reads like a Saudi copywriter wrote it: Khaleeji-coded vocabulary, product names left in Latin script (which is how Saudi users write them), sentence rhythm that matches the dialect, with only light editing needed before shipping. Qwen 3 produces something close to as good, with a slight Levantine tone that creeps in during longer outputs (the Qwen training data is weighted more toward Levantine sources than Khaleeji).

Gemini's pattern is MSA drift. When the model is uncertain about a dialect choice, it falls back to a more formal MSA register, so the output comes out technically correct but stylistically off. A Khaleeji reader notices immediately that the text was written by someone trying to sound Khaleeji rather than someone who is. GPT-5 drifts the other way, toward Egyptian: it defaults to Egyptian phrasings and vocabulary even when the prompt names a Gulf audience. The drift is subtle enough that you'd struggle to articulate it without exposure to both dialects, yet it's enough to break immersion for the intended reader. Llama 4 Maverick produces MSA with a few dialect words sprinkled in regardless of the prompt; the dialectal training has not landed yet on that lineage.

MSA is mostly a solved problem (the differences are taste)

The MSA workloads are the closest race. Every model in this range produces competent Modern Standard Arabic. The differences are in tone, sentence rhythm, and the small word choices that mark whether text was written by someone steeped in the language or by a model approximating it.

For a Saudi-business support reply, Claude lands the politeness register and the formulaic opening and close without the over-formality some models default to. Qwen and Gemini produce strong MSA with minor stylistic awkwardness, an old-fashioned phrasing here, an unusual word choice there. Nothing embarrassing. GPT-5's MSA is technically correct but tonally cold, and the voice reads translated rather than native. English structural shapes show through underneath it, and the seams are visible to a careful Arabic reader.

Every one of these models knows the words. What separates them is tone, rhythm, and local idiom, which is the part no leaderboard bothers to measure.

Where each model lands on Saudi-market work

Strength of fit, public-report consensus.

Claude Opus 4.7

Strong

Qwen 3 235B

Strong

Gemini 3.x

GPT-5 (Egyptian drift)

Off-register

Llama 4 Maverick

Weak

Code-switching is the hardest test

Mixed Arabic-English customer email is the toughest single workload for every model. Saudi internet writing routinely drops English brand names and technical terms, and sometimes whole English sentences, into otherwise-Arabic prose. A reply that matches that style works; switching to formal MSA reads as tone-deaf.

Claude handles this best in the community reports. Its reply code-switches naturally, leaving technical terms in English where translating them would feel artificial and reaching for Arabic on the emotional and relationship parts. The other models tend to go to one extreme or the other: either everything gets pushed into Arabic, technical terms included, or the reply stays mostly English with a few Arabic phrases dropped in as decoration. Neither matches how Saudi users actually write.

MSA

Claude Closest race overall

Khaleeji

Claude Clearest gap to GPT-5

Egyptian

GPT-5 Training-data preference

Levantine

Qwen Edge on this register

Code-switching

Claude Best on Saudi mix

Poetry

Qwen Best style-imitation attempt

1. Identify the register

MSA, Khaleeji, Egyptian, Levantine. The register decides the model.

↓

2. Pick the model

Claude for Khaleeji and MSA. Qwen for Levantine. GPT-5 for Egyptian.

↓

3. Prompt with audience

Name the city, the age range, the tone. Defaults aren't enough.

↓

4. Native-speaker review

For customer-facing copy, always. Don't skip this step.

Where no model is yet trustworthy

A few categories give every frontier model trouble, and deploying any of them there without human review is a mistake.

Legal text. MSA legal summaries are good enough to draft from but not to publish. Specific terms carry specific meanings; misremembering an article number or substituting a near-synonym changes the legal implication. Don't deploy any of these models for Arabic legal work without a qualified human reviewer.

Classical Arabic. None of the frontier models is fluent in pre-modern Classical Arabic the way they are in MSA. Quotes from medieval texts, exegesis of religious texts, anything in the classical style: expect significant errors and budget for expert review.

Specific regional dialects. Khaleeji is itself a family of dialects. Najdi differs from Hijazi differs from Qatari differs from Bahraini. No model distinguishes between them at the level a native does. For text that specifically needs Hijazi or Bahraini coloring, the models won't capture it without significant prompting and editing.

The pick for production

For Arabic work in early 2026, the default is Claude Opus 4.7. The model handles MSA and Khaleeji and code-switching better than the alternatives, and that tone sensitivity is usually what decides whether the text ships or goes back for a rewrite. The pricing on Opus is covered in the Opus review; the model has the headroom for the careful tonal work this kind of content needs.

Qwen 3 235B is the strong second pick. It's the right call when license clarity matters (Apache 2.0) and your audience speaks one of the dialects in its training mix. For Levantine work, Qwen edges Claude in the public reports; for Khaleeji, Claude keeps the lead. The open-weight tier piece walks through where Qwen fits in the broader open-weight picture, and the small-models piece covers when to drop to a smaller model.

Gemini is fine for general MSA and slips on dialect. GPT-5 produces Egyptian-flavored output even when prompted for Saudi: don't use it for Khaleeji audiences. Llama 4 Maverick is not yet competitive for serious Arabic work, despite improvements over Llama 3. The multimodal piece covers Gemini's separate Arabic strength on document images.

For serious Saudi-market work, the pattern that holds up is to draft the customer-facing copy with Claude, run a Khaleeji-fluent human reviewer over the output, and budget for a heavier editing pass than English would need. The gap from the best model to a native writer is real, but it has closed enough that the workflow now beats translating from scratch. Anthropic and Alibaba have both visibly invested in Arabic, and they're positioning for a half-billion-person market that their less-invested rivals are content to leave on the table.

Frequently asked

Which AI model is best for Arabic content?

Claude Opus 4.7 for Saudi-market work. The model is consistently the strongest on Modern Standard Arabic and Khaleeji in the public community discussion. Tone sensitivity is the difference between text worth shipping and text that needs to be rewritten.

Can AI write in Saudi (Khaleeji) Arabic?

Claude Opus 4.7 produces output close to shippable on the first attempt with a Saudi audience prompt. Qwen 3 235B is close behind. GPT-5 tends to drift toward Egyptian phrasing. Gemini 3.x slips back to MSA when uncertain.

Does Qwen 3 handle Arabic well?

Yes, and the Apache 2.0 licensing is the clearest of the open-weight options for commercial use. Qwen 3 trails Claude on Khaleeji and edges it on Levantine, per the public reports.

How well does AI handle code-switched Arabic-English?

Mixed-language Saudi customer email is the hardest single workload across all frontier models. Claude handles it best in the public reports: keeping technical terms in English where appropriate and using Arabic for relational content.

Can AI translate medieval or Classical Arabic?

Not reliably. None of the frontier models is fluent in pre-modern Classical Arabic the way they are in MSA. Expect significant errors on quotes from medieval texts or religious exegesis.

Changelog

May 25, 2026 — Rewrote sections that previously narrated a private six-task scoring exercise. The qualitative table now reflects the consensus of the public community discussion across Arabic-NLP forums and lab developer communities.
May 4, 2026 — Originally published.

References

Anthropic, "Claude API Documentation," docs.claude.com, accessed May 2026.
Alibaba, "Qwen," qwen.ai, accessed May 2026.
Google, "Gemini API models," ai.google.dev/gemini-api/docs/models, accessed May 2026.
Meta, "Llama," llama.com, accessed May 2026.
"Chatbot Arena leaderboard," lmarena.ai, May 2026 snapshot.
"Modern Standard Arabic," Wikipedia, en.wikipedia.org/wiki/Modern_Standard_Arabic, accessed May 2026.