Review·May 2026

The best AI for Arabic-English translation

Which models move cleanly between Arabic and English both ways, and the places every one of them still breaks.

By the benchr team · Reviewed May 30, 2026 · View changelog · Figures verified against official sources, 30 May 2026

Translation has a direction, and the two directions are not the same job. Arabic into English is the easier one. English into Arabic is where models earn or lose their keep, because Arabic forces choices English never asks for: which dialect, how formal, whether to add diacritics, how to render a name. A model can clear one direction and stumble on the other. So this page splits by direction. Read the row your text runs in.

This is the translation-specific companion to our wider look at how five frontier models handle MSA and three Arabic dialects. The findings line up: Claude leads on MSA and Gulf, GPT drifts Egyptian, and no model is fluent in classical Arabic. The picks here are the same family, one rung newer. Opus 4.8 landed on May 28, 2026, with a 1M-token context window and 128K max output, and it carries the tonal strength of the 4.7 line into a cheaper-to-trust release.

Arabic into English: mostly a solved direction

Going from Arabic to English, the frontier models are close, and the floor is high. Feed any of them clean MSA news copy or a standard business email in Arabic, and you'll get accurate, readable English back. The differences show up at the edges, not the center.

Opus 4.8 is the steadiest here, and its 1M-token window is the reason. Drop a 40-page Arabic contract or a full manuscript in, and it keeps terminology consistent from the first clause to the last. Competitors start introducing inconsistencies after 15,000 to 20,000 words, where Claude holds the thread. GPT-5.5 is right behind on accuracy and posts high-80s-or-above scores on translated Arabic MMLU, so for short and mid-length passages you won't see daylight between them. Gemini 3.1 Pro is strong too, scoring 93 on the Artificial Analysis Arabic benchmark, and it's the cheapest way to pour a large pile of Arabic sources into one window.

The catch on this direction is dialect input. Hand a model a Tunisian voice note or a Moroccan Darija WhatsApp thread and the floor drops out. Translation Error Rates run between 6 and 25% across models on underrepresented dialects, and GPT-4-class models have been caught reverting to English output or simply guessing. Gulf, Egyptian, and Levantine input is handled well; the Maghreb is not.

English into Arabic: where the models split

This is the harder direction and the one that decides the page. English has no dialect to pick and no diacritics to place. Arabic makes you choose, and the wrong choice reads as foreign to the intended audience even when every word is technically correct.

Claude is the most reliable here. It produces more natural, fluid MSA than GPT-class models, with better structural consistency, and it handles Egyptian and Gulf Arabic better than the rest, rated excellent on Egyptian dialect translation in published comparisons. Ask for Gulf copy and it stays Gulf instead of sliding back to formal MSA. That tonal control is exactly what our closer look at Saudi and Khaleeji dialect handling digs into, and it's the same reason Opus tops the Opus 4.8 review for multilingual prose.

GPT-5.5 is the runner-up and a solid one for MSA. It clears standardized Arabic and Arabic-to-dialect benchmarks well. But in bidirectional tests it shows Egyptian Arabic drift: aim for Gulf, and Egyptian vocabulary leaks in. Gemini's failure mode is the opposite. Left alone it picks MSA-specific vocabulary instead of dialect-appropriate terms, and it needs explicit region instructions ("for users in Egypt") to hold a local dialect. Add the instruction and it improves; skip it and you get textbook Arabic where you wanted street Arabic.

The open-weight side is worth a line. Qwen 3 ships a 256K-vocabulary tokenizer that represents Arabic concepts efficiently, and Qwen-MT, its translation variant, supports 92 languages and beat GPT-4.1-mini and Gemini-2.5-Flash in human evaluation. It's a real option when you want weights you control. Llama 4 is not; it posts a 50.8% book-translation score and isn't competitive for serious Arabic work despite fast inference.

The scoreboard, by direction

This is a qualitative read, not a numeric leaderboard, because most of these workloads have no public per-task score. Grades reflect the published comparisons and benchmark snapshots in the references, mapped to the newest model in each lineage.

Arabic-English translation by workload and direction, May 2026
Workload	Claude Opus 4.8	GPT-5.5	Gemini 3.1 Pro	Qwen 3	Llama 4
MSA Arabic → English	Strong	Strong	Strong	Strong	Weak
English → MSA	Most natural	Strong	Strong, formal	Strong	Weak
English → Gulf dialect	Holds dialect	Egyptian drift	MSA drift	Good	Weak
Long contract, both ways	Consistent to end	Drifts late	Drifts late	OK	Weak
Maghrebi dialect input	Weak	Weak	Weak	Weak	Weak
Classical / Quranic text	Verify	Verify	Verify	Verify	Weak

Where they fail

Classical and Quranic Arabic. Pre-modern and Quranic terminology differs sharply from MSA, and no model tested shows strong capability beyond what MSA can carry. Use MSA for international contracts and formal work; reserve classical handling for texts where you have an expert to check the output.

Legal terminology. Terms like طلاق (talaq, divorce), حضانة (hadanah, custody), and شركة (sharika, company) carry precise legal weight. Generic translation often misses the formal equivalent unless you name it in the prompt. Spell out the domain, or the contract reads wrong in ways that matter.

Names and diacritics. Tashkeel (the diacritical marks) is generated inconsistently, and models slip into dialectal diacritic patterns on proper names and non-standard Arabic. Foreign-script names going into Arabic, and diacritized Arabic names coming out, are both fragile. Check every name by hand.

Underrepresented dialects and few-shot. Tunisian, Moroccan Darija, and Algerian sit far below Gulf and Egyptian across every model, and Gemini and GPT drift toward Egyptian without an explicit dialect tag. One counterintuitive note: few-shot prompting can fail or even worsen dialect results, with GPT-4-class models reverting to English despite Arabic-only examples. More examples isn't the fix here.

Arabic is a low-resource language for these models despite 400 million-plus speakers, because the training data thins out fast once you leave MSA.

That low-resource label is the whole story under the failures. Arabic lacks standardized orthography across dialects, written text usually omits diacritics, and the data behind dialects is a fraction of what backs MSA. The models are good where the data is thick and shaky everywhere else. If you want the longer treatment of voice and register beyond raw translation, our guide to the best AI for long-form writing covers how the same models hold tone across a piece.

Calculate your cost →·Compare this model →·Find your model →

Frequently asked

Which model is best for Arabic-English translation in both directions?

Claude Opus 4.8 is the strongest overall, particularly for Modern Standard Arabic and Gulf/Saudi dialect. It excels at maintaining consistency across long documents and handling classical idioms. GPT-5.5 is the runner-up with good MSA performance but shows more Egyptian dialect drift.

How do these models handle Arabic dialects (Egyptian, Levantine, Gulf)?

Claude handles Gulf and Egyptian Arabic better than others. Gemini requires explicit dialect specification (for example, "for users in Egypt") to avoid drift. GPT-5.5 performs well on dialect translation benchmarks but tends toward Egyptian Arabic. All models struggle with underrepresented dialects like Tunisian and Moroccan Darija.

What are the key failure modes in Arabic-English translation?

Main failures: underrepresented dialects (Tunisian, Moroccan), legal and contract terminology without explicit specification, Quranic and classical Arabic, inconsistent diacritical marks on proper names, and paradoxical degradation with few-shot prompting. MSA-trained models also drift toward regional dialects when not constrained.

Is Qwen 3 or Llama 4 competitive for Arabic translation?

Qwen 3, with its 256K vocabulary, is strong for Arabic due to efficient token representation. Qwen-MT outperformed GPT-4.1-mini and Gemini-2.5-Flash in human evaluation. Llama 4 is not competitive; it shows poor translation performance despite fast inference, with only a 50.8% score on book translation.

Should I use MSA or classical Arabic for professional translation?

Modern Standard Arabic (MSA) is recommended for international business, legal contracts, and formal content. Classical or Quranic Arabic is needed only for religious or historical texts. All major models handle MSA better than classical variants.

Changelog

May 30, 2026 — Originally published. Picks reflect Claude Opus 4.8 (released May 28), GPT-5.5, Gemini 3.1 Pro, Qwen 3, and Llama 4, mapped from the published Arabic-translation comparisons in the references.

References

Anthropic, "What's new in Claude Opus 4.8," platform.claude.com, accessed May 2026.
Truescho, "Claude vs ChatGPT: Which Is Better for Arabic Content? 2026," truescho.com, accessed May 2026.
Frontiers in AI, "Cross-dialectal Arabic translation: comparative analysis on large language models," frontiersin.org, accessed May 2026.
Localazy, "Can LLMs translate Arabic accurately? We put 8 of them to the test," localazy.com, accessed May 2026.
MarkTechPost, "Alibaba Qwen Introduces Qwen3-MT: Next-Gen Multilingual Machine Translation," marktechpost.com, accessed May 2026.
OpenAI, "Introducing GPT-5.5," openai.com, accessed May 2026.