Translation has a direction, and the two directions are not the same job. Arabic into English is the easier one. English into Arabic is where models earn or lose their keep, because Arabic forces choices English never asks for: which dialect, how formal, whether to add diacritics, how to render a name. A model can clear one direction and stumble on the other. So this page splits by direction. Read the row your text runs in.
This is the translation-specific companion to our wider look at how five frontier models handle MSA and three Arabic dialects. The findings line up: Claude leads on MSA and Gulf, GPT drifts Egyptian, and no model is fluent in classical Arabic. The picks here are the same family, one rung newer. Opus 4.8 landed on May 28, 2026, with a 1M-token context window and 128K max output, and it carries the tonal strength of the 4.7 line into a cheaper-to-trust release.
Arabic into English: mostly a solved direction
Going from Arabic to English, the frontier models are close, and the floor is high. Feed any of them clean MSA news copy or a standard business email in Arabic, and you'll get accurate, readable English back. The differences show up at the edges, not the center.
Opus 4.8 is the steadiest here, and its 1M-token window is the reason. Drop a 40-page Arabic contract or a full manuscript in, and it keeps terminology consistent from the first clause to the last. Competitors start introducing inconsistencies after 15,000 to 20,000 words, where Claude holds the thread. GPT-5.5 is right behind on accuracy and posts high-80s-or-above scores on translated Arabic MMLU, so for short and mid-length passages you won't see daylight between them. Gemini 3.1 Pro is strong too, scoring 93 on the Artificial Analysis Arabic benchmark, and it's the cheapest way to pour a large pile of Arabic sources into one window.
The catch on this direction is dialect input. Hand a model a Tunisian voice note or a Moroccan Darija WhatsApp thread and the floor drops out. Translation Error Rates run between 6 and 25% across models on underrepresented dialects, and GPT-4-class models have been caught reverting to English output or simply guessing. Gulf, Egyptian, and Levantine input is handled well; the Maghreb is not.
English into Arabic: where the models split
This is the harder direction and the one that decides the page. English has no dialect to pick and no diacritics to place. Arabic makes you choose, and the wrong choice reads as foreign to the intended audience even when every word is technically correct.
Claude is the most reliable here. It produces more natural, fluid MSA than GPT-class models, with better structural consistency, and it handles Egyptian and Gulf Arabic better than the rest, rated excellent on Egyptian dialect translation in published comparisons. Ask for Gulf copy and it stays Gulf instead of sliding back to formal MSA. That tonal control is exactly what our closer look at Saudi and Khaleeji dialect handling digs into, and it's the same reason Opus tops the Opus 4.8 review for multilingual prose.
GPT-5.5 is the runner-up and a solid one for MSA. It clears standardized Arabic and Arabic-to-dialect benchmarks well. But in bidirectional tests it shows Egyptian Arabic drift: aim for Gulf, and Egyptian vocabulary leaks in. Gemini's failure mode is the opposite. Left alone it picks MSA-specific vocabulary instead of dialect-appropriate terms, and it needs explicit region instructions ("for users in Egypt") to hold a local dialect. Add the instruction and it improves; skip it and you get textbook Arabic where you wanted street Arabic.
The open-weight side is worth a line. Qwen 3 ships a 256K-vocabulary tokenizer that represents Arabic concepts efficiently, and Qwen-MT, its translation variant, supports 92 languages and beat GPT-4.1-mini and Gemini-2.5-Flash in human evaluation. It's a real option when you want weights you control. Llama 4 is not; it posts a 50.8% book-translation score and isn't competitive for serious Arabic work despite fast inference.
The scoreboard, by direction
This is a qualitative read, not a numeric leaderboard, because most of these workloads have no public per-task score. Grades reflect the published comparisons and benchmark snapshots in the references, mapped to the newest model in each lineage.
| Workload | Claude Opus 4.8 | GPT-5.5 | Gemini 3.1 Pro | Qwen 3 | Llama 4 |
|---|---|---|---|---|---|
| MSA Arabic → English | Strong | Strong | Strong | Strong | Weak |
| English → MSA | Most natural | Strong | Strong, formal | Strong | Weak |
| English → Gulf dialect | Holds dialect | Egyptian drift | MSA drift | Good | Weak |
| Long contract, both ways | Consistent to end | Drifts late | Drifts late | OK | Weak |
| Maghrebi dialect input | Weak | Weak | Weak | Weak | Weak |
| Classical / Quranic text | Verify | Verify | Verify | Verify | Weak |
Where they fail
Classical and Quranic Arabic. Pre-modern and Quranic terminology differs sharply from MSA, and no model tested shows strong capability beyond what MSA can carry. Use MSA for international contracts and formal work; reserve classical handling for texts where you have an expert to check the output.
Legal terminology. Terms like طلاق (talaq, divorce), حضانة (hadanah, custody), and شركة (sharika, company) carry precise legal weight. Generic translation often misses the formal equivalent unless you name it in the prompt. Spell out the domain, or the contract reads wrong in ways that matter.
Names and diacritics. Tashkeel (the diacritical marks) is generated inconsistently, and models slip into dialectal diacritic patterns on proper names and non-standard Arabic. Foreign-script names going into Arabic, and diacritized Arabic names coming out, are both fragile. Check every name by hand.
Underrepresented dialects and few-shot. Tunisian, Moroccan Darija, and Algerian sit far below Gulf and Egyptian across every model, and Gemini and GPT drift toward Egyptian without an explicit dialect tag. One counterintuitive note: few-shot prompting can fail or even worsen dialect results, with GPT-4-class models reverting to English despite Arabic-only examples. More examples isn't the fix here.
Arabic is a low-resource language for these models despite 400 million-plus speakers, because the training data thins out fast once you leave MSA.
That low-resource label is the whole story under the failures. Arabic lacks standardized orthography across dialects, written text usually omits diacritics, and the data behind dialects is a fraction of what backs MSA. The models are good where the data is thick and shaky everywhere else. If you want the longer treatment of voice and register beyond raw translation, our guide to the best AI for long-form writing covers how the same models hold tone across a piece.