# benchr

> A reference for AI model selection: pricing, benchmarks, and use-case fit for Claude, GPT, Gemini, Llama, Mistral, and the open-weight tier, sourced from official provider documentation.

benchr is an editorial publication that synthesizes public information (official provider pricing pages, benchmark leaderboards, and model documentation) into reviews, head-to-head comparisons, and buying guidance for frontier and open-weight AI models. Pricing, release dates, deprecation status, and benchmark numbers are verified against primary sources before publication. When a comparison cannot be backed by a citable source, it is stated qualitatively rather than invented.

## Guides
- [Frontier AI models in 2026](https://benchr.org/guides/frontier-models): Claude Opus 4.7, GPT-5, and Gemini 3.1 Pro Preview — the reviews, the head-to-head, and a single buying decision.
- [Open-weight AI models](https://benchr.org/guides/open-weight-models): Llama 4, Mistral, DeepSeek, Qwen, the small-model tier, and what it takes to self-host.
- [AI costs in 2026](https://benchr.org/guides/ai-costs): What AI actually costs by model and workload, and where teams overspend.

## Reviews — Claude
- [Claude Opus 4.8, reviewed](https://benchr.org/articles/claude-opus-4-8-review): A six-week upgrade at the same $5/$25 price, a 69.2% SWE-bench Pro score, and honesty gains that catch flaws in the model's own code.
- [Claude Opus 4.7, reviewed](https://benchr.org/articles/claude-opus-4-7-review): Coding, long-document analysis, and multilingual capability at $5/$25 per million input/output tokens.
- [Claude Sonnet 4.6, reviewed](https://benchr.org/articles/claude-sonnet-4-6-review): The $3/$15 daily-driver tier with a 1M-token window, when it's the right default, and when to pay for Opus.
- [Claude Haiku 4.5, reviewed](https://benchr.org/articles/claude-haiku-4-5-review): The $1/$5 cost-control tier — where the cheapest current Claude is enough and how to use it in a routing strategy.
- [Claude Mythos: the model you can't use](https://benchr.org/articles/claude-mythos): Anthropic's restricted frontier model, gated under Project Glasswing for cybersecurity — what it is and why it's locked away.
- [Claude Fable 5 launch](https://benchr.org/articles/claude-fable-5-launch): The Mythos-class model in general availability since June 9, 2026 — $10/$50 pricing, safety classifiers that fall back to Opus 4.8, and the two-week free window.
- [Claude Cowork: the desktop agent for non-coders](https://benchr.org/articles/claude-cowork): Claude Code's agentic engine for desktop knowledge work across local files and apps, plus the rollout.

## Reviews — Other providers
- [GPT-5, reviewed](https://benchr.org/articles/gpt-5-review): Visual design, structured output, and language breadth, plus confident-but-wrong failure modes on niche technical questions.
- [GPT-5.5, reviewed](https://benchr.org/articles/gpt-5-5-review): What it changes over GPT-5 on agentic coding and computer use, at roughly double the API price. Who should upgrade and who should wait.
- [GPT-5.4, reviewed](https://benchr.org/articles/gpt-5-4-review): The seven-week flagship retrospective — $2.50/$15, 1M context, 75% OSWorld computer use, and why losing the crown made it the value pick.
- [Gemini 3 Pro, reviewed](https://benchr.org/articles/gemini-3-pro-evaluation): Best-in-class vision and a 1M-token context window; average elsewhere. Deprecated March 2026 for Gemini 3.1 Pro Preview.
- [Gemini 3.1 Pro, reviewed](https://benchr.org/articles/gemini-3-1-pro-review): A verified ARC-AGI-2 jump to 77.1%, a top GPQA Diamond score, 1M context, and a tiered long-context pricing cliff to watch.
- [Gemini 3.5 Flash, reviewed](https://benchr.org/articles/gemini-3-5-flash-review): $1.50/$9 per 1M tokens, roughly 4x faster output, and Google's claim that it beats the 3.1 Pro tier on coding and agentic work.
- [Grok 4.3, reviewed](https://benchr.org/articles/grok-4-3-review): Native live X and web search built into the API, $1.25/$2.50 pricing, a 1M-token window, and where its real-time edge does and doesn't win.
- [DeepSeek-V4, reviewed](https://benchr.org/articles/deepseek-review): An MIT open-weight coding model with a 1M-token context and a hosted API that massively undercuts closed frontier output.
- [Qwen3.6, reviewed](https://benchr.org/articles/qwen-review): An Apache-2.0 open-weight family in two sizes with up to a 1M context, free to self-host, built for agentic coding.
- [Kimi K2.6, reviewed](https://benchr.org/articles/kimi-review): An open-weight trillion-parameter MoE for agentic and coding work, with an Agent Swarm of up to 300 sub-agents and a 256K context.
- [Llama 4, reviewed](https://benchr.org/articles/llama-4-review): Scout's 10M-token context still stands out, but with Meta's flagship now closed-source, Llama 4 is the last major open-weight Llama.
- [Mistral Large 3, reviewed](https://benchr.org/articles/mistral-review): An Apache-2.0 open-weight MoE flagship with a 256K context and multimodal, multilingual reach, plus Medium 3.5 and Small 4.
- [ChatGPT Images 2.0, reviewed](https://benchr.org/articles/chatgpt-images-review): The image model that finally renders readable text in pictures — where it nails dense layouts and where it slips.

## Comparisons
- [GPT-5 vs Claude Opus 4.7](https://benchr.org/articles/gpt-5-vs-claude-opus): Seven workloads scored. Claude takes five, GPT-5 one decisively, one tie.
- [Opus 4.8 vs GPT-5.5: coder's pick vs daily driver](https://benchr.org/articles/opus-4-8-vs-gpt-5-5): Where Opus wins on coding, where it loses on Terminal-Bench, and which fits your stack at the same $5 input price.
- [Gemini 3.1 Pro vs GPT-5.5](https://benchr.org/articles/gemini-3-1-pro-vs-gpt-5-5): Gemini chases hardest-mode reasoning at $2/$12; GPT-5.5 chases all-round knowledge work at $5/$30.
- [Grok 4.3 vs ChatGPT: when live context wins](https://benchr.org/articles/grok-4-3-vs-chatgpt): Grok plugs into the live web and X; ChatGPT is the all-round assistant — a scenario-by-scenario guide.
- [ChatGPT vs Claude vs Gemini: the 2026 pick](https://benchr.org/articles/chatgpt-vs-claude-vs-gemini): Three near-identical $20 subscriptions, scored on four everyday tasks, with a clear pick for each user.
- [Claude vs ChatGPT for long-form writing](https://benchr.org/articles/claude-vs-chatgpt-writing): Output ceilings, voice, and instruction-following for long-form writing: how much each produces and which to trust.
- [AI search engines: Perplexity vs ChatGPT vs Google](https://benchr.org/articles/ai-search-engines-compared): Perplexity, ChatGPT Search, and Google AI Overviews on sourcing, accuracy, and when to use each.
- [The coding assistants shootout](https://benchr.org/articles/coding-assistants-shootout): Cursor, GitHub Copilot, Windsurf, and Cody on architecture, model backend, and bug profile.
- [Multimodal capability ranking](https://benchr.org/articles/multimodal-capability-ranking): Vision tested across Claude, GPT-5, Gemini 3, and Llama 4. Gemini leads on dense UIs, documents, and Arabic script.
- [Voice models compared](https://benchr.org/articles/voice-models-compared): ElevenLabs, OpenAI, and Cartesia on latency, accuracy, and naturalness.
- [Context windows compared](https://benchr.org/articles/context-windows-compared): The advertised window versus the effective retrieval zone where models reliably find information.
- [AI model pricing comparison 2026](https://benchr.org/articles/ai-model-pricing-comparison): Cost per million tokens across OpenAI, Anthropic, Google, DeepSeek, and open-weight models.
- [Cheapest LLM API 2026](https://benchr.org/articles/cheapest-llm-api-2026): Ultra-low-cost API models ranked by price, plus the trade-offs that come with the cheapest tier.
- [Anthropic Claude API pricing guide](https://benchr.org/articles/claude-api-pricing-guide): Opus 4.8, Sonnet 4.6, and Haiku 4.5 pricing, including caching and batch discounts.
- [DeepSeek vs OpenAI pricing](https://benchr.org/articles/deepseek-vs-openai-pricing): Cost comparison and quality trade-offs between DeepSeek and OpenAI pricing tiers.
- [OpenAI API pricing guide](https://benchr.org/articles/openai-api-pricing-guide): GPT-5.5, GPT-5, and GPT-5 Mini pricing, including input, output, caching, and batch costs.

## Roundups — Best AI for X
- [Best free AI with no subscription](https://benchr.org/articles/best-free-ai-no-subscription): The tools genuinely free with no credit card in 2026, the exact point where each free tier taps out, and which to pick for what.
- [Best AI for writing anything long](https://benchr.org/articles/best-ai-for-writing): Opus 4.8 leads on voice; Sonnet 4.6 is the cheap pick — ranked by job: drafting, polishing, and sustained work.
- [Best AI for students](https://benchr.org/articles/best-ai-for-students): NotebookLM for notes, ChatGPT 5.5 for practice problems, plus what counts as fair use versus an academic-integrity risk.
- [Best AI for resumes and cover letters](https://benchr.org/articles/best-ai-for-resumes): Claude Opus 4.8 is the strongest AI for tailoring a resume to a job; Teal is the best dedicated tool, plus the ATS rules that sink an application.
- [Best AI for email](https://benchr.org/articles/best-ai-for-email): When Gmail's free Gemini and Outlook's Copilot cover drafting and triage, and when a standalone like Superhuman is worth it.
- [Best AI for spreadsheets and formulas](https://benchr.org/articles/best-ai-for-spreadsheets): Microsoft 365 Copilot in Excel leads; Claude runs second on big CSVs, plus where Gemini and chat models win.
- [Best AI for research without the fake citations](https://benchr.org/articles/best-ai-for-research): NotebookLM, Perplexity, Elicit, Consensus, Semantic Scholar, and where each one fabricates references.
- [Best AI for Arabic-English translation](https://benchr.org/articles/best-ai-for-arabic-translation): Which models move cleanly between Arabic and English both ways, and where every one of them still breaks.
- [Best AI for Saudi and Gulf Arabic](https://benchr.org/articles/best-ai-for-gulf-arabic): Which AI models hold Khaleeji (Gulf) Arabic and which slide back into MSA or drift toward Egyptian.
- [Best AI for customer service](https://benchr.org/articles/best-ai-for-customer-service): Off-the-shelf bots like Intercom Fin at $0.99 a resolution, platform agents, or building your own on Claude Haiku 4.5.
- [Best free coding model: DeepSeek vs Qwen vs Kimi](https://benchr.org/articles/best-free-coding-model): Three open-weight models on SWE-bench Verified, license, and context, with a clear pick.
- [Best AI for video in 2026](https://benchr.org/articles/best-ai-for-video): Google Veo 3.1 leads while Sora is being discontinued — Veo, Runway, Kling, Luma, and Pika with honest limits and pricing.
- [Best AI tools for social media](https://benchr.org/articles/best-ai-for-social-media): Matched to platform and job: which AI tools earn a spot for captions, hooks, LinkedIn posts, and turning long video into short clips.
- [Best free AI for coding](https://benchr.org/articles/best-free-ai-for-coding): What you actually get at $0 from Copilot, Cursor, Windsurf, Gemini Code Assist, and the BYO-key editors, and when the meter starts.

## Analysis
- [The price-per-use-case table](https://benchr.org/articles/price-per-use-case): The cheapest model for chat, coding, RAG, agents, classification, and summarization.
- [The open-weight tier right now](https://benchr.org/articles/open-weight-tier-right-now): Where Llama 4, Mistral Large 2, DeepSeek-V3.1, and Qwen 3 stand against the closed labs.
- [Cutting your token bill](https://benchr.org/articles/reduce-token-usage): Where AI token spend comes from, and the levers that cut it: routing, prompt caching, the Batch API, shorter output, and lower effort.
- [Why benchmarks stopped telling you anything](https://benchr.org/articles/why-benchmarks-stopped-telling-you): MMLU is saturated above 90%. The benchmarks worth tracking now.
- [Small language models, in working use](https://benchr.org/articles/small-language-models): Phi-4, Gemma 3, and the workloads where sub-10B-parameter models quietly win.
- [Running models on your own machine](https://benchr.org/articles/running-models-on-your-own-machine): Hardware, software, tokens-per-second on three quantizations, and when local inference is worth it.
- [AI agents, eighteen months in](https://benchr.org/articles/ai-agents-eighteen-months-in): LangGraph, OpenAI Assistants v2, Anthropic computer use, and Autogen, after the hype cycle.
- [RAG vs fine-tuning](https://benchr.org/articles/rag-vs-fine-tuning): When to retrieve, when to fine-tune, and the cases where fine-tuning earns its keep.
- [Prompt engineering did not die](https://benchr.org/articles/prompt-engineering-did-not-die): Three techniques that still improve outputs in 2026, with before-and-after examples.
- [The million-token context marketing](https://benchr.org/articles/million-token-context-marketing): What long context is actually good for, and where retrieval still beats it.
- [AI for Arabic content](https://benchr.org/articles/ai-for-arabic-content): How five frontier models handle Modern Standard, Khaleeji, Egyptian, and Levantine Arabic.
- [The AI agent that checks out for you](https://benchr.org/articles/agentic-shopping): How agentic shopping works end to end, who's building it, and where a hands-off purchase can go wrong.
- [Are AI hallucinations fixed yet?](https://benchr.org/articles/ai-hallucinations-2026): Grounded summarization got near-perfect, but open-ended factual answers and some reasoning models still miss.
- [Which AI providers train on your chats](https://benchr.org/articles/ai-privacy-who-trains-on-you): A provider-by-provider guide to which chatbots train on conversations by default and how to opt out.
- [Do AI text detectors actually work?](https://benchr.org/articles/do-ai-detectors-work): Why AI detectors flag innocent students, how badly they miss non-native writers, and why a flag isn't proof.
- [Do you actually need a reasoning model?](https://benchr.org/articles/do-you-need-reasoning-models): Thinking models bill hidden reasoning at the output rate and can run minutes slower — a buy-or-skip guide with real numbers.
- [How to get cited inside AI answers](https://benchr.org/articles/get-cited-by-ai-search): GEO and AEO tactics that get pages quoted by ChatGPT, Perplexity, and AI Overviews, backed by the Princeton GEO study.
- [When the model remembers you](https://benchr.org/articles/persistent-memory): How persistent memory works across separate chats in ChatGPT, Gemini, and Claude, and where to control it.
- [What zero-click search did to the web](https://benchr.org/articles/zero-click-search): AI Overviews and chat answers now sit above the links, and the lost-click numbers are real: 58% lower CTR for the top result.

## Tools and reference
- [Compare models directly](https://benchr.org/compare): Interactive comparison of pricing, benchmarks, context windows, and capability ratings for 21 frontier and open-weight models.
- [Recent model releases](https://benchr.org/recent-releases): Major AI model launches since early 2026, in reverse chronological order.
- [AI model deprecations](https://benchr.org/deprecations): Every announced model retirement across Anthropic, OpenAI, and Google — dates, replacements, and migration cost analysis, verified against official deprecation docs.
- [Claude Sonnet 4 retirement](https://benchr.org/deprecations/claude-sonnet-4): Retires June 15, 2026; migration guide to Sonnet 4.6 at the same $3/$15 price.
- [Claude Opus 4 and 4.1 retirement](https://benchr.org/deprecations/claude-opus-4): June 15 and August 5, 2026; Opus 4.8 replaces both at a third of the price.
- [GPT-4o shutdown](https://benchr.org/deprecations/gpt-4o): The gpt-4o-2024-05-13 snapshot retires October 23, 2026; replacement options priced.
- [OpenAI October 2026 retirements](https://benchr.org/deprecations/openai-october-2026-retirements): Nine model IDs retire October 23, 2026 — the end of the GPT-4 era, plus the Assistants API sunset.
- [Gemini 2.5 Pro and Flash shutdown](https://benchr.org/deprecations/gemini-2-5-pro): October 16, 2026 retirement, with replacement paths that raise list prices.
- [AI API price history](https://benchr.org/price-history): Append-only log of verified AI API pricing events with official sources; open data under CC BY 4.0.

## API errors
- [AI API Error Database](https://benchr.org/errors): 15 common errors across OpenAI, Anthropic, and Gemini — verified causes, code fixes, and migration alternatives, filterable by provider and category.
- [OpenAI insufficient_quota](https://benchr.org/errors/openai-insufficient-quota): The 429 that backoff can't fix — billing exhausted, with the code guard that separates it from rate limits.
- [OpenAI model_not_found](https://benchr.org/errors/openai-model-not-found): Why model IDs 404 in 2026 — usually retirement, with the October 23 wave checklist.
- [Anthropic overloaded_error 529](https://benchr.org/errors/anthropic-overloaded-error): Platform-wide overload vs your account, and how to retry without making it worse.
- [Anthropic invalid_request_error](https://benchr.org/errors/anthropic-invalid-request-error): The 2026 causes — sampling params on Opus 4.7+, prefill, modified thinking blocks.
- [Gemini RESOURCE_EXHAUSTED](https://benchr.org/errors/gemini-resource-exhausted): Free-tier rate limits and the quota decision tree.

## About
- [About benchr](https://benchr.org/about): What benchr is and how it sources information.
- [Contact](https://benchr.org/contact): General questions, corrections with priority handling, and security reports.
- [Methodology](https://benchr.org/methodology): Where the data comes from and how it is kept current.
- [Editorial standards](https://benchr.org/editorial-standards): Publishing principles and sourcing rules.
- [Corrections](https://benchr.org/corrections): The log of material corrections.