Leaderboard · Coding Capabilities · June 2026

AI coding models leaderboard

Every model in benchr's index ranked by SWE-bench Verified score — the standard benchmark for autonomous code repair on real GitHub issues. Higher is better. Official figures where published; editorial estimates otherwise.

Data from models.json Data-driven and neutral

Rank	Model	Provider	SWE-bench Verified	Input $/1M	Quality per dollar
#1	Claude Opus 4.8	Anthropic	88.6%	$5.00	17.7/$
#2	Claude Opus 4.7	Anthropic	87.6%	$5.00	17.5/$
#3	GPT-5.5	OpenAI	84.0%	$5.00	16.8/$
#4	Gemini 3.5 Flash	Google	80.6%	$1.50	53.7/$
#5	Gemini 3.1 Pro	Google	80.6%	$2.00	40.3/$
#6	DeepSeek V4-Pro	DeepSeek	80.6%	$0.435	185.3/$
#7	Kimi K2.6	Moonshot AI	80.2%	$0.950	84.4/$
#8	Claude Sonnet 4.6	Anthropic	79.6%	$3.00	26.5/$
#9	DeepSeek V4-Flash	DeepSeek	79.0%	$0.140	564.3/$
#10	Mistral Medium 3.5	Mistral	77.6%	$1.50	51.7/$
#11	Qwen3.6-27B	Alibaba	77.2%	Free	∞ (free)
#12	GPT-5	OpenAI	74.9%	$1.25	59.9/$
#13	Claude Haiku 4.5	Anthropic	73.3%	$1.00	73.3/$
#14	Grok 4.3	xAI	68.0%	$1.25	54.4/$
#15	Llama 4 Maverick	Meta	66.0%	Free	∞ (free)
#16	Mistral Large 3	Mistral	62.0%	$0.500	124.0/$
#17	Llama 4 Scout	Meta	56.0%	Free	∞ (free)
#18	GPT-5 Mini	OpenAI	48.0%	$0.250	192.0/$
#19	Phi-4	Microsoft	30.0%	Free	∞ (free)

What SWE-bench Verified actually measures

SWE-bench presents a model with a GitHub issue and the repository's codebase. The model must locate the bug, write a fix, and have that fix pass the repository's test suite — automatically, without human guidance. The "Verified" subset removes ambiguous issues that could be fixed in multiple ways, making scores more reliable.

What it doesn't measure: speed, latency, non-coding tasks, or how the model behaves when a developer is actively supervising. A model that scores 75% on SWE-bench might be excellent for pair programming even if an 88% model is better for fully autonomous pipelines. The score is most meaningful when humans are out of the loop.

Reading the leaderboard: the cost dimension

Claude Opus 4.8 leads on raw SWE-bench at 88.6%, but costs $5/1M input. DeepSeek V4-Pro scores 80.6% and costs $0.435/1M. Claude Haiku 4.5 scores 73.3% and costs $1/1M. The table above shows a "quality per dollar" column — SWE-bench score divided by input price — to surface models where the value proposition is strongest. DeepSeek V4-Pro and DeepSeek V4-Flash consistently lead this derived metric.

For autonomous code repair at volume, the relevant question isn't just "who scores highest" but "at what failure rate does the quality gap cost more than the price gap." See the GPT-5 vs Opus 4.8 comparison for that analysis.

Methodology

SWE-bench Verified scores from official provider submissions where published. Where a provider hasn't submitted, benchr uses an editorial estimate based on related benchmarks and disclosed internal evaluations. Estimates are marked with an asterisk in the table. Input prices from official API documentation as of June 3, 2026.

How to validate a coding model before switching

Use SWE-bench as a screen, not as the final decision. A model that repairs Python repository issues well may still struggle with your TypeScript monorepo, your test harness, your dependency graph, or your internal style constraints. Before migration, run the candidate model against recent bugs your team already fixed and check whether its patch would have passed review.

The strongest production metric is accepted patch cost: token spend plus failed attempts plus engineer review time. If a cheaper model needs twice as many retries, the higher-scoring model can be less expensive in practice. If every patch is reviewed by a developer anyway, a mid-tier model may be the better default even when Opus leads the benchmark.

Why repository fit matters

Coding models vary by language, repository size, test quality, and how much surrounding context they need. A benchmark score can hide those differences. Before moving traffic, sample issues from your own repositories: one simple bug, one dependency problem, one refactor, one failing test with misleading logs, and one issue that requires reading documentation. That small suite reveals failure modes faster than a generic leaderboard.

Also check tool behavior. Some models write excellent patches but struggle with shell commands, file navigation, or concise commit messages. If your coding agent depends on those behaviors, evaluate the full workflow instead of the final patch only.

One more practical point: keep a small regression set after you choose a model. Coding models can change behavior when providers update routing, inference settings, or model aliases. A monthly rerun on the same issues will tell you whether the model still deserves its spot in your pipeline.

Frequently asked questions

What is SWE-bench Verified?

SWE-bench Verified is a benchmark that presents models with real GitHub issues from open-source Python repositories. The model must identify the problem, write a fix, and pass the existing test suite without scaffolding or hints. A 'verified' subset filters out ambiguous or underspecified issues for cleaner scoring. It's widely considered the best public signal of autonomous coding ability.

Which is the best coding model in 2026?

Claude Opus 4.8 leads at 88.6% on SWE-bench Verified. For teams with cost constraints, DeepSeek V4-Pro scores 80.6% at a fraction of the price. The choice between them is largely a cost-vs-maximum-quality decision.

Why are some SWE-bench scores marked as estimates?

Not all providers submit to the official SWE-bench leaderboard. Where a provider hasn't published an official SWE-bench result, benchr uses an editorial estimate based on available benchmark data and related scores. Estimated figures are marked as such.