Leaderboard · Reasoning · June 2026

AI reasoning models leaderboard

Models ranked by GPQA Diamond — 198 graduate-level science questions that most domain experts answer incorrectly. Higher is better. The frontier cluster has converged above 90%.

Data from models.json Data-driven and neutral

Rank	Model	Provider	GPQA Diamond	SWE-bench	Input $/1M
#1	Gemini 3.1 Pro	Google	94.3%	80.6%	$2.00
#2	Claude Opus 4.7	Anthropic	94.2%	87.6%	$5.00
#3	Claude Opus 4.8	Anthropic	93.6%	88.6%	$5.00
#4	Kimi K2.6	Moonshot AI	90.5%	80.2%	$0.950
#5	DeepSeek V4-Pro	DeepSeek	90.1%	80.6%	$0.435
#6	Claude Sonnet 4.6	Anthropic	89.9%	79.6%	$3.00
#7	DeepSeek V4-Flash	DeepSeek	88.1%	79.0%	$0.140
#8	Qwen3.6-27B	Alibaba	87.8%	77.2%	Free
#9	Llama 4 Maverick	Meta	69.8%	66.0%	Free
#10	Llama 4 Scout	Meta	57.2%	56.0%	Free
#11	Phi-4	Microsoft	56.1%	30.0%	Free
#12	Claude Haiku 4.5	Anthropic	—	73.3%	$1.00
#13	GPT-5.5	OpenAI	—	84.0%	$5.00
#14	GPT-5	OpenAI	—	74.9%	$1.25
#15	GPT-5 Mini	OpenAI	—	48.0%	$0.250
#16	Gemini 3.5 Flash	Google	—	80.6%	$1.50
#17	Grok 4.3	xAI	—	68.0%	$1.25
#18	Mistral Large 3	Mistral	—	62.0%	$0.500
#19	Mistral Medium 3.5	Mistral	—	77.6%	$1.50

The 90% threshold

In early 2025, reaching 85% on GPQA Diamond placed a model at the frontier. By mid-2026, four models clear 90%: Claude Opus 4.7, Claude Opus 4.8, Gemini 3.1 Pro, and Kimi K2.6. DeepSeek V4-Pro and Claude Sonnet 4.6 are within one or two points below. The frontier has moved, and what was exceptional a year ago is now mid-tier performance.

The convergence above 90% means GPQA Diamond is losing its ability to differentiate between top models. The practical question for most teams isn't which model scores 93.6% vs 94.2% — it's whether a model above 89% is sufficient for your use case, and which one is cheapest above that threshold.

What GPQA actually predicts

GPQA Diamond's questions test whether a model has internalized the principles behind scientific reasoning, not whether it memorized answers. A high score predicts: reliable multi-step logical chains, accurate handling of domain-specific edge cases, and resistance to plausible-but-wrong reasoning paths. For products involving scientific or technical analysis — medical literature review, financial modeling, research summarization — GPQA is the most predictive public benchmark available.

For purely creative or conversational tasks, the benchmark is less relevant. A customer support bot doesn't need a 90% GPQA score. A clinical decision support tool probably does.

Methodology

GPQA Diamond scores from official provider announcements where published. Where no official figure exists, the table shows —. Benchmarks can be cherry-picked; always evaluate on your actual task distribution before making infrastructure decisions based solely on leaderboard position.

Using the threshold in real products

For most production teams, the useful cutoff is not the absolute first-place score. It is the cheapest model that clears the reasoning threshold your task requires. A research assistant that reviews technical papers may need a model near the frontier cluster; a routing classifier or support assistant usually does not benefit from paying for a 90%+ GPQA score.

Run your own threshold test by grouping failures: wrong final answer, missed constraint, fabricated citation, or incomplete chain of reasoning. If two models fail in the same way on your workload, choose the cheaper or faster one. If the higher-scoring model avoids a failure mode that creates legal, medical, financial, or engineering risk, the premium may be justified.

Failure cost matters more than rank

Reasoning benchmarks are most useful when the cost of a wrong answer is high. If the task is low-risk and easy to verify, a lower-ranked model may be the right choice. If the task involves expert review, regulatory exposure, or expensive downstream actions, the premium for stronger reasoning can be rational even when the benchmark gap looks small.

For evaluation, ask the model to show intermediate assumptions and compare those assumptions to the source material. The failure mode you want to catch is not only a wrong final answer; it is a plausible chain of reasoning built on a false premise.

Keep a separate holdout set for reasoning evaluations. If the same examples are reused in prompt tuning, the score stops measuring general reasoning and starts measuring adaptation to your test. A small unseen set of difficult internal cases is often more useful than another public benchmark.

For teams using reasoning models in sensitive contexts, keep a human-review policy tied to confidence and evidence. A leaderboard can identify candidates, but it cannot decide when an answer must be escalated. The escalation rule is part of the product, not a property of the model.

Frequently asked questions

What is GPQA Diamond?

GPQA Diamond is a set of 198 multiple-choice questions in graduate-level biology, chemistry, and physics, written by domain experts and designed to be difficult even for experts outside the subfield. Non-expert PhD-holding humans score around 34%. A score above 85% indicates frontier-level reasoning. It tests depth of reasoning and domain knowledge, not just surface pattern matching.

Which AI model has the best reasoning in 2026?

Claude Opus 4.7 leads at 94.2% GPQA Diamond, followed closely by Claude Opus 4.8 at 93.6% and Gemini 3.1 Pro at 94.3%. The frontier cluster now sits above 90%, meaning the difference between top models is small — the decision between them usually comes down to price and coding ability, not pure reasoning.

Does a high GPQA score mean a model is good for math?

GPQA Diamond correlates with hard reasoning ability, including mathematical reasoning. But it's a science-domain benchmark, not a pure math test. For mathematical problem solving specifically, benchmarks like MATH-500 and AIME are more targeted. High GPQA performance generally indicates strong multi-step reasoning, which tends to transfer to math.