Essay·May 2026

Why the benchmarks stopped telling you anything

Q: Are AI benchmarks still useful?

Mostly not. MMLU, HumanEval, GSM8K, MATH are all saturated above 90% with significant contamination concerns. Four benchmarks still tell you something: LMSYS Arena, SimpleBench, ARC-AGI 2, and SWE-bench Verified.

Q: Why is MMLU saturated?

Frontier models score 92%+ in 2026, putting them within 3 percentage points of the ceiling. Improvements in this range come from fine-tuning against the benchmark, not from capability gains. The benchmark stops sorting the contestants.

Q: Which benchmark is hardest to game?

LMSYS Chatbot Arena. Real user prompts, blind pairwise voting, constantly-refreshed test set. Contamination is structurally hard because the prompts that show up are the prompts users actually ask.

Q: How should I evaluate AI models for my own use?

Build a 50-200 prompt eval set from your actual work. Run candidate models against it. Score the outputs yourself. The investment is 2-3 hours initially and a few minutes per release — the payoff is decisions that match your workload.

Q: Is SWE-bench Verified worth tracking?

Yes, it's one of the four benchmarks still informative. Real GitHub issues, real test suites, hand-checked for correctness. The verified subset is the closest public approximation of what shipping real code requires.

MMLU is saturated, HumanEval is gamed. A field guide to what's left worth reading.

Updated May 25, 2026 · View changelog

Saturation line 92% Most legacy benchmarks above this

Worth reading 4 Benchmarks still informative

To retire 2 MMLU, HumanEval — gamed

Frontier gap <3pt Top 5 models on MMLU

Open the model card for any recent frontier release. The MMLU score will be there. So will GPQA, MATH, HumanEval, GSM8K. All in the upper nineties. None of those numbers will tell you anything you can act on. Each of those benchmarks is saturated, and a tournament where everyone scores in the upper nineties is a tournament that has stopped sorting the contestants.

The benchmark didn't get worse. The models got too good for it. That happens to every benchmark eventually, and it happened across the second half of the 2024–2025 cycle faster than anyone in 2022 had reason to expect.

Four benchmarks still earn attention in 2026. Two deserve to be retired from daily reading. In my testing, the most useful thing you can do as a working developer is stop relying on the public leaderboards for procurement decisions and build a small evaluation set from real work instead. The argument for each below.

It's not pretty.

Why MMLU stopped being informative

MMLU is a multiple-choice test of factual knowledge across 57 subjects. When it was published in 2020, GPT-3 scored 43.9%. By mid-2025, the state-of-the-art models were at 92%+. The remaining gap is dominated by ambiguous questions, scoring disagreements, and a small set of items that are honestly hard for reasons unrelated to model capability. Improvements in this range don't track to improvements in real-world capability. They track to fine-tuning specifically against the benchmark format.

There's also a contamination problem. The MMLU questions are public. Many are on the open internet. Any frontier model trained on a recent web crawl has seen most of them, possibly with the answers. The benchmark increasingly measures how well the model memorized the test rather than how well it knows the underlying material. No lab admits to this. Most labs are guilty of it in some form.

HumanEval, the canonical coding benchmark, has the same problem in a more aggressive form. The test set is small, the problems are public, and the models have been optimized against it for years. A model can be excellent at HumanEval and bad at any actual codebase. The two signals have decoupled.

The strongest case for keeping legacy benchmarks: they're widely understood, the dataset is stable, and trend lines across years have informational value even if the absolute scores are saturated. That's a real argument. The counter-argument — that the trend line is dominated by contamination — is also real. I lean toward the latter, but it's a judgment call.

What still tells you something

Four benchmarks that stay informative, each measuring something different.

LMSYS Chatbot Arena — real human preference at scale, hard to game
SimpleBench — small problem set, designed to break frontier models
ARC-AGI 2 — abstract reasoning that doesn't leak into training data
SWE-bench Verified — real GitHub issues, real test suites

LMSYS Chatbot Arena. Pair-wise human preference voting on real user prompts. Two anonymous models respond to the same prompt. You pick the better response. The votes add up into an Elo-style leaderboard. This benchmark still works because the prompts are user-submitted, the comparison is blind, and contamination is structurally hard. The prompts that show up in voting are the prompts users actually ask. The catch: the voting population skews technical and English-speaking, so the leaderboard is biased toward those use cases. It's the best general-purpose benchmark available. It isn't the only one you need.

SimpleBench. A set of intentionally tricky reasoning problems where the surface form suggests a wrong answer that a hasty reader would give. The problems are designed to resist contamination because each one needs the model to override a surface heuristic in favor of careful reasoning. Frontier models score around 70-75% in 2026. Humans score around 90%. The gap is real, and it tells you something the saturated benchmarks don't. Namely, whether the model is doing real reasoning or pattern-matching on the prompt.

ARC-AGI 2. The second version of the visual reasoning benchmark that was famously hard for the first generation of frontier models. ARC-AGI 2 (released early 2025) addressed limitations of the original and stays a stress test that exposes capability differences between models that look identical on MMLU. The criticism: ARC-AGI tests a specific kind of abstract reasoning that may or may not predict real-world capability. Treat it as one signal among several, not a definitive measure.

SWE-bench Verified. Real GitHub issues from open-source projects, evaluated by whether the model's proposed patch passes the project's existing test suite. The verified subset (introduced 2024) is hand-checked for correctness of both the underlying issue and the tests. This benchmark is informative for coding work in a way that synthetic coding benchmarks aren't. It's the closest public approximation of what shipping real code requires. For real coding-assistant comparison results, see the coding-assistants shootout.

A benchmark is informative until it isn't. The moment to stop reading a benchmark is when the top models all score within two points of each other and within five points of the ceiling.

Frontier-model saturation by benchmark

Top-five-models score range, 2026. Benchmarks above 92% are saturated.

MMLU

92%

HumanEval

95%

MATH

93%

GPQA Diamond

71%

SWE-bench Verified

87.6%

ARC-AGI 2

22%

92%+ Score above which most legacy benchmarks are saturated

MMLU scores, 2020–2026. The curve flattens into the ceiling. The benchmark stops sorting models.

LMSYS Arena

Read it Real human preferences

SimpleBench

Read it Breaks frontier models

ARC-AGI 2

Read it No training-data leak

SWE-bench Verified

Read it Real GitHub issues

Worth flagging: I'm not arguing the benchmarks I name as "still useful" are perfect. LMSYS Arena has known biases (technical, English-speaking voter base). ARC-AGI 2 measures a specific kind of abstract reasoning. SimpleBench has a small problem set. They're the least bad option, not a clean solution.

The benchmarks to retire

MMLU itself, for the saturation and contamination reasons named above.

HumanEval and MBPP. Both are saturated coding benchmarks that were never representative of real software engineering work. A model that scores 95% on HumanEval can still produce code that breaks in actual use in subtle ways.

The various "intelligence index" charts that combine half a dozen benchmarks into a single score. Combining them hides the saturation in each component. The resulting number is precise without being meaningful.

Any benchmark released at the same time as a model that shows the model winning, where the benchmark itself is novel. That's a structural conflict of interest, and benchmarks that emerge this way rarely survive independent replication.

What to do instead

In my testing, building your own evaluation set is what actually works.

The single most useful thing you can do to make model selection decisions is to assemble fifty to two hundred prompts from your actual work, run the candidate models against them, and judge the outputs yourself. Public benchmarks tell you something about the population of models. A personal evaluation set tells you something about how the models perform on the population of prompts that actually run in your system.

The working evaluation set used in the writing of this site contains 84 prompts across coding tasks, technical writing, multilingual content, customer-support drafting, and a handful of edge cases collected over time. New models are run against it within a few days of release. The results often disagree with the public benchmarks in informative ways. A leaderboard star can be a mediocre fit for the work, and a model the leaderboards rank lower can be exactly the right call.

This approach takes effort. The first version of your eval will be bad. The second version will be okay. By the fifth iteration, the instrument is informative enough to drive real procurement decisions. The investment pays back. No serious model-selection decision should happen without running it.

Anyway.

One framing note before the checklist: the recommendations here are based on what I've found useful for procurement decisions in my own work. Researchers studying capability frontiers have different needs. So do labs benchmarking their own models. The advice below is for the practitioner choosing what to deploy, not for the benchmark community itself.

How to read a benchmark claim with skepticism

Five questions worth asking whenever a benchmark score shows up in a release announcement.

Is the benchmark saturated? If the top models all score within three points of each other, the benchmark isn't sorting anymore. The chart looks good. The signal is gone.

When was the benchmark released, and could it be in the training data? Anything published before the model's training cutoff is suspect. Anything on the open internet for more than a year is more suspect.

Did the lab releasing the model also create or heavily influence the benchmark? If yes, treat the result with extra skepticism. The conflict is rarely intentional but worth knowing.

Does the benchmark measure something that maps to your use case? A model that wins on graduate-level physics problems may not be the right model for your customer support inbox.

Is the benchmark a public leaderboard with independent verification, or a number from a model card? Independent verification matters. Marketing numbers have a known optimistic bias.

In 2026, no public benchmark is enough on its own. The era when an MMLU leaderboard glance could drive a model decision is over. The benchmarks that survive (LMSYS Arena, SimpleBench, ARC-AGI 2, SWE-bench Verified) each measure something specific and partial. Read them in combination, weight them against the use case that matters, and treat any single-number ranking as a starting hypothesis, not a conclusion.

The most useful thing you can do as a working developer is build a small held-out evaluation set from your own work and run candidate models against it. The investment in building the set is a few hours initially and a few minutes per release cycle to maintain. The return is the ability to make model decisions that fit your actual workload, instead of the imaginary average workload public benchmarks describe. For the related case of why long-context window numbers are similarly misleading, see the million-token marketing piece.

If you're using benchmark charts as the primary input to procurement decisions, stop. The charts are useful for sorting candidates into rough tiers. They aren't useful for picking between models that are close. For that, run them against the work that matters. There's no shortcut. That's the lesson of the past two years of benchmark inflation, and it's the lesson worth carrying into the next two.

Bottom line

For procurement decisions in 2026, I'd stop reading saturated benchmarks (MMLU, HumanEval, GSM8K). Read LMSYS Arena, SimpleBench, ARC-AGI 2, and SWE-bench Verified instead. For procurement decisions, build your own 50-200 prompt eval set from real work. The investment pays back. Public benchmarks tell you about the model population. A personal eval tells you about your workload.

Frequently asked

Are AI benchmarks still useful?

Mostly not. MMLU, HumanEval, GSM8K, MATH are all saturated above 90% with significant contamination concerns. Four benchmarks still tell you something: LMSYS Arena, SimpleBench, ARC-AGI 2, and SWE-bench Verified.

Why is MMLU saturated?

Frontier models score 92%+ in 2026, putting them within 3 percentage points of the ceiling. Improvements in this range come from fine-tuning against the benchmark, not from capability gains. The benchmark stops sorting the contestants.

Which benchmark is hardest to game?

LMSYS Chatbot Arena. Real user prompts, blind pairwise voting, constantly-refreshed test set. Contamination is structurally hard because the prompts that show up are the prompts users actually ask.

How should I evaluate AI models for my own use?

Build a 50-200 prompt eval set from your actual work. Run candidate models against it. Score the outputs yourself. The investment is 2-3 hours initially and a few minutes per release — the payoff is decisions that match your workload.

Is SWE-bench Verified worth tracking?

Yes, it's one of the four benchmarks still informative. Real GitHub issues, real test suites, hand-checked for correctness. The verified subset is the closest public approximation of what shipping real code requires.

Changelog

May 25, 2026 — Verified pricing against current provider documentation. Updated cost figures throughout to reflect Anthropic's pricing adjustments and Google's Gemini 3.1 Pro Preview rollout.
January 22, 2026 — Added the ordered list of four replacement benchmarks worth reading.
May 21, 2026 — Originally published.

References

"LMSYS Chatbot Arena," lmarena.ai, accessed May 2026.
"SWE-bench Verified leaderboard," swebench.com, accessed May 2026.
ARC Prize, "ARC-AGI," arcprize.org, accessed May 2026.
"SimpleBench," simple-bench.com, accessed May 2026.

Why the benchmarks stopped telling you anything

Why MMLU stopped being informative

What still tells you something

Frontier-model saturation by benchmark

LMSYS Arena

SimpleBench

ARC-AGI 2

SWE-bench Verified

The benchmarks to retire

What to do instead

How to read a benchmark claim with skepticism

Bottom line

Frequently asked

Changelog

References

The million-token context was always a marketing number.

AI agents, eighteen months in.

GPT-5 vs Claude Opus 4.7: seven tasks, scored.