Open the model card for any recent frontier release. The MMLU score will be there. So will GPQA, MATH, HumanEval, GSM8K. All in the upper nineties. None of those numbers will tell you anything you can act on. Each of those benchmarks is saturated, and a tournament where everyone scores in the upper nineties is a tournament that has stopped sorting the contestants.
The benchmark didn't get worse. The models got too good for it. That happens to every benchmark eventually, and it happened across the second half of the 2024–2025 cycle faster than anyone in 2022 had reason to expect.
Four benchmarks still earn attention in 2026. Two deserve to be retired from daily reading. In my testing, the most useful thing you can do as a working developer is stop relying on the public leaderboards for procurement decisions and build a small evaluation set from real work instead. The argument for each below.
It's not pretty.
Why MMLU stopped being informative
MMLU is a multiple-choice test of factual knowledge across 57 subjects. When it was published in 2020, GPT-3 scored 43.9%. By mid-2025, the state-of-the-art models were at 92%+. The remaining gap is dominated by ambiguous questions, scoring disagreements, and a small set of items that are honestly hard for reasons unrelated to model capability. Improvements in this range don't track to improvements in real-world capability. They track to fine-tuning specifically against the benchmark format.
There's also a contamination problem. The MMLU questions are public. Many are on the open internet. Any frontier model trained on a recent web crawl has seen most of them, possibly with the answers. The benchmark increasingly measures how well the model memorized the test rather than how well it knows the underlying material. No lab admits to this. Most labs are guilty of it in some form.
HumanEval, the canonical coding benchmark, has the same problem in a more aggressive form. The test set is small, the problems are public, and the models have been optimized against it for years. A model can be excellent at HumanEval and bad at any actual codebase. The two signals have decoupled.
The strongest case for keeping legacy benchmarks: they're widely understood, the dataset is stable, and trend lines across years have informational value even if the absolute scores are saturated. That's a real argument. The counter-argument — that the trend line is dominated by contamination — is also real. I lean toward the latter, but it's a judgment call.
What still tells you something
Four benchmarks that stay informative, each measuring something different.
- LMSYS Chatbot Arena — real human preference at scale, hard to game
- SimpleBench — small problem set, designed to break frontier models
- ARC-AGI 2 — abstract reasoning that doesn't leak into training data
- SWE-bench Verified — real GitHub issues, real test suites
LMSYS Chatbot Arena. Pair-wise human preference voting on real user prompts. Two anonymous models respond to the same prompt. You pick the better response. The votes add up into an Elo-style leaderboard. This benchmark still works because the prompts are user-submitted, the comparison is blind, and contamination is structurally hard. The prompts that show up in voting are the prompts users actually ask. The catch: the voting population skews technical and English-speaking, so the leaderboard is biased toward those use cases. It's the best general-purpose benchmark available. It isn't the only one you need.
SimpleBench. A set of intentionally tricky reasoning problems where the surface form suggests a wrong answer that a hasty reader would give. The problems are designed to resist contamination because each one needs the model to override a surface heuristic in favor of careful reasoning. Frontier models score around 70-75% in 2026. Humans score around 90%. The gap is real, and it tells you something the saturated benchmarks don't. Namely, whether the model is doing real reasoning or pattern-matching on the prompt.
ARC-AGI 2. The second version of the visual reasoning benchmark that was famously hard for the first generation of frontier models. ARC-AGI 2 (released early 2025) addressed limitations of the original and stays a stress test that exposes capability differences between models that look identical on MMLU. The criticism: ARC-AGI tests a specific kind of abstract reasoning that may or may not predict real-world capability. Treat it as one signal among several, not a definitive measure.
SWE-bench Verified. Real GitHub issues from open-source projects, evaluated by whether the model's proposed patch passes the project's existing test suite. The verified subset (introduced 2024) is hand-checked for correctness of both the underlying issue and the tests. This benchmark is informative for coding work in a way that synthetic coding benchmarks aren't. It's the closest public approximation of what shipping real code requires. For real coding-assistant comparison results, see the coding-assistants shootout.
A benchmark is informative until it isn't. The moment to stop reading a benchmark is when the top models all score within two points of each other and within five points of the ceiling.
LMSYS Arena
Read it Real human preferencesSimpleBench
Read it Breaks frontier modelsARC-AGI 2
Read it No training-data leakSWE-bench Verified
Read it Real GitHub issuesWorth flagging: I'm not arguing the benchmarks I name as "still useful" are perfect. LMSYS Arena has known biases (technical, English-speaking voter base). ARC-AGI 2 measures a specific kind of abstract reasoning. SimpleBench has a small problem set. They're the least bad option, not a clean solution.
The benchmarks to retire
MMLU itself, for the saturation and contamination reasons named above.
HumanEval and MBPP. Both are saturated coding benchmarks that were never representative of real software engineering work. A model that scores 95% on HumanEval can still produce code that breaks in actual use in subtle ways.
The various "intelligence index" charts that combine half a dozen benchmarks into a single score. Combining them hides the saturation in each component. The resulting number is precise without being meaningful.
Any benchmark released at the same time as a model that shows the model winning, where the benchmark itself is novel. That's a structural conflict of interest, and benchmarks that emerge this way rarely survive independent replication.
What to do instead
In my testing, building your own evaluation set is what actually works.
The single most useful thing you can do to make model selection decisions is to assemble fifty to two hundred prompts from your actual work, run the candidate models against them, and judge the outputs yourself. Public benchmarks tell you something about the population of models. A personal evaluation set tells you something about how the models perform on the population of prompts that actually run in your system.
The working evaluation set used in the writing of this site contains 84 prompts across coding tasks, technical writing, multilingual content, customer-support drafting, and a handful of edge cases collected over time. New models are run against it within a few days of release. The results often disagree with the public benchmarks in informative ways. A leaderboard star can be a mediocre fit for the work, and a model the leaderboards rank lower can be exactly the right call.
This approach takes effort. The first version of your eval will be bad. The second version will be okay. By the fifth iteration, the instrument is informative enough to drive real procurement decisions. The investment pays back. No serious model-selection decision should happen without running it.
Anyway.
One framing note before the checklist: the recommendations here are based on what I've found useful for procurement decisions in my own work. Researchers studying capability frontiers have different needs. So do labs benchmarking their own models. The advice below is for the practitioner choosing what to deploy, not for the benchmark community itself.
How to read a benchmark claim with skepticism
Five questions worth asking whenever a benchmark score shows up in a release announcement.
Is the benchmark saturated? If the top models all score within three points of each other, the benchmark isn't sorting anymore. The chart looks good. The signal is gone.
When was the benchmark released, and could it be in the training data? Anything published before the model's training cutoff is suspect. Anything on the open internet for more than a year is more suspect.
Did the lab releasing the model also create or heavily influence the benchmark? If yes, treat the result with extra skepticism. The conflict is rarely intentional but worth knowing.
Does the benchmark measure something that maps to your use case? A model that wins on graduate-level physics problems may not be the right model for your customer support inbox.
Is the benchmark a public leaderboard with independent verification, or a number from a model card? Independent verification matters. Marketing numbers have a known optimistic bias.
In 2026, no public benchmark is enough on its own. The era when an MMLU leaderboard glance could drive a model decision is over. The benchmarks that survive (LMSYS Arena, SimpleBench, ARC-AGI 2, SWE-bench Verified) each measure something specific and partial. Read them in combination, weight them against the use case that matters, and treat any single-number ranking as a starting hypothesis, not a conclusion.
The most useful thing you can do as a working developer is build a small held-out evaluation set from your own work and run candidate models against it. The investment in building the set is a few hours initially and a few minutes per release cycle to maintain. The return is the ability to make model decisions that fit your actual workload, instead of the imaginary average workload public benchmarks describe. For the related case of why long-context window numbers are similarly misleading, see the million-token marketing piece.
If you're using benchmark charts as the primary input to procurement decisions, stop. The charts are useful for sorting candidates into rough tiers. They aren't useful for picking between models that are close. For that, run them against the work that matters. There's no shortcut. That's the lesson of the past two years of benchmark inflation, and it's the lesson worth carrying into the next two.