Methodology·Updated May 2026

How this site sources information

Where the data on benchr comes from and how it is kept current.

Pricing data

Per-token pricing for closed-source models comes directly from each provider's official pricing page: Anthropic, OpenAI, Google, Mistral, DeepSeek. Pricing for open-weight models hosted on third-party inference providers references the inference provider's own published rates when relevant. Where pricing changes, the article is updated and a changelog entry is added.

Benchmark scores

Benchmark numbers are sourced from the benchmark maintainers' published leaderboards. SWE-bench Verified scores come from swebench.com. LMSYS Arena scores come from lmarena.ai. ARC-AGI scores come from arcprize.org. When a provider publishes a model's score on a benchmark before it appears on the official leaderboard, the provider's published figure is used with attribution.

Capability ratings

Where this site assigns capability ratings (coding, reasoning, writing, vision, long context, multilingual) on a 0–100 scale, the ratings are synthesized from the model's documented benchmark performance on relevant evaluations, capability claims in the model's release notes, and observed behavior in published third-party comparisons. They are a synthesized reference figure, not a score from an original lab evaluation.

Editorial estimates vs sourced figures (the tools)

The interactive tools — the recommender, calculator, charts, and benchmark explorer — all read one file, assets/data/models.json, and that file keeps a hard line between two kinds of number:

Sourced (factual): pricing, context window, max output tokens, and release dates come from each provider's own official docs and are reconciled against assets/data/model-figures.json, the verified single source of truth. Where the two ever disagree, the officially-sourced figure wins.
Sourced benchmarks: in the benchmark explorer and the intelligence-vs-price scatter, the Coding axis is SWE-bench Verified and the Reasoning axis is GPQA Diamond. These are the provider's official published figure where one exists; where a provider has not published that benchmark, the value is a clearly-marked benchr estimate (shown with “est” in the explorer), and GPQA is left blank rather than guessed.
Editorial estimates: the 0–100 ratings for writing, vision, long context, and multilingual, plus all latency (first-token and tokens-per-second) figures, are benchr editorial estimates, not lab measurements. They are labelled as ratings, never as benchmarks.

So when a tool ranks models, the coding and reasoning weight is grounded in real benchmarks, while the other dimensions are honest editorial judgement. A dimension a model has no data for (for example, vision on a text-only model) counts as zero when you weight it, rather than being quietly dropped — so the ranking always reflects the weights you set.

What this site is and is not

This is an editorial publication that synthesizes public information. It is not a benchmarking lab. Articles do not narrate original lab tests, private API-cost totals, or first-person time-on-tool reports. Where an article takes a position on which model fits a workload, that verdict is grounded in published benchmarks, official pricing, official spec sheets, and the well-known public behavior of the models being compared.

What that means in practice: you will see qualitative judgments (“stronger on long-document analysis,” “weaker on dialectal Arabic”) more often than fresh numbers. When a number appears, the source is cited. When a comparison cannot be backed by a citable source, it is stated qualitatively instead of inventing precision.

Update cadence

Pricing tables are checked against provider documentation when articles are revised. Model release and deprecation events are added to articles within several days of the announcement. The schedule for systematic re-verification of all model data is “before major article revisions”. There is no fixed weekly or monthly cycle.

Corrections and disputes

If you find a number, date, or attribution that does not match the primary source, send a note to corrections@benchr.org. Material corrections are noted on the corrections page and in the article changelog.