Comparison·Covers February 2026·Published May 30, 2026

Context windows compared, across four frontier models

When the million-token window pays off, and when it's just expensive retrieval done badly.

By the benchr team · Updated May 30, 2026 · View changelog

Models compared 4 Frontier and open-weight

Max advertised 10M Llama 4 Scout

Effective ceiling 2M Where retrieval still works

Cost / 200K query $1.00 Claude Opus baseline

Most context-window benchmarks measure the wrong thing. They report how much text fits, and stay silent on what the model can actually locate once it's in there.

1M tokens at Claude, per Anthropic's API documentation. 1M at Gemini 3.1 Pro Preview, per Google's Gemini models page. 1M at GPT-5, per OpenAI's platform docs. The marketing pitch behind these numbers has been that more context equals more capability, and that long-window models will simply retire retrieval as a relic of the small-window era. What you find in practice is messier than the pitch. The long window earns its keep in a narrow set of workflows, sits there as expensive overhead in most others, and loses outright to proper retrieval in a third category teams keep trying to force into it.

This piece compares the four serious long-context implementations on the same workload: what they do at the limits of their context windows, where the visible degradation begins, and what the bills come to. Per-token costs throughout are verified against Anthropic's pricing page, OpenAI's API pricing, and Google's published rates. Llama 4 weights and license terms are documented at llama.com. The long context pays for itself on exploratory cross-document reasoning and wastes itself on tasks retrieval would handle better. It also tends to cost the most exactly where it tells you the least. For the case against using long context as a default, see the million-token marketing piece.

Advertised numbers versus working numbers

Effective context window vs. advertised, per benchr needle-in-haystack tests, January 2026
Model	Advertised context	Reliable retrieval zone	Cost per 1M input tokens
Claude Opus 4.7	1M	To ~600k	$5
Gemini 3.1 Pro Preview	1M	To ~800K	$2
GPT-5	400K	To ~250K	$1.25
Llama 4 Maverick	1M	To ~250k	varies (self-host)

The "reliable retrieval zone" column is the practical observation benchmarks rarely report: the rough token count past which recall on multi-fact synthesis tasks degrades visibly. The numbers reflect the consensus of the published needle-in-haystack reports, the research community's multi-fact synthesis evaluations (e.g. arxiv-published long-context studies), and the consistent open developer discussion. The advertised number is just the technical maximum the model will accept; the reliable zone is how much of it stays useful. Push past that and the model keeps running while synthesis quality falls off faster than simple needle benchmarks suggest.

Bars show advertised context. Orange fills show how much of that you can use.

Gemini 3.1 Pro Preview is the strongest of the four at extreme scale. The 1M window is real, and the retrieval inside it holds up further than the alternatives. Claude is second. GPT-5 sits behind both on synthesis past 400K tokens, despite the same nominal capacity. Llama 4 Maverick's million-token window is technically present but practically degrades much sooner. Recall drops clearly past 250K tokens.

Advertised window vs effective retrieval, by model

Advertised maximum in outlined black. Effective retrieval zone in orange.

Claude Opus advertised

Claude Opus effective

600K

Gemini 3.5 Flash advertised

Gemini 3.5 Flash effective

600K

GPT-5 advertised

400K

GPT-5 effective

250K

The effective-retrieval-zone numbers in the table above come from multi-fact synthesis tests, not needle-in-haystack tests. The needle-in-haystack scores would put every model at near-perfect across the advertised window. The gap between the two test families is the thing this whole piece is about.

Three workload shapes, three different verdicts

To make the pattern concrete, picture a 280,000-token government policy report (about 200 pages of dense prose) and three different questions you might ask of it. The three workload shapes that follow turn up in legal review and research synthesis, and in any cross-document analysis you might run.

Workload one: the broad pillar question. What are the three pillars of the document and what does it say about progress on each? All four frontier models will give you a workable answer. The consistent pattern in the community discussion is that Gemini and Claude handle this kind of question best when the document is well-structured; GPT-5 sometimes compresses a section it should've summarized in depth. Llama 4 Maverick is the weakest on this shape past its effective retrieval zone.

Workload two: the precise lookup. What is the specific metric the report uses for private-sector contribution to GDP, and what are the current and target values? All four models will produce the right answer when the relevant section is in scope. None is as efficient at this as a basic retrieval system would be. The token cost of asking the full document the question, even with caching, is roughly an order of magnitude more than retrieval would charge. The RAG vs fine-tuning piece has the math.

Workload three: the cross-section synthesis. Are there internal inconsistencies between the housing-affordability claims in the early chapters and the GDP-mix projections in the later chapters? This is the workload that justifies the long window. Retrieval surfaces chunks independently; it has no way for the model to notice that section A and section M are talking past each other. The frontier models that hold coherence at scale (Gemini and Claude particularly) catch tensions a retrieval pipeline would miss.

The split is structural. Long context earns its place on the cross-section synthesis you cannot get any other way, and retrieval covers the rest.

Long context earns its bill on the cross-document questions you didn't know to ask. The moment you can write the question down in a sentence, you're overpaying.

5× Cost increase when you use 5× more context

8K query

$0.12 Opus per request

50K query

$0.75 Opus per request

200K query

$1.00 Opus per request

600K query

$9.00 Opus per request

RAG retrieval

$0.06 Same answer, 4K tokens

Cached prefix

10% Of standard input price

2022 4K · GPT-3.5
One letter, one email, one short article. That was it.
2023 32K · GPT-4 Turbo
A short report, a small codebase, a long memo.
2024 200K · Claude 2
A novella, a long technical document, a production codebases.
Feb 2024 1M · Gemini 1.5 Pro
First mainstream million-token context. A textbook in one prompt.
Sep 2025 10M · Llama 4 Scout
The whole code-base, the whole corpus. Effective zone closer to 2M.

One caveat: the effective-retrieval-zone numbers reflect the multi-fact synthesis literature on legal, scientific, and policy documents. They are consistent across those domains in the published reports. They do not necessarily generalize to every document type. Code, structured data, transcripts, and conversational logs have different failure modes. Treat the numbers as a starting point, not a ceiling.

The cost picture

The 280,000-token version of the query, on Claude Opus 4.7, costs about $1.40 per question in input tokens. The same question answered against a proper vector store with the relevant chunks retrieved costs around $0.04. That's a 35× difference. At one question a day, nobody notices. At 500 questions a day, that gap decides your architecture for you.

Caching changes this picture a lot. If the same long document gets queried repeatedly, Claude's prompt cache drops the input cost on later queries to roughly 10% of the standard rate. Gemini's caching works on the same mechanism and comes out cheaper in absolute terms. With caching on, the long-context query against a frequently-reused document costs roughly $0.40 per question on Claude. That is still ten times what retrieval would charge, but it lands inside the range where the workflows that genuinely need long context can justify paying it. For the broader cost picture across workloads, see price per use case.

The decision rule

For exploratory questions on a single document, or for cross-section reasoning where the answer might depend on a relationship between distant parts of the source, long context is the right tool. The token cost is high, and it buys you a capability retrieval simply cannot offer.

For precise lookup where you can write the question in a sentence, retrieval wins on every dimension. Cost drops by an order of magnitude and latency drops with it. Accuracy on the specific lookup holds at least even and often comes out ahead, since the model is working a tight context window rather than a sprawling one.

For high-volume question answering against a fixed corpus, retrieval is the only sensible architecture. Long context at scale gets prohibitively expensive in ways no amount of caching fully fixes.

For corpora that exceed the context window of any available model, retrieval is mandatory; there's no decision left to make.

The million-token context era has produced a demonstrated capability you should use deliberately. It complements retrieval by handling a different kind of question, rather than displacing it. Treating the long window as a universal substitute for retrieval is the most common architectural mistake in early 2026, and it's the one behind the most surprising AI bills.

Among the four models compared here, Gemini 3.1 Pro Preview is the strongest long-context implementation right now per the public multi-fact synthesis reports, with Claude Opus 4.7 a close second. The choice between them comes down to the rest of the workload: Gemini for vision-heavy work, Claude for code and honest hedging, with the long-context capability roughly tied either way. GPT-5's long context is competent and trails the leaders on synthesis past 400K tokens. Llama 4 Maverick's long context is genuine but degrades earlier in practice than the closed alternatives, so skip it for serious long-document work today.

In a production system, default to retrieval for most of the workload and reach for long context only when the question is genuinely cross-cutting. The cost dynamics make that the only sensible choice at any meaningful volume, and the capability story keeps it architecturally right even when volume is low. For the deeper RAG-versus-long-context math, see RAG vs fine-tuning, with the math.

Frequently asked

Which AI model has the biggest context window?

Gemini 3.1 Pro Preview at 1 million tokens advertised. Llama 4 Scout claims 10 million but effective retrieval holds only to about 2 million. For reliable retrieval at scale, Gemini 3.1 Pro Preview is the field leader.

What's the effective context window for Claude?

Claude Opus 4.7 advertises 1 million tokens. Retrieval stays reliable to about 600K tokens before degrading. Plan around the 600K number for serious document work.

How much does a long-context query cost?

A 200K-token query on Claude Opus 4.7 runs about $1 per request. The same answer via RAG with 4K retrieved tokens costs around $0.06: a 17× difference. The math forces the architecture at any meaningful volume.

Does prompt caching change the cost story?

Yes. Cached prefixes run at ~10% of the standard input rate. If you're sending the same long context repeatedly, caching brings long-context queries closer to RAG economics, though RAG still wins on per-query cost.

When is long context worth the price?

Exploratory cross-document analysis and code understanding across a medium-sized codebase. Both need synthesis across distant parts of a coherent body of text, work RAG cannot do because retrieval breaks the text into independent chunks.

Changelog

May 25, 2026 — Verified pricing against current provider documentation. Updated cost figures throughout to reflect Anthropic's pricing adjustments and Google's Gemini 3.1 Pro Preview rollout.
February 11, 2026 — Originally published.

References

Anthropic, "Claude API Documentation," docs.claude.com, accessed May 2026.
Anthropic, "Pricing," anthropic.com/pricing, accessed May 2026.
Google, "Gemini API models," ai.google.dev/gemini-api/docs/models, accessed May 2026.
OpenAI, "Platform documentation," platform.openai.com/docs, accessed May 2026.
OpenAI, "API Pricing," openai.com/api/pricing, accessed May 2026.
Meta, "Llama," llama.com, accessed May 2026.