Most context-window benchmarks measure the wrong thing. They tell you what fits. They don't tell you what the model can actually find inside.
1M tokens at Claude, per Anthropic's API documentation. 2M at Gemini 3.1 Pro Preview, per Google's Gemini models page. 1M at GPT-5, per OpenAI's platform docs. The marketing pitch behind these numbers has been that more context equals more capability, that retrieval is an outdated coping mechanism for the small-window era, and that long-window models will replace it. The reality of using these tools is more interesting and a lot more boring. The long window is useful in a narrow set of workflows, expensive overhead in most others, and a worse pick than proper retrieval in a third category teams keep trying to force into it.
This piece compares the four serious long-context implementations on the same workload. What they do at the limits of their context windows, where the real degradation begins, and what the bills come to. Per-token costs throughout are verified against Anthropic's pricing page, OpenAI's API pricing, and Google's published rates. Llama 4 weights and license terms are documented at llama.com. The long context pays for itself on exploratory cross-document reasoning, wastes itself on tasks retrieval would handle better, and is most expensive precisely where it's least informative. For the case against using long context as a default, see the million-token marketing piece.
Advertised numbers versus working numbers
| Model | Advertised context | Reliable retrieval zone | Cost per 1M input tokens |
|---|---|---|---|
| Claude Opus 4.7 | 1M | To ~600k | $5 |
| Gemini 3.1 Pro Preview | 2M | To ~1.2M | $5 |
| GPT-5 | 1M | To ~400K | $10 |
| Llama 4 Maverick | 1M | To ~250k | varies (self-host) |
The "reliable retrieval zone" column is the practical observation benchmarks rarely report. The rough token count past which recall on multi-fact synthesis tasks starts clearly dropping in actual use. The advertised number is the technical max the model will accept. The reliable zone is where the model will actually be useful. Past that, the model still runs, but synthesis quality drops faster than the simple needle-in-haystack benchmarks suggest.
Gemini 3.1 Pro Preview is the strongest of the four at extreme scale. The 2M window is real, and the retrieval inside it holds up further than the alternatives. Claude is second. GPT-5 sits behind both on synthesis past 400K tokens, despite the same nominal capacity. Llama 4 Maverick's million-token window is technically present but practically degrades much sooner. Recall drops clearly past 250K tokens.
Worth flagging: the effective-retrieval-zone numbers in the table above come from multi-fact synthesis tests, not needle-in-haystack tests. The needle-in-haystack scores would put every model at near-perfect across the advertised window. The gap between the two test families is the thing this whole piece is about.
The worked example
To make this concrete, the same test ran against a 207-page government implementation report. Roughly 280,000 tokens of structured English text with scattered numerical claims and a layered argument. Three questions, each posed to all four models at full context.
Question one was broad: What are the three pillars of the document and what does the report say about progress on each? Claude and Gemini both gave strong answers covering all three pillars with relevant detail. GPT-5 hit two of three pillars in depth and compressed the third. Llama 4 Maverick produced a competent summary that missed one pillar almost entirely and conflated two of the others.
Question two was specific: What metric does the report use for private sector contribution to GDP, and what are the current and target values? All four models got the answer right when the relevant section was already loaded. None of them was as efficient at this task as a basic retrieval system would have been. The cost of running the question across the full document, even with caching, was twenty times the cost of running it against a retrieved chunk.
Question three was the one that justified the long context. Are there internal inconsistencies between the housing-affordability claims in the early chapters and the GDP-mix projections in the later chapters? Claude flagged a real tension between implied wage growth in the housing section and the labor-mix assumptions in a later chapter. Gemini caught the same tension and identified a second, smaller inconsistency Claude missed. GPT-5 noticed the relationship existed but didn't commit to a clear finding. Llama 4 produced output that didn't engage with the question at this depth.
This is the workflow where long context wins clearly. Cross-section synthesis can't be done well by retrieval, because retrieval surfaces chunks independently and has no way for the model to notice that section A and section M are talking past each other.
Long context is unbeatable for the cross-document questions you didn't know to ask. It's a waste of money for the precise questions you can already write down.
8K query
$0.12 Opus per request50K query
$0.75 Opus per request200K query
$3.00 Opus per request600K query
$9.00 Opus per requestRAG retrieval
$0.06 Same answer, 4K tokensCached prefix
10% Of standard input price-
2022
4K — GPT-3.5
One letter, one email, one short article. That was it.
-
2023
32K — GPT-4 Turbo
A short report, a small codebase, a long memo.
-
2024
200K — Claude 2
A novella, a long technical document, a real codebase.
-
Feb 2024
1M — Gemini 1.5 Pro
First mainstream million-token context. A textbook in one prompt.
-
Sep 2025
10M — Llama 4 Scout
The whole code-base, the whole corpus. Effective zone closer to 2M.
One genuine uncertainty: the effective-retrieval-zone numbers reported here come from multi-fact synthesis tests on legal, scientific, and policy documents. They're consistent across document types in my testing. But I can't promise they generalize to every document — code, structured data, transcripts, conversational logs — where the failure modes might be different. The numbers are a starting point, not a ceiling.
The cost picture
The 280,000-token version of the query, on Claude Opus 4.7, costs about $1.40 per question in input tokens. The same question answered against a proper vector store with the relevant chunks retrieved costs around $0.04. That's a 100× difference. For one question a day, the difference doesn't matter. For 500 questions a day, the difference forces the architecture.
Caching changes this picture a lot. If the same long document gets queried repeatedly, Claude's prompt cache drops the input cost on later queries to roughly 10% of the standard rate. Gemini's caching is comparable in mechanism but cheaper in absolute terms. With caching on, the long-context query against a frequently-reused document costs roughly $0.40 per question on Claude. Still ten times what retrieval would charge. But inside the range where the extra cost is worth paying for the workflows where long context wins. For the broader cost picture across workloads, see price per use case.
Anyway. On to the decision rule.
The decision rule
For exploratory questions on a single document, or for cross-section reasoning where the answer might depend on a relationship between distant parts of the source, long context is the right tool. The token cost is high. You're paying for capability retrieval can't provide.
For precise lookup where you can write the question in a sentence, retrieval wins on every dimension. Cost is lower by an order of magnitude. Latency is lower. The accuracy on the specific lookup is at least as good and often better, because the model is working with a tight context window instead of a long one.
For high-volume question answering against a fixed corpus, retrieval is the only sensible architecture. Long context at scale gets prohibitively expensive in ways no amount of caching fully fixes.
For corpora that exceed the context window of any available model, retrieval is mandatory. No choice to make.
The million-token context era has produced a real capability you should use deliberately. The capability isn't a replacement for retrieval. It's a complement that handles a different kind of question. Treating the long window as a universal substitute for retrieval is the most common architectural mistake in early 2026, and it's the mistake that produces the most surprising AI bills.
Among the four models compared here, On the multi-fact synthesis tests I ran, Gemini 3.1 Pro Preview is the strongest long-context implementation right now, with Claude Opus 4.7 a close second. The choice between them depends on the rest of the workload. Gemini for vision-heavy work, Claude for code and honest hedging, with long context as a tied capability either way. GPT-5's long context is competent but trails the leaders on synthesis past 400K tokens. Llama 4 Maverick's long context is real but practically degrades earlier than the closed alternatives. Skip it for serious long-document work today.
For any production system: default to retrieval for most of the workload, and reach for long context only when the question is truly cross-cutting. The cost dynamics make that the only sensible choice at any real volume, and the capability story makes it the architecturally right choice even when volume is low. For the deeper RAG-versus-long-context math, see RAG vs fine-tuning, with the math.