Most context-window benchmarks measure the wrong thing. They report how much text fits, and stay silent on what the model can actually locate once it's in there.
1M tokens at Claude, per Anthropic's API documentation. 1M at Gemini 3.1 Pro Preview, per Google's Gemini models page. 1M at GPT-5, per OpenAI's platform docs. The marketing pitch behind these numbers has been that more context equals more capability, and that long-window models will simply retire retrieval as a relic of the small-window era. What you find in practice is messier than the pitch. The long window earns its keep in a narrow set of workflows, sits there as expensive overhead in most others, and loses outright to proper retrieval in a third category teams keep trying to force into it.
This piece compares the four serious long-context implementations on the same workload: what they do at the limits of their context windows, where the visible degradation begins, and what the bills come to. Per-token costs throughout are verified against Anthropic's pricing page, OpenAI's API pricing, and Google's published rates. Llama 4 weights and license terms are documented at llama.com. The long context pays for itself on exploratory cross-document reasoning and wastes itself on tasks retrieval would handle better. It also tends to cost the most exactly where it tells you the least. For the case against using long context as a default, see the million-token marketing piece.
Advertised numbers versus working numbers
| Model | Advertised context | Reliable retrieval zone | Cost per 1M input tokens |
|---|---|---|---|
| Claude Opus 4.7 | 1M | To ~600k | $5 |
| Gemini 3.1 Pro Preview | 1M | To ~800K | $2 |
| GPT-5 | 400K | To ~250K | $1.25 |
| Llama 4 Maverick | 1M | To ~250k | varies (self-host) |
The "reliable retrieval zone" column is the practical observation benchmarks rarely report: the rough token count past which recall on multi-fact synthesis tasks degrades visibly. The numbers reflect the consensus of the published needle-in-haystack reports, the research community's multi-fact synthesis evaluations (e.g. arxiv-published long-context studies), and the consistent open developer discussion. The advertised number is just the technical maximum the model will accept; the reliable zone is how much of it stays useful. Push past that and the model keeps running while synthesis quality falls off faster than simple needle benchmarks suggest.
Gemini 3.1 Pro Preview is the strongest of the four at extreme scale. The 1M window is real, and the retrieval inside it holds up further than the alternatives. Claude is second. GPT-5 sits behind both on synthesis past 400K tokens, despite the same nominal capacity. Llama 4 Maverick's million-token window is technically present but practically degrades much sooner. Recall drops clearly past 250K tokens.
The effective-retrieval-zone numbers in the table above come from multi-fact synthesis tests, not needle-in-haystack tests. The needle-in-haystack scores would put every model at near-perfect across the advertised window. The gap between the two test families is the thing this whole piece is about.
Three workload shapes, three different verdicts
To make the pattern concrete, picture a 280,000-token government policy report (about 200 pages of dense prose) and three different questions you might ask of it. The three workload shapes that follow turn up in legal review and research synthesis, and in any cross-document analysis you might run.
Workload one: the broad pillar question. What are the three pillars of the document and what does it say about progress on each? All four frontier models will give you a workable answer. The consistent pattern in the community discussion is that Gemini and Claude handle this kind of question best when the document is well-structured; GPT-5 sometimes compresses a section it should've summarized in depth. Llama 4 Maverick is the weakest on this shape past its effective retrieval zone.
Workload two: the precise lookup. What is the specific metric the report uses for private-sector contribution to GDP, and what are the current and target values? All four models will produce the right answer when the relevant section is in scope. None is as efficient at this as a basic retrieval system would be. The token cost of asking the full document the question, even with caching, is roughly an order of magnitude more than retrieval would charge. The RAG vs fine-tuning piece has the math.
Workload three: the cross-section synthesis. Are there internal inconsistencies between the housing-affordability claims in the early chapters and the GDP-mix projections in the later chapters? This is the workload that justifies the long window. Retrieval surfaces chunks independently; it has no way for the model to notice that section A and section M are talking past each other. The frontier models that hold coherence at scale (Gemini and Claude particularly) catch tensions a retrieval pipeline would miss.
The split is structural. Long context earns its place on the cross-section synthesis you cannot get any other way, and retrieval covers the rest.
Long context earns its bill on the cross-document questions you didn't know to ask. The moment you can write the question down in a sentence, you're overpaying.
8K query
$0.12 Opus per request50K query
$0.75 Opus per request200K query
$1.00 Opus per request600K query
$9.00 Opus per requestRAG retrieval
$0.06 Same answer, 4K tokensCached prefix
10% Of standard input price-
2022
4K · GPT-3.5
One letter, one email, one short article. That was it.
-
2023
32K · GPT-4 Turbo
A short report, a small codebase, a long memo.
-
2024
200K · Claude 2
A novella, a long technical document, a production codebases.
-
Feb 2024
1M · Gemini 1.5 Pro
First mainstream million-token context. A textbook in one prompt.
-
Sep 2025
10M · Llama 4 Scout
The whole code-base, the whole corpus. Effective zone closer to 2M.
One caveat: the effective-retrieval-zone numbers reflect the multi-fact synthesis literature on legal, scientific, and policy documents. They are consistent across those domains in the published reports. They do not necessarily generalize to every document type. Code, structured data, transcripts, and conversational logs have different failure modes. Treat the numbers as a starting point, not a ceiling.
The cost picture
The 280,000-token version of the query, on Claude Opus 4.7, costs about $1.40 per question in input tokens. The same question answered against a proper vector store with the relevant chunks retrieved costs around $0.04. That's a 35× difference. At one question a day, nobody notices. At 500 questions a day, that gap decides your architecture for you.
Caching changes this picture a lot. If the same long document gets queried repeatedly, Claude's prompt cache drops the input cost on later queries to roughly 10% of the standard rate. Gemini's caching works on the same mechanism and comes out cheaper in absolute terms. With caching on, the long-context query against a frequently-reused document costs roughly $0.40 per question on Claude. That is still ten times what retrieval would charge, but it lands inside the range where the workflows that genuinely need long context can justify paying it. For the broader cost picture across workloads, see price per use case.
The decision rule
For exploratory questions on a single document, or for cross-section reasoning where the answer might depend on a relationship between distant parts of the source, long context is the right tool. The token cost is high, and it buys you a capability retrieval simply cannot offer.
For precise lookup where you can write the question in a sentence, retrieval wins on every dimension. Cost drops by an order of magnitude and latency drops with it. Accuracy on the specific lookup holds at least even and often comes out ahead, since the model is working a tight context window rather than a sprawling one.
For high-volume question answering against a fixed corpus, retrieval is the only sensible architecture. Long context at scale gets prohibitively expensive in ways no amount of caching fully fixes.
For corpora that exceed the context window of any available model, retrieval is mandatory; there's no decision left to make.
The million-token context era has produced a demonstrated capability you should use deliberately. It complements retrieval by handling a different kind of question, rather than displacing it. Treating the long window as a universal substitute for retrieval is the most common architectural mistake in early 2026, and it's the one behind the most surprising AI bills.
Among the four models compared here, Gemini 3.1 Pro Preview is the strongest long-context implementation right now per the public multi-fact synthesis reports, with Claude Opus 4.7 a close second. The choice between them comes down to the rest of the workload: Gemini for vision-heavy work, Claude for code and honest hedging, with the long-context capability roughly tied either way. GPT-5's long context is competent and trails the leaders on synthesis past 400K tokens. Llama 4 Maverick's long context is genuine but degrades earlier in practice than the closed alternatives, so skip it for serious long-document work today.
In a production system, default to retrieval for most of the workload and reach for long context only when the question is genuinely cross-cutting. The cost dynamics make that the only sensible choice at any meaningful volume, and the capability story keeps it architecturally right even when volume is low. For the deeper RAG-versus-long-context math, see RAG vs fine-tuning, with the math.