Google promised a new Gemini at I/O in May, with a loose "June 2026" timeline attached. Gemini 3.5 Pro landed on June 30, one day inside that window, and it's built around raw scale. The headline is a 2,000,000-token input context window, an industry-first at the frontier tier the day it shipped. That number matters less as a marketing line and more as a positioning statement: after Gemini 3.5 Flash spent its own launch beating the older 3.1 Pro tier on coding benchmarks, Google needed an answer at the top of the family, and this is it.
The case in one line: Gemini 3.5 Pro is the deepest reasoning and the longest context Google currently ships, positioned above both Gemini 3.1 Pro, which stays current as the cheaper reasoning option, and Gemini 3.5 Flash, which stays current as the faster, cheaper pick for coding agents. Whether the upgrade is worth it depends on whether your workload actually needs the ceiling this model reaches for, and on how carefully you manage the pricing line sitting at 200,000 tokens.
The 2,000,000-token context window
Start with the number Google is leading with. Gemini 3.5 Pro's context window is 2,000,000 tokens, double Gemini 3.1 Pro's 1,000,000 and double Gemini 3.5 Flash's 1,000,000, and it was an industry-first at the frontier tier when it shipped. Max output is 100,000 tokens. The API model ID is gemini-3.5-pro, reachable wherever Google ships its Gemini API and Vertex AI access.
A window that size is a genuine capability for long-document and long-video work, research synthesis across large corpora, and any job where the alternative is chunking and retrieval. It's also, as with every big-context model, an invitation to overuse. benchr has written before about how million-token claims get marketed versus what they cost to actually fill, and that gap is exactly what the pricing section below is about.
The pricing, tier by tier
Gemini 3.5 Pro is priced in two tiers by prompt size, per Google's Gemini API pricing page. For the first 200,000 tokens in a request, standard rates are $2.50 input and $15 output per million. Cross that line and the whole request, input and output alike, reprices to $5 input and $22 output per million — the same tiered shape as Gemini 3.1 Pro, just anchored to a higher base rate. Batch jobs run at half of standard: $1.25/$7.50 at or under 200K, $2.50/$11 above it. Cached input reads are $0.25 per million, a 90% discount off the $2.50 standard rate, and Google separately charges $1 per million tokens per hour to keep that cache warm, the same convention it uses for Gemini 3.1 Pro and Gemini 3.5 Flash. Unlike Flash, there's no free API tier here.
Working out whether that tiered math beats a flat-priced alternative for your specific job is what benchr's cost calculator and price-per-use-case breakdown exist to settle, and if your prompts are creeping past 200K without a real reason, the tactics in cutting token usage apply here as directly as they do to any tiered model.
New highs on reasoning
The benchmark case rests on two numbers Google is happy to have compared: ARC-AGI-2 and GPQA Diamond. Gemini 3.5 Pro scores 80.0 on ARC-AGI-2, up from Gemini 3.1 Pro's 77.1 and a new high for the Gemini family. On GPQA Diamond it scores 95.5, up from 3.1 Pro's 94.3 and also a new Gemini-family high — and at 95.5, it's currently the single highest GPQA Diamond score benchr tracks across any model, ahead of Claude Opus 4.8's 93.6.
| Benchmark | Score | Note |
|---|---|---|
| GPQA Diamond | 95.5 | Up from Gemini 3.1 Pro's 94.3; highest GPQA score benchr tracks |
| ARC-AGI-2 | 80.0 | Up from Gemini 3.1 Pro's 77.1; new Gemini-family high |
| SWE-bench Verified | 85.5% | Coding on real GitHub issues |
| LMSYS Arena | 1420 | Human-preference head-to-head Elo |
| MMLU | 93.0% | Broad knowledge benchmark |
| HumanEval | 92.0% | Code generation |
| MATH | 92.5% | Competition mathematics |
Read the "highest GPQA score benchr tracks" line carefully. It's a snapshot claim, not a permanent one: Gemini 3.5 Pro shipped June 30, 2026, one day before OpenAI moved GPT-5.6 to general availability and two days before Anthropic launched Claude Sonnet 5 alongside restoring Claude Fable 5 to all customers. None of that changes what Gemini 3.5 Pro is on its own merits, but the field it's being compared against moved the same week — see benchr's GPT-5.6 launch coverage and Claude Sonnet 5 launch coverage for the other side of it.
The pitch isn't just a bigger number. It's whether your workload can use two million tokens without paying for that headroom every time a request reaches past the middle of the window.
Where it sits in Google's lineup
Gemini 3.5 Pro is the model to reach for when the task itself needs the ceiling: the deepest reasoning in the Gemini family, the widest context window Google ships, and vision work that spans long documents or long video, where the extra context room does real work instead of sitting unused. If you're running graduate-level reasoning, research synthesis across large document sets, or multimodal analysis that genuinely needs more than a million tokens of context, this is the Google model built for that job.
Skip it, or at least don't reach for it by default, if your workload is a high-volume coding agent loop. Gemini 3.5 Flash stays current specifically because it's faster and cheaper for that job, and Google positions it that way on purpose. Skip it too if you're cost-sensitive and your prompts routinely land in the 150K-to-250K range, because that's exactly where the 200K pricing line bites hardest and least predictably. Gemini 3.1 Pro remains the cheaper reasoning pick for work that fits comfortably under its own 200K tier and doesn't need the extra headroom this model reaches for. To see where all three land against the rest of the field, benchr's model rankings and compare tool put them side by side.
The verdict
Gemini 3.5 Pro earns a strong score on capability: an industry-first 2,000,000-token context window, and new highs for the Gemini family on both ARC-AGI-2 (80.0) and GPQA Diamond (95.5, the highest GPQA score benchr currently tracks anywhere). The tiered pricing at $2.50/$15 under 200K and $5/$22 above it is legible once you know where the line sits, and it follows the same shape Gemini 3.1 Pro already established, so there are no surprises if you've priced a tiered Gemini model before.
Go with Gemini 3.5 Pro if your work needs the deepest reasoning in the Gemini family, a context window measured in the millions, or both, and you can either keep requests under 200K tokens or accept the over-200K rate as the cost of that headroom. Skip it if you're running a high-volume coding agent loop, where Gemini 3.5 Flash is faster, cheaper, and the model Google itself points you toward. And treat Gemini 3.1 Pro as the value pick if your reasoning work fits under its own 200K line and doesn't need the extra ceiling this model reaches for. For the reasoning and context crown, this is currently the strongest seat in Google's lineup.