Gemini 3.5 Pro review: Google’s 2M-context reasoning flagship

A 2,000,000-token context window, new highs on ARC-AGI-2 and GPQA Diamond, and a 200K pricing cliff worth planning around.

By the benchr team · · View changelog · Figures verified against official sources, 1 July 2026

benchr rating: 4.6 / 5

Google promised a new Gemini at I/O in May, with a loose "June 2026" timeline attached. Gemini 3.5 Pro landed on June 30, one day inside that window, and it's built around raw scale. The headline is a 2,000,000-token input context window, an industry-first at the frontier tier the day it shipped. That number matters less as a marketing line and more as a positioning statement: after Gemini 3.5 Flash spent its own launch beating the older 3.1 Pro tier on coding benchmarks, Google needed an answer at the top of the family, and this is it.

The case in one line: Gemini 3.5 Pro is the deepest reasoning and the longest context Google currently ships, positioned above both Gemini 3.1 Pro, which stays current as the cheaper reasoning option, and Gemini 3.5 Flash, which stays current as the faster, cheaper pick for coding agents. Whether the upgrade is worth it depends on whether your workload actually needs the ceiling this model reaches for, and on how carefully you manage the pricing line sitting at 200,000 tokens.

Standard input / 1M $2.50 Output is $15 / 1M under 200K; jumps to $5 / $22 above it
Input context window 2M 2,000,000 tokens in, 100,000 out — an industry-first at the frontier tier
GPQA Diamond 95.5 Highest GPQA score benchr tracks, ahead of Claude Opus 4.8's 93.6
ARC-AGI-2 80.0 New Gemini-family high, up from 3.1 Pro's 77.1

The 2,000,000-token context window

Start with the number Google is leading with. Gemini 3.5 Pro's context window is 2,000,000 tokens, double Gemini 3.1 Pro's 1,000,000 and double Gemini 3.5 Flash's 1,000,000, and it was an industry-first at the frontier tier when it shipped. Max output is 100,000 tokens. The API model ID is gemini-3.5-pro, reachable wherever Google ships its Gemini API and Vertex AI access.

A window that size is a genuine capability for long-document and long-video work, research synthesis across large corpora, and any job where the alternative is chunking and retrieval. It's also, as with every big-context model, an invitation to overuse. benchr has written before about how million-token claims get marketed versus what they cost to actually fill, and that gap is exactly what the pricing section below is about.

The pricing, tier by tier

Gemini 3.5 Pro is priced in two tiers by prompt size, per Google's Gemini API pricing page. For the first 200,000 tokens in a request, standard rates are $2.50 input and $15 output per million. Cross that line and the whole request, input and output alike, reprices to $5 input and $22 output per million — the same tiered shape as Gemini 3.1 Pro, just anchored to a higher base rate. Batch jobs run at half of standard: $1.25/$7.50 at or under 200K, $2.50/$11 above it. Cached input reads are $0.25 per million, a 90% discount off the $2.50 standard rate, and Google separately charges $1 per million tokens per hour to keep that cache warm, the same convention it uses for Gemini 3.1 Pro and Gemini 3.5 Flash. Unlike Flash, there's no free API tier here.

Working out whether that tiered math beats a flat-priced alternative for your specific job is what benchr's cost calculator and price-per-use-case breakdown exist to settle, and if your prompts are creeping past 200K without a real reason, the tactics in cutting token usage apply here as directly as they do to any tiered model.

New highs on reasoning

The benchmark case rests on two numbers Google is happy to have compared: ARC-AGI-2 and GPQA Diamond. Gemini 3.5 Pro scores 80.0 on ARC-AGI-2, up from Gemini 3.1 Pro's 77.1 and a new high for the Gemini family. On GPQA Diamond it scores 95.5, up from 3.1 Pro's 94.3 and also a new Gemini-family high — and at 95.5, it's currently the single highest GPQA Diamond score benchr tracks across any model, ahead of Claude Opus 4.8's 93.6.

Gemini 3.5 Pro benchmarks, per Google's official pricing page and model card
BenchmarkScoreNote
GPQA Diamond95.5Up from Gemini 3.1 Pro's 94.3; highest GPQA score benchr tracks
ARC-AGI-280.0Up from Gemini 3.1 Pro's 77.1; new Gemini-family high
SWE-bench Verified85.5%Coding on real GitHub issues
LMSYS Arena1420Human-preference head-to-head Elo
MMLU93.0%Broad knowledge benchmark
HumanEval92.0%Code generation
MATH92.5%Competition mathematics

Read the "highest GPQA score benchr tracks" line carefully. It's a snapshot claim, not a permanent one: Gemini 3.5 Pro shipped June 30, 2026, one day before OpenAI moved GPT-5.6 to general availability and two days before Anthropic launched Claude Sonnet 5 alongside restoring Claude Fable 5 to all customers. None of that changes what Gemini 3.5 Pro is on its own merits, but the field it's being compared against moved the same week — see benchr's GPT-5.6 launch coverage and Claude Sonnet 5 launch coverage for the other side of it.

The pitch isn't just a bigger number. It's whether your workload can use two million tokens without paying for that headroom every time a request reaches past the middle of the window.

Where it sits in Google's lineup

Gemini 3.5 Pro is the model to reach for when the task itself needs the ceiling: the deepest reasoning in the Gemini family, the widest context window Google ships, and vision work that spans long documents or long video, where the extra context room does real work instead of sitting unused. If you're running graduate-level reasoning, research synthesis across large document sets, or multimodal analysis that genuinely needs more than a million tokens of context, this is the Google model built for that job.

Skip it, or at least don't reach for it by default, if your workload is a high-volume coding agent loop. Gemini 3.5 Flash stays current specifically because it's faster and cheaper for that job, and Google positions it that way on purpose. Skip it too if you're cost-sensitive and your prompts routinely land in the 150K-to-250K range, because that's exactly where the 200K pricing line bites hardest and least predictably. Gemini 3.1 Pro remains the cheaper reasoning pick for work that fits comfortably under its own 200K tier and doesn't need the extra headroom this model reaches for. To see where all three land against the rest of the field, benchr's model rankings and compare tool put them side by side.

The verdict

Gemini 3.5 Pro earns a strong score on capability: an industry-first 2,000,000-token context window, and new highs for the Gemini family on both ARC-AGI-2 (80.0) and GPQA Diamond (95.5, the highest GPQA score benchr currently tracks anywhere). The tiered pricing at $2.50/$15 under 200K and $5/$22 above it is legible once you know where the line sits, and it follows the same shape Gemini 3.1 Pro already established, so there are no surprises if you've priced a tiered Gemini model before.

Go with Gemini 3.5 Pro if your work needs the deepest reasoning in the Gemini family, a context window measured in the millions, or both, and you can either keep requests under 200K tokens or accept the over-200K rate as the cost of that headroom. Skip it if you're running a high-volume coding agent loop, where Gemini 3.5 Flash is faster, cheaper, and the model Google itself points you toward. And treat Gemini 3.1 Pro as the value pick if your reasoning work fits under its own 200K line and doesn't need the extra ceiling this model reaches for. For the reasoning and context crown, this is currently the strongest seat in Google's lineup.

Frequently asked

How much does Gemini 3.5 Pro cost?

On the standard paid tier, per Google's official Gemini API pricing page, Gemini 3.5 Pro is $2.50 per million input tokens and $15 per million output for the first 200,000 tokens in a request. Cross that line and the whole request, input and output alike, reprices to $5 per million input and $22 per million output. Cached input reads are $0.25 per million, a 90% discount off the standard rate, and cache storage runs $1 per million tokens per hour, the same convention Google uses for Gemini 3.1 Pro and Gemini 3.5 Flash. Batch jobs run half of standard: $1.25/$7.50 at or under 200K, $2.50/$11 above it. There is no free API tier.

What is the context window and max output?

Gemini 3.5 Pro has a 2,000,000-token input context window, an industry-first at the frontier tier when it shipped on June 30, 2026, and a maximum output of 100,000 tokens. The API model ID is gemini-3.5-pro.

Is Gemini 3.5 Pro better than Gemini 3.1 Pro?

On reasoning, yes, and by a wide margin on paper: Gemini 3.5 Pro scores 80.0 on ARC-AGI-2 versus Gemini 3.1 Pro's 77.1, and 95.5 on GPQA Diamond versus 3.1 Pro's 94.3, both new highs for the Gemini family. Gemini 3.1 Pro remains a current, cheaper option for reasoning work that fits comfortably under its own 200K pricing line; Gemini 3.5 Pro is the pick when you need the deepest reasoning, the widest context, or both.

Does Gemini 3.5 Pro replace Gemini 3.5 Flash for coding agents?

No. Gemini 3.5 Flash stays current as Google's pick for high-volume coding and agentic loops, where its lower per-token cost and faster output matter more than peak reasoning depth. Gemini 3.5 Pro sits above both Flash and Gemini 3.1 Pro as the model to reach for when the task itself demands the deepest reasoning or the longest context, not the cheapest loop.

How does Gemini 3.5 Pro's GPQA Diamond score compare to other frontier models?

At 95.5, Gemini 3.5 Pro's GPQA Diamond score is the single highest benchr currently tracks, ahead of Claude Opus 4.8's 93.6. It shipped June 30, 2026, the same week OpenAI moved GPT-5.6 to general availability and Anthropic launched Claude Sonnet 5, making early July 2026 an unusually crowded week for frontier reasoning claims.

Changelog

  • July 1, 2026 — Originally published. Gemini 3.5 Pro launched June 30, 2026; the $2.50/$15 standard tier, $5/$22 over-200K tier, $0.25 cached-input rate, $1-per-million-token-hour cache storage charge, batch rates, 2,000,000-token context window, 100,000-token max output, and gemini-3.5-pro API model ID verified against Google's official Gemini API pricing page and the Gemini 3.5 Pro model card. ARC-AGI-2, GPQA Diamond, SWE-bench Verified, MMLU, HumanEval, MATH, and LMSYS Arena figures are Google's own published benchmark results from the same model card.

References

  1. Google, "Gemini 3.5 Pro," blog.google, June 30, 2026.
  2. Google, "Gemini API pricing," ai.google.dev/gemini-api/docs/pricing, accessed July 2026.
  3. Google DeepMind, "Gemini 3.5 Pro model card," deepmind.google, accessed July 2026.