Pricing breakdown
| Tier | Rate / 1M tokens |
|---|---|
| Standard input (≤200K) | $2.50 |
| Standard input (>200K) | $5.00 |
| Standard output (≤200K) | $15.00 |
| Standard output (>200K) | $22.00 |
| Cached input | $0.25 |
| Cache storage | $1.00 per 1M-token-hour |
| Batch input (≤200K) | $1.25 |
| Batch output (≤200K) | $7.50 |
| Batch input (>200K) | $2.50 |
| Batch output (>200K) | $11.00 |
| Free tier | Not offered |
| Context window | 2,000,000 tokens |
| Max output | 100,000 tokens |
Tiered pricing: what changes above 200,000 tokens
Gemini 3.5 Pro uses the same tiered pricing shape Google shipped with Gemini 3.1 Pro, at a higher base rate. Stay at or below 200,000 tokens of context in a call and you pay $2.50/1M input, $15/1M output. Cross that line — even by one token — and the entire call re-prices: $5/1M input (a 100% increase) and $22/1M output (a 46.7% increase, since $22 is $7 more than $15, and $7 ÷ $15 = 0.467). This is a per-call threshold, not a monthly cap, so a workload that mixes short and long calls pays the lower rate on the short ones and the higher rate only on the long ones.
The 2,000,000-token context window
At 2,000,000 tokens, Gemini 3.5 Pro's context window is an industry-first size at the frontier tier when it shipped — nearly double GPT-5.6 Sol's 1,100,000 tokens and exactly twice Claude Sonnet 5's 1,000,000 tokens. That headroom is the point of the model: workloads that need more context than any other frontier model tracked in this update can offer now have somewhere to go. The trade-off is the tiered pricing above and the output ceiling: max output per call is 100,000 tokens, half of Claude Sonnet 5's 200,000-token cap, so extremely long single-response generation still favors Anthropic's new mid-tier model even though Gemini 3.5 Pro can ingest far more.
GPQA Diamond 95.5: the highest score benchr tracks
GPQA Diamond tests PhD-level questions in biology, chemistry, and physics. Gemini 3.5 Pro's 95.5% is the highest score benchr tracks across this update — ahead of Gemini 3.1 Pro (94.3%), Claude Opus 4.8 (93.6%), Claude Sonnet 5 (92.0%), and GPT-5.6 Sol (91.2%). It's also a new high for the Gemini family on ARC-AGI-2, at 80.0 versus Gemini 3.1 Pro's 77.1. The gap doesn't carry over to coding: on SWE-bench Verified, Gemini 3.5 Pro's 85.5% trails GPT-5.6 Sol (89.8%) and Claude Sonnet 5 (89.4%). Route reasoning-heavy science and research work here; route coding-heavy agent work elsewhere.
Where Gemini 3.5 Pro fits in the Gemini family
Gemini 3.5 Pro sits above both Gemini 3.1 Pro and Gemini 3.5 Flash — it's the deepest-reasoning, longest-context model Google offers, launched after Flash had already beaten 3.1 Pro on coding benchmarks. If your workload is latency-sensitive or coding-agent-shaped, Flash remains the faster, cheaper pick. If you need more context or reasoning depth than 3.1 Pro provides, 3.5 Pro is the upgrade path. Use benchr's comparison tools or the model rankings to weigh this against non-Gemini options for your specific workload.
Cost scenarios
A single long-context call. One call using 1,000,000 input tokens exceeds the 200K threshold, so the whole call bills at the higher tier: 1,000,000 × $5/1M = $5.00 for input. Add 20,000 output tokens at $22/1M = $0.44. Total: $5.44 for that single call.
A typical month, calls under 200K. At 10M input + 2M output tokens per month, all within the 200K-per-call tier: 10 × $2.50 = $25 input, 2 × $15 = $30 output, total $55/month. The same volume on Claude Opus 4.8 ($5/$25): 10 × $5 = $50 input, 2 × $25 = $50 output, total $100/month — Gemini 3.5 Pro costs 55% of that, a 45% saving. Against GPT-5.6 Sol ($5/$30) at the same volume: $50 + $60 = $110/month — Gemini 3.5 Pro is exactly half.
The same month, calls over 200K. If every call in that 10M/2M month crosses the 200K threshold: 10 × $5 = $50 input, 2 × $22 = $44 output, total $94/month — 70.9% more than the under-200K scenario ($39 more on a $55 base), but still 6% cheaper than Claude Opus 4.8's $100/month and 14.5% cheaper than GPT-5.6 Sol's $110/month at the same volume.
Cached input. At a 90% cache hit rate within the 200K tier: 0.9 × $0.25 + 0.1 × $2.50 = $0.225 + $0.25 = $0.475 effective per million — an 81% reduction from the $2.50 uncached rate. Note the separate $1-per-1M-token-hour cache storage charge applies on top of that discounted read rate, so cache economics depend on how long you hold context in cache, not just the hit rate.
Batch processing. Batch is a flat 50% discount in both tiers: $1.25/$7.50 at or below 200K ($1.25 ÷ $2.50 = 50%, $7.50 ÷ $15 = 50%), and $2.50/$11 above it ($2.50 ÷ $5 = 50%, $11 ÷ $22 = 50%). For non-interactive, high-volume jobs where turnaround time isn't the constraint, batch halves the bill regardless of which context tier you're in.
Use-case fit
Best for: Single calls that genuinely need more than 1,000,000 tokens of context; PhD-level science and research reasoning where GPQA Diamond depth matters; teams already on Gemini 3.1 Pro who are hitting its context or reasoning ceiling; workloads where the absolute reasoning score matters more than coding throughput.
Skip if: Your calls are mostly under 200,000 tokens and don't need frontier-level GPQA depth — Gemini 3.5 Flash is faster and cheaper for coding-agent work. Skip it too if SWE-bench Verified is your primary metric — Claude Sonnet 5 (89.4%) and GPT-5.6 Sol (89.8%) both outscore Gemini 3.5 Pro's 85.5%. And skip it if you need more than 100,000 tokens of output per call — Claude Sonnet 5's 200,000-token ceiling is double.
Decision checklist
Measure your typical context length before committing: if your p90 call size regularly crosses 200,000 tokens, budget for the $5/$22 tier, not the $2.50/$15 headline rate — the difference compounds quickly at volume, as the cost scenarios above show.
Confirm whether GPQA-style reasoning depth is actually your bottleneck, or whether you're really optimizing for coding throughput. If it's the latter, Gemini 3.5 Flash or Claude Sonnet 5 are both cheaper and score higher on SWE-bench Verified than Gemini 3.5 Pro.