Claude Opus 4.8 vs Gemini 3.1 Pro

The two strongest frontier models with fully published benchmarks, head to head. Coding accuracy or cheap strong reasoning — that's the choice.

By the benchr team · · View changelog

Opus SWE-bench Verified 88.6 vs Gemini's 80.6
Gemini ARC-AGI-2 77.1 Opus didn't publish
Input price $5 vs $2 Opus / Gemini per 1M
GPQA Diamond 93.6 / 94.3 Effectively tied

These are the two frontier models with the most complete public benchmark records right now, and they pull in different directions. Claude Opus 4.8 is Anthropic's strongest tier, built around coding and agentic work. Gemini 3.1 Pro is Google's frontier preview, cheaper and tuned to win the hardest abstract-reasoning tests. The decision isn't "which is smarter" — it's whether your budget is going toward coding accuracy or toward strong reasoning at a lower price. For each model on its own terms, see the Claude Opus 4.8 review and the Gemini 3.1 Pro review.

The pricing reference first, since it shapes every other decision. Opus 4.8 lists at $5 per million input tokens and $25 output, per Anthropic's pricing page. Gemini 3.1 Pro lists at $2 input and $12 output for prompts up to 200K tokens, per Google's Gemini API pricing. That's roughly 2.5x cheaper on input and about half the output cost — a real gap if you're running volume. Two caveats keep it honest: Gemini's price climbs to $4 input and $18 output above 200K tokens, and its output is capped at 64K versus Opus's 128K.

One housekeeping note before the comparison. Gemini 3.1 Pro shipped in preview on February 19, 2026, and Google flags that its prices and limits may change. Opus 4.8 is a generally available production model. If you're building something that needs stable pricing and rate limits, weight that difference; if you're experimenting, it matters less.

Claude Opus 4.8 vs Gemini 3.1 Pro at a glance, from each provider's official documentation, June 2026.
Spec Claude Opus 4.8 Gemini 3.1 Pro
Price (per 1M, in / out) $5 / $25 $2 / $12 (≤200K); $4 / $18 above
Context window 1M tokens 1M tokens
Max output 128K (300K beta) 64K (includes thinking)
SWE-bench Verified 88.6 80.6
GPQA Diamond 93.6 94.3
Availability Generally available Preview (no free API tier)

Coding and agentic work: Opus, clearly

This is the cleanest win on the board. On SWE-bench Verified, the most-cited real-world coding benchmark, Opus 4.8 posts 88.6 against Gemini 3.1 Pro's 80.6. Eight points on that benchmark is the difference between a model that lands most fixes first try and one you'll review more carefully. Opus also leads the harder coding cuts Anthropic publishes: 69.2 on SWE-bench Pro and 84.4 on SWE-bench Multilingual.

The gap widens once you move from writing code to running it. Agentic work — a model operating a computer, a terminal, or a multi-step task on its own — is where Opus 4.8 was tuned, and the numbers show it: 83.4 on OSWorld-Verified, 74.6 on Terminal-Bench 2.1, and a GDPval-AA Elo of 1890 on structured professional tasks. Gemini doesn't publish a comparable agentic suite, so there's no head-to-head number, but Opus's positioning and these scores make it the model to reach for when the work is autonomous rather than conversational.

Winner: Opus, on both code accuracy and agentic execution. If you're paying for a coding assistant or building an agent that touches real files and tools, this is your model, and the price premium is what you're paying for it.

Abstract reasoning and science: Gemini's edge

Flip the workload to hardest-mode reasoning and the lead changes hands. Gemini 3.1 Pro posts 77.1 on ARC-AGI-2, the benchmark built to resist memorization and reward genuine abstraction. The honest framing here matters: Opus 4.8 didn't publish an ARC-AGI-2 number, so this isn't Gemini beating a known Opus score — it's Gemini showing a strength Anthropic chose not to report. Read it as Gemini's clearest reasoning signal, not as a measured gap.

On graduate-level science the two are a tie inside the noise. GPQA Diamond is 94.3 for Gemini and 93.6 for Opus — seven tenths of a point, well within run-to-run variance. Gemini also leads multimodal understanding, posting 80.5 on MMMU-Pro, and 92.6 on MMMLU for multilingual knowledge. If your work is research-flavored, science-heavy, or leans on images and charts, Gemini is the stronger fit and the cheaper one at the same time.

Winner: Gemini on abstract reasoning and multimodal, a tie on graduate science. The ARC-AGI-2 result is the standout, with the caveat that there's no Opus figure beside it.

Price: Gemini is much cheaper, with two asterisks

For most workloads Gemini 3.1 Pro is the budget pick by a clear margin. At $2 input and $12 output for prompts up to 200K tokens, it runs at roughly 40% of Opus 4.8's input cost and under half the output cost. Batch pricing keeps the gap: Gemini bills $1 input and $6 output in batch (then $2 / $9 above 200K), while Opus batch is $2.50 / $12.50. At volume, that spread compounds into a real line item.

The two asterisks are where Opus claws value back. First, Gemini's price steps up above 200K input tokens — $4 input and $18 output — so very long-context jobs narrow the gap considerably. Second, Gemini caps output at 64K tokens and counts thinking tokens against that ceiling, while Opus gives you 128K (300K in beta). If your task produces long structured output or leans on heavy reasoning traces, Opus's larger output budget can matter more than the per-token price. Opus also has its own cost levers: a cache-hit rate of $0.50 per million, batch at half price, and an optional fast mode at $10 / $50 for roughly 2.5x output speed.

Winner: Gemini on raw price, decisively, until you hit the 200K cliff or need more than 64K of output.

Context and output: same window, different ceilings

Both models carry a 1M-token context window, so for ingesting large codebases, long documents, or sprawling chat histories they're evenly matched on what they can read. The difference is on the way out. Opus 4.8 can produce up to 128K tokens in a single response, with a 300K beta, while Gemini 3.1 Pro tops out at 64K — and because Gemini's output budget includes its thinking tokens, a heavy reasoning pass eats into the room left for the actual answer.

In practice this rarely bites on chat or short generations, where neither model comes close to the ceiling. It bites on the jobs that produce a lot of text at once: a full document draft, a large refactor returned as one diff, an exhaustive structured extraction. If that's your pattern, Opus's output headroom is a concrete advantage that the price comparison alone won't show you.

For the broader frame on how these flagships stack against the rest of the field, the Gemini 3.1 Pro vs GPT-5.5 comparison covers the reasoning-versus-knowledge-work axis, and the Opus 4.8 review goes deeper on where Anthropic's tier earns its keep.

Which one for which work

If you're shipping code or building agents, default to Opus 4.8. The SWE-bench lead is wide, the agentic and computer-use scores are higher, and there's no Gemini number that contradicts the picture. Coding accuracy is exactly what the $5/$25 premium buys.

If you're running high-volume reasoning, research, or multimodal work and price-per-token drives the decision, Gemini 3.1 Pro is the pick. It's much cheaper, ties or wins on the science and reasoning benchmarks, and leads on multimodal — just keep the 200K price cliff and the 64K output cap in view.

If you can run both, the clean split is Opus for the coding and agent layer, Gemini for the high-volume reasoning and analysis layer. And remember that Gemini is still a preview: stable-pricing-dependent production work leans toward Opus until Gemini graduates to GA.

Frequently asked

Which is better, Claude Opus 4.8 or Gemini 3.1 Pro?

It depends on what you're paying for. Opus 4.8 leads coding and agentic work: 88.6 on SWE-bench Verified to Gemini's 80.6, plus higher OSWorld, Terminal-Bench, and GDPval scores. Gemini 3.1 Pro is much cheaper ($2 vs $5 input, $12 vs $25 output) and edges ahead on abstract reasoning, posting 77.1 on ARC-AGI-2 where Opus didn't publish a number. GPQA Diamond is basically tied (94.3 vs 93.6). Pick Opus for coding accuracy, Gemini for cheap strong reasoning at scale.

Which model is cheaper to run?

Gemini 3.1 Pro, by a wide margin. It lists at $2 per million input tokens and $12 output for prompts up to 200K tokens, against Opus 4.8's $5 and $25. But watch two things: above 200K input, Gemini jumps to $4 input and $18 output, and its output (which includes thinking tokens) is capped at 64K versus Opus's 128K. Gemini also has no free API tier, only an AI Studio UI trial.

Which is better at coding?

Claude Opus 4.8. It posts 88.6 on SWE-bench Verified to Gemini 3.1 Pro's 80.6, an eight-point gap on the most-cited real-world coding benchmark. Opus also leads the agentic and computer-use measures that matter for autonomous coding work: 83.4 on OSWorld-Verified, 74.6 on Terminal-Bench 2.1, and a GDPval-AA Elo of 1890. If accuracy on production code is what you're buying, Opus is the pick.

Which is better at reasoning and science questions?

Gemini 3.1 Pro has the edge on abstract reasoning, posting 77.1 on ARC-AGI-2, a benchmark Opus 4.8 didn't publish. On graduate-level science the two are effectively tied: GPQA Diamond is 94.3 for Gemini and 93.6 for Opus. Gemini also leads MMMU-Pro multimodal at 80.5. Treat the ARC-AGI-2 result as Gemini's clearest reasoning advantage, since there's no Opus number to compare it against.

Can I compare the Humanity's Last Exam scores directly?

No. Opus 4.8 reports 49.8 on Humanity's Last Exam with no tools, while Gemini 3.1 Pro reports 51.4 with tools. Those are different test conditions, so the two numbers aren't comparable head-to-head. A model with tool access can look up and compute, which generally lifts the score, so don't read Gemini's higher figure as a clean win on this benchmark.

Is Gemini 3.1 Pro stable enough to build on?

Treat it as a preview. Gemini 3.1 Pro was released in preview on February 19, 2026, and Google notes that prices and limits may change. Opus 4.8 is a generally available production model. If you're shipping something that depends on stable pricing and rate limits, that difference matters; if you're experimenting or can tolerate change, the preview status is less of a concern.

Changelog

  • June 13, 2026 — Originally published. Prices, context and output limits, and all benchmark figures verified against Anthropic's and Google's official documentation.

References

  1. Anthropic, "Claude Pricing," anthropic.com/pricing, accessed June 2026.
  2. Anthropic, "Claude API Documentation," docs.claude.com, accessed June 2026.
  3. Google, "Gemini API Pricing," ai.google.dev/gemini-api/docs/pricing, accessed June 2026.
  4. Google, "Gemini models," ai.google.dev/gemini-api/docs/models, accessed June 2026.
  5. "SWE-bench Verified leaderboard," swebench.com, June 2026.
  6. "ARC Prize," arcprize.org, June 2026.