These are the two frontier models with the most complete public benchmark records right now, and they pull in different directions. Claude Opus 4.8 is Anthropic's strongest tier, built around coding and agentic work. Gemini 3.1 Pro is Google's frontier preview, cheaper and tuned to win the hardest abstract-reasoning tests. The decision isn't "which is smarter" — it's whether your budget is going toward coding accuracy or toward strong reasoning at a lower price. For each model on its own terms, see the Claude Opus 4.8 review and the Gemini 3.1 Pro review.
The pricing reference first, since it shapes every other decision. Opus 4.8 lists at $5 per million input tokens and $25 output, per Anthropic's pricing page. Gemini 3.1 Pro lists at $2 input and $12 output for prompts up to 200K tokens, per Google's Gemini API pricing. That's roughly 2.5x cheaper on input and about half the output cost — a real gap if you're running volume. Two caveats keep it honest: Gemini's price climbs to $4 input and $18 output above 200K tokens, and its output is capped at 64K versus Opus's 128K.
One housekeeping note before the comparison. Gemini 3.1 Pro shipped in preview on February 19, 2026, and Google flags that its prices and limits may change. Opus 4.8 is a generally available production model. If you're building something that needs stable pricing and rate limits, weight that difference; if you're experimenting, it matters less.
| Spec | Claude Opus 4.8 | Gemini 3.1 Pro |
|---|---|---|
| Price (per 1M, in / out) | $5 / $25 | $2 / $12 (≤200K); $4 / $18 above |
| Context window | 1M tokens | 1M tokens |
| Max output | 128K (300K beta) | 64K (includes thinking) |
| SWE-bench Verified | 88.6 | 80.6 |
| GPQA Diamond | 93.6 | 94.3 |
| Availability | Generally available | Preview (no free API tier) |
Coding and agentic work: Opus, clearly
This is the cleanest win on the board. On SWE-bench Verified, the most-cited real-world coding benchmark, Opus 4.8 posts 88.6 against Gemini 3.1 Pro's 80.6. Eight points on that benchmark is the difference between a model that lands most fixes first try and one you'll review more carefully. Opus also leads the harder coding cuts Anthropic publishes: 69.2 on SWE-bench Pro and 84.4 on SWE-bench Multilingual.
The gap widens once you move from writing code to running it. Agentic work — a model operating a computer, a terminal, or a multi-step task on its own — is where Opus 4.8 was tuned, and the numbers show it: 83.4 on OSWorld-Verified, 74.6 on Terminal-Bench 2.1, and a GDPval-AA Elo of 1890 on structured professional tasks. Gemini doesn't publish a comparable agentic suite, so there's no head-to-head number, but Opus's positioning and these scores make it the model to reach for when the work is autonomous rather than conversational.
Winner: Opus, on both code accuracy and agentic execution. If you're paying for a coding assistant or building an agent that touches real files and tools, this is your model, and the price premium is what you're paying for it.
Abstract reasoning and science: Gemini's edge
Flip the workload to hardest-mode reasoning and the lead changes hands. Gemini 3.1 Pro posts 77.1 on ARC-AGI-2, the benchmark built to resist memorization and reward genuine abstraction. The honest framing here matters: Opus 4.8 didn't publish an ARC-AGI-2 number, so this isn't Gemini beating a known Opus score — it's Gemini showing a strength Anthropic chose not to report. Read it as Gemini's clearest reasoning signal, not as a measured gap.
On graduate-level science the two are a tie inside the noise. GPQA Diamond is 94.3 for Gemini and 93.6 for Opus — seven tenths of a point, well within run-to-run variance. Gemini also leads multimodal understanding, posting 80.5 on MMMU-Pro, and 92.6 on MMMLU for multilingual knowledge. If your work is research-flavored, science-heavy, or leans on images and charts, Gemini is the stronger fit and the cheaper one at the same time.
Winner: Gemini on abstract reasoning and multimodal, a tie on graduate science. The ARC-AGI-2 result is the standout, with the caveat that there's no Opus figure beside it.
Price: Gemini is much cheaper, with two asterisks
For most workloads Gemini 3.1 Pro is the budget pick by a clear margin. At $2 input and $12 output for prompts up to 200K tokens, it runs at roughly 40% of Opus 4.8's input cost and under half the output cost. Batch pricing keeps the gap: Gemini bills $1 input and $6 output in batch (then $2 / $9 above 200K), while Opus batch is $2.50 / $12.50. At volume, that spread compounds into a real line item.
The two asterisks are where Opus claws value back. First, Gemini's price steps up above 200K input tokens — $4 input and $18 output — so very long-context jobs narrow the gap considerably. Second, Gemini caps output at 64K tokens and counts thinking tokens against that ceiling, while Opus gives you 128K (300K in beta). If your task produces long structured output or leans on heavy reasoning traces, Opus's larger output budget can matter more than the per-token price. Opus also has its own cost levers: a cache-hit rate of $0.50 per million, batch at half price, and an optional fast mode at $10 / $50 for roughly 2.5x output speed.
Winner: Gemini on raw price, decisively, until you hit the 200K cliff or need more than 64K of output.
Context and output: same window, different ceilings
Both models carry a 1M-token context window, so for ingesting large codebases, long documents, or sprawling chat histories they're evenly matched on what they can read. The difference is on the way out. Opus 4.8 can produce up to 128K tokens in a single response, with a 300K beta, while Gemini 3.1 Pro tops out at 64K — and because Gemini's output budget includes its thinking tokens, a heavy reasoning pass eats into the room left for the actual answer.
In practice this rarely bites on chat or short generations, where neither model comes close to the ceiling. It bites on the jobs that produce a lot of text at once: a full document draft, a large refactor returned as one diff, an exhaustive structured extraction. If that's your pattern, Opus's output headroom is a concrete advantage that the price comparison alone won't show you.
For the broader frame on how these flagships stack against the rest of the field, the Gemini 3.1 Pro vs GPT-5.5 comparison covers the reasoning-versus-knowledge-work axis, and the Opus 4.8 review goes deeper on where Anthropic's tier earns its keep.
Which one for which work
If you're shipping code or building agents, default to Opus 4.8. The SWE-bench lead is wide, the agentic and computer-use scores are higher, and there's no Gemini number that contradicts the picture. Coding accuracy is exactly what the $5/$25 premium buys.
If you're running high-volume reasoning, research, or multimodal work and price-per-token drives the decision, Gemini 3.1 Pro is the pick. It's much cheaper, ties or wins on the science and reasoning benchmarks, and leads on multimodal — just keep the 200K price cliff and the 64K output cap in view.
If you can run both, the clean split is Opus for the coding and agent layer, Gemini for the high-volume reasoning and analysis layer. And remember that Gemini is still a preview: stable-pricing-dependent production work leans toward Opus until Gemini graduates to GA.