Put these two spec sheets side by side and the first thing you notice is that they don't line up. Google's Gemini 3.1 Pro leads with abstract reasoning and science scores. OpenAI's GPT-5.5 leads with agentic coding and knowledge-work scores. They share almost no common benchmark, which makes the usual "who's higher" table mostly empty. So this comparison is less a race and more a question of which scoreboard matches your job.
One benchmark does overlap, and it's worth stating up front: Terminal-Bench 2.0, the test for multi-step command-line agents. There, GPT-5.5 wins clearly, 82.7% to 68.5%. Everywhere else, you're comparing each lab's chosen strengths, not the same test run twice. If that frustrates you, it should, and why benchmarks stopped telling you much is the longer argument for reading these numbers with suspicion.
| Benchmark | Gemini 3.1 Pro | GPT-5.5 |
|---|---|---|
| ARC-AGI-2 (abstract reasoning) | 77.1% | — |
| GPQA Diamond (science) | 94.3% | — |
| Humanity's Last Exam (no tools) | 44.4% | — |
| SWE-bench Verified (agentic coding) | 80.6% | — |
| Terminal-Bench 2.0 (CLI agents) | 68.5% | 82.7% |
| SWE-Bench Pro (real-world issues) | — | 58.6% |
| GDPval (knowledge work) | — | 84.9% |
Where Gemini's reasoning case is strongest
Gemini 3.1 Pro is built to push the reasoning frontier, and the numbers Google reports back that up. The 77.1% on ARC-AGI-2 is the headline, more than double Gemini 3 Pro's 31.1% on the same test, and ARC-AGI-2 is the benchmark designed specifically to resist memorization. Add 94.3% on GPQA Diamond for graduate-level science and a competitive-coding rating of 2887 Elo on LiveCodeBench Pro, and you've got a model tuned for the questions where the model has to think, not retrieve.
If your work is hard reasoning, novel math, research-grade science, problems with no answer to look up, Gemini 3.1 Pro is reporting the strongest public scores, and OpenAI simply isn't putting GPT-5.5 on those same boards. The deeper read on Google's flagship line is in the Gemini evaluation.
Where GPT-5.5 owns the work
GPT-5.5 aims somewhere else: getting professional knowledge work done. OpenAI reports 84.9% on GDPval, a test of producing well-specified work across dozens of occupations, and built the release around agentic coding and computer use. Its 82.7% on Terminal-Bench 2.0 is state of the art, and on the shared test it beats Gemini outright. OpenAI tuned it to be concise and to hold context across big systems, which is the texture of real analyst and engineering work.
So GPT-5.5 is the one you reach for when the deliverable is a finished thing, a report, a working change, a synthesized analysis, rather than a hard puzzle solved. The GPT-5 review covers the lineage that GPT-5.5 extends.
The price gap nobody mentions
Here's the lever that quietly decides a lot of real deployments: Gemini 3.1 Pro is much cheaper. At standard prompt lengths it's $2 input and $12 output per million tokens, against GPT-5.5's $5 and $30. That's roughly 2.5 times the cost to run the same volume on GPT-5.5.
At low volume that gap is noise. At scale, it's the budget. A pipeline running millions of tokens a day will feel a 2.5x multiplier in a way no benchmark captures, and Gemini 3.1 Pro posts a strong 80.6% on SWE-bench Verified while it's at it. The one caveat: Gemini 3.1 Pro is still in preview as of May 2026, so treat its pricing and limits as not yet final.
Which one for you
Go with Gemini 3.1 Pro if your work is hard reasoning, science, or research, or if you're running enough volume that a 2.5x price gap matters. It posts the strongest public reasoning scores and costs far less per token, and its SWE-bench Verified result means it holds up on coding too.
Go with GPT-5.5 if your job is finished knowledge work and terminal-driven agents. It wins the one benchmark both labs ran, it's tuned for concise professional output, and it's the safer pick when the model has to grind through tools and long systems without losing the thread.
And if you're deciding between the two frontier coders rather than this reasoning-versus-breadth split, the closer fight is Opus 4.8 vs GPT-5.5, where the benchmarks line up enough to score a winner.