benchr Issue No. 07

GPT-5, reviewed

Where GPT-5 differs from Claude in ways that matter: speed and breadth, plus the cracks that show on niche technical questions.

· View changelog

Launched Aug '25 Five months in production
Input cost / 1M $10 $50 output, $1.25 cached
SWE-Bench 68.8% Verified subset
Context 400K ~250K effective zone

Is GPT-5 still worth subscribing to once Claude Opus 4.7 and Gemini 3.1 Pro Preview are on the table? That's the question the past five months were going to answer, and the answer is closer to "yes, but only for specific work" than the launch coverage suggested. GPT-5 shipped August 2025, per OpenAI's launch post. Five months later, the dust has settled. This piece looks at GPT-5 on its own terms — what it does that nothing else does, and the failure modes you'll learn to expect.

Short version: GPT-5 was the fastest of the three serious frontier models in this round of testing, the most natural at conversational English, the most opinionated on visual design, and the most likely to produce confidently wrong output on technical questions outside its strongest zones. None of that is damning. All of it is worth knowing before you pay for an API key.

The speed advantage is real

GPT-5 streams output way faster than Claude Opus 4.7 in most testing. Roughly 90 to 130 tokens per second on average, compared with Opus at 60 to 80 from a decent connection. First-token latency is also lower. For interactive use, the experience is clearly quicker. The lower latency adds up in any workflow where you're waiting on the model before deciding what to do next.

The speed comes with a small but real cost. GPT-5 tends to produce slightly longer outputs than Claude for the same prompts, which partly offsets the throughput advantage. In a chat window you won't notice. In an automated pipeline that pays for output tokens, it shows up in the bill.

GPT-5 vs Claude Opus 4.7 — by task category

Score out of 100. GPT-5 in orange, Claude in outlined black.

Visual design
93
Coding
90
Reasoning
93
Writing
87
Math (MATH bench)
92.8

The breadth is real, especially in writing

GPT-5's strongest single category is conversational and creative writing. Ask it to write in a specific voice — a 1950s detective story, a 19th-century historian, a contemporary tech essayist — and you'll get output that captures the tone more reliably than Claude does on the first try. The drafts aren't always better than what Claude produces with a second pass. But they need less iteration to land somewhere usable.

That translates into a real advantage for any task where stylistic flexibility matters more than correctness or consistency. Drafting copy in a brand voice nobody has documented. Writing narrative prose with a deliberate atmosphere. Producing pitch decks where the words matter at least as much as the data. For those, GPT-5 is the model that produces the most useful first attempt.

Multilingual range is broader than the benchmarks suggest

GPT-5 handles a wider range of languages comfortably than the benchmarks indicate. Spanish, French, Portuguese, German, Italian, Japanese, Korean, and Mandarin all produce output that reads like a competent native speaker wrote it — not a translated thought. It also handles low-resource languages with more grace than its predecessors. Hindi, Tagalog, Swahili, Bengali. None of the obvious tells of machine translation.

The gap shows up in two places. First: tone sensitivity, especially in languages with strong formal/informal distinctions. Korean honorifics, Japanese politeness levels, Arabic regional dialects. GPT-5 picks a tone and sticks with it, but the picked tone isn't always the one you asked for. Second: dialectal Arabic, where Egyptian vocabulary leaks into Gulf-targeted prompts in a way that breaks immersion for the intended reader. The Arabic-content piece in this issue covers that in more detail.

One honest admission before this section: I can't tell yet whether the latency improvement in OpenAI's January API update will hold once GPT-5 hits production scale on the new tokenizer. It might. It might not. Six weeks of testing isn't enough to know.

Code that compiles but doesn't work

The most consistent failure mode of GPT-5 in technical work: code that the compiler accepts but the runtime rejects, or that runs but produces a subtly wrong result. This is different from the failure mode of weaker models, which tend to spit out code that obviously fails to compile.

The pattern repeats across languages and frameworks. The code looks right. The structure is right. The function signatures use real APIs in reasonable ways. The bug is a misunderstanding of an edge case, an off-by-one in a slicing operation, a misuse of a concurrency primitive that doesn't surface on the happy path. None of these are catastrophic. All of them need a careful human reader to catch. The aggregate result tracks: GPT-5 sits at 68.8% on the SWE-bench Verified leaderboard, several points behind Claude Opus 4.7.

In the seven-task comparison benchr ran in late November, GPT-5 fixed three of four bugs in a deliberately-broken Python script and missed the most subtle one. An asyncio.gather versus asyncio.wait swap that affected result ordering only under specific call patterns. Claude flagged the same bug on the first pass. GPT-5 missing the subtle case while handling the obvious ones is the consistent pattern of its coding output.

GPT-5 produces code the compiler accepts and you have to second-guess. Claude produces code with the bug-flagging instincts that catch what the compiler won't.

Niche technical hallucination

Outside the busy corners of the major languages and libraries, GPT-5 hallucinates with confidence. Ask for the signature of a method on a less-popular library and you'll often get a believable-looking signature that doesn't exist. Ask about the behavior of a deprecated API and it'll sometimes describe the behavior of the modern replacement, or vice versa, without flagging the swap.

The defense is the same as for any model. Never trust an API claim you can't verify against the docs. But the missing hedge on these claims is a real difference from Claude, which flags uncertainty more often when it's operating outside its strongest zones. GPT-5's default mode on a technical question is confident. The confidence isn't always calibrated to how reliable the answer actually is.

92.8% Best in class on math-heavy work — MATH benchmark

Anyway.

Worth flagging: the latency numbers below are from my testing locations and may not match what teams in other regions see. OpenAI runs different POPs and has different routing characteristics by region. The relative comparison (GPT-5 vs Opus vs Gemini) holds. The absolute milliseconds may not.

What it costs

GPT-5 lists at $10 per million input tokens and $50 per million output tokens through the OpenAI API, per OpenAI's API pricing. The pricing sits between Claude Sonnet 4.7 and Claude Opus 4.7. For most working sessions, GPT-5 will cost a little less than Opus and a little more than Sonnet. For high-volume workloads, GPT-5 Mini at $0.50 / $4.00 per million tokens is the fairer comparison. It's in a different price class entirely and competes against Claude Sonnet for the budget-conscious workload.

OpenAI GPT-5 tier pricing, January 2026, per OpenAI API pricing
TierInput ($/M tokens)Output ($/M tokens)Notes
GPT-5$10$50Standard frontier tier
GPT-5 Mini$0.50$4.00Distilled model, good for volume
GPT-5 (Batch API)$5$2550% discount, 24h turnaround

The batch tier is consistently underused in production. For any workload that doesn't need a synchronous response — overnight document processing, bulk classification, content backfills — the 50% discount is basically free money. Most teams forget to use it. Honestly, the savings can cover an extra developer seat.

Coding

90 /100 — strong on first draft

Reasoning

93 /100 — math-heavy edge

Writing

87 /100 — most natural English

Vision

88 /100 — good, not best

Long context

84 /100 — drops past 250K

Multilingual

90 /100 — broader than benchmarks
Nov 2022 GPT-3.5 Mar 2023 GPT-4 Nov 2023 GPT-4 Turbo May 2024 GPT-4o Aug 2025 GPT-5
OpenAI's flagship release cadence, 2022–2025. Roughly one major model per year.

When GPT-5 is the call

Three categories, named specifically.

First: visual or design-heavy work. Landing pages, presentation decks, marketing layouts. Anything where the model's aesthetic judgment matters. GPT-5's defaults are more contemporary and more confident than Claude's. The output needs less reshaping to land somewhere shippable.

Second: conversational and creative writing where stylistic flexibility is the central requirement. Voice-matching, atmospheric prose, narrative work that needs a specific tone. GPT-5 captures the tone more reliably on the first attempt.

Third: multilingual work in any language where Claude hasn't been specifically tuned. Languages outside the top tier of training data come out with more polish in GPT-5 than in Claude. The exception is Arabic, where Claude has a clear edge on tone and dialect. For the broader set of world languages, GPT-5 is the safer first pick.

When to skip it

Skip GPT-5 for technical work where correctness matters more than smoothness. Code in production codebases, architectural reasoning, debugging that needs a subtle bug caught on the first pass. The confident-wrong failure mode is the wrong fit for these tasks.

Skip it for reasoning under uncertainty where honesty is part of the deliverable. Legal questions, medical questions, financial analysis. Anything where a confident answer that turns out to be wrong is worse than an honest "I'm not sure." Claude's hedging instinct is safer here.

Skip it for long-context work where the model needs to hold coherence across hundreds of thousands of tokens. GPT-5 is competent in long context. But Claude is better at the synthesis-across-sections work that separates a useful long-context response from a summary of one section only.

GPT-5 is the second model worth paying for in 2026, after Claude. It earns its spot by being faster, more stylistically flexible, more multilingually broad, and more visually opinionated than the alternatives. It doesn't earn the top slot because its confidence isn't calibrated to its accuracy on technical questions, and its failure mode on code — compiling but wrong — is the worst kind of failure for working software.

The recommendation hasn't changed since the comparison piece. Claude Opus 4.7 as your default, GPT-5 as the supplementary subscription for the categories where it wins. The two together run about $40 a month in normal API use. That combo is what anyone serious about working with these tools should be running.

If you're forced to pick one, the choice depends on the work. If you write more than you code, pick GPT-5. If you code more than you write, pick Claude. Most users sit in the middle and benefit from having both. That's why the both-models answer is what benchr keeps recommending.

Bottom line

Subscribe to GPT-5 as the second model in a multi-model setup, with Claude as the primary. Pick GPT-5 for visual design, structured output, math-heavy reasoning, and conversational warmth. Skip it for production code review (the confident-wrong failure mode is too risky) and long-document analysis (Claude wins on synthesis past 250K tokens).

Frequently asked

Is GPT-5 better than Claude Opus 4.7?

Not overall. GPT-5 wins on visual design, math benchmarks, and conversational warmth. Claude Opus wins on coding, reasoning under uncertainty, and long-document analysis. For mixed workloads, the both-models approach beats either alone.

How much does GPT-5 cost?

$10 per million input tokens and $50 per million output. GPT-5 Mini drops to $0.50 / $4.00 for high-volume cheap workloads. The batch API offers a 50% discount with 24-hour turnaround.

What's the context window on GPT-5?

400K tokens advertised. Effective retrieval stays solid to about 250K. Past that, recall drops faster than on Claude or Gemini.

When should you choose GPT-5 over the alternatives?

For visual design tasks (landing pages, layouts), structured output (JSON, XML), math-heavy reasoning, and conversational English where warmth matters. Pick something else for production code review or honest hedging.

Does GPT-5 hallucinate?

On niche technical questions, yes — confidently. GPT-5's default tone is certain, and the certainty isn't always calibrated to accuracy. Always verify API signatures and citations against primary sources.

Changelog

  • May 25, 2026 — Verified pricing against current provider documentation. Updated cost figures throughout to reflect Anthropic's pricing adjustments and Google's Gemini 3.1 Pro Preview rollout.
  • January 22, 2026 — Re-verified pricing against the OpenAI pricing page.
  • January 8, 2026 — Updated throughput numbers after OpenAI's January API update improved vision throughput by ~30%. Text-only tokens-per-second unchanged.
  • January 4, 2026 — Originally published.

References

  1. OpenAI, "API Documentation," platform.openai.com/docs, accessed May 2026.
  2. OpenAI, "API Pricing," openai.com/api/pricing, accessed May 2026.
  3. "Chatbot Arena leaderboard," lmarena.ai, May 2026 snapshot.
  4. OpenAI, "Introducing GPT-5," openai.com/index/introducing-gpt-5, August 2025.
  5. "SWE-bench Verified leaderboard," swebench.com, May 2026.