benchr Issue No. 07

GPT-5 vs Claude Opus 4.7: seven tasks, scored

Seven prompts. Same day. Same machine. The scoreboard, and which model wins which kind of work.

· View changelog

Tasks run 7 Each model, twice
Days of testing 2 Consecutive Saturday mornings
Total API cost $33 Both vendors combined
Final score 5–2 Claude wins five outright

Seven prompts. Two API tabs. The Opus side runs against Claude Opus 4.7, released by Anthropic on April 16, 2026 — pricing per Anthropic's pricing page. GPT-5 runs against published rates on OpenAI's API pricing. At list pricing, running this comparison costs roughly $25–$40 depending on how aggressively prompts get re-run. The headline before the tasks: Opus 4.7 wins five clearly, GPT-5 wins one decisively, and there's one tie that turns on style. The scoreboard looks one-sided. Using both side by side feels closer than that.

The test ran GPT-5 through the OpenAI API and Opus 4.7 through Anthropic's. The aggregate scoreboard tracks the public leaderboards: Opus leads on the SWE-bench Verified leaderboard, while GPT-5 trades the top of the LMSYS Arena rankings with Gemini on general capability. The areas where GPT-5 wins matter a lot if you're doing visual design. So the overall score looks more one-sided than it feels when you're running both. For the cost story on both models, see price per use case.

Going in, I expected GPT-5 to win the design-heavy tasks and lose the technical tasks. That's mostly what happened. The two surprises were the recipe — closer than I expected, both models knew their way around a chicken — and the customer email, where Claude beat GPT-5 by a wider margin than I'd predicted. Tone matters more than benchmark gaps for that kind of work.

Going in, I expected GPT-5 to dominate on the visual design tasks given OpenAI's emphasis on multimodal in the release notes. It did, but by less than I'd predicted. And on coding, I expected the gap to be narrow — I was wrong about that. Claude opened a wider margin than the benchmark numbers suggest.

Worth flagging up front: seven tasks is not a comprehensive evaluation. It's a directional one. The tasks were chosen to cover the categories I see most often in real work (code, design, reasoning, writing, data) but a different seven tasks could reorder some of the results. Don't treat 5-2 as definitive.

Task one: refactor a class hierarchy

Both models got the same 900-line file from a production codebase. A base class with five derived types that had built up cross-cutting concerns through protected members. The prompt asked each model to name the architectural smell, propose a refactor, and produce the new files.

Opus noticed the actual problem first. The inheritance wasn't the issue. The leaky base-class protected surface was. It proposed converting the base into an interface plus a small abstract for shared state, with the rest pulled into composition. The four files it produced compiled with two trivial namespace fixes.

GPT-5 proposed something different: three mixin-style interfaces, with each derived type implementing the ones it needed. The code was correct, but more verbose. And it added an event aggregator the prompt didn't ask for. Both solutions work. Opus's was closer to what a senior engineer would have written without scaffolding, and the file split read more cleanly.

In practice: dropping Opus's output into the repo took one cleanup pass to align names with project conventions. The code review on that PR would be a 15-minute read. GPT-5's output needed roughly twice the cleanup. The event aggregator it added had to be ripped out, and three of the five derived types ended up with mixin combinations that read awkwardly. None of that's catastrophic. It's the friction that adds up over a sprint.

Winner: Claude, by a clear margin on architectural taste.

Task two: write a marketing landing page

The prompt: produce a single-page marketing site aimed at a young technical audience. Vanilla HTML and CSS, no frameworks. Mobile-first. Bold typographic hierarchy. Hero, three feature blocks, a pricing table, a footer.

This was where GPT-5 won so clearly that both runs got repeated to confirm. Opus produced a page that was technically correct, accessible, and visually anonymous. The personality of an enterprise SaaS dashboard. GPT-5 produced a page with an aggressive accent color, a hero with one oversize headline, and a feature section in a 2×2 grid with hairline borders. It also wrote two CSS animations that got cut, but the underlying composition was the one a designer would have shipped.

This is GPT-5's strongest category by a wide margin. It has a visual sensibility Opus doesn't. When Opus is asked to revise toward bolder design, it produces something that visibly tries (bigger type, more contrast) but still reads as a model imitating boldness rather than choosing it. GPT-5 is bold by default.

Specific choices that landed. GPT-5 picked a single sharp accent color (#FF4A2B in the first run, #1F4FFF in the second) and let everything else stay neutral. It set the hero headline at 72px with line-height 1.05 and a -0.04em letter-spacing, which is what a contemporary marketing designer would do. The pricing table used a quiet two-tier layout with one row highlighted by an inverted color block. Opus, by comparison, picked a blue-on-white palette with a 48px hero and a three-tier pricing table that looked like it came from a 2018 enterprise stencil.

Winner: GPT-5, and it isn't close.

Task three: reasoning under uncertainty

The prompt asked a specific question about a regulatory clause in a jurisdiction where the right answer needs familiarity with both the underlying statute and the case-law trajectory that has shaped its enforcement. The right answer names the article, separates the statute from its practical application, and flags the limits of what a non-attorney can confidently assert.

Opus produced a careful multi-paragraph response. It named the correct article. It separated the textual rule from the enforcement reality. It flagged uncertainty and recommended consulting a qualified attorney. The legal substance held up against the public sources.

GPT-5 produced a confident four-paragraph response. It cited the wrong article. An article that exists but covers a different topic. The structural framing was right. The headline conclusion was right. The supporting citation was wrong in a way a non-specialist reader would never catch. No flag of uncertainty anywhere.

This is the failure pattern most likely to ship to real users by accident. A confident answer that's nearly right, with one citation off. A paralegal reviewing the output would catch it. A founder running the answer through ChatGPT and hitting "send" wouldn't. Opus's instinct to hedge in a recognizable way is what protects you from that class of error.

Claude takes this one, in my testing of the legal-question task. The hedging instinct is the right fit for the work.

Task four: a recipe from a fridge

The fourth task was included precisely because it shouldn't have a clear winner. The prompt: I have chicken thighs, two yellow onions, half a tin of harissa paste, a lemon, and some basmati rice. Dinner for two. Forty minutes. Make me something good.

Opus produced a one-pan harissa chicken with caramelized onions over rice, finished with lemon zest. The recipe was tight, the timing was realistic, and the instructions read like they'd been written by someone who has handled a chicken thigh before. The dish was cooked. It was good.

GPT-5 produced a Moroccan-inflected braise from the same ingredients, with optional additions (preserved lemon, cilantro). The recipe was slightly more ambitious. One step called for stock to deglaze, which the prompt didn't specify having. The dish was cooked with water in place of stock. It was also good.

This was a real tie. The deciding factor (small, but real) was that the prompt asked for forty minutes, and Opus respected the constraint without optional flourishes that would've stretched it. GPT-5's recipe, run to completion with the optional steps, ran fifty-two minutes. Without them, forty-three. Opus's hit thirty-eight on the first try.

Call it a tie. Claude edges ahead for respecting the time constraint.

Task five: summarize a 60-page paper

The paper was a recent technical report on scaling behavior in sparse mixture-of-experts models. 60 pages, dense math, a 12-page experimental section. The prompt asked each model for a 1,500-word summary written for an engineer who knows the basics but hasn't read the paper.

Opus structured the summary by claim. It identified the three main contributions, explained the experimental setup, summarized the key result for each claim, and noted two limitations the paper itself flagged in the appendix. The prose was on the dry side. The substance held up.

GPT-5 structured the summary by section. It walked through the paper in order with roughly the same level of detail in each part. That made it easier to navigate back to the source. It also missed one of the contributions almost entirely. The section on activation-function ablations compressed to a single sentence, even though it carries most of the weight of the paper's broader argument.

For an engineer who wants to know what to take from the paper, the claim-first structure wins. For an academic reviewer who wants section-by-section coverage, the GPT-5 structure wins. The prompt asked for the first audience. Opus delivered for the audience asked. GPT-5 delivered for a different one.

Claude. Structure choice fits the audience.

Task six: a difficult email

The prompt described a real-feeling scenario. A paying customer was upset about a bug that took four days to fix, and a downstream side effect of the bug had affected something the customer cared about, even though the side effect wasn't what the bug was technically supposed to do. Draft a reply that takes responsibility, explains the technical situation without excuses, offers a concrete remedy, and reads like a human wrote it.

Opus produced an excellent draft. It opened with a direct apology, named the specific harm, explained the bug in two short paragraphs without jargon, offered a concrete remedy, and closed with a sentence that read as written by a real person.

GPT-5 produced a longer, more formal draft. It used the word unfortunately three times. The structure was correct, but the tone was wrong. It sounded like a corporate apology rather than the kind of message a small team would actually send. When asked to revise toward a more personal tone, GPT-5 improved, but didn't reach Opus's first-pass quality.

Specifics that landed. Opus's opening sentence was "I owe you an apology, and an explanation that doesn't try to dodge any of this." GPT-5's was "We sincerely apologize for the inconvenience this has caused." One sounds like a person. The other sounds like a chatbot reading a script. This task surprised the test. GPT-5 has a reputation for warm prose. In this specific job (sounding like a human rather than a brand), Opus is more reliable.

Claude. The tone is the work here, and Claude reads it better.

(A note on costs: at list pricing, the seven-task comparison runs roughly $25-$40 depending on re-runs and context limits. Streaming responses left running by mistake — common with GPT-5's faster streaming — add up faster than you'd expect. Budget accordingly.)

Task seven: debug a broken Python script

The script was 140 lines of Python with four deliberately-introduced bugs. Three were obvious: a misspelled variable name, a wrong import, an off-by-one slice. The fourth was subtle. asyncio.gather had been replaced with asyncio.wait in a function that depended on result ordering. The bug doesn't always show up, and it scrambles outputs in production under specific call patterns.

Both models found the three obvious bugs on their first pass and fixed them correctly. The interesting question was the fourth.

Opus flagged the asyncio.wait usage in its first pass, explained the ordering implication, proposed the fix, and then asked in a clarifying paragraph whether the author had a deliberate reason for the unusual choice. That implicit hedge is the behavior that catches subtle production bugs.

GPT-5 fixed the three obvious bugs and missed the fourth. On follow-up ("look at the async section again"), it inspected the function and concluded that, with the three obvious fixes applied, the code was structurally fine. Only on an explicit prompt ("what happens to result ordering?") did it identify the bug.

First-pass behavior is what matters in a real debugging workflow, because you usually don't know what you're missing. Opus's instinct to flag the ambiguous case is the right instinct for catching production bugs before they ship. A junior engineer reviewing GPT-5's "three bugs found and fixed" output would hit "merge." A junior reviewing Opus's "three bugs found, fourth flagged as a question" output would stop and check. That's the whole difference.

Claude, by the bug it bothered to flag.

Seven tasks, scored 0–10

Claude in orange, GPT-5 in outlined black. Higher is better on every task.

Refactor — Claude
9.2
Refactor — GPT-5
7.8
Landing page — Claude
6.5
Landing page — GPT-5
9.0
Debug script — Claude
9.5
Debug script — GPT-5
7.2
5–2 Claude wins on 5 of 7 tasks

The scoreboard

Pulling the seven verdicts into prose instead of a grid: Claude won the class-hierarchy refactor on architectural taste, the legal-question reasoning on honest hedging, the recipe by a hair on time-constraint compliance, the paper summary on structure choice, the difficult email on tone, and the Python debug by catching a subtle async ordering bug GPT-5 missed. GPT-5 won the marketing landing-page task decisively — better composition, better typography, ready to ship. The seven tasks add up to a 5-2 result on the scoreboard, but two of Claude's wins were narrow enough that I'd call them ties on a different sample.

STYLE → CORRECTNESS ↑ Claude Opus 4.7 GPT-5 Gemini 3.1 Pro Preview
Two axes, two models. Claude leans correctness. GPT-5 leans style. Pick the corner you care about.

The scoreboard looks one-sided. Using both side by side feels closer, because the categories where GPT-5 wins matter a lot to certain readers.

Refactor

Claude Architectural taste

Landing page

GPT-5 Visual design

Reasoning

Claude Honest hedging

Recipe

Claude Time constraints

Paper summary

Claude Structure choice

Hard email

Claude Tone work

Debug script

Claude Caught the subtle bug
1. Identical prompt

Same instruction, same temperature defaults, same machine.

2. Run both models

GPT-5 via OpenAI API, Opus 4.7 via Anthropic.

3. Score each output

Blind grading on correctness, completeness, taste.

4. Total verdict

Aggregate across seven tasks. Re-run anything close.

Which one for which work

If you're writing code with anything more than trivial structure, go with Claude. The refactor sense, the bug-flagging instinct, the willingness to ask a clarifying question instead of silently overwriting. These show up every day in a working codebase, and Claude has them more reliably.

If you're producing anything with a visual component — landing pages, slides, dashboard mockups, anything where taste matters — try GPT-5 first. Its default look is closer to what a contemporary audience expects, and it needs less prompting to land something worth shipping.

If you're doing legal, medical, or financial analysis where confident-wrong is more dangerous than openly uncertain, use Claude. The honest hedging is the right fit for that work.

If you're writing in any non-English language with serious tone work — dialect, audience-specific voice — Claude wins more often than the leaderboard would predict. GPT-5 is competent but less tuned to tone than its English work suggests.

If your workload is high-volume, low-latency, and price-per-token dominates the decision, neither of these is the right model. Drop to Claude Sonnet 4.7 or GPT-5 Mini and compare those instead.

The single recommendation: subscribe to Claude Opus 4.7 for most of the work, keep a paid ChatGPT account for visual design tasks and the occasional case where GPT-5's warmer voice fits the job. Total monthly cost is around $40 for both. The work that combo unlocks is worth the small premium.

Bottom line

For technical work — code, document analysis, reasoning — pick Claude Opus 4.7. For visual design and structured-output work, GPT-5. The both-models approach at about $40 a month combined is the right answer for any team whose work spans both kinds of task. The 5-2 scoreboard understates how often GPT-5 is the right tool when the work is visual or warm-tone, and how often Claude is the right tool when correctness costs more than speed.

Frequently asked

Who wins between GPT-5 and Claude Opus 4.7?

Claude Opus 4.7 wins 5 of 7 head-to-head tasks. GPT-5 wins one decisively (landing-page design). One is a tie. The scoreboard reads 5-2 for Claude. The lived experience of using both is closer than that score suggests.

Should I subscribe to both Claude and GPT-5?

Yes if your work spans both technical and creative-design tasks. Total monthly cost is around $40 in normal API use. The combination handles more types of work than either model alone.

Which is better at coding?

Claude Opus 4.7. It catches subtle bugs GPT-5 misses, makes better architectural choices in refactors, and hedges when it's uncertain instead of producing confidently wrong code.

Which is better at writing?

GPT-5 for stylistic flexibility and warmth. Claude for technical writing and tone discipline. GPT-5's first draft usually lands closer to a brand voice; Claude's needs less editing on technical content.

How much did the seven-task comparison cost to run?

At list pricing, running seven head-to-head tasks on both models costs roughly $25-$40 depending on how aggressively you re-run prompts. Published Anthropic and OpenAI rates apply.

Which model is faster?

GPT-5. About 90 to 130 tokens per second on average compared with Opus at 60 to 80. First-token latency is also lower on GPT-5. For interactive use, you can feel the difference.

Changelog

  • May 25, 2026 — Verified pricing against current provider documentation. Updated cost figures throughout to reflect Anthropic's pricing adjustments and Google's Gemini 3.1 Pro Preview rollout.
  • January 22, 2026 — Updated pricing references to reflect January 2026 rates.
  • April 28, 2026 — Originally published.
  • April 30, 2026 — Corrected GPT-5 input price in one table from $12 to $10 per million tokens.

References

  1. Anthropic, "Claude API Documentation," docs.claude.com, accessed May 2026.
  2. Anthropic, "Claude Pricing," anthropic.com/pricing, accessed May 2026.
  3. OpenAI, "API Documentation," platform.openai.com/docs, accessed May 2026.
  4. OpenAI, "API Pricing," openai.com/api/pricing, accessed May 2026.
  5. "Chatbot Arena leaderboard," lmarena.ai, May 2026 snapshot.
  6. "SWE-bench Verified leaderboard," swebench.com, May 2026.