This piece compares the two frontier models you're most likely to be deciding between in 2026 across seven workload categories: a code refactor, a marketing landing page, reasoning under uncertainty, an instruction-bound recipe task, a paper summary, a difficult customer email, and a Python debug. The verdict for each category is grounded in the public benchmark record (SWE-bench Verified, LMArena), each lab's own positioning of its model, and the consistent public discussion of how each tier handles each category. The headline: Opus 4.7 takes five, GPT-5 takes one decisively, and there is one tie.
The pricing reference, since both come up in every workload decision: Claude Opus 4.7 lists at $5 per million input tokens and $25 output, per Anthropic's pricing page. GPT-5 lists at $1.25 and $10 per OpenAI's API pricing. For the full cost picture across workloads, see price per use case. For each model on its own terms, see the Opus review and the GPT-5 review.
One framing note before the seven categories: treat this as a directional comparison rather than an exhaustive evaluation. A different seven categories would reorder some of the verdicts. The 5-2 is a useful summary of where the two labs have aimed their models, and it's worth holding loosely.
Category one: refactor a class hierarchy
The workload: take a production class with five derived types and crosscutting concerns, name the architectural smell, propose a refactor, and produce the new files. This is the kind of work the SWE-bench Verified benchmark is designed to score. Anthropic reports Opus 4.7 at 87.6% on the verified subset, against OpenAI's reported figure for GPT-5 around 74.9%. A gap that wide lines up with the consistent community discussion of where each model lands on production refactoring work.
What you should expect from each model on this kind of task, based on the public record: Opus tends to name the actual architectural smell (the leaky base-class surface, not the inheritance) and propose a split that reads like the call a senior engineer would've made. GPT-5 produces correct code that is more verbose and tends to add scaffolding the prompt did not ask for. Both ship working solutions. Opus's needs less cleanup before it lands in your codebase.
Winner: Claude, with the gap consistent with the SWE-bench rank order.
Category two: write a marketing landing page
The workload: produce a single-page marketing site for a young technical audience. Vanilla HTML and CSS, mobile-first, bold typographic hierarchy. OpenAI's positioning for GPT-5 explicitly emphasizes visual design and structured output as model strengths. The public community discussion of both models has converged on the same pattern: GPT-5's defaults look more contemporary, Opus's defaults look like an enterprise SaaS dashboard. The same prompt asked of both will give you a more shippable layout out of GPT-5 every time.
The specifics behind the pattern: GPT-5 leans on a single confident accent color, oversize hero typography with tight letter-spacing, and quiet asymmetric grids. Opus produces a more cautious palette and a more conventional grid. Ask Opus to revise toward something bolder and you can feel the model trying, where GPT-5 arrives bold without being asked. That is the visual sensibility you are paying OpenAI for, and on this category it earns the premium.
Winner: GPT-5, and the margin is wide.
Category three: reasoning under uncertainty
The workload: a specific regulatory or policy question where the right answer needs familiarity with both the underlying statute and the trajectory of its enforcement. A good answer names the relevant article, separates the statute from how it actually gets enforced, and is honest about the limits of what a non-specialist can assert.
This is the workload Anthropic positions Opus on most heavily. The lab's release material talks about "hedging in the right places" — flagging uncertainty when the model's at the edge of its competence. The public discussion across Anthropic's forum and the broader research community is consistent: Opus produces a careful answer that separates what it is sure of from what you should still run past a human expert. GPT-5's default tone on the same question is confident, whether or not that confidence is earned.
The failure pattern you should plan around with GPT-5 here: a nearly-correct answer with one citation that is wrong in a way a non-specialist reader will not catch. That is the worst kind of failure for this workload, because it survives a casual review. Opus's instinct to hedge is what guards against that class of error.
Winner: Claude. The hedging is part of the value.
Category four: a constrained instruction-following task
The workload: an everyday task with a quantitative constraint, such as a time budget or a recipe built from a fixed set of ingredients. The interesting question is which model respects the constraint without adding scope.
Both models handle the substance fine. The pattern that recurs across the public community discussion is that GPT-5 tends to add optional flourishes — an extra step, a note on substitutions — that push it over the constraint. Opus respects the constraint by default and asks before adding scope. When you want a quick task done inside the budget you specified, that is the right behavior. When you want a richer answer that explores the space, GPT-5's tendency to elaborate is the feature.
Edge to Claude on constraint compliance. Call it a tie if you value the elaboration.
Category five: summarize a long technical paper
The workload: take a 60-page technical paper and produce a 1500-word summary for an engineer who knows the basics but has not read it. There are two reasonable ways to structure this: by section (walk the paper in order) or by claim (identify the contributions and pull the supporting experiments into each one).
The pattern across the public community discussion: Opus tends to structure by claim, GPT-5 by section. The right answer is whichever structure fits the audience the prompt named. An engineer who mostly wants the takeaways is better served claim-first; an academic reviewer who wants section-by-section coverage is better served the other way. So the category goes to whichever model's default matches your reader, and for the prompt above — engineer, what to take from it — that is Opus.
The secondary risk on long technical summaries, across both models, is that the experimental section gets compressed to one sentence when its substance carries the paper's broader argument. GPT-5 is somewhat more prone to that on complex experimental sections, while Opus is more likely to err toward dry prose. Pick your trade.
Winner: Claude when the audience is "engineer, what to take from it."
Category six: a difficult customer email
The workload: a real-feeling customer-service scenario. A paying customer is upset. A bug took longer than it should have to fix. A downstream side effect hit something the customer cared about. Draft a reply that takes responsibility, explains the situation without excuses, offers a concrete remedy, and reads like a human wrote it.
GPT-5 has a reputation for warm prose. On this specific workload, the pattern across the public community discussion is that Opus produces a draft that reads like a person wrote it rather than a brand. GPT-5's defaults lean toward corporate apology language ("we sincerely apologize for the inconvenience"). Opus's defaults lean toward direct, plain-language apology ("I owe you an apology, and an explanation that doesn't try to dodge any of this"). The latter is what a small team should send.
GPT-5 will get there on a second pass when you spell out the tone constraint, where Opus tends to land it first try. On a workload where tone is the deliverable, that first-attempt difference is what you are paying for.
Winner: Claude. Tone discipline is the work here.
Category seven: debug a broken Python script
The workload: a 140-line Python script with four deliberately-introduced bugs. Three obvious. One subtle (an async ordering bug that only surfaces under specific call patterns). The interesting question is which model flags the subtle one without being asked.
This is the category most directly tied to SWE-bench Verified, and the rank order tracks the benchmark. Opus's documented behavior is to flag ambiguous code as a question ("is this the intended choice?") rather than silently overwriting. That instinct is what catches subtle production bugs on the first pass. GPT-5 is more likely to fix the three obvious bugs cleanly and miss the fourth, then identify it correctly only when prompted to look at the async section again.
First-pass behavior is what matters here, because in a production debugging workflow you usually do not know what you are missing. Opus's instinct to ask about ambiguous code is what stops the bug that would otherwise ship quietly.
Winner: Claude, by the bug it bothered to flag.
The scoreboard, in prose
Pulling the seven verdicts together: Claude takes the class-hierarchy refactor on architectural taste, the regulatory question on honest hedging, the constrained recipe on instruction compliance, the paper summary on structure choice for the named audience, the difficult email on tone work, and the Python debug by flagging the subtle async bug. GPT-5 takes the marketing landing page decisively on visual sensibility. Seven categories, five-two on the scoreboard, with two of Claude's wins narrow enough that a different sample could move them.
The scoreboard looks one-sided. The lived experience is closer, because the one category where GPT-5 wins matters a lot to some readers.
Refactor
Claude Architectural tasteLanding page
GPT-5 Visual sensibilityReasoning
Claude Honest hedgingRecipe
Claude Constraint compliancePaper summary
Claude Audience fitHard email
Claude Tone workDebug script
Claude Flagged the subtle bugTechnical correctness, visual design, tone work, structured output, long context, or breadth.
Anthropic and OpenAI both publish detailed positioning. The lab tells you where the model fits.
SWE-bench Verified for coding. LMArena for general capability. Community discussion for the rest.
The benchmark gives you a rank order. Your workload gives you the truth.
Which one for which work
For code with non-trivial structure, default to Claude. The refactor sense, the bug-flagging instinct, and the willingness to ask a clarifying question instead of silently overwriting are documented strengths, and they show up in production every day.
Anything visual — landing pages, slide decks, dashboard mockups, work where taste matters — is worth trying on GPT-5 first. Its defaults are closer to what a contemporary audience expects, and you will spend less time reshaping the output.
Legal, medical, or financial analysis, where a confident wrong answer does more damage than an openly uncertain one, belongs with Claude. The honest hedging is the right fit.
When you're writing in a non-English language where tone matters (dialect, audience-specific voice), Claude wins more often than the leaderboard would predict on Arabic, while GPT-5 keeps an edge on most other world languages. The Arabic content piece walks through that specific axis.
And if your workload is high-volume and low-latency, with price-per-token dominating the decision, neither of these is your model. Drop to Claude Sonnet 4.6 or GPT-5 Mini and compare those instead. The small-models piece covers when to drop a tier.
The single recommendation that holds across most teams: subscribe to Claude Opus 4.7 as the default and keep a paid OpenAI key for visual design and the occasional case where GPT-5's voice fits the job. Running both costs a small premium over committing to one, and for most teams the range of work it covers more than pays for itself.