Review·May 2026

Claude Opus 4.8, reviewed

Name: Claude Opus 4.8, reviewed
Item: Claude Opus 4.8
Rating: 4.7
Author: benchr

Same price as 4.7, a small leaderboard bump, and one benchmark it loses. The upgrade worth caring about is honesty.

By the benchr team · Reviewed May 30, 2026 · View changelog · Figures verified against official sources, 30 May 2026

benchr rating: 4.7 / 5

SWE-Bench Pro 69.2% Anthropic-reported, up from 64.3% on 4.7

Standard price / 1M $5 / $25 $5 in, $25 out, unchanged from 4.7 · optional fast mode is $10/$50

GDPval-AA (Elo) 1890 vs 1769 for GPT-5.5 on knowledge work

Code-flaw honesty 4× Less likely than 4.7 to let a flaw pass

Anthropic shipped Claude Opus 4.8 on May 28, 2026, per its launch announcement. What's unusual is the timing: Opus 4.7 landed in mid-April, so this is about a six-week turnaround, faster than the company's usual cadence. The price didn't move. Standard rates are still $5 per million input tokens and $25 per million output, the same numbers Opus 4.7 carried.

So this is not a generational leap, and Anthropic isn't pitching it as one. It's an in-place refresh: the leaderboard ticks up a few points, the model gets steadier at long agentic work, and the rough edges get filed down. The interesting part is buried under the benchmark chart, and it's the thing that's hardest to measure: the model got more honest.

Below is what the public evidence says, sourced to Anthropic's announcement and the Claude model documentation, plus where the numbers come with a footnote you should read before quoting them.

The leaderboard moved a little

Opus 4.8 wins six of the seven benchmarks Anthropic put in its launch table. The biggest gain is on SWE-bench Pro, the harder, less-saturated cousin of the coding benchmark everyone cites: 69.2% against 64.3% for Opus 4.7. That's a benchmark where GPT-5.5 is reported at 58.6% and the strongest Gemini lands lower still, so Opus 4.8 isn't just beating its own past self there, it's well clear of the field.

Opus 4.8 vs Opus 4.7 on Anthropic's launch benchmarks, May 2026
Benchmark	Opus 4.8	Opus 4.7
SWE-bench Pro (agentic coding)	69.2%	64.3%
SWE-bench Verified	88.6%	87.6%
Humanity's Last Exam (no tools)	49.8%	46.9%
GPQA Diamond	93.6%	94.2%

Read that GPQA Diamond row honestly: Opus 4.8 dips half a point against 4.7. That benchmark is near the ceiling, where a six-tenths swing is inside the noise and tells you nothing about which model is smarter. Same story on SWE-bench Verified, where the one-point bump to 88.6% is a saturation artifact more than a capability story. benchr has argued the saturated benchmarks stopped discriminating between frontier models, and this launch is a clean example.

On the agent-style numbers, Opus 4.8 posts 83.4% on OSWorld-Verified, the highest score on that computer-use benchmark. One caveat you should know: Anthropic updated the test harness for this round and restated Opus 4.7's score upward into the low 80s, so a chunk of the apparent jump is the new scoring setup, not the model. The company says as much in the announcement, which is the right way to report it. On knowledge work, the GDPval-AA Elo of 1890 is a clearer signal, sitting 121 points ahead of GPT-5.5.

Where it loses

There's exactly one benchmark in the table where Opus 4.8 comes second, and it's worth naming. On Terminal-Bench 2.1, the test for command-line agent loops, Opus 4.8 scores 74.6% and GPT-5.5 leads at 78.2%. If your daily workload is a model driving a shell, running commands, reading output, and iterating, that's the one place a competitor still has the edge.

It's a narrow gap and a narrow workload, but it's real, and it's the cleanest reason to not assume Opus 4.8 wins everything. For most coding, the agentic-coding benchmarks that Opus dominates are the better proxy. For pure terminal-driving, test both on your own setup before you commit.

Honesty is the upgrade that matters

The number that should change how you work isn't on the leaderboard. Anthropic reports Opus 4.8 is around four times less likely than 4.7 to let a flaw in code pass without flagging it. Misaligned behavior, the kind where a model goes along with a bad plan or papers over a problem to look helpful, drops to rates the company puts near its restricted Mythos preview model.

In practice it shows up as a model that asks the right question before it writes, catches the off-by-one you missed, and pushes back when the plan you handed it doesn't hold together. That's a different relationship than "fast autocomplete that agrees with you." It also lines up with where the field is heading, away from raw capability and toward whether you can hand the thing a task and walk away. benchr's field report on AI agents kept running into the same wall: capability was never the blocker, trust was.

What shipped alongside it

The model came with a few platform changes worth knowing. The headline one is Dynamic Workflows, a research-preview feature in Claude Code where Opus 4.8 plans a big task, fans out many parallel subagents that attack it from independent angles, and has them check each other before it reports back. It's pitched at jobs the size of a full-codebase migration, the kind of work that doesn't fit in one context window.

Two smaller changes matter for anyone building on the API. Effort control is now exposed in claude.ai and Cowork, not just the API, so you can dial reasoning depth up or down by hand. And the Messages API will now accept an updated system instruction part-way through a long task without restating the whole prompt, which keeps your prompt cache hits intact and cuts the input cost on long agent loops.

What it costs, and the fast-mode math

Nothing changed at the standard tier. You pay $5 per million input and $25 per million output, with cached input at a tenth of that and a 50% batch discount on async jobs. The new lever is fast mode: for double the per-token rate, $10 input and $50 output, the model runs at roughly 2.5× the output speed. The headline there is that fast mode is three times cheaper than it was on the last Opus generation, where the same speed-up cost $30 and $150.

3× How much cheaper Opus 4.8's fast mode is than the previous generation's, at the same speed-up.

Fast mode buys you latency, not a smarter model, so reach for it on interactive work where a user is waiting, not on batch jobs where you'd rather take the 50% discount. For the broader question of which Anthropic tier to default to, the Sonnet 4.6 review works through where the cheaper model is the right call. Opus is the tier you escalate to, not the one you run everything on.

The verdict

Claude Opus 4.8 is the easiest upgrade Anthropic has shipped in a while, because it asks nothing of you. The price is identical, the API is drop-in compatible with 4.7, and the numbers go up. For coding and knowledge work, move to it and don't think twice. The honesty gain alone justifies the switch on any codebase where a missed bug is expensive.

Stick with 4.7 in two cases. If your workload is terminal command-line loops, the gap to 4.8 is small and a rival still leads that benchmark, so there's no rush. And if you're already pinned near the top of saturated benchmarks where 4.7 and 4.8 are within a point, you won't feel the difference. Everyone else: the upgrade is free in every sense that counts. For the cross-vendor picture, the GPT-5 head-to-head still frames where each lab leads, and the terminal result here is the one line that's shifted.

Frequently asked

Is Claude Opus 4.8 worth upgrading to from Opus 4.7?

If your work is coding or knowledge work, yes. The price is identical at $5/$25 and the benchmark numbers move up across the board, with no migration cost. If your work is terminal-heavy command-line loops, the gain is smaller and GPT-5.5 still leads that one benchmark. For most users the upgrade is a free improvement.

What does Claude Opus 4.8 cost?

Standard pricing is $5 per million input tokens and $25 per million output, unchanged from Opus 4.7. The optional fast mode runs at about 2.5× the output speed for $10 input and $50 output per million, three times cheaper than fast mode was on the previous Opus generation.

How much better is Opus 4.8 at coding than 4.7?

On SWE-bench Pro it scores 69.2% against 64.3% for Opus 4.7, a real jump on the harder agentic coding benchmark. On the more saturated SWE-bench Verified it's 88.6% against 87.6%, close to the ceiling where the gap stops meaning much.

What is the honesty improvement in Claude Opus 4.8?

Anthropic reports the model is around four times less likely than Opus 4.7 to let a flaw in code pass without flagging it, alongside lower rates of misaligned behavior. In practice it's better at catching its own mistakes and pushing back when a plan is wrong, which matters more for production work than a one-point benchmark bump.

What are Dynamic Workflows in Claude Opus 4.8?

A research-preview feature in Claude Code where the model plans a large task, spawns many parallel subagents that approach the problem from independent angles, and checks their outputs against each other before reporting back. It's aimed at jobs the size of a whole-codebase migration.

Changelog

May 30, 2026 — Originally published, two days after the model's release. Pricing, benchmark scores, and feature claims verified against Anthropic's launch announcement and the Claude model documentation.

References

Anthropic, "Introducing Claude Opus 4.8," anthropic.com/news/claude-opus-4-8, May 28, 2026.
Anthropic, "What's new in Claude Opus 4.8," platform.claude.com, accessed May 2026.
Anthropic, "Models overview," platform.claude.com, accessed May 2026.
Anthropic, "Pricing," platform.claude.com, accessed May 2026.
"SWE-bench leaderboards," swebench.com, May 2026.