Review·June 2026

GPT-5.4, reviewed: the value pick OpenAI doesn't advertise

It held the flagship crown for seven weeks. Reviewing it after the crown moved is when the price-performance story gets honest.

By the benchr team · Published June 10, 2026 · Updated July 29, 2026 · View changelog · Figures re-verified against OpenAI's official release and API model page, July 29, 2026

GPT-5.4, reviewed: the value pick OpenAI doesn't advertise: dark routing lines and measured green bands. — **OpenAI**GPT-5.4, reviewed: the value pick OpenAI doesn't advertise is framed by dark routing lines and measured green bands.

Input / 1MCached $0.25 · output $15

OSWorld-VerifiedOpenAI release evaluation

Contextcurrent API model page

Maximum outputfor long deliverables

GPT-5.4 is easiest to misunderstand when it is treated as a generic chat model. OpenAI's release positions it around professional work, coding, tool use, and native computer interaction. That combination is the product: inspect a working set, decide which tool or interface matters, act, and return a reviewable artifact. A team evaluating only short question answering will miss the reason to pay for it.

The decision is about workflow shape

The strongest case is not “our prompts are difficult.” It is “our work crosses boundaries.” A procurement analyst may read contracts, update a workbook, check a portal, and write a recommendation. A software agent may inspect a repository, use a shell, browse an internal dashboard, and prepare a patch summary. GPT-5.4 is a credible candidate when those steps need one stateful model. If the task is a single classification, extraction, or templated rewrite, route it to a cheaper lane and reserve GPT-5.4 for the cases that need reasoning plus action.

Decision matrix for a GPT-5.4 pilot
Workload	Fit	Evidence to collect	Primary risk
Desktop or browser operation	Strong candidate	Completed state, correct target, recovery after a changed screen	A benchmark pass does not guarantee reliability in your interface
Documents, slides, and spreadsheets	Strong when tools are involved	Reviewer edits, formula integrity, citation trace, formatting survival	A polished artifact can still contain a factual or structural error
Repository-scale coding	Test before routing broadly	Accepted patches, tests passed, regressions, tool-call loops	OpenAI reports SWE-Bench Pro, not SWE-bench Verified
Short, repetitive API work	Usually avoid	Cost and latency against a smaller control model	Paying for capabilities the request never uses

What the published record does—and does not—prove

OpenAI reports 83.0% on GDPval wins or ties, a 57.7 score on SWE-Bench Pro (Public), 75.0% on OSWorld-Verified, 82.7% on BrowseComp, and 54.6% on Toolathlon. Those figures point in the same direction: professional deliverables, browsing, computer use, and multi-tool execution. They remain provider-published evaluations, with task definitions and settings that may not match your environment.

The naming matters. SWE-Bench Pro (Public) is not SWE-bench Verified. A purchasing sheet that moves the published 57.7 result into a “Verified” column would create false precision. Keep the official label, then add your own repository test below it.

Run a three-part workload gate

Use the same permissions, system prompt, tools, and retry policy that production will use. Keep every prompt, tool trace, final artifact, reviewer correction, and billed token total. The goal is not to manufacture a single score; it is to expose where the model creates or removes operational work.

Interface task: ask the agent to complete a reversible desktop or browser workflow in a staging account. Change one label or screen position between runs. Record whether it notices the change, chooses the correct target, and stops at the confirmation boundary.
Artifact task: supply a real redacted source packet and request a spreadsheet, document, or presentation that follows an existing house style. Review formulas, references, omissions, and the amount of manual cleanup—not just visual polish.
Repository task: give it a bounded bug that requires reading several files, running tests, and explaining the patch. Compare accepted-result cost and reviewer time with your current coding model. Do not substitute OpenAI's SWE-Bench Pro figure for this test.

Long context is a capacity ceiling, not a quality promise. The model page lists 1,050,000 tokens, but OpenAI describes different usage economics above the standard 272K window. Split the pilot into ordinary and long-context buckets so an attractive average does not hide the expensive tail. For rate-card examples, use the GPT-5.4 pricing breakdown.

Choose it / avoid it

Choose GPT-5.4 when…	Avoid or route elsewhere when…
One agent must combine analysis, tool calls, computer use, and a reviewable deliverable.	The task is short, deterministic, and already reliable on a smaller model.
Your documents or codebase can exceed an ordinary context window and retrieval alone is not enough.	Your compliance process requires an independently reproduced benchmark or an official SWE-bench Verified number.
You can sandbox actions, retain traces, and require confirmation for consequential steps.	You cannot inspect tool calls, contain permissions, or recover from a wrong interface action.

Frequently asked

What is the strongest reason to choose GPT-5.4?

Choose it when one workflow combines professional documents, tool calls, and direct computer interaction. OpenAI reports a 75.0% OSWorld-Verified result, while the current API page lists a 1,050,000-token context window and 128,000-token maximum output.

Does GPT-5.4 have an official SWE-bench Verified score?

No official SWE-bench Verified score is recorded for GPT-5.4. OpenAI publishes a 57.7 score on SWE-Bench Pro (Public), which is a different evaluation and should not be relabeled.

What should a GPT-5.4 pilot measure?

Measure accepted deliverables per dollar, tool-call recovery, computer-use completion, review time, and input size. Test long-context cases separately because requests above the standard 272K window have different usage economics.

Changelog

July 29, 2026 — Rebuilt the review around a production decision matrix and three-part workload gate; corrected the context record to the current 1,050,000-token API specification; added OpenAI's published GDPval, SWE-Bench Pro, BrowseComp, Toolathlon, and OSWorld evidence without relabeling provider evaluations.
June 10, 2026 — Published as a deliberate retrospective, three months after the March 5 launch. Pricing and context verified on OpenAI's pricing page; OSWorld, accuracy, and finance figures attributed to OpenAI's launch material; the SWE-bench gap flagged as an honest hole in the record.

References

OpenAI, Introducing GPT-5.4, March 5, 2026; evaluation table re-verified July 29, 2026.
OpenAI API, GPT-5.4 model documentation, token limits and standard token prices verified July 29, 2026.