OpenAI context_length_exceeded: meaning, cause, and fix

A 400 that's pure arithmetic: the window has to hold everything you send plus everything you've asked the model to write back, and OpenAI checks that sum before any work starts.

By the benchr team · · Verified against OpenAI's error documentation, June 12, 2026

OpenAIHTTP 400severity: mediumcontext

The arithmetic nobody does

Every request spends the context window from both ends. The input side is everything you send: system prompt, conversation history, retrieved chunks, the user's latest message. The output side is max_tokens, the completion budget the API sets aside before generating a single word. Input plus reservation has to fit inside the window, and OpenAI runs that check first.

The total rarely breaks in one jump. It creeps.

A chat app appends every exchange to history and never prunes. A retrieval pipeline stuffs twelve chunks into the prompt where five would answer the question. Somebody sets max_tokens generously because tokens you don't generate are free. Each decision is reasonable alone; the sum clears the window by a few thousand tokens — and every request after that point fails the same way.

What the API sends back

{
  "error": {
    "message": "This model's maximum context length is 400000 tokens. However, your messages resulted in 412031 tokens. Please reduce the length of the messages.",
    "type": "invalid_request_error",
    "code": "context_length_exceeded"
  }
}

The two numbers vary with the model and the request; the shape doesn't. Type stays invalid_request_error, code stays context_length_exceeded, and the message names the window beside your total, so the gap you need to close is printed right in the failure. Read it as a measurement, not a refusal.

Count before you send

tiktoken, OpenAI's Python tokenizer library, turns the overflow into something you catch in development instead of production. Encode the messages, add the reservation, and assert the sum:

# pre-flight: fail in dev, not in prod
import tiktoken

WINDOW = 400_000      # GPT-5's context window
RESERVED = 8_000      # your max_tokens setting

enc = tiktoken.encoding_for_model("gpt-5")
total = sum(len(enc.encode(m["content"])) for m in messages)

assert total + RESERVED <= WINDOW, (
    f"over budget: {total} prompt + {RESERVED} reserved > {WINDOW}"
)

Leave yourself headroom under the assert, since message formatting carries small per-message overhead that a raw encode misses. Precision isn't the point. The point is that an oversized request fails your tests, not your users.

When trimming isn't the answer

Pruning history and shrinking max_tokens solve the occasional overflow. They don't solve a workload that's bigger than the model.

GPT-5 gives you a 400K window with up to 128K of output for $1.25 in and $10 out per million tokens. GPT-5.5 stretches the window to 1,050,000 tokens at $5 and $30. GPT-5.4 offers a 1M window at $2.50 and $15 — most of the room for half the money. Claude Sonnet 4.6 matches the 1M window at $3 and $15 if you're open to leaving OpenAI for the long-context jobs. The context-window comparison puts these options against real document sizes so the choice takes minutes, not a sprint.

Frequently asked

Does max_tokens count toward the context limit?

Yes. The API reserves your full completion budget up front, so prompt tokens plus max_tokens have to fit the window together. A prompt that fits on its own can still fail once the reservation lands on top.

Why does it fail when my prompt looks short?

Tokens aren't words, and the prompt you see isn't the request you send. History rides along on every call, system prompts and retrieval chunks stack underneath, and code or non-English text can tokenize heavier than it reads.

Should I chunk the input or switch models?

Chunk when the task is retrieval-shaped and only slices of the document matter at a time. Switch when the model needs the whole document in view at once; a 1M-token window costs less than engineering around a 400K one you've outgrown.

Changelog

  • — Published. Status code, response shape, and the input-plus-reservation accounting verified against OpenAI's error-codes guide.

Sources

  • OpenAI error codes guide: developers.openai.com/api/docs/guides/error-codes (verified June 12, 2026)
  • benchr api-errors.json: the structured entry behind this page