The arithmetic nobody does
Every request spends the context window from both ends. The input side is everything you send: system prompt, conversation history, retrieved chunks, the user's latest message. The output side is max_tokens, the completion budget the API sets aside before generating a single word. Input plus reservation has to fit inside the window, and OpenAI runs that check first.
The total rarely breaks in one jump. It creeps.
A chat app appends every exchange to history and never prunes. A retrieval pipeline stuffs twelve chunks into the prompt where five would answer the question. Somebody sets max_tokens generously because tokens you don't generate are free. Each decision is reasonable alone; the sum clears the window by a few thousand tokens — and every request after that point fails the same way.
What the API sends back
{
"error": {
"message": "This model's maximum context length is 400000 tokens. However, your messages resulted in 412031 tokens. Please reduce the length of the messages.",
"type": "invalid_request_error",
"code": "context_length_exceeded"
}
}
The two numbers vary with the model and the request; the shape doesn't. Type stays invalid_request_error, code stays context_length_exceeded, and the message names the window beside your total, so the gap you need to close is printed right in the failure. Read it as a measurement, not a refusal.
Count before you send
tiktoken, OpenAI's Python tokenizer library, turns the overflow into something you catch in development instead of production. Encode the messages, add the reservation, and assert the sum:
# pre-flight: fail in dev, not in prod
import tiktoken
WINDOW = 400_000 # GPT-5's context window
RESERVED = 8_000 # your max_tokens setting
enc = tiktoken.encoding_for_model("gpt-5")
total = sum(len(enc.encode(m["content"])) for m in messages)
assert total + RESERVED <= WINDOW, (
f"over budget: {total} prompt + {RESERVED} reserved > {WINDOW}"
)
Leave yourself headroom under the assert, since message formatting carries small per-message overhead that a raw encode misses. Precision isn't the point. The point is that an oversized request fails your tests, not your users.
When trimming isn't the answer
Pruning history and shrinking max_tokens solve the occasional overflow. They don't solve a workload that's bigger than the model.
GPT-5 gives you a 400K window with up to 128K of output for $1.25 in and $10 out per million tokens. GPT-5.5 stretches the window to 1,050,000 tokens at $5 and $30. GPT-5.4 offers a 1M window at $2.50 and $15 — most of the room for half the money. Claude Sonnet 4.6 matches the 1M window at $3 and $15 if you're open to leaving OpenAI for the long-context jobs. The context-window comparison puts these options against real document sizes so the choice takes minutes, not a sprint.