Gemini DEADLINE_EXCEEDED: meaning, cause, and fix

The model didn't fail. The clock ran out before it finished, and the clock is usually yours to change.

By the benchr team · · Verified against Google's Gemini API troubleshooting docs, June 12, 2026

Google GeminiHTTP 504severity: lowtimeout

Why the deadline dies

Two ingredients produce nearly every DEADLINE_EXCEEDED. First, input size: a huge prompt or a stuffed context needs more processing time than your client is willing to wait, so the connection gives up while the model is still working. Second, generation length: when you ask for a long answer and hold the line open for the complete response, the wait scales with the output, and an unstreamed call has no way to show progress before the cutoff. Neither one means Gemini is down. It means the time budget and the workload disagree.

What comes back

A representative 504 body, in Google's standard shape:

{
  "error": {
    "code": 504,
    "message": "The service is unable to finish processing within the deadline.",
    "status": "DEADLINE_EXCEEDED"
  }
}

Key on status being DEADLINE_EXCEEDED rather than on the prose, and don't lump it in with 429s: this is a per-request time failure, not a traffic ceiling.

Three fixes, in order of dignity

Start with streaming, because it removes the wait instead of extending it. Move to a bigger timeout when you've decided a slow call is acceptable. Trim context when the input was bloated to begin with.

# Python (google-genai)
from google import genai

# Fix 2: a timeout you chose deliberately (milliseconds)
client = genai.Client(http_options={"timeout": 120000})

# Fix 1: stream, so nothing waits on the full answer
for chunk in client.models.generate_content_stream(
    model="gemini-3.5-flash",
    contents=long_prompt,
):
    print(chunk.text, end="")

The third fix happens before the request leaves your machine: summarize stale history, send the relevant slice of a document instead of the whole thing, and keep an eye on how much context each request really carries.

When it's a size problem in disguise

Oversized input doesn't always announce itself as a 504. Google's docs note that a context too large for processing can surface as a 500 INTERNAL instead, with the same prescription: reduce the context, switch to a different model, or retry. If you're seeing both codes from the same pipeline, stop treating them as separate incidents and put the input on a diet. Give each request a context budget, chunk anything book-sized, and compare what the major models can hold in the context-window comparison.

And if giant inputs are your daily reality rather than an edge case, price the workload against Gemini 3.5 Flash and its 1M-token window before you re-architect around the clock.

Frequently asked

Is a 504 a rate limit?

No. Rate and quota problems answer as 429 RESOURCE_EXHAUSTED. A 504 is about time, not volume: the service couldn't finish this one request before the deadline, no matter how little traffic you're sending.

Will retrying help?

Only after you've changed something. Trim the context or raise the client timeout first; resending the identical oversized request tends to time out the identical way.

Should every call stream?

Long outputs, yes: streaming is the standard way to avoid sitting on one long response until the clock kills it. Short calls finish well inside any sane timeout and don't need the extra plumbing.

Changelog

  • — Published. Status shape, the large-input cause, and the timeout guidance verified against Google's Gemini API troubleshooting page.

Sources

  • Gemini API troubleshooting — ai.google.dev/gemini-api/docs/troubleshooting (verified June 12, 2026)
  • benchr api-errors.json — structured entry for this error