Why the deadline dies
Two ingredients produce nearly every DEADLINE_EXCEEDED. First, input size: a huge prompt or a stuffed context needs more processing time than your client is willing to wait, so the connection gives up while the model is still working. Second, generation length: when you ask for a long answer and hold the line open for the complete response, the wait scales with the output, and an unstreamed call has no way to show progress before the cutoff. Neither one means Gemini is down. It means the time budget and the workload disagree.
What comes back
A representative 504 body, in Google's standard shape:
{
"error": {
"code": 504,
"message": "The service is unable to finish processing within the deadline.",
"status": "DEADLINE_EXCEEDED"
}
}
Key on status being DEADLINE_EXCEEDED rather than on the prose, and don't lump it in with 429s: this is a per-request time failure, not a traffic ceiling.
Three fixes, in order of dignity
Start with streaming, because it removes the wait instead of extending it. Move to a bigger timeout when you've decided a slow call is acceptable. Trim context when the input was bloated to begin with.
# Python (google-genai)
from google import genai
# Fix 2: a timeout you chose deliberately (milliseconds)
client = genai.Client(http_options={"timeout": 120000})
# Fix 1: stream, so nothing waits on the full answer
for chunk in client.models.generate_content_stream(
model="gemini-3.5-flash",
contents=long_prompt,
):
print(chunk.text, end="")
The third fix happens before the request leaves your machine: summarize stale history, send the relevant slice of a document instead of the whole thing, and keep an eye on how much context each request really carries.
When it's a size problem in disguise
Oversized input doesn't always announce itself as a 504. Google's docs note that a context too large for processing can surface as a 500 INTERNAL instead, with the same prescription: reduce the context, switch to a different model, or retry. If you're seeing both codes from the same pipeline, stop treating them as separate incidents and put the input on a diet. Give each request a context budget, chunk anything book-sized, and compare what the major models can hold in the context-window comparison.
And if giant inputs are your daily reality rather than an edge case, price the workload against Gemini 3.5 Flash and its 1M-token window before you re-architect around the clock.