RPM, TPM, and which one got you
OpenAI enforces two meters at once: requests per minute and tokens per minute. Teams watch the first and get ambushed by the second. A pipeline pushing long documents can trip TPM with just a handful of calls, while a chatbot with tiny prompts trips RPM long before TPM moves. Look at what failed: bursts of small calls point to RPM, a few heavy calls point to TPM, and the fixes diverge: spreading calls out versus trimming prompts and capping max_tokens.
One more cause hides in plain sight: a single API key shared across services. Each service is sized sanely; the sum isn't.
The error you'll see
{
"error": {
"message": "Rate limit reached for requests",
"type": "rate_limit_exceeded",
"code": "rate_limit_exceeded"
}
}
Don't confuse it with its evil twin: insufficient_quota also returns 429, but no amount of waiting fixes billing. Branch on the error body, always.
Backoff that behaves
# Python — exponential backoff with full jitter
import random, time
from openai import OpenAI, RateLimitError
client = OpenAI()
def create_with_backoff(max_retries=6, **kwargs):
for attempt in range(max_retries):
try:
return client.chat.completions.create(**kwargs)
except RateLimitError as e:
if "insufficient_quota" in str(e):
raise # billing — waiting won't help
wait = min(60, 2 ** attempt) * random.random()
time.sleep(wait) # jitter prevents retry stampedes
raise RuntimeError("still rate-limited after retries")
The jitter matters more than the exponent. When every worker retries on the same schedule, your own fleet synchronizes into waves that re-trigger the limit. Randomizing the wait breaks the formation.
The 503 cousin: "Slow Down"
Ramp traffic too aggressively and OpenAI answers with a 503 throttle instead of a 429. The documented recovery is specific: drop back to your previous request rate and hold it stable for at least 15 minutes before climbing again, gradually. Treat launches like a warm-up, not a starting gun.
When 429s are chronic, not occasional
Backoff is for spikes. If you're rate-limited every hour, you have a sizing problem, and there are three honest exits: smooth the load (queue + cache repeated prompts), split it (separate projects for separate workloads), or move bulk traffic to a tier with headroom. GPT-5 Mini costs an eighth of GPT-5 per input token, and Gemini 3.5 Flash was built for exactly the parallel-agent traffic that eats rate limits. The rankings and calculator turn the reroute into a ten-minute decision.