Reference·June 2026

OpenAI rate_limit_exceeded: meaning, cause, and fix

This 429 means the workload exceeded a request or token rate limit. The response headers tell you which limit was reached and when to retry.

By the benchr team · Published June 12, 2026 · Verified against OpenAI's error documentation, June 12, 2026

OpenAIHTTP 429severity: mediumrate limit

Identify the RPM or TPM limit

OpenAI enforces two meters at once: requests per minute and tokens per minute. Teams watch the first and get ambushed by the second. A pipeline pushing long documents can trip TPM with just a handful of calls, while a chatbot with tiny prompts trips RPM long before TPM moves. Look at what failed: bursts of small calls point to RPM, a few heavy calls point to TPM, and the fixes diverge: spreading calls out versus trimming prompts and capping max_tokens.

One more cause hides in plain sight: a single API key shared across services. Each service is sized sanely; the sum isn't.

The error you'll see

{
  "error": {
    "message": "Rate limit reached for requests",
    "type": "rate_limit_exceeded",
    "code": "rate_limit_exceeded"
  }
}

insufficient_quota also returns HTTP 429, but it reflects billing or quota rather than a transient rate limit. Branch on the error body's type or code so the application retries only errors that can clear with time.

Use bounded, jittered backoff

# Python — exponential backoff with full jitter
import random, time
from openai import OpenAI, RateLimitError

client = OpenAI()

def create_with_backoff(max_retries=6, **kwargs):
    for attempt in range(max_retries):
        try:
            return client.chat.completions.create(**kwargs)
        except RateLimitError as e:
            if "insufficient_quota" in str(e):
                raise                       # billing — waiting won't help
            wait = min(60, 2 ** attempt) * random.random()
            time.sleep(wait)                # jitter prevents retry stampedes
    raise RuntimeError("still rate-limited after retries")

The jitter matters more than the exponent. When every worker retries on the same schedule, your own fleet synchronizes into waves that re-trigger the limit. Randomizing the wait breaks the formation.

The related 503 "Slow Down" response

Ramp traffic too aggressively and OpenAI answers with a 503 throttle instead of a 429. The documented recovery is specific: drop back to your previous request rate and hold it stable for at least 15 minutes before climbing again, gradually. Treat launches like a warm-up, not a starting gun.

When 429s are chronic, not occasional

Backoff is for spikes. If you're rate-limited every hour, you have a sizing problem, and there are three honest exits: smooth the load (queue + cache repeated prompts), split it (separate projects for separate workloads), or move bulk traffic to a tier with headroom. GPT-5 Mini costs an eighth of GPT-5 per input token, and Gemini 3.5 Flash was built for exactly the parallel-agent traffic that eats rate limits. The rankings and calculator turn the reroute into a ten-minute decision.

Frequently asked

How long until the limit resets?

Every minute, per OpenAI's docs. If your volume drops, a failed request usually passes within 60 seconds — which is why backoff starting at ~1s works.

RPM vs TPM — which did I hit?

Bursts of small calls = RPM. A few heavy, long-prompt calls = TPM. The error looks identical; your traffic shape tells you which meter tripped.

What's the 503 "Slow Down" error?

OpenAI's anti-ramp throttle. Reduce to your previous rate, hold stable 15 minutes, then increase gradually. It's the documented recovery, not folklore.

Changelog

June 12, 2026 — Published. Reset behavior, error shape, and the 503 recovery rule verified against OpenAI's error-codes guide.

Sources

OpenAI error codes guide — developers.openai.com/api/docs/guides/error-codes (verified June 12, 2026)
benchr api-errors.json — structured entry for this error