Anthropic rate_limit_error: meaning, cause, and fix

Volume isn't the only trigger. Anthropic also watches velocity, so a growth curve that looks like success on your dashboard can read as a runaway to the API.

By the benchr team · · Verified against Anthropic's API error documentation, June 12, 2026

AnthropicHTTP 429severity: mediumrate limit

Two ways to hit it

The first road is the ceiling. Every tier carries request and token budgets, and traffic past them draws 429s until the window clears. Note which budget went first, because the cures differ: blowing the request budget calls for fewer, better-bundled calls, while blowing the token budget calls for shorter prompts and a tighter max_tokens, even at modest call volume.

The second road is the personality quirk. Anthropic warns that sharp increases in usage can trigger 429s from acceleration limits even while you sit under your published numbers. The platform treats a vertical ramp as a hazard regardless of tier, and sudden fame counts — a launch, a viral moment, a backfill script someone started at full parallelism. The documented advice is to ramp gradually and keep usage patterns consistent, which in practice means warming traffic up over hours rather than seconds. And if errors arrive while your traffic is flat, suspect the platform instead: that's the 529, a different animal.

What comes back

{
  "type": "error",
  "error": {
    "type": "rate_limit_error",
    "message": "Your account has hit a rate limit."
  },
  "request_id": "req_011CSHoEeqs5C35K2UUqR7Fy"
}

Branch on the type field. The Python SDK surfaces this as anthropic.RateLimitError, so catch the exception class; message strings are wording, not contract. Every response also carries a req_-prefixed request-id header that the SDKs expose, and when a limits question turns into a support ticket, that ID is the difference between a fast answer and a slow one.

Smooth, don't sprint

Backoff reacts to a 429 after it lands. Concurrency shaping prevents most of them from existing, and it doubles as ramp control, since a fixed pool can't sprint no matter how much work piles up behind it. One semaphore does the job:

# Python: a worker pool that turns bursts into a steady stream
import asyncio
import anthropic

client = anthropic.AsyncAnthropic()
gate = asyncio.Semaphore(8)        # in-flight cap; tune to your tier

async def ask(prompt: str):
    async with gate:
        return await client.messages.create(
            model="claude-haiku-4-5",
            max_tokens=512,
            messages=[{"role": "user", "content": prompt}],
        )

async def run(prompts):
    return await asyncio.gather(*(ask(p) for p in prompts))

Eight workers grinding through a thousand prompts produce a bounded, even stream that reads as a steady customer. Raise the pool size in small steps over days, and the same gate that stops bursts becomes your gradual ramp.

Cut tokens, not corners

Chronic 429s mean the workload outgrew the plan, and the cheap fixes shrink the workload rather than masking the error. Prompt caching takes 90% off the input price of context you resend on every call. The Batch API handles anything that can wait at a 50% discount, off the live path. And routing is the biggest lever: simple, high-volume calls don't need a frontier model when Claude Haiku 4.5 runs $1 in and $5 out per million tokens. Price your split with the calculator; the math usually ends the meeting early.

Frequently asked

Why am I getting 429s below my published limits?

Acceleration limits. Anthropic's docs warn that sharp usage increases can draw 429s even under your tier's numbers. Ramp gradually and keep traffic patterns consistent; vertical growth is the trigger.

Does exponential backoff fix this?

For spikes, yes. Jittered backoff absorbs bursts and the occasional ramp penalty cleanly. For 429s arriving every hour, no: that's chronic undersizing, and the fix is shaping traffic, caching repeated context, and moving bulk work to the Batch API.

What's the cheapest way to cut token throughput?

Prompt caching plus routing. Caching discounts repeated input context by 90%, and sending simple calls to Claude Haiku 4.5 at $1/$5 per million tokens clears the heaviest traffic off your priciest model.

Changelog

  • — Published. Rate-limit behavior, the acceleration-limit caveat, and the response shape verified against Anthropic's API error docs.

Sources

  • Anthropic API errors — platform.claude.com/docs/en/api/errors (verified June 12, 2026)
  • benchr api-errors.json, the structured entry for this error