Two ways to hit it
The first road is the ceiling. Every tier carries request and token budgets, and traffic past them draws 429s until the window clears. Note which budget went first, because the cures differ: blowing the request budget calls for fewer, better-bundled calls, while blowing the token budget calls for shorter prompts and a tighter max_tokens, even at modest call volume.
The second road is the personality quirk. Anthropic warns that sharp increases in usage can trigger 429s from acceleration limits even while you sit under your published numbers. The platform treats a vertical ramp as a hazard regardless of tier, and sudden fame counts — a launch, a viral moment, a backfill script someone started at full parallelism. The documented advice is to ramp gradually and keep usage patterns consistent, which in practice means warming traffic up over hours rather than seconds. And if errors arrive while your traffic is flat, suspect the platform instead: that's the 529, a different animal.
What comes back
{
"type": "error",
"error": {
"type": "rate_limit_error",
"message": "Your account has hit a rate limit."
},
"request_id": "req_011CSHoEeqs5C35K2UUqR7Fy"
}
Branch on the type field. The Python SDK surfaces this as anthropic.RateLimitError, so catch the exception class; message strings are wording, not contract. Every response also carries a req_-prefixed request-id header that the SDKs expose, and when a limits question turns into a support ticket, that ID is the difference between a fast answer and a slow one.
Smooth, don't sprint
Backoff reacts to a 429 after it lands. Concurrency shaping prevents most of them from existing, and it doubles as ramp control, since a fixed pool can't sprint no matter how much work piles up behind it. One semaphore does the job:
# Python: a worker pool that turns bursts into a steady stream
import asyncio
import anthropic
client = anthropic.AsyncAnthropic()
gate = asyncio.Semaphore(8) # in-flight cap; tune to your tier
async def ask(prompt: str):
async with gate:
return await client.messages.create(
model="claude-haiku-4-5",
max_tokens=512,
messages=[{"role": "user", "content": prompt}],
)
async def run(prompts):
return await asyncio.gather(*(ask(p) for p in prompts))
Eight workers grinding through a thousand prompts produce a bounded, even stream that reads as a steady customer. Raise the pool size in small steps over days, and the same gate that stops bursts becomes your gradual ramp.
Cut tokens, not corners
Chronic 429s mean the workload outgrew the plan, and the cheap fixes shrink the workload rather than masking the error. Prompt caching takes 90% off the input price of context you resend on every call. The Batch API handles anything that can wait at a 50% discount, off the live path. And routing is the biggest lever: simple, high-volume calls don't need a frontier model when Claude Haiku 4.5 runs $1 in and $5 out per million tokens. Price your split with the calculator; the math usually ends the meeting early.