What tripped
Free-tier ceilings sit low on purpose. Google built that tier for kicking the tires, and a real interface in front of it will spend a minute's request budget in seconds. One user click that fans out into four model calls, a retry loop with no delay, a cron job that wakes up hungry: each looks innocent in code review and burns quota at runtime.
Shared projects are the other classic. Quota is counted per project, not per app, so the demo a coworker left running eats from the same plate as production. And ceilings don't vanish when you pay; paid tiers publish higher numbers, and heavy parallel workloads can still find them.
The response
Google wraps every failure in the same envelope: a numeric code that matches the HTTP status, a human-readable message, and a gRPC status string. A representative 429 body looks like this:
{
"error": {
"code": 429,
"message": "Resource has been exhausted (e.g. check quota).",
"status": "RESOURCE_EXHAUSTED"
}
}
The message text shifts depending on which quota tripped, so branch on the status field rather than the prose.
Throttle at the source
Retrying is triage. Shaping traffic before it leaves your process is the cure, and you don't need a library for it:
// JavaScript: cap concurrency and space out request starts
function geminiLimiter(maxInFlight = 2, startGapMs = 4000) {
const queue = [];
let inFlight = 0;
let lastStart = 0;
function pump() {
if (inFlight >= maxInFlight) return;
if (queue.length === 0) return;
const wait = lastStart + startGapMs - Date.now();
if (wait > 0) { setTimeout(pump, wait); return; }
lastStart = Date.now();
inFlight += 1;
const job = queue.shift();
job.thunk().then(job.resolve, job.reject)
.finally(() => { inFlight -= 1; pump(); });
pump(); // another slot may be open
}
return (thunk) => new Promise((resolve, reject) => {
queue.push({ thunk, resolve, reject });
pump();
});
}
// 4000ms between starts caps you near 15 calls a minute;
// tune both knobs to the published limits for your tier
const limited = geminiLimiter(2, 4000);
const reply = await limited(() => model.generateContent(prompt));
Every call goes in as a thunk; the gate decides when it runs. Two knobs, both tuned to your tier's published numbers: how many requests fly at once, and how far apart they launch.
Free tier or real tier
If the limiter is doing its job and your app is still starving, stop tuning and decide. Three doors. A quota increase request, when your usage pattern is sound and the ceiling is the only problem. Billing, when this is production; free tiers exist for testing, and traffic that matters deserves a tier with a contract behind it. Or rerouting, when the bulk work belongs on a different model entirely. Paid Gemini 3.5 Flash runs $1.50 in and $9.00 out per million tokens with a 1M-token context, which prices most chat workloads in single-digit dollars a day. The calculator turns your traffic into an exact monthly figure, and the rankings settle whether another model earns the volume.