Handling 429 and 402 Errors From an LLM Gateway
A 429 and a 402 mean different things and need different client logic. Here is how to handle rate limits, budget caps, and out-of-credits responses gracefully — with backoff, not blind retries.
The difference between a resilient LLM client and a fragile one is almost entirely in how it handles the unhappy responses. Most clients retry blindly — same request, immediately, in a loop — which turns a transient throttle into a self-inflicted outage and a budget cap into a tight, expensive retry storm. Handling 429 and 402 correctly means reading which limit you hit and reacting to that. Here's the playbook.
The three responses you must handle
A gateway can reject a request for three distinct reasons that look superficially similar but need opposite reactions:
| Status | error.type | Means | Right reaction |
|---|---|---|---|
429 | rate_limit_exceeded | Too fast | Back off, honor retry_after, retry |
429 | budget_exceeded | Scope hit its cap | Don't retry — raise budget or wait for reset |
402 | (insufficient credits) | Account out of money | Top up / enable auto-topup; don't retry |
The trap: both throttles are 429. If you treat all 429s as "wait and retry," you'll cheerfully retry into a budget cap that won't lift until the window resets — burning latency and getting nowhere. Branch on error.type, not just the status code.
429 rate-limit: exponential backoff with jitter
A rate limit is transient — it clears as the window rolls. The correct response is to retry, but politely: honor the server's retry_after_seconds if present, otherwise back off exponentially with jitter so a fleet of clients doesn't synchronize into a thundering herd.
def call_with_backoff(req, max_retries=5):
for attempt in range(max_retries):
resp = gateway.call(req)
if resp.status != 429 or resp.error.type != "rate_limit_exceeded":
return resp
delay = resp.error.get("retry_after_seconds") or (2 ** attempt)
sleep(delay + random.uniform(0, delay * 0.25)) # jitter
raise RateLimited()The jitter matters: without it, every client that got throttled at the same instant retries at the same instant, recreating the burst. (This is the client side of the RPM/TPM limits the gateway enforces.)
Retrying a budget cap is a bug, not resilience
A budget_exceeded 429 will return 429 again on the next call, and the next, until the budget window resets or someone raises the cap. Retrying it in a tight loop wastes your latency budget and your patience. Detect error.type == "budget_exceeded" and surface it to a human or fail the job cleanly — don't spin.
429 budget-cap: stop and surface
A budget cap is a policy outcome, not a transient failure. The right move is to stop, report which scope hit its cap, and let a human decide: raise the budget, wait for the window to reset, or shed load. For batch jobs, checkpoint and resume after reset; for interactive flows, show the user an honest "spending limit reached" rather than a spinner that never resolves.
402 out-of-credits: top up, don't retry
A 402 means the account has no credits — retrying changes nothing until money is added. Handle it by triggering your top-up path (or relying on auto-topup if enabled) and only then resuming. Treat 402 as a billing event that pages the account owner, not a transient error to retry through.
Idempotency and partial work
When you do retry, make sure a retried request is safe to repeat. For chat completions this is usually fine (you want a fresh answer), but for anything with side effects (tool calls that write, agent steps that mutate state) a naive retry can double-apply. Tag retried requests and make downstream effects idempotent so a backoff-retry doesn't run the side effect twice.
A decision tree
response not ok?
├─ 429 + rate_limit_exceeded → backoff (honor retry_after) + retry
├─ 429 + budget_exceeded → STOP, surface scope, no retry
├─ 402 → top-up / auto-topup, then resume
├─ 4xx (400/401) → fail fast, fix the request/key
└─ 5xx / timeout → backoff + retry (gateway may fall back for you)The takeaway
Resilience lives in the error path. Read error.type, not just the status: back off and retry rate limits with jitter, stop and surface budget caps instead of spinning, treat 402 as a billing event, and keep retries idempotent. Do that and a throttle becomes a brief pause instead of an outage — which is exactly what the gateway's limits were designed to give you room for. The API docs list every error type and the headers that come with them.