The difference between a resilient LLM client and a fragile one is almost entirely in how it handles the unhappy responses. Most clients retry blindly — same request, immediately, in a loop — which turns a transient throttle into a self-inflicted outage and a budget cap into a tight, expensive retry storm. Handling 429 and 402 correctly means reading which limit you hit and reacting to that. Here's the playbook.

The three responses you must handle

A gateway can reject a request for three distinct reasons that look superficially similar but need opposite reactions:

Status	`error.type`	Means	Right reaction
`429`	`rate_limit_exceeded`	Too fast	Back off, honor `retry_after`, retry
`429`	`budget_exceeded`	Scope hit its cap	Don't retry — raise budget or wait for reset
`402`	(insufficient credits)	Account out of money	Top up / enable auto-topup; don't retry

The trap: both throttles are 429. If you treat all 429s as "wait and retry," you'll cheerfully retry into a budget cap that won't lift until the window resets — burning latency and getting nowhere. Branch on error.type, not just the status code.

429 rate-limit: exponential backoff with jitter

A rate limit is transient — it clears as the window rolls. The correct response is to retry, but politely: honor the server's retry_after_seconds if present, otherwise back off exponentially with jitter so a fleet of clients doesn't synchronize into a thundering herd.

def call_with_backoff(req, max_retries=5):
    for attempt in range(max_retries):
        resp = gateway.call(req)
        if resp.status != 429 or resp.error.type != "rate_limit_exceeded":
            return resp
        delay = resp.error.get("retry_after_seconds") or (2 ** attempt)
        sleep(delay + random.uniform(0, delay * 0.25))   # jitter
    raise RateLimited()

The jitter matters: without it, every client that got throttled at the same instant retries at the same instant, recreating the burst. (This is the client side of the RPM/TPM limits the gateway enforces.)

Retrying a budget cap is a bug, not resilience

A budget_exceeded 429 will return 429 again on the next call, and the next, until the budget window resets or someone raises the cap. Retrying it in a tight loop wastes your latency budget and your patience. Detect error.type == "budget_exceeded" and surface it to a human or fail the job cleanly — don't spin.

429 budget-cap: stop and surface

A budget cap is a policy outcome, not a transient failure. The right move is to stop, report which scope hit its cap, and let a human decide: raise the budget, wait for the window to reset, or shed load. For batch jobs, checkpoint and resume after reset; for interactive flows, show the user an honest "spending limit reached" rather than a spinner that never resolves.

402 out-of-credits: top up, don't retry

A 402 means the account has no credits — retrying changes nothing until money is added. Handle it by triggering your top-up path (or relying on auto-topup if enabled) and only then resuming. Treat 402 as a billing event that pages the account owner, not a transient error to retry through.

Idempotency and partial work

When you do retry, make sure a retried request is safe to repeat. For chat completions this is usually fine (you want a fresh answer), but for anything with side effects (tool calls that write, agent steps that mutate state) a naive retry can double-apply. Tag retried requests and make downstream effects idempotent so a backoff-retry doesn't run the side effect twice.

A decision tree

response not ok?
 ├─ 429 + rate_limit_exceeded → backoff (honor retry_after) + retry
 ├─ 429 + budget_exceeded     → STOP, surface scope, no retry
 ├─ 402                       → top-up / auto-topup, then resume
 ├─ 4xx (400/401)             → fail fast, fix the request/key
 └─ 5xx / timeout             → backoff + retry (gateway may fall back for you)

The takeaway

Resilience lives in the error path. Read error.type, not just the status: back off and retry rate limits with jitter, stop and surface budget caps instead of spinning, treat 402 as a billing event, and keep retries idempotent. Do that and a throttle becomes a brief pause instead of an outage — which is exactly what the gateway's limits were designed to give you room for. The API docs list every error type and the headers that come with them.

Handling 429 and 402 Errors From an LLM Gateway

The three responses you must handle

429 rate-limit: exponential backoff with jitter

429 budget-cap: stop and surface

402 out-of-credits: top up, don't retry

Idempotency and partial work

A decision tree

The takeaway

More from Guides

Ship AI Features Faster: API Key to Production in an Afternoon

Forward LLM Logs to Datadog, Langfuse, S3, and Slack

The Free Promo Tier: Signup Credits Explained