$5 free credits when you sign up
← All posts
Guides

Handling 429 and 402 Errors From an LLM Gateway

A 429 and a 402 mean different things and need different client logic. Here is how to handle rate limits, budget caps, and out-of-credits responses gracefully — with backoff, not blind retries.

Nemo Team8 min read

The difference between a resilient LLM client and a fragile one is almost entirely in how it handles the unhappy responses. Most clients retry blindly — same request, immediately, in a loop — which turns a transient throttle into a self-inflicted outage and a budget cap into a tight, expensive retry storm. Handling 429 and 402 correctly means reading which limit you hit and reacting to that. Here's the playbook.

The three responses you must handle

A gateway can reject a request for three distinct reasons that look superficially similar but need opposite reactions:

Statuserror.typeMeansRight reaction
429rate_limit_exceededToo fastBack off, honor retry_after, retry
429budget_exceededScope hit its capDon't retry — raise budget or wait for reset
402(insufficient credits)Account out of moneyTop up / enable auto-topup; don't retry

The trap: both throttles are 429. If you treat all 429s as "wait and retry," you'll cheerfully retry into a budget cap that won't lift until the window resets — burning latency and getting nowhere. Branch on error.type, not just the status code.

429 rate-limit: exponential backoff with jitter

A rate limit is transient — it clears as the window rolls. The correct response is to retry, but politely: honor the server's retry_after_seconds if present, otherwise back off exponentially with jitter so a fleet of clients doesn't synchronize into a thundering herd.

def call_with_backoff(req, max_retries=5):
    for attempt in range(max_retries):
        resp = gateway.call(req)
        if resp.status != 429 or resp.error.type != "rate_limit_exceeded":
            return resp
        delay = resp.error.get("retry_after_seconds") or (2 ** attempt)
        sleep(delay + random.uniform(0, delay * 0.25))   # jitter
    raise RateLimited()

The jitter matters: without it, every client that got throttled at the same instant retries at the same instant, recreating the burst. (This is the client side of the RPM/TPM limits the gateway enforces.)

Retrying a budget cap is a bug, not resilience

A budget_exceeded 429 will return 429 again on the next call, and the next, until the budget window resets or someone raises the cap. Retrying it in a tight loop wastes your latency budget and your patience. Detect error.type == "budget_exceeded" and surface it to a human or fail the job cleanly — don't spin.

429 budget-cap: stop and surface

A budget cap is a policy outcome, not a transient failure. The right move is to stop, report which scope hit its cap, and let a human decide: raise the budget, wait for the window to reset, or shed load. For batch jobs, checkpoint and resume after reset; for interactive flows, show the user an honest "spending limit reached" rather than a spinner that never resolves.

402 out-of-credits: top up, don't retry

A 402 means the account has no credits — retrying changes nothing until money is added. Handle it by triggering your top-up path (or relying on auto-topup if enabled) and only then resuming. Treat 402 as a billing event that pages the account owner, not a transient error to retry through.

Idempotency and partial work

When you do retry, make sure a retried request is safe to repeat. For chat completions this is usually fine (you want a fresh answer), but for anything with side effects (tool calls that write, agent steps that mutate state) a naive retry can double-apply. Tag retried requests and make downstream effects idempotent so a backoff-retry doesn't run the side effect twice.

A decision tree

response not ok?
 ├─ 429 + rate_limit_exceeded → backoff (honor retry_after) + retry
 ├─ 429 + budget_exceeded     → STOP, surface scope, no retry
 ├─ 402                       → top-up / auto-topup, then resume
 ├─ 4xx (400/401)             → fail fast, fix the request/key
 └─ 5xx / timeout             → backoff + retry (gateway may fall back for you)

The takeaway

Resilience lives in the error path. Read error.type, not just the status: back off and retry rate limits with jitter, stop and surface budget caps instead of spinning, treat 402 as a billing event, and keep retries idempotent. Do that and a throttle becomes a brief pause instead of an outage — which is exactly what the gateway's limits were designed to give you room for. The API docs list every error type and the headers that come with them.

Written by Nemo TeamEngineering, product, and company posts from the Nemo Router team — code-first, cost-honest, no vendor-marketing fluff.