$5 free credits when you sign up
← All posts
Engineering

Provider Fallback Chains: Surviving an OpenAI Outage

When a provider 5xxs or rate-limits you, your app shouldn't go down with it. Here is how fallback chains on an LLM gateway reroute to a healthy provider mid-request — without changing your code.

Nemo Team9 min read

Every model provider has bad days: a regional outage, a capacity crunch that turns into 429s, a slow degradation that times out your requests. If your app calls one provider directly, its bad day is your bad day. A fallback chain turns a provider outage from an incident into a non-event — the gateway reroutes to a healthy provider mid-request, and your users never notice.

What is a fallback chain?

A fallback chain is an ordered list of routes the gateway tries for a given request. If the first fails in a retryable way, it tries the second, and so on, until one succeeds or the chain is exhausted:

request for "best-chat"
  ├─ try Anthropic claude-sonnet-4-6   → 529 overloaded
  ├─ try OpenAI gpt-class equivalent   → 200 OK   ✅ returned to caller
  └─ (would try a third if needed)

The caller asked for a capability ("best-chat"), not a specific provider endpoint. That indirection is what makes fallback possible: the gateway is free to satisfy the request from whichever route is healthy, because the client never hardcoded one.

What counts as a retryable failure?

Not every error should trigger a fallback. Falling back on the wrong errors is worse than not falling back at all — you'd retry a bad request several times and multiply the cost. The line:

FailureFall back?Why
429 / 529 (rate-limit, overloaded)YesTransient; another provider has capacity
500 / 503 / timeoutYesProvider-side, likely transient
400 (malformed request)NoThe request is bad everywhere; retrying wastes money
401 (auth)NoA key problem won't fix itself on provider #2
Guardrail blockNoA policy decision, not a failure

The chain only advances on errors that a different healthy provider could plausibly succeed at. Client errors fail fast.

Fallback is not free latency

Each hop in the chain adds a round trip. A request that falls back twice took roughly three providers' worth of wall-clock — which is exactly the kind of thing that widens your p99 tail. Order the chain so the fastest, most-reliable route is first, and keep chains short. Fallback buys availability; pay for it deliberately.

Cost and credits across a fallback

A subtle point most "retry" implementations get wrong: only the successful call should be billed. When the gateway reserves credits for a request (see reserve-and-settle), a failed first hop must not settle a charge — the reservation carries through to the route that actually serves the request, and only that route's authoritative cost is settled. A fallback that left a phantom charge for the failed attempt would make outages cost customers money, which is exactly backwards.

So the accounting is: reserve once, try the chain, settle the winner's real cost, release the rest. The customer pays for one successful call regardless of how many routes were attempted.

Ordering the chain

A good chain is ordered by a blend of three factors:

  1. Reliability first — your most consistently-available route leads, so the common case is one hop.
  2. Latency second — among reliable routes, the faster one goes earlier to protect your tail.
  3. Cost as a tiebreaker — when two routes are equally good, the cheaper one leads.

Resist making the cheapest route first if it's also the flakiest — you'll fall back constantly, and the latency and complexity cost outweighs the per-call savings. The point of the chain is availability; optimize it for that, and let dedicated cost-vs-quality routing handle the spend trade-off separately.

Testing the unhappy path

You can't trust a fallback you've never seen fire. The way to test it is to force the first route to fail — point it at a deliberately-broken target or simulate a 529 — and assert that (a) the request still returns 200 from the next route, and (b) exactly one successful call was billed. An untested fallback chain is a comforting config line that may or may not work when the outage you bought it for actually arrives.

The takeaway

A fallback chain is the difference between "OpenAI is down" being an incident and being a log line. Define the chain by capability not provider, advance only on retryable failures, bill only the successful hop, and order by reliability-then-latency-then-cost. Then the next provider outage is something your gateway handles while you sleep. Configure chains in Router Settings.

Written by Nemo TeamEngineering, product, and company posts from the Nemo Router team — code-first, cost-honest, no vendor-marketing fluff.

More from Engineering

All posts →
Engineering

Hydration-Safe Rendering for Money and Time

new Date() and Math.random() in a React render body cause hydration mismatches — and on a billing dashboard, a flicker on a number erodes trust. Here is the pattern that keeps server and client agreeing.

Nemo Team
8 min
Engineering

Canary Deploys and Auto-Rollback by SLO

A deploy shouldn't need a human watching a dashboard. Here is how a 5% canary, a fixed observation window, and SLO-gated auto-rollback let changes ship and self-heal without a 3 a.m. page.

Nemo Team
9 min
Engineering

Credit Ledger Parity Checks: Catching Drift Early

If a balance and its ledger ever disagree, money is wrong somewhere. Here is how continuous parity checks compare balance to ledger sum and surface a reservation leak before it becomes a billing incident.

Nemo Team
8 min