$5 free credits when you sign up
Routing

One API. 20+ models.

Call Google (Gemini, Imagen, Veo) — Anthropic, OpenAI, and AWS Bedrock shipping next through a single endpoint. Five routing strategies, configurable fallback chains, tag-based filters — every decision logged for replay and audit.

routing-decision · request_id 7f2a

Live routing trace

Strategyleast-busy
Tag filtervision
Primaryvertex/gemini-2.5-pro
Fallback chainanthropic → bedrock
Retries used0
Routing overhead< 1 ms
loggedtag-filteredcooldown-aware
Routing strategies
5

Shuffle, least-busy, usage, latency, cost

Retries + cooldown
Configurable

Per-org retry count + cooldown window

Tag filtering
Per-request

vision, code, long-context, multilingual

Decisions logged
100%

Strategy, primary, fallback, latency

Provider catalog

  • OpenAI
  • Anthropic
  • Gemini
  • AWS Bedrock
  • Vertex AI
  • Mistral
  • Meta
  • Cohere
  • DeepSeek
  • + 67 more
Capabilities

Smart routing, zero effort

Provider-agnostic load balancing, automatic fallbacks, and an OpenAI-compatible API. One endpoint routes requests across 20+ models with per-org strategies and tag-based filters.

Five routing strategies

Pick the strategy that fits your workload. Simple-shuffle randomizes uniformly, least-busy steers to the lowest concurrent-request endpoint, and usage/latency/cost-based weights pick from live signal.

  • simple-shuffle — uniform random across healthy endpoints
  • least-busy — fewest in-flight requests wins
  • usage-based — distribute by RPM/TPM headroom
  • latency-based + cost-based — optimize for the metric you care about

Fallback chains in practice

Define an ordered list of backup providers per model group. If the primary returns a 5xx, times out, or hits a circuit break, the gateway retries the next link transparently — your users never see the error.

  • Ordered chain per model_group, not a global setting
  • Cross-provider failover (Vertex → Anthropic → Bedrock)
  • Timeout + 5xx + circuit-break all trigger the next link
  • Final failure returns a clean 502 with the full chain in headers

Retries, timeouts, cooldowns

Configure per-org retry count, request timeout, and cooldown between retry attempts. Cooldown prevents hammering an unhealthy provider; the gateway honors provider-side rate limit hints in the response.

  • Per-org retry count and per-request override
  • Cooldown window respected across worker fleet
  • Honors `retry-after` and provider rate-limit hints
  • Exponential backoff with jitter, capped

Tag-based model filtering

Filter the candidate pool by capability tags — vision, code, long-context, multilingual, function-calling. Route a vision request to vision-capable models only; never accidentally fall back to a text-only one.

  • Tags resolved from the model catalog at request time
  • Multi-tag intersection (e.g. `vision` AND `long-context`)
  • Per-key default tags with per-request override
  • Mismatched tag set returns 400, not a wrong model

Per-org cooldowns + circuit breakers

Provider-level health is tracked per organization so a noisy tenant cannot poison the pool for others. A failing endpoint trips its own circuit; healthy tenants keep routing without any blast radius.

  • Per-org failure counters — one tenant cannot DoS the pool
  • Circuit-break triggers cooldown automatically
  • Healthy endpoints recover after a configurable cooldown window
  • Tenant isolation matches the same RLS contract used elsewhere

OpenAI-compatible API

Drop-in compatible with the unmodified OpenAI SDK. Change two lines — base URL and API key — and you’re routing across every model in the catalog. Works with Python, Node.js, Go, Ruby, Java, C#, PHP, and Rust.

  • Same chat.completions, embeddings, and images endpoints
  • Base URL: https://api.nemorouter.ai/v1
  • Auth: Bearer sk-nemo-...
  • Streaming, tool use, and JSON mode all proxied transparently
How it works

A routing decision, end to end

Every request flows through the same path: strategy match, primary provider, fallback chain, settle. The decision is logged with strategy, primary, fallback, retries, and latency for replay and audit.

Routing decision flow

  1. Request

    POST /v1/chat/completions

    Bearer sk-nemo-..., model="gemini-2.5-flash"

  2. Strategy match

    least-busy · vision tag

    Pool filtered by tags; strategy picks the candidate.

  3. Primary provider

    Vertex AI · gemini-2.5-flash

    5xx / timeout / circuit-break triggers the chain.

  4. Fallback chain

    Anthropic → Bedrock → OpenAI

    Ordered list per model_group; each link retried in turn.

  5. Settled

    cost + latency logged

    Strategy, primary, fallback, retries — all in the request log.

Routing decisions happen in-memory and add < 1 ms. The dominant latency factor is always LLM inference, not our proxy overhead.

Fallback Chains

Fallback chains in practice

Configure an ordered chain per model group. Each link is tried in turn on failure — same SDK call, no client-side retry logic, no shadow-traffic glue code.

Cross-provider failover

Vertex → Anthropic → Bedrock — same request, three providers

When the primary returns a 5xx, times out, or trips its circuit breaker, the gateway tries the next link in the chain. Each retry honors the cooldown window and provider-side rate-limit hints.

  • Per model_group, not a global setting
  • Cross-provider chains (e.g. Vertex Gemini → Anthropic Claude → Bedrock)
  • Final failure returns 502 with the full chain in response headers
  • Every link logged with latency + result for post-hoc replay
model_group · gemini-2.5-flash

Fallback chain config

Primaryvertex/gemini-2.5-flash
Fallback 1anthropic/claude-3.5-sonnet
Fallback 2bedrock/claude-3-haiku
Fallback 3openai/gpt-4o-mini
Cooldown5 s
Max retries3
orderedcross-providercooldown-aware
Drop-in

Two-line migration from OpenAI

OpenAI-compatible

Change the base URL and the key — that’s it

The unmodified OpenAI SDK works against Nemo Router on every supported language. Streaming, tool use, JSON mode, embeddings, image generation — all proxied transparently. We never modify the request body except to strip Nemo-specific extras.

  • Python, Node.js, Go, Ruby, Java, C#, PHP, Rust
  • Base URL: https://api.nemorouter.ai/v1
  • Auth: Bearer sk-nemo-...
  • Drop-in for chat.completions, embeddings, and images
diff · client.py

Two-line change

- base_urlhttps://api.openai.com/v1
+ base_urlhttps://api.nemorouter.ai/v1
- api_keysk-proj-...
+ api_keysk-nemo-...
Code lines changed2
OpenAI SDK8 languagesstreaming-safe
We had a homegrown failover layer with three providers behind it. Replaced ~600 lines with a base-URL change and a chain config. Same reliability, none of the maintenance.

Staff Engineer

AI-native scale-up, ~40 engineers

Catalog

A live model catalog, kept in sync

20+ models live now on Google Vertex AI. New models added within hours of provider launch — your code never has to change.

Model catalog

The catalog drives routing — not your code

Tags, capabilities, and fallback chains live in the catalog, not in your repo. When a new model launches, we add it to the catalog and tag it; your existing requests start picking it up if the tags match. No SDK upgrade, no redeploy.

  • Catalog updated within hours of provider launch
  • Capability tags: vision, code, long-context, function-calling, multilingual
  • Per-model RPM/TPM advertised — strategies use it as input
  • Public read-only catalog at /api/public/models
GET /api/public/models

Catalog snapshot

gemini-2.5-provision · long-context
gemini-2.5-flashvision · fast
claude-3.5-sonnetcode · long-context
gpt-4ovision · multimodal
Total models live18
Provider count1
live catalogtaggedcapability-aware
FAQ

Common routing questions

One key. One bill. Every model.

Replace your homegrown failover layer in an afternoon

Sign up, paste your virtual key, change the base URL. Five routing strategies, fallback chains, and tag filtering are unlocked on every plan.