$5 free credits when you sign up
Use Case · Voice & Realtime

Voice assistants that answer fast — every turn.

A voice app lives or dies on latency. Nemo Router steers each conversational turn to the quickest healthy endpoint, keeps the line alive with transparent failover, and logs every turn for observability.

voice-turn · conversation 4c1d

One turn of a live conversation

Route strategylatency-based
Endpoint chosenquickest healthy
Streamingtoken-by-token
Gateway overhead~95 ms p50
Mid-call failovertransparent
Turn loggedp50 / p99
low-latencyfailover-safeobservable
Gateway overhead
95 ms

p50 added — LLM inference dominates

Each turn
Latency-routed

Steered to the quickest healthy endpoint

Mid-call failure
Failed over

The caller never hears an error

Turn-level
Observable

p50 / p99 latency per model

Why Nemo for voice

Speed, resilience, and a turn-level trail

Voice is the least forgiving LLM workload — every turn is on a clock and a caller is listening. Nemo Router gives that turn fast routing, transparent failover, and observability.

Latency-based routing

A voice app lives or dies on response time — silence on the line is the failure mode. Latency-based routing steers each conversational turn to the quickest healthy endpoint using live signal.

  • Routing decisions add ~95 ms p50 — LLM time dominates
  • Latency-based and least-busy strategies for the hot path
  • Streaming proxied transparently to your speech layer
  • No hot-path buffering — tokens flow as they are generated

Failover mid-conversation

A provider blip during a live call cannot be a dead air moment. The fallback chain retries the next link transparently so the turn still completes and the caller hears a response, not an error.

  • Ordered fallback chain per model group
  • Timeouts, 5xx, and circuit-breaks trigger the next link
  • Cross-provider failover keeps a conversation alive
  • Each fallback logged without interrupting the call

Turn-level observability

When a call feels slow, you need the turn. Every turn lands in the request log with the model, latency, token counts, and cost — and latency metrics surface p50 and p99 per model.

  • Request log records model, latency, tokens, and cost per turn
  • p50 / p99 latency metrics per model to catch a slow tail
  • Export to Langfuse, Datadog, or S3 via a logging callback
  • Alerts fire on latency or error-rate thresholds

Model catalog for realtime

Realtime assistants want a fast, capable model — and the freedom to switch as faster models ship. The catalog exposes every model behind one key; pick the one with the latency profile you need.

  • Choose any catalog model per turn
  • Tag-filtered routing keeps function-calling turns on capable models
  • Swap models as the catalog grows — no SDK change
  • 20+ models live, more shipping
How it works

A conversational turn, end to end

Nemo Router is the LLM hop between transcription and synthesis. The turn routes for latency, streams back token-by-token, and lands in the log with p50 / p99 metrics.

Voice turn flow

  1. Caller speaks

    speech-to-text upstream

    Your speech layer transcribes the turn to text.

  2. Turn request

    POST /v1/chat/completions

    The transcript becomes a streaming chat request.

  3. Latency route

    quickest healthy endpoint

    Live latency signal picks the fastest model deployment.

  4. Stream the reply

    token-by-token

    Tokens flow to your text-to-speech with no buffering.

  5. Turn logged

    latency metrics

    Model, latency, cost per turn — p50 / p99 tracked.

Your speech-to-text and text-to-speech layers stay yours. Nemo Router gives the reasoning turn in between low-latency routing, failover, and a logged record.

Latency

The quickest healthy endpoint, every turn

Latency-based routing

Live signal picks the fastest deployment — turn by turn

A model deployment that was fastest a minute ago may not be now. Latency-based routing uses live latency signal to pick the quickest healthy endpoint for each turn. The gateway itself adds about 95 ms at p50 — the dominant factor is always LLM inference, never the proxy.

  • Latency-based and least-busy strategies for the hot path
  • Live signal — the choice updates as deployments speed up or slow
  • Streaming proxied with no hot-path buffering
  • Latency metrics surface p50 / p99 so a slow tail is visible
latency · per-turn routing

Turn-by-turn latency

Turn 1endpoint A · fastest
Turn 2endpoint C · fastest
Gateway p50~95 ms
LLM inferencedominant factor
Slow-tail alerton p99 threshold
live signalper-turnp50 / p99 tracked
The code

Stream the reasoning turn

A voice turn is a streaming chat request — transcript in, tokens out to your speech layer. These snippets come from the same SDK examples the playground uses; enable streaming and tokens flow as they generate.

Installpip install openai
1# Cache: enabled (org default). Pass nemo_cache: false to skip.
2from openai import OpenAI
3import os
4
5client = OpenAI(
6 api_key=os.environ["NEMOROUTER_API_KEY"],
7 base_url="https://api.nemorouter.ai/v1",
8)
9
10response = client.chat.completions.create(
11 model="gemini-2.5-flash",
12 temperature=1,
13 max_tokens=1024,
14 top_p=1,
15 messages=[
16 {"role": "user", "content": "Hello! What models do you support?"},
17 ],
18 extra_body={
19 # "nemo_cache": False, # Uncomment to skip cache
20 },
21)
22
23print(response.choices[0].message.content)

Set stream: true — Nemo Router proxies the token stream with no hot-path buffering.

FAQ

Common voice & realtime questions

Fast turns, no dead air

Build a voice assistant that never leaves the caller waiting

Latency-based routing, transparent failover, and turn-level observability — all unlocked on every plan.

Nemo Router is the LLM hop — pair it with any speech-to-text and text-to-speech stack.