Does Nemo Router support embedding models for RAG?

Yes. The same OpenAI-compatible endpoint serves embeddings and chat completions. Your indexing job and your query path both call https://api.nemorouter.ai/v1 with one NemoRouter key — no separate embedding provider to configure.

How does caching reduce RAG costs?

RAG traffic is repetitive — popular questions, re-asked queries, identical context windows. Nemo caches responses by default; an exact-match repeat is served from cache and skips the provider call entirely. Pass nemo_cache: false per request to force a fresh generation.

Can I track cost for each stage of a RAG pipeline?

Yes. LiteLLM reports the real cost of every call via the response-cost header, and the request log records the model and metadata for each. Tag your embedding calls and chat calls separately to attribute spend per pipeline stage.

Does Nemo Router support embedding models for RAG?

Yes. The same OpenAI-compatible endpoint serves embeddings and chat completions. Your indexing job and your query path both call the gateway with one NemoRouter key — no separate embedding provider to configure.

How does caching reduce RAG costs?

RAG traffic is repetitive — popular questions, re-asked queries, identical context windows. Nemo caches responses by default; an exact-match repeat is served from cache and skips the provider call entirely. Pass nemo_cache: false per request to force a fresh generation.

Can I track cost for each stage of a RAG pipeline?

Yes. LiteLLM reports the real cost of every call via the response-cost header, and the request log records the model and metadata for each. Tag your embedding calls and chat calls separately to attribute spend per pipeline stage.

What happens if my embedding provider goes down mid-index?

The fallback chain retries the next provider in the configured order. A 5xx, timeout, or circuit-break on the primary triggers the next link transparently — your index build keeps going and the failover is logged.

Use Case · RAG

Retrieval-augmented generation, one key for embed and chat.

A RAG pipeline calls two model families — embeddings to index and query, a chat model to synthesize the answer. Route both through one Nemo Router endpoint, cache the repeats, and track cost per stage.

Get started See the pipeline

rag-pipeline · request trace

One pipeline, two model families

Index embedgemini-embedding

Query embedgemini-embedding

Synthesisgemini-2.5-flash

Cachehit · 0 ms

Stages taggedindex · query

Provider confignone

embed + chatcachedcost-tracked

Model families: Embed + chat
Repeat queries: Cached
Cost attribution: Per stage
Catalog: 20+

Why Nemo for RAG

The four things a RAG pipeline needs

Two model families, repetitive traffic, real cost pressure, and providers that occasionally fail. Nemo Router handles all four behind one key.

Embeddings and chat, one key

A RAG pipeline touches two model families. The same endpoint that answers chat.completions also serves the embeddings call that builds and queries your index — one key, one bill, zero provider config.

embeddings + chat.completions on the same base URL
Swap embedding or chat models without re-keying
Catalog tags surface long-context models for big retrievals
No separate embedding-provider account to manage

Caching for repetitive traffic

RAG traffic repeats — FAQs re-asked, identical context windows, popular queries. Caching is on by default, so an exact-match request is served without ever hitting the provider.

Response cache enabled by default per org
Exact-match repeats skip the provider call entirely
Override per request with nemo_cache: false for fresh output
Cache decision recorded in the request log

Cost tracking per pipeline stage

LiteLLM reports the real cost of every call. Tag your embedding calls and your synthesis calls separately and the dashboard attributes spend to each stage of the pipeline.

Real per-call cost from the response-cost header
Tag index vs. query traffic via request metadata
Per-org, per-team, and per-key budgets cap runaway spend
Spend analytics break down cost by model and tag

Failover keeps retrieval answering

When an embedding or chat provider degrades, the fallback chain retries the next link transparently. Your index build finishes and your query path keeps returning answers.

Ordered fallback chain per model group
Timeouts, 5xx, and circuit-breaks all trigger the next link
Retries honor cooldown and provider rate-limit hints
Every fallback logged for replay

How it works

A RAG request, end to end

Index once, then query: embed the question, retrieve context from your own vector store, and synthesize with a chat model. Nemo sits on the two LLM hops — embeddings and synthesis — and logs the cost of each.

RAG pipeline flow

Index documents
POST /v1/embeddings
Chunk + embed your corpus once; store vectors in your DB.
Query embedding
POST /v1/embeddings
Embed the user question with the same model.
Retrieve context
your vector store
Nearest-neighbour search runs in your own database.
Synthesize answer
POST /v1/chat/completions
Chat model answers from retrieved context — cached if repeated.
Settled + logged
cost per stage
Embed cost, chat cost, cache hit — all in the request log.

Nearest-neighbour search stays in your database. Nemo Router handles the two LLM hops — embeddings and synthesis — with caching, failover, and per-stage cost tracking.

Caching

Repeated questions never hit the provider twice

Response caching

Exact-match repeats are served from cache

Knowledge-base RAG answers the same questions over and over. With caching on by default, an identical request — same model, same context, same prompt — returns from cache instead of paying for another generation. The cache decision lands in the request log so you can see the hit rate.

Caching enabled by default per org
Exact-match repeats skip the provider call and the cost
nemo_cache: false forces a fresh generation when freshness matters
Cache hit / miss recorded per request for observability

cache · knowledge-base RAG

Cache behaviour

Question"reset my password?"

First askmiss · generated

Re-askhit · 0 ms

Provider callskipped

Cost on hit$0.00

default-onper-request overridelogged

The code

Same client for embeddings and chat

A RAG pipeline is just two endpoint calls against one key. These snippets come straight from the SDK examples the playground and dashboard use — set NEMOROUTER_API_KEY and the chat call runs as-is; the embeddings call uses the same client and base URL.

Installpip install openai

1	`# Cache: enabled (org default). Pass nemo_cache: false to skip.`
2	`from openai import OpenAI`
3	`import os`
4
5	`client = OpenAI(`
6	`api_key=os.environ["NEMOROUTER_API_KEY"],`
7	`base_url="https://api.nemorouter.ai/v1",`
8	`)`
9
10	`response = client.chat.completions.create(`
11	`model="gemini-2.5-flash",`
12	`temperature=1,`
13	`max_tokens=1024,`
14	`top_p=1,`
15	`messages=[`
16	`{"role": "user", "content": "Hello! What models do you support?"},`
17	`],`
18	`extra_body={`
19	`# "nemo_cache": False, # Uncomment to skip cache`
20	`},`
21	`)`
22
23	`print(response.choices[0].message.content)`

The same client object also calls client.embeddings.create() — one key covers the whole pipeline.

FAQ

Common RAG questions

One key for the whole pipeline

Ship a RAG pipeline without juggling providers

Embeddings, chat, caching, and per-stage cost tracking — all behind one NemoRouter key. Every feature is unlocked on every plan.

Get started Browse models

Building autonomous workflows on top of retrieval? See the AI agents use case.