$5 free credits when you sign up
Use Case · RAG

Retrieval-augmented generation, one key for embed and chat.

A RAG pipeline calls two model families — embeddings to index and query, a chat model to synthesize the answer. Route both through one Nemo Router endpoint, cache the repeats, and track cost per stage.

rag-pipeline · request trace

One pipeline, two model families

Index embedgemini-embedding
Query embedgemini-embedding
Synthesisgemini-2.5-flash
Cachehit · 0 ms
Stages taggedindex · query
Provider confignone
embed + chatcachedcost-tracked
Model families
Embed + chat

Both on one OpenAI-compatible endpoint

Repeat queries
Cached

Exact-match hits skip the provider call

Cost attribution
Per stage

Index vs. query, split by metadata tag

Catalog
20+

models on Google Vertex AI

Why Nemo for RAG

The four things a RAG pipeline needs

Two model families, repetitive traffic, real cost pressure, and providers that occasionally fail. Nemo Router handles all four behind one key.

Embeddings and chat, one key

A RAG pipeline touches two model families. The same endpoint that answers chat.completions also serves the embeddings call that builds and queries your index — one key, one bill, zero provider config.

  • embeddings + chat.completions on the same base URL
  • Swap embedding or chat models without re-keying
  • Catalog tags surface long-context models for big retrievals
  • No separate embedding-provider account to manage

Caching for repetitive traffic

RAG traffic repeats — FAQs re-asked, identical context windows, popular queries. Caching is on by default, so an exact-match request is served without ever hitting the provider.

  • Response cache enabled by default per org
  • Exact-match repeats skip the provider call entirely
  • Override per request with nemo_cache: false for fresh output
  • Cache decision recorded in the request log

Cost tracking per pipeline stage

LiteLLM reports the real cost of every call. Tag your embedding calls and your synthesis calls separately and the dashboard attributes spend to each stage of the pipeline.

  • Real per-call cost from the response-cost header
  • Tag index vs. query traffic via request metadata
  • Per-org, per-team, and per-key budgets cap runaway spend
  • Spend analytics break down cost by model and tag

Failover keeps retrieval answering

When an embedding or chat provider degrades, the fallback chain retries the next link transparently. Your index build finishes and your query path keeps returning answers.

  • Ordered fallback chain per model group
  • Timeouts, 5xx, and circuit-breaks all trigger the next link
  • Retries honor cooldown and provider rate-limit hints
  • Every fallback logged for replay
How it works

A RAG request, end to end

Index once, then query: embed the question, retrieve context from your own vector store, and synthesize with a chat model. Nemo sits on the two LLM hops — embeddings and synthesis — and logs the cost of each.

RAG pipeline flow

  1. Index documents

    POST /v1/embeddings

    Chunk + embed your corpus once; store vectors in your DB.

  2. Query embedding

    POST /v1/embeddings

    Embed the user question with the same model.

  3. Retrieve context

    your vector store

    Nearest-neighbour search runs in your own database.

  4. Synthesize answer

    POST /v1/chat/completions

    Chat model answers from retrieved context — cached if repeated.

  5. Settled + logged

    cost per stage

    Embed cost, chat cost, cache hit — all in the request log.

Nearest-neighbour search stays in your database. Nemo Router handles the two LLM hops — embeddings and synthesis — with caching, failover, and per-stage cost tracking.

Caching

Repeated questions never hit the provider twice

Response caching

Exact-match repeats are served from cache

Knowledge-base RAG answers the same questions over and over. With caching on by default, an identical request — same model, same context, same prompt — returns from cache instead of paying for another generation. The cache decision lands in the request log so you can see the hit rate.

  • Caching enabled by default per org
  • Exact-match repeats skip the provider call and the cost
  • nemo_cache: false forces a fresh generation when freshness matters
  • Cache hit / miss recorded per request for observability
cache · knowledge-base RAG

Cache behaviour

Question"reset my password?"
First askmiss · generated
Re-askhit · 0 ms
Provider callskipped
Cost on hit$0.00
default-onper-request overridelogged
The code

Same client for embeddings and chat

A RAG pipeline is just two endpoint calls against one key. These snippets come straight from the SDK examples the playground and dashboard use — set NEMOROUTER_API_KEY and the chat call runs as-is; the embeddings call uses the same client and base URL.

Installpip install openai
1# Cache: enabled (org default). Pass nemo_cache: false to skip.
2from openai import OpenAI
3import os
4
5client = OpenAI(
6 api_key=os.environ["NEMOROUTER_API_KEY"],
7 base_url="https://api.nemorouter.ai/v1",
8)
9
10response = client.chat.completions.create(
11 model="gemini-2.5-flash",
12 temperature=1,
13 max_tokens=1024,
14 top_p=1,
15 messages=[
16 {"role": "user", "content": "Hello! What models do you support?"},
17 ],
18 extra_body={
19 # "nemo_cache": False, # Uncomment to skip cache
20 },
21)
22
23print(response.choices[0].message.content)

The same client object also calls client.embeddings.create() — one key covers the whole pipeline.

FAQ

Common RAG questions

One key for the whole pipeline

Ship a RAG pipeline without juggling providers

Embeddings, chat, caching, and per-stage cost tracking — all behind one NemoRouter key. Every feature is unlocked on every plan.

Building autonomous workflows on top of retrieval? See the AI agents use case.