LlamaIndex's OpenAI LLM and OpenAIEmbedding work directly with NemoRouter. Set api_base to https://api.nemorouter.ai/v1 — guardrails, caching, and rate-limits auto-apply.

Installation

pip install llama-index-llms-openai llama-index-embeddings-openai nemoroutersdk

Setup

import os
from llama_index.llms.openai import OpenAI
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.core import Settings

NEMO_KEY = os.environ["NEMOROUTER_API_KEY"]
NEMO_BASE = "https://api.nemorouter.ai/v1"

Settings.llm = OpenAI(
    model="claude-sonnet-4-20250514",
    api_key=NEMO_KEY,
    api_base=NEMO_BASE,
)
Settings.embed_model = OpenAIEmbedding(
    model_name="text-embedding-3-small",
    api_key=NEMO_KEY,
    api_base=NEMO_BASE,
)

Chat

from llama_index.core.llms import ChatMessage

response = Settings.llm.chat([
    ChatMessage(role="user", content="What is quantum computing?"),
])
print(response.message.content)

Per-Request Overrides

Pass nemo_* fields via additional_kwargs:

llm_with_prompt = OpenAI(
    model="gpt-4o",
    api_key=NEMO_KEY,
    api_base=NEMO_BASE,
    additional_kwargs={
        "nemo_prompt_template_id": "your-summarizer-id",
        "nemo_prompt_variables": {"language": "Spanish", "max_length": "100"},
        "nemo_guardrail_ids": ["guardrail-uuid-1"],
        "nemo_cache": False,
    },
)

RAG with Query Engine

from llama_index.core import VectorStoreIndex, Document

documents = [
    Document(text="NemoRouter is an LLM gateway with 78+ models."),
    Document(text="Guardrails protect every request: PII, injection, keywords."),
]
index = VectorStoreIndex.from_documents(documents)
query_engine = index.as_query_engine()

response = query_engine.query("How does NemoRouter handle safety?")
print(response)

The user query is checked by guardrails before it reaches the model. Prompt-injection attempts on RAG queries are blocked before retrieval starts.

Next Steps

LangChain — Same pattern for chain-style RAG
Python SDK — Without LlamaIndex

LlamaIndex