LLM Routing for AI Agent Pipelines — A Production Guide
How to structure LLM routing across multi-step agent pipelines: choosing models per task type, handling failures mid-chain, and tracking cost per agent run using a managed gateway.
How to attribute LLM costs across multi-agent systems: virtual keys per agent role, per-run cost accumulation, budget enforcement, and the observability patterns that prevent surprise bills.
The hardest cost problem in AI is not knowing what a single API call costs — providers publish that. The hard problem is knowing what a business operation costs when it involves dozens of LLM calls across multiple agents, models, and providers.
"Our AI feature cost $0.03 last month per user" is useful. "We're not sure — somewhere between $0.001 and $0.50 depending on what the agent does" is a billing time bomb.
This guide covers the infrastructure and code patterns for tracking, attributing, and bounding LLM costs in multi-agent systems.
Single-turn applications have predictable cost: one request, one response, one line item. Agent costs are non-deterministic by design:
Without explicit cost attribution, you discover the problem when the bill arrives.
Effective attribution works at three granularities:
| Layer | Question | Mechanism |
|---|---|---|
| Role | Which agent type is expensive? | Virtual key per agent role |
| Run | What did this specific job cost? | user field per run ID |
| Step | Which pipeline step drives cost? | Per-call cost header accumulation |
You want all three. Role-level tells you where to optimize. Run-level tells you when a specific job went wrong. Step-level tells you exactly which operation to fix.
Create a separate NemoRouter API key for each logical agent role in your system. Each key has its own spend dashboard, budget limit, and rate limit.
Dashboard view after setup:
sk-nemo-orchestrator $12.40 / 30 days
sk-nemo-researcher $89.20 / 30 days ← this is the expensive one
sk-nemo-writer $8.60 / 30 days
sk-nemo-critic $3.10 / 30 days
sk-nemo-embeddings $1.80 / 30 daysIn code, route each agent type to its key:
import os
from openai import AsyncOpenAI
# Keys from environment — never hardcode
ROLE_CLIENTS = {
"orchestrator": AsyncOpenAI(
api_key=os.environ["NEMO_KEY_ORCHESTRATOR"],
base_url="https://api.nemorouter.ai/v1",
),
"researcher": AsyncOpenAI(
api_key=os.environ["NEMO_KEY_RESEARCHER"],
base_url="https://api.nemorouter.ai/v1",
),
"writer": AsyncOpenAI(
api_key=os.environ["NEMO_KEY_WRITER"],
base_url="https://api.nemorouter.ai/v1",
),
"critic": AsyncOpenAI(
api_key=os.environ["NEMO_KEY_CRITIC"],
base_url="https://api.nemorouter.ai/v1",
),
}
def get_client(role: str) -> AsyncOpenAI:
if role not in ROLE_CLIENTS:
raise ValueError(f"Unknown agent role: {role}. Configure a key first.")
return ROLE_CLIENTS[role]This gives you immediate spend visibility per role without changing how agents call LLMs. The gateway tracks it automatically.
Each unique agent invocation should carry a run ID. Attach it to every LLM call via the user parameter:
import uuid
from dataclasses import dataclass, field
from typing import Optional
@dataclass
class AgentContext:
"""Carries run-level metadata through the entire pipeline."""
run_id: str = field(default_factory=lambda: str(uuid.uuid4()))
job_id: Optional[str] = None # External job/task ID from your system
user_id: Optional[str] = None # End user (if applicable)
total_cost_usd: float = 0.0
@property
def user_tag(self) -> str:
"""Formatted user field passed to every LLM call."""
parts = [f"run:{self.run_id[:8]}"]
if self.job_id:
parts.append(f"job:{self.job_id}")
if self.user_id:
parts.append(f"user:{self.user_id}")
return "|".join(parts)
async def llm_call(
ctx: AgentContext,
role: str,
model: str,
messages: list,
**kwargs,
) -> tuple[str, float]:
"""
Make an LLM call with full cost attribution.
Returns (content, cost_usd).
"""
client = get_client(role)
response = await client.chat.completions.create(
model=model,
messages=messages,
user=ctx.user_tag,
**kwargs,
)
# Extract cost from response header
cost = _extract_cost(response)
ctx.total_cost_usd += cost
return response.choices[0].message.content, cost
def _extract_cost(response) -> float:
"""Read actual cost from NemoRouter response headers."""
try:
# Access raw response headers via the underlying HTTP response
headers = response._request.headers if hasattr(response, '_request') else {}
return float(headers.get("x-litellm-response-cost", 0))
except (AttributeError, ValueError):
return 0.0The cleaner way to access response headers is via the httpx response object:
import httpx
from openai import AsyncOpenAI
class CostCapturingTransport(httpx.AsyncHTTPTransport):
"""Intercepts responses to capture cost headers."""
def __init__(self, cost_callback, **kwargs):
super().__init__(**kwargs)
self.cost_callback = cost_callback
async def handle_async_request(self, request: httpx.Request) -> httpx.Response:
response = await super().handle_async_request(request)
cost_str = response.headers.get("x-litellm-response-cost", "0")
try:
self.cost_callback(float(cost_str))
except ValueError:
pass
return response
class TrackedAgentClient:
"""OpenAI-compatible client that accumulates LLM costs."""
def __init__(self, api_key: str):
self._total_cost = 0.0
transport = CostCapturingTransport(
cost_callback=self._record_cost,
)
self.client = AsyncOpenAI(
api_key=api_key,
base_url="https://api.nemorouter.ai/v1",
http_client=httpx.AsyncClient(transport=transport),
)
def _record_cost(self, cost: float) -> None:
self._total_cost += cost
@property
def total_cost_usd(self) -> float:
return round(self._total_cost, 8)
def reset_cost(self) -> float:
"""Returns total and resets the counter."""
total = self._total_cost
self._total_cost = 0.0
return totalFor pipeline debugging, track cost at each step:
from typing import TypedDict
class StepCost(TypedDict):
step: str
model: str
role: str
cost_usd: float
tokens_in: int
tokens_out: int
class PipelineCostLedger:
"""Accumulates step costs for a single agent run."""
def __init__(self, run_id: str):
self.run_id = run_id
self._steps: list[StepCost] = []
def record(
self,
step: str,
role: str,
response,
) -> None:
cost_header = getattr(response, '_raw_response', None)
cost = 0.0
if cost_header:
cost = float(
cost_header.headers.get("x-litellm-response-cost", 0)
)
usage = response.usage
self._steps.append({
"step": step,
"model": response.model,
"role": role,
"cost_usd": cost,
"tokens_in": usage.prompt_tokens if usage else 0,
"tokens_out": usage.completion_tokens if usage else 0,
})
@property
def total_cost(self) -> float:
return round(sum(s["cost_usd"] for s in self._steps), 8)
def most_expensive_step(self) -> StepCost | None:
if not self._steps:
return None
return max(self._steps, key=lambda s: s["cost_usd"])
def to_dict(self) -> dict:
return {
"run_id": self.run_id,
"total_usd": self.total_cost,
"steps": self._steps,
}{
"run_id": "a3f2c1b4",
"total_usd": 0.004712,
"steps": [
{"step": "plan", "model": "o3-mini", "role": "orchestrator",
"cost_usd": 0.001200, "tokens_in": 450, "tokens_out": 380},
{"step": "research_query_1", "model": "gpt-4o-mini", "role": "researcher",
"cost_usd": 0.000180, "tokens_in": 320, "tokens_out": 150},
{"step": "research_query_2", "model": "gpt-4o-mini", "role": "researcher",
"cost_usd": 0.000240, "tokens_in": 420, "tokens_out": 200},
{"step": "synthesis", "model": "claude-3-5-sonnet-20241022", "role": "writer",
"cost_usd": 0.002800, "tokens_in": 2100, "tokens_out": 620},
{"step": "critique", "model": "gpt-4o-mini", "role": "critic",
"cost_usd": 0.000292, "tokens_in": 680, "tokens_out": 140}
]
}The synthesis step costs 59% of the total run. That tells you where to experiment with cheaper models.
Budgets belong in the gateway, not application code. Application code has bugs. The gateway does not.
Set a max_budget on each agent key via the NemoRouter dashboard or API:
Role Budget Reset
orchestrator $50/month monthly
researcher $200/month monthly
writer $50/month monthly
critic $20/month monthlyWhen a key hits its budget, further calls return a 402 Payment Required error. Handle it in agent code:
from openai import OpenAIError
async def safe_llm_call(ctx: AgentContext, role: str, model: str, messages: list):
try:
content, cost = await llm_call(ctx, role, model, messages)
return content
except OpenAIError as e:
status = getattr(e, 'status_code', None)
if status == 402:
raise AgentBudgetExhausted(
f"Agent role '{role}' has exhausted its budget. "
f"Current run cost: ${ctx.total_cost_usd:.4f}"
)
raise
class AgentBudgetExhausted(RuntimeError):
"""Raised when an agent role hits its configured budget limit."""
passFor long-running autonomous agents, add a cost guardrail at the run level:
class BudgetedAgentContext(AgentContext):
max_run_cost_usd: float = 0.50 # Default $0.50 per run
def check_budget(self) -> None:
if self.total_cost_usd >= self.max_run_cost_usd:
raise AgentBudgetExhausted(
f"Run budget of ${self.max_run_cost_usd:.2f} exceeded. "
f"Spent: ${self.total_cost_usd:.4f}"
)
# Check before each expensive step
async def guarded_llm_call(ctx: BudgetedAgentContext, role: str, model: str, messages: list):
ctx.check_budget()
return await safe_llm_call(ctx, role, model, messages)When agents run in parallel, concurrent access to shared cost state requires thread/async safety:
import asyncio
from decimal import Decimal
class ConcurrentCostTracker:
"""Thread-safe cost accumulator for parallel agent runs."""
def __init__(self):
self._lock = asyncio.Lock()
self._cost = Decimal("0")
self._call_count = 0
async def record(self, cost_usd: float) -> None:
async with self._lock:
self._cost += Decimal(str(cost_usd))
self._call_count += 1
@property
def total_usd(self) -> float:
return float(self._cost)
@property
def call_count(self) -> int:
return self._call_count
# Running parallel researcher agents with shared cost tracking
async def run_parallel_researchers(queries: list[str], ctx: AgentContext) -> list[str]:
tracker = ConcurrentCostTracker()
async def research_one(query: str) -> str:
content, cost = await llm_call(ctx, "researcher", "gpt-4o-mini", [
{"role": "user", "content": query}
])
await tracker.record(cost)
return content
results = await asyncio.gather(*[research_one(q) for q in queries])
print(f"Parallel research: {len(queries)} queries, "
f"{tracker.call_count} calls, "
f"${tracker.total_usd:.4f} total")
return list(results)After running these patterns in production, here are representative cost ranges for common agent architectures (April 2026 pricing):
| Agent Type | Calls per Run | Typical Cost | Expensive Outlier |
|---|---|---|---|
| Simple Q&A with retrieval | 2-3 | $0.001-0.003 | $0.02 |
| ReAct 3-5 step pipeline | 5-8 | $0.005-0.020 | $0.15 |
| Multi-agent research + synthesis | 10-20 | $0.020-0.080 | $0.50 |
| Recursive document analyzer | variable | $0.010-0.200 | $2.00+ |
The outliers are why budget guardrails matter. An edge case that triggers 10x the normal calls turns a $0.020 operation into $0.200 or worse.
Three metrics tell you if agent costs are under control:
All three are visible in the NemoRouter observability dashboard without additional instrumentation — the key-per-role setup does the work for you.
The full stack for multi-agent cost attribution:
user field → per-run cost reconstruction in logsx-litellm-response-cost header accumulation → step-level cost breakdownThis gives you cost observability without building custom accounting infrastructure.
How to structure LLM routing across multi-step agent pipelines: choosing models per task type, handling failures mid-chain, and tracking cost per agent run using a managed gateway.
How a pinned-version dependency and a separate Nemo Backend give us zero merge conflicts, instant upstream updates, and clean service boundaries.
How NemoRouter uses Supabase Row Level Security for tenant isolation in a multi-tenant LLM gateway — single database, zero cross-tenant leaks.