A Guide to LLM Cost Tracking Across Every Major Provider
From per-token pricing to provisioned throughput — how NemoRouter tracks costs across every major LLM provider with a credit-based system.
The Cost Complexity Problem
LLM pricing is not simple. OpenAI charges per token with different rates for input and output. Anthropic has tiered pricing based on model size. Google Vertex AI offers both per-token and provisioned throughput pricing. AWS Bedrock has on-demand, provisioned, and batch pricing modes. Some providers charge per request, others per minute of audio, others per image generated.
When your application routes to models across a dozen providers, building a unified cost tracking system from scratch is a significant engineering effort. Pricing changes frequently, new models launch weekly, and edge cases abound (cached tokens, batched requests, fine-tuned model multipliers).
NemoRouter solves this by delegating cost calculation to LiteLLM and building a credit system on top.
Let LiteLLM Handle Cost Calculation
LiteLLM maintains a comprehensive pricing database that covers every supported model. After each request, it computes the exact cost based on the model, token counts, and current pricing, then returns it in the x-litellm-response-cost response header.
This is a deliberate architectural decision: NemoRouter never computes cost independently. We read the cost from LiteLLM's header and use that as the authoritative value.
# Nemo Backend reads cost from LiteLLM response
litellm_response = await forward_to_litellm(request)
cost = float(litellm_response.headers.get("x-litellm-response-cost", "0"))
# Settle the credit reservation with the actual cost
await settle_credits(org_id, reservation_id, cost)Why not compute cost ourselves? Because maintaining a parallel pricing database creates reconciliation nightmares. When LiteLLM updates pricing for a new model, we would need to update ours simultaneously. Any drift between the two means disagreements between what LiteLLM reports as spend and what we charge — a billing dispute waiting to happen.
The Credit System
Users purchase credits in dollar amounts. Credits map 1:1 to USD for simplicity. When you buy $100 in credits, you get $100 of LLM usage. The platform fee is charged on top at purchase time, not deducted from the credit amount.
Tier-Based Platform Fees
NemoRouter offers three pricing tiers, each with a different platform fee charged at credit purchase:
| Tier | Platform Fee | Minimum |
|---|---|---|
| Tier 1 (Pay As You Go) | 4% | $0/mo |
| Tier 2 | 2% | $100/mo |
| Tier 3 | 0% | $1,200/yr |
When a Tier 1 user buys $100 in credits, they pay $104 and receive $100 in their balance. A Tier 3 user pays exactly $100 for the same $100 in credits. This keeps per-request cost deduction simple — every request deducts the exact cost reported by LiteLLM, with no additional fee calculation at request time.
Platform fee tiers are set automatically by Stripe webhooks when a subscription is created or changed. There is no manual configuration.
Reserve and Settle
The critical challenge in a credit system is preventing overspend when multiple requests execute concurrently. If a user has $1.00 remaining and sends ten requests simultaneously, naive balance checking would approve all ten before any deduction occurs.
NemoRouter uses a reserve-and-settle pattern:
- Reserve — Before forwarding to LiteLLM, estimate the maximum cost and reserve that amount from the credit balance using an atomic database operation with
FOR UPDATElocks. - Forward — Send the request to LiteLLM. The reservation ensures the balance cannot be spent by concurrent requests.
- Settle — When LiteLLM returns the response with the actual cost in
x-litellm-response-cost, settle the reservation: deduct the actual cost and release the unused portion. - Release on failure — If the LiteLLM request fails, release the full reservation. Every error path must call
release_reservation— a leaked reservation is frozen credits.
# Reserve credits before forwarding
reservation = await reserve_credits(org_id, estimated_cost)
try:
response = await forward_to_litellm(request)
actual_cost = float(response.headers["x-litellm-response-cost"])
await settle_credits(org_id, reservation.id, actual_cost)
except Exception:
# Every failure path releases the reservation
await release_reservation(org_id, reservation.id)
raiseThis pattern guarantees that credit balances never go negative, even under high concurrency.
Cost Attribution and Observability
Every LLM response from NemoRouter includes headers for cost attribution:
x-nemo-org-id— The organization that owns the requestx-nemo-key-alias— The API key alias used (e.g.,sk-...last4)
On the provider side, we set body.user to nemo:{org_id} so that provider dashboards (OpenAI Usage, Google Cloud Console) show spend broken down by NemoRouter organization. This enables reconciliation between what NemoRouter reports and what providers charge.
The dashboard provides multiple views into cost data: per-model breakdowns, per-key spend, daily/weekly/monthly trends, and exportable CSV reports. All cost data flows from LiteLLM's spend tracking — we query it, display it, but never recompute it.
Handling Edge Cases
Several scenarios require special attention:
Streaming responses — Cost is only known after the full response completes. The reservation holds an estimated amount until the stream finishes and LiteLLM reports the final cost.
Cached responses — LiteLLM can cache responses to reduce cost. Cached hits have zero or reduced cost in the x-litellm-response-cost header, and the settlement reflects that.
Failed requests — Provider errors (rate limits, content filters, timeouts) must release the reservation immediately. We treat any non-2xx from LiteLLM as a trigger to release.
Provisioned throughput — Models running on Azure PTU or GCP GSU have different cost characteristics. LiteLLM handles this in its pricing calculation; NemoRouter treats the reported cost identically regardless of the underlying pricing model.
The principle is consistent: LiteLLM computes cost, NemoRouter manages credits. This separation of concerns keeps both systems reliable and reconcilable.