Reading x-litellm-response-cost While Streaming
When a response streams token by token, the final cost arrives last. Here is how NemoRouter captures the authoritative per-call cost from the response header and settles credits exactly — without computing cost itself.
There is one rule about cost that we never break: the gateway reads the provider's authoritative cost, it never computes cost itself. Token-count math drifts the moment a provider changes pricing, adds a cached-input tier, or bills reasoning tokens differently. The only number we trust is the one the inference layer reports for the call it actually served — x-litellm-response-cost. This post is how we capture it, including the awkward case of streaming, where the cost arrives after the content.
Why not compute cost from tokens?
It's tempting: you know the model, you know the token counts, multiply by a price table, done. It's also wrong often enough to matter:
- Prices change and a hardcoded table goes stale silently — under-charging is a revenue leak, over-charging is a customer-trust leak.
- Cached input is billed at a fraction of normal input on several providers; a naive multiply misses the discount.
- Reasoning / thinking tokens are priced separately on some models.
- Multimodal inputs (image, audio) don't map cleanly to a single per-token rate.
LiteLLM already encodes all of this and emits the settled cost as a response header. Reading that header is both simpler and correct. Computing it ourselves is more code and more wrong.
Where does the cost live?
For a normal (non-streaming) completion, the cost is a header on the response:
x-litellm-response-cost: 0.0024319We read it at settle time, convert to credits, and true-up the reservation (see reserve-and-settle). Easy. Streaming is where it gets interesting.
The streaming problem
In a streaming (SSE) response, the body is a sequence of chunks delivered as the model generates them. The total token usage — and therefore the cost — isn't known until the last chunk, because you can't price output you haven't generated yet.
data: {"choices":[{"delta":{"content":"Hel"}}]}
data: {"choices":[{"delta":{"content":"lo"}}]}
...
data: {"choices":[{"delta":{},"finish_reason":"stop"}], "usage":{...}} ← cost knowable HERE
data: [DONE]So the gateway can't settle on the first byte. It has to: forward the stream to the client as it arrives (latency matters — you don't buffer a stream just to bill it), and also observe the terminal usage to settle credits once the stream completes.
How NemoRouter handles it
The pattern is "pass through, observe the tail":
1. reserve the estimated max cost up front (as always)
2. open the upstream stream, pipe each chunk straight to the client
3. as the final chunk passes, capture the settled cost
4. settle credits to the real cost; release the over-reservation
5. if the stream errors or the client disconnects mid-stream,
release the full reservation — no partial silent chargeThe client sees a normal, low-latency stream. Behind it, the gateway has held a reservation since before the first byte and settles it the instant the cost becomes known. The customer is billed exactly the provider's number, captured from the cost header / terminal usage — never an estimate, never our own arithmetic.
Disconnects are a billing event, not a no-op
A client that hangs up mid-stream still consumed whatever the provider generated up to that point. The reservation must be settled to the real partial cost or released — never silently dropped. A leaked reservation on disconnect quietly shrinks the customer's headroom for a call they didn't finish.
Surfacing cost to the caller
Because the cost is authoritative, we can pass it back to clients on their own responses as an x-nemo-* header, so applications can do per-request accounting without a second API call. The numbers a customer sees in their usage analytics, in their budgets, and in any header we return all trace to the same source: the cost the inference layer reported for that exact call. One number, everywhere — no reconciliation drift.
The takeaway
Cost tracking is only trustworthy if there's a single source of truth, and for LLM calls that source is the provider's settled cost, not a token-count estimate. Streaming complicates when you learn the number, not which number you trust: pass the stream through for latency, observe the terminal usage, and settle exactly. The result is spend tracking that stays correct through every pricing change a provider ships — because we never hardcoded the prices in the first place.
More on the model behind this: why LiteLLM owns cost and the reserve-and-settle accounting that consumes it.