$5 free credits when you sign up
← All posts
Engineering

Cost vs Usage: Finding the Quietly Expensive Model

Request count and dollar cost tell different stories. Here is how cost-vs-usage analytics surface the low-volume model that dominates your bill — and the cheap one you can route more traffic to.

Nemo Team8 min read

A model that serves 5% of your requests can be 60% of your bill. If you only watch request volume, that model is invisible — small bar, no attention — while it quietly dominates spend. Cost-vs-usage analytics exist to surface exactly this divergence: the place where "how often" and "how much" disagree, which is almost always where the savings are.

Why do cost and usage diverge?

Because per-request cost varies by orders of magnitude across models and prompt shapes. A request is a request when you count them; in dollars, they're nothing alike.

            requests        cost
small model   78%           14%      ← high volume, low cost
flagship      12%           71%      ← low volume, dominates the bill
embeddings    10%            5%

Plotting usage and cost on the same timeline immediately shows the flagship line towering over its modest request share. That gap is the actionable signal: a small slice of traffic you might be able to route, cache, or down-size for a large slice of the bill.

The four shapes worth recognizing

When you overlay cost and usage, the patterns repeat:

PatternCost vs usageWhat it meansMove
The whaleCost ≫ usageA pricey model on a minority of callsCan it be a cheaper model? Route by need.
The bargainUsage ≫ costCheap model doing real workSafe to send more here
The leakCost rising, usage flatPer-call cost crept upPricing change? Prompt bloat? Lost cache?
The spikeBoth jump togetherGenuine traffic increaseCapacity/budget question, not efficiency

The first two are routing opportunities; the third is an investigation; the fourth is a planning input. Same chart, four different decisions.

"Cost rising, usage flat" is the one to chase

Of the four, the leak is the most valuable to catch early because it's pure waste — you're paying more for the same work. Common causes, in order of likelihood:

  1. Prompt bloat — context grew (more history, bigger RAG chunks) so input tokens — and cost — crept up without more requests.
  2. Lost caching — a cache that used to hit now misses (a templating change, a cache-busting field), so you re-pay for repeated input.
  3. A pricing change — the provider raised a rate; because the gateway reads the authoritative cost header, this shows up immediately and correctly rather than hiding behind a stale price table.
  4. A model swap — a default quietly moved to a pricier model.

Because the cost number is the provider's settled value, the chart never lies to you about whether cost rose — only your investigation decides why.

Attribution makes this precise

Overlay cost-vs-usage with tags and the whale gets a name: it's the summarizer feature on the enterprise tier, not just "the flagship." A breakdown you can act on beats an aggregate you can only worry about.

From insight to routing

The payoff of spotting a whale or a bargain is a routing decision. If a flagship is dominating cost on a task a smaller model handles well, route that task down. If a cheap model is quietly doing great work, you can lean on it harder. This is where analytics meets A/B testing: measure cost-vs-usage, hypothesize a cheaper route, and test it deterministically before committing — then watch the next period's chart confirm the savings.

The takeaway

Counting requests tells you what's busy; counting dollars tells you what's expensive; only overlaying them tells you where they disagree — and the disagreement is the opportunity. Learn the four shapes (whale, bargain, leak, spike), chase "cost up, usage flat" first because it's pure waste, and turn each finding into a routing move. Your Cost vs Usage report is where the quietly expensive model finally becomes visible.

Written by Nemo TeamEngineering, product, and company posts from the Nemo Router team — code-first, cost-honest, no vendor-marketing fluff.

More from Engineering

All posts →
Engineering

Hydration-Safe Rendering for Money and Time

new Date() and Math.random() in a React render body cause hydration mismatches — and on a billing dashboard, a flicker on a number erodes trust. Here is the pattern that keeps server and client agreeing.

Nemo Team
8 min
Engineering

Canary Deploys and Auto-Rollback by SLO

A deploy shouldn't need a human watching a dashboard. Here is how a 5% canary, a fixed observation window, and SLO-gated auto-rollback let changes ship and self-heal without a 3 a.m. page.

Nemo Team
9 min
Engineering

Credit Ledger Parity Checks: Catching Drift Early

If a balance and its ledger ever disagree, money is wrong somewhere. Here is how continuous parity checks compare balance to ledger sum and surface a reservation leak before it becomes a billing incident.

Nemo Team
8 min