$5 free credits when you sign up
← All posts
Engineering

Deterministic A/B Testing Across Model Variants

Randomly splitting LLM traffic gives you flaky, unrepeatable experiments. Here is how hash-based deterministic A/B testing splits traffic consistently per user, so your model comparison is actually measurable.

Nemo Team9 min read

You want to know whether a cheaper model is good enough, or whether a new prompt actually improves quality. So you split traffic 50/50 and compare. If that split is random per request, your experiment is quietly broken: the same user bounces between variants mid-session, your metrics are noisy, and you can't reproduce a result. Deterministic A/B testing fixes this by making the split a stable function of who is calling, not a coin flip on each call.

Why random per-request splitting fails

Imagine an experiment comparing Model A and Model B at 50/50, assigned randomly on every request. A single user in one session might hit A, then B, then A again. Three problems follow:

  • Inconsistent UX — the user experiences two different models in one conversation; response style and quality visibly flip.
  • Confounded metrics — you can't attribute a satisfaction signal to a variant when the user saw both.
  • Unrepeatable — re-run the experiment and a different random draw gives a different result. There's nothing to reproduce.

Random per request maximizes noise. What you want is random assignment, stable thereafter.

Deterministic assignment via hashing

The fix is to derive the variant from a hash of a stable key (user id, session, or org) plus the experiment id — not from a random number:

bucket = hash(experiment_id + user_id) % 100
variant = bucket < 50 ? "A" : "B"

Because hash(experiment_id + user_id) is deterministic, a given user always lands in the same bucket for the life of the experiment. The assignment is still uniformly distributed across users (a good hash spreads them evenly), but stable for each one. Same user, same variant, every request — until you change the experiment.

Why include the experiment id in the hash

Hashing the user id alone would put the same users in "group A" for every experiment forever — your variant-A cohort becomes a fixed set of people, and their quirks bias every test. Mixing the experiment id in re-shuffles the cohort per experiment, so each test gets an independent, unbiased split.

What you can A/B test

The same deterministic seam handles several kinds of comparison:

TestVariant AVariant BYou learn
Model swapflagshipcheaper modelIs cheaper good enough?
Prompt changecurrent promptrevised promptDoes the rewrite help?
Parametertemp 0.7temp 0.3Does determinism improve quality?
Providerprovider Xprovider YSame model, who serves it better?

Each runs through the gateway, so the split happens centrally and your application code doesn't branch — it asks for the experiment's logical model and gets the variant the user is assigned to.

A/B tests are not request-overridable

A caller cannot ask to skip the experiment or force a variant. If they could, your sample would be self-selected and your results meaningless — and a client could dodge the cheaper variant, defeating the test. Assignment is the gateway's decision, derived deterministically, period. (This is the same principle as rate limits being non-overridable: some controls are boundaries, not knobs.)

Lifecycle: draft → running → paused → completed

An experiment has a state machine, because "is this test live" needs to be unambiguous:

  • Draft — configured but not splitting traffic yet.
  • Running — actively assigning variants and recording results.
  • Paused — assignment frozen; existing users keep their variant, no new splits.
  • Completed — the winner is chosen; traffic consolidates to it.

Crucially, moving to completed doesn't strand the cohort — everyone routes to the winning variant. And because assignment was deterministic, the results you're deciding on are the results you can reproduce.

Measuring the result honestly

With stable assignment, your metrics finally mean something: compare cost-per-request, latency, and whatever quality signal you capture (thumbs, task success, downstream conversion) between cohorts that each saw exactly one variant. Pair this with cost-vs-usage analytics to confirm that "cheaper variant B" actually moved the bill, not just the request count.

The takeaway

A model comparison is only as trustworthy as its split. Random-per-request splitting manufactures noise; deterministic hash-based assignment gives each user a stable variant, keeps the split uniform and unbiased across users, and makes the experiment reproducible. Define the test, let the gateway assign, and read a result you can actually stand behind. Start in Router Settings → A/B Tests.

Written by Nemo TeamEngineering, product, and company posts from the Nemo Router team — code-first, cost-honest, no vendor-marketing fluff.

More from Engineering

All posts →
Engineering

Hydration-Safe Rendering for Money and Time

new Date() and Math.random() in a React render body cause hydration mismatches — and on a billing dashboard, a flicker on a number erodes trust. Here is the pattern that keeps server and client agreeing.

Nemo Team
8 min
Engineering

Canary Deploys and Auto-Rollback by SLO

A deploy shouldn't need a human watching a dashboard. Here is how a 5% canary, a fixed observation window, and SLO-gated auto-rollback let changes ship and self-heal without a 3 a.m. page.

Nemo Team
9 min
Engineering

Credit Ledger Parity Checks: Catching Drift Early

If a balance and its ledger ever disagree, money is wrong somewhere. Here is how continuous parity checks compare balance to ledger sum and surface a reservation leak before it becomes a billing incident.

Nemo Team
8 min