A/B Testing

Split live traffic between model variants to compare quality, latency, and cost on real production calls

Last updated

A/B tests let you compare model variants on real production traffic — without changing a single line of application code. You declare two or more variants for a given model alias, assign weights, start the test, and NemoRouter rewrites the model your application asked for to one of the variants on the way through. Per-variant request counts, latency, cost, and success rate land on the dashboard so you can decide which variant wins.

Where A/B testing sits in the request pipeline

Every NemoRouter inference call goes through a fixed pre-flight pipeline before it ever reaches a model. A/B testing is one of those steps:

Your request: POST /v1/chat/completions   body.model = "gemini-2.5-flash"

   ├─ 1. Authenticate           → resolve your virtual key → org + role
   ├─ 2. Pre-flight             → RPM / TPM gate, atomic credit reservation
   ├─ 3. A/B test selection      → if a test matches, rewrite body.model
   ├─ 4. Guardrails              → input PII / blocklist / injection checks
   ├─ 5. Prompt template         → inject a managed system prompt (if any)
   ├─ 6. Forward                 → call the (possibly rewritten) model
   └─ 7. Settle credits          → release the reservation, charge the actual cost

Step 3 is where A/B testing lives. The selector looks for a running test in your org whose variants list the model you sent (gemini-2.5-flash here). If it finds one, it hashes (salt, org_id, test_id, request_id), maps the hash into the cumulative-weight range, picks a variant, and rewrites body.model to that variant's model. Everything after step 3 — guardrails, prompts, settlement — sees the selected variant, not what your application originally asked for. Settlement uses the actually-served model's cost, so billing stays accurate. The credit reservation at step 2, however, was sized for the originally-requested model — so for now keep variants in the same pricing class to avoid under-reserving on bursty traffic.

Two consequences of doing this in pre-flight:

  1. Your client code is unchanged. The only thing the SDK knows about is the model alias your app already uses. The split is server-side.
  2. The split is integrity-protected. The hash salt is server-side; clients cannot bias the bucketing by crafting request_id values, and there is no per-request opt-out field. This is deliberate — once you let a client pin themselves to their favorite variant, experiment integrity is gone.

Lifecycle

   draft ── start ──► running ── pause ──► paused ── start ──► running
                         │                    │
                         └─ complete ──┬──────┘

                                   completed (terminal)
  • draft — created but not routing. Edit freely.
  • running — actively splitting traffic. Variants resolve at pre-flight step 3.
  • paused — temporarily off. Requests flow straight to whichever model they originally asked for. Resumable.
  • completed — terminal. Configuration is frozen for the audit trail. Clone the test if you need a follow-up; you cannot reopen one.

Status changes take effect within ~30 seconds across all backend replicas (we cache the active-test lookup to keep the hot path fast).

A worked example

The scenario: you currently use gemini-2.5-flash for summarising customer support tickets and want to know whether gemini-2.5-flash-lite (cheaper, faster) holds up at acceptable quality. Start at 70/30, run it for a day, then decide.

All requests below assume your organization's admin-role NemoRouter API key is in $NEMOROUTER_API_KEY.

1. Create the test

curl -X POST https://api.nemorouter.ai/nemo/ab-test/new \
  -H "Authorization: Bearer $NEMOROUTER_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "name": "Flash vs Flash-Lite for support summaries",
    "description": "Does Flash-Lite hold up at acceptable quality for one-line ticket summaries?",
    "variants": [
      { "name": "control",    "model": "gemini-2.5-flash",      "weight": 0.7 },
      { "name": "challenger", "model": "gemini-2.5-flash-lite", "weight": 0.3 }
    ]
  }'

Response (the new test starts in draft):

{
  "data": {
    "id": "7f3c1a8e-2b4d-4a91-bf7c-1d2e3a4b5c6d",
    "name": "Flash vs Flash-Lite for support summaries",
    "status": "draft",
    "variants": [
      { "name": "control",    "model": "gemini-2.5-flash",      "weight": 0.7 },
      { "name": "challenger", "model": "gemini-2.5-flash-lite", "weight": 0.3 }
    ],
    "created_at": "2026-05-28T16:00:00Z"
  }
}

Save the id — every subsequent call refers to it.

2. Start the test

TEST_ID="7f3c1a8e-2b4d-4a91-bf7c-1d2e3a4b5c6d"

curl -X POST "https://api.nemorouter.ai/nemo/ab-test/start?test_id=$TEST_ID" \
  -H "Authorization: Bearer $NEMOROUTER_API_KEY"

The response shows "status": "running". From now on, every request with model: gemini-2.5-flash or model: gemini-2.5-flash-lite enters the test at pre-flight step 3 and gets bucketed.

3. Your application keeps calling the same model alias

No code change. The application still sends model: gemini-2.5-flash. Roughly 70% of those calls actually run on gemini-2.5-flash; the other 30% are rewritten to gemini-2.5-flash-lite. Cost and latency settle against whichever model actually served the response — billing stays accurate. The model that actually served each call is what appears in the observability log at /[organization]/observability; the original-alias → served-variant breakdown lives in the A/B test rollup (the /info and /results endpoints below).

4. Read the per-variant rollup

After a few hundred calls per variant have flowed:

curl "https://api.nemorouter.ai/nemo/ab-test/info?test_id=$TEST_ID" \
  -H "Authorization: Bearer $NEMOROUTER_API_KEY"
{
  "data": {
    "id": "7f3c1a8e-2b4d-4a91-bf7c-1d2e3a4b5c6d",
    "status": "running",
    "variants": [
      { "name": "control",    "model": "gemini-2.5-flash",      "weight": 0.7 },
      { "name": "challenger", "model": "gemini-2.5-flash-lite", "weight": 0.3 }
    ],
    "metrics": [
      { "variant_name": "challenger", "request_count": 294, "avg_latency_ms": 287, "avg_cost": 0.000094, "total_cost": 0.027636, "success_rate": 0.993 },
      { "variant_name": "control",    "request_count": 706, "avg_latency_ms": 412, "avg_cost": 0.000183, "total_cost": 0.129198, "success_rate": 0.997 }
    ]
  }
}

metrics is a list with one entry per variant, ordered alphabetically by variant_name. Fields: variant_name, request_count, avg_latency_ms, avg_cost, total_cost, success_rate. Cost is the same value your billing uses — taken from the upstream cost header for known models. For brand-new models not yet in our pricing table the cost falls back through inline pricing → reservation floor; treat freshly-launched models with light traffic before drawing cost conclusions.

In this run, the challenger is faster and cheaper at near-identical quality. Enough signal to ramp up.

5. Decide: roll out, keep running, or re-split

Once a test leaves draft it is immutable — variants and weights are frozen for the integrity of the experiment, so a single test can never silently mix two configurations (which would make its metrics meaningless). update accepts only draft tests; calling it on a running or paused test returns 400 Only draft tests can be edited. That gives you three moves once you've read the rollup:

Keep gathering signal — do nothing; let it run.

Roll the winner out — change the model alias your application sends to the winning variant's model. At that point no variant matches and traffic flows straight to the chosen model. Then complete the test to freeze it for the audit trail:

curl -X POST "https://api.nemorouter.ai/nemo/ab-test/complete?test_id=$TEST_ID" \
  -H "Authorization: Bearer $NEMOROUTER_API_KEY"

Try a different split — because weights are frozen on a started test, you re-split by completing the current test and creating a fresh one at the new weights (a new test starts in draft and returns a new test_id):

# Freeze the current experiment (complete is allowed directly from running)
curl -X POST "https://api.nemorouter.ai/nemo/ab-test/complete?test_id=$TEST_ID" \
  -H "Authorization: Bearer $NEMOROUTER_API_KEY"

# Create + start a follow-up at 30/70 — this returns a NEW test_id
curl -X POST https://api.nemorouter.ai/nemo/ab-test/new \
  -H "Authorization: Bearer $NEMOROUTER_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "name": "Flash vs Flash-Lite — 30/70 follow-up",
    "variants": [
      { "name": "control",    "model": "gemini-2.5-flash",      "weight": 0.3 },
      { "name": "challenger", "model": "gemini-2.5-flash-lite", "weight": 0.7 }
    ]
  }'

Variant weights in a new (or update) request must sum to 1.0 (±0.01) — the API rejects splits that don't. The test_id on every mutating call is a query parameter, not a body field; the body carries only the new field values.

Test cases you can run yourself

Four runnable checks that demonstrate the four invariants you should trust before scaling traffic up. Each test is self-contained.

Prerequisites:

  • The worked example above is created and running. TEST_ID is set in your shell.
  • NEMOROUTER_API_KEY is set to an admin-role key (reads accept member, writes require admin).
  • A few dollars of credits available.
  • Tests use xargs -P 10 to parallelize calls 10-way — total wall-clock per test is ~10-20 seconds instead of 1-3 minutes.
  • Each test waits ~12 seconds after sending traffic, because results are batched to the database every 10 seconds. Without the wait, your read can land before the flush and undercount.

Test 1 — The split actually splits

What it proves: Weighted bucketing produces roughly the configured distribution under load. We compute deltas, so prior traffic on the same test doesn't pollute the expected numbers.

# Snapshot per-variant counts before the test
BEFORE_CONTROL=$(curl -s "https://api.nemorouter.ai/nemo/ab-test/info?test_id=$TEST_ID" \
  -H "Authorization: Bearer $NEMOROUTER_API_KEY" | jq '([.data.metrics[] | select(.variant_name=="control")][0].request_count) // 0')
BEFORE_CHALLENGER=$(curl -s "https://api.nemorouter.ai/nemo/ab-test/info?test_id=$TEST_ID" \
  -H "Authorization: Bearer $NEMOROUTER_API_KEY" | jq '([.data.metrics[] | select(.variant_name=="challenger")][0].request_count) // 0')

# Send 200 requests in parallel (10-way, ~15s wall clock)
seq 1 200 | xargs -P 10 -I {} curl -s -X POST https://api.nemorouter.ai/v1/chat/completions \
  -H "Authorization: Bearer $NEMOROUTER_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"model":"gemini-2.5-flash","messages":[{"role":"user","content":"Reply with the single word: ok"}],"max_tokens":4}' \
  -o /dev/null

# Wait for the result buffer to flush (10s flush interval) then compute deltas
sleep 12
AFTER_CONTROL=$(curl -s "https://api.nemorouter.ai/nemo/ab-test/info?test_id=$TEST_ID" \
  -H "Authorization: Bearer $NEMOROUTER_API_KEY" | jq '([.data.metrics[] | select(.variant_name=="control")][0].request_count) // 0')
AFTER_CHALLENGER=$(curl -s "https://api.nemorouter.ai/nemo/ab-test/info?test_id=$TEST_ID" \
  -H "Authorization: Bearer $NEMOROUTER_API_KEY" | jq '([.data.metrics[] | select(.variant_name=="challenger")][0].request_count) // 0')

echo "control delta:    $((AFTER_CONTROL - BEFORE_CONTROL))"
echo "challenger delta: $((AFTER_CHALLENGER - BEFORE_CHALLENGER))"

Expected: With the test at 70/30 (the worked example's configuration), control delta should be ~140 (±25), challenger ~60 (±15). Exact numbers vary — 200 calls is small for a 70/30 split. Bump to 1000 for tighter confidence (still finishes in ~1 minute under -P 10).

Test 2 — Models not in any variant pass through untouched

What it proves: A test only intercepts requests whose model matches one of its variants. Everything else is unaffected.

# Snapshot. The `// 0` coerces an empty-metrics result to 0 (safe under set -euo pipefail).
BEFORE_TOTAL=$(curl -s "https://api.nemorouter.ai/nemo/ab-test/info?test_id=$TEST_ID" \
  -H "Authorization: Bearer $NEMOROUTER_API_KEY" | jq '([.data.metrics[].request_count] | add) // 0')

# Send 10 requests with a model that is NOT in any variant (parallel, fast)
seq 1 10 | xargs -P 5 -I {} curl -s -X POST https://api.nemorouter.ai/v1/chat/completions \
  -H "Authorization: Bearer $NEMOROUTER_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"model":"gemini-2.5-pro","messages":[{"role":"user","content":"hi"}],"max_tokens":4}' \
  -o /dev/null

# Wait through the buffer-flush window
sleep 12
AFTER_TOTAL=$(curl -s "https://api.nemorouter.ai/nemo/ab-test/info?test_id=$TEST_ID" \
  -H "Authorization: Bearer $NEMOROUTER_API_KEY" | jq '([.data.metrics[].request_count] | add) // 0')

echo "Delta: $((AFTER_TOTAL - BEFORE_TOTAL))"

Expected: Delta is 0. The 10 gemini-2.5-pro calls bypassed the test entirely. They appear in /[organization]/observability served by gemini-2.5-pro, never by either variant.

Test 3 — Pausing stops the routing within ~30s

What it proves: A paused test does not re-route. Requests flow straight to the originally-requested model.

# Pause
curl -s -X POST "https://api.nemorouter.ai/nemo/ab-test/pause?test_id=$TEST_ID" \
  -H "Authorization: Bearer $NEMOROUTER_API_KEY" > /dev/null

# Wait for the active-test cache to expire across all replicas
sleep 35

# Snapshot
BEFORE=$(curl -s "https://api.nemorouter.ai/nemo/ab-test/info?test_id=$TEST_ID" \
  -H "Authorization: Bearer $NEMOROUTER_API_KEY" | jq '([.data.metrics[].request_count] | add) // 0')

# Send 30 requests in parallel
seq 1 30 | xargs -P 5 -I {} curl -s -X POST https://api.nemorouter.ai/v1/chat/completions \
  -H "Authorization: Bearer $NEMOROUTER_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"model":"gemini-2.5-flash","messages":[{"role":"user","content":"hi"}],"max_tokens":4}' \
  -o /dev/null

# Wait through the buffer-flush window
sleep 12
AFTER=$(curl -s "https://api.nemorouter.ai/nemo/ab-test/info?test_id=$TEST_ID" \
  -H "Authorization: Bearer $NEMOROUTER_API_KEY" | jq '([.data.metrics[].request_count] | add) // 0')

echo "Metrics delta while paused: $((AFTER - BEFORE))"

Expected: Delta is 0. While paused, no new rows are written to ab_test_results. All 30 calls show up in observability served by gemini-2.5-flash directly.

Resume for the next test:

curl -s -X POST "https://api.nemorouter.ai/nemo/ab-test/start?test_id=$TEST_ID" \
  -H "Authorization: Bearer $NEMOROUTER_API_KEY" > /dev/null
sleep 35

Test 4 — A started test is immutable

What it proves: Once a test leaves draft, its configuration is frozen. update is rejected on a running (or paused) test, so an experiment can never silently change its split mid-flight — the guarantee that makes the metrics trustworthy.

# The worked-example test is running. Try to re-weight it to 20/80.
HTTP_CODE=$(curl -s -o /tmp/ab-update-resp.json -w "%{http_code}" \
  -X PATCH "https://api.nemorouter.ai/nemo/ab-test/update?test_id=$TEST_ID" \
  -H "Authorization: Bearer $NEMOROUTER_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "variants": [
      { "name": "control",    "model": "gemini-2.5-flash",      "weight": 0.2 },
      { "name": "challenger", "model": "gemini-2.5-flash-lite", "weight": 0.8 }
    ]
  }')

echo "HTTP status: $HTTP_CODE"
cat /tmp/ab-update-resp.json

Expected: HTTP status: 400, with a body like {"detail":"Only draft tests can be edited; this test is in 'running' status"}. The split stays exactly as configured. To actually run a different split, complete this test and create a new one at the new weights (see step 5 — "Try a different split").

Cleanup

When you're done with the test cases:

curl -X POST "https://api.nemorouter.ai/nemo/ab-test/complete?test_id=$TEST_ID" \
  -H "Authorization: Bearer $NEMOROUTER_API_KEY"

Things to watch for

  • You can't edit variants on a started test — by design. update returns 400 on any non-draft test. This is deliberate: mixing two configurations within one test makes the metrics meaningless. To change the split, complete the test and start a fresh one (step 5).
  • Run for long enough. A few hundred calls per variant is a reasonable floor before you draw conclusions on latency or cost. Quality differences usually need more.
  • Rate limits are per virtual key, not per variant. If one variant points at a model with stricter provider limits, you'll see more 429s on that variant — real signal, not a test artifact.
  • Cost reflects the actually-served model. If a variant's upstream provider triggers a NemoRouter fallback chain, the cost recorded is for whatever model answered. Right behavior, but an unstable upstream can muddy comparisons.
  • One running test per (org, model). If two running tests both list gemini-2.5-flash as a variant, only one will intercept that traffic. Complete or pause the first before starting the second.
  • All teams in an org are eligible. Every team's traffic that matches a variant model enters the test. Per-team experiment scoping is not supported today.

Endpoint reference

OperationMethod + path
List org testsGET /nemo/ab-test/list
Inspect one (with metrics)GET /nemo/ab-test/info?test_id=...
CreatePOST /nemo/ab-test/new
Update (draft only, body = {variants})PATCH /nemo/ab-test/update?test_id=...
Start / Pause / CompletePOST /nemo/ab-test/start | /pause | /complete?test_id=...
Raw results (paginated per-request rows, each tagged with variant_name, plus a per-variant summary)GET /nemo/ab-test/results?test_id=...
DeleteDELETE /nemo/ab-test/delete?test_id=...

Roles: read endpoints (list, info, results) accept any member-role NemoRouter API key. Write endpoints (new, update, start, pause, complete, delete) require an admin-role key (owner, admin, proxy_admin, or org_admin). Most virtual keys created from the dashboard land in member role; an org owner needs to promote a key explicitly. See Authentication for how to inspect and promote key roles.

Next Steps

  • Playground — Send a few calls through your model alias to confirm the test is splitting before you scale traffic up.
  • Budget Controls — Cap A/B-test spend with a per-key budget on the key used for the experiment.
  • Authentication — Virtual keys, admin roles, and how A/B routing respects both.
  • Chat Completions — The endpoint that A/B tests sit in front of.
Was this page helpful?