$5 free credits when you sign up
← All posts
Engineering

Canary Deploys and Auto-Rollback by SLO

A deploy shouldn't need a human watching a dashboard. Here is how a 5% canary, a fixed observation window, and SLO-gated auto-rollback let changes ship and self-heal without a 3 a.m. page.

Nemo Team9 min read

The traditional deploy ends with a human staring at a dashboard for twenty minutes, ready to roll back. That doesn't scale and it doesn't sleep. A better shape ships the change to a small slice of traffic, watches a fixed set of service-level objectives for a bounded window, and automatically promotes or rolls back based on what the numbers say. This post is how canary-plus-SLO-gating turns a deploy from a vigil into a self-healing process.

The shape: 5% for 20 minutes

new revision deployed
  → route 5% of traffic to it (canary), 95% to stable
  → observe SLOs for a fixed window (e.g. 20 min)
       ├─ SLOs healthy  → promote canary to 100%
       └─ SLOs breached → roll back: 100% to stable, alert

Two numbers do the work: a small traffic share (5%) bounds the blast radius — a bad revision can only hurt a twentieth of requests — and a fixed observation window (20 min) gives the SLOs time to show a problem before promotion. Neither requires a human in the loop; the decision is a function of the metrics.

What SLOs gate the decision

The canary is judged against objectives that a bad deploy would visibly violate:

SLOHealthyBreach signals
Error ratewithin baselinea new 5xx source
Latency p95/p99within baselinea regression in the tail
ReadinessDB + mount + queue healthya broken dependency
Log errorsquietDataError, tracebacks, migration drift

A canary that holds error rate and latency within baseline for the window earns promotion. One that introduces a 5xx source or a p99 spike trips the gate and rolls back automatically. Crucially, log errors count — a revision can return 200s while a background poller spams column does not exist, so "200 OK" alone is not the bar.

A 200 is necessary, not sufficient

Readiness caches for seconds and can stay green while the request path degrades. The auto-rollback gate watches error rate, latency tail, and log errors over the whole window — not a single health ping. The failure mode we're guarding against is exactly the deploy that looks healthy for the first thirty seconds and isn't.

Rollback is a traffic flip, never a data operation

The most important safety property: rollback reverts traffic, not data. Auto-rollback points 100% of traffic back at the known-good stable revision — it does not touch the database, run a down-migration, or wipe anything. This is only safe because migrations are additive: the stable revision still works against the new schema, so flipping back is instant and harmless. A rollback that had to undo schema changes wouldn't be a rollback; it'd be a second risky deploy under pressure.

Tiered gating: not every change auto-promotes

Auto-canary is appropriate for low-risk changes and gated for risky ones. A risk classifier tags each change, and only the safe tier auto-promotes within the window; higher-risk changes stay human-gated. The point isn't to remove humans from every decision — it's to remove them from the routine ones, so attention is spent on the changes that actually warrant it. The content-level invariants (credit safety, RLS, virtual-key-only) are enforced by tests and reviewers regardless of tier; the canary gates operational health, not correctness.

Why bounded windows beat indefinite watching

A human "keeping an eye on it" has no defined end and no defined criteria — attention drifts, and "looks fine" is not a measurement. A fixed window with explicit SLOs is both bounded (it ends) and decidable (promote or roll back, by the numbers). It also composes: every deploy gets the same 20-minute, same-SLO treatment, so deploy safety is a property of the pipeline, not of who happened to be watching.

The takeaway

Deploys shouldn't depend on a human's vigilance. Ship to a small canary slice to bound blast radius, judge it against explicit SLOs — error rate, latency tail, and log errors — over a fixed window, and auto-promote or auto-rollback by the numbers. Keep rollback a traffic flip (safe because migrations are additive), gate the risky tier for humans, and deploy safety becomes a pipeline property instead of a 3 a.m. page.

Written by Nemo TeamEngineering, product, and company posts from the Nemo Router team — code-first, cost-honest, no vendor-marketing fluff.

More from Engineering

All posts →
Engineering

Hydration-Safe Rendering for Money and Time

new Date() and Math.random() in a React render body cause hydration mismatches — and on a billing dashboard, a flicker on a number erodes trust. Here is the pattern that keeps server and client agreeing.

Nemo Team
8 min
Engineering

Credit Ledger Parity Checks: Catching Drift Early

If a balance and its ledger ever disagree, money is wrong somewhere. Here is how continuous parity checks compare balance to ledger sum and surface a reservation leak before it becomes a billing incident.

Nemo Team
8 min
Engineering

Zero-Downtime Migrations With Two Schema Owners

One Postgres, two migration engines — Alembic and Prisma — that must never touch each other. Here is how additive-only migrations and idempotent hotfixes keep deploys safe and downtime-free.

Nemo Team
9 min