The traditional deploy ends with a human staring at a dashboard for twenty minutes, ready to roll back. That doesn't scale and it doesn't sleep. A better shape ships the change to a small slice of traffic, watches a fixed set of service-level objectives for a bounded window, and automatically promotes or rolls back based on what the numbers say. This post is how canary-plus-SLO-gating turns a deploy from a vigil into a self-healing process.

The shape: 5% for 20 minutes

new revision deployed
  → route 5% of traffic to it (canary), 95% to stable
  → observe SLOs for a fixed window (e.g. 20 min)
       ├─ SLOs healthy  → promote canary to 100%
       └─ SLOs breached → roll back: 100% to stable, alert

Two numbers do the work: a small traffic share (5%) bounds the blast radius — a bad revision can only hurt a twentieth of requests — and a fixed observation window (20 min) gives the SLOs time to show a problem before promotion. Neither requires a human in the loop; the decision is a function of the metrics.

What SLOs gate the decision

The canary is judged against objectives that a bad deploy would visibly violate:

SLO	Healthy	Breach signals
Error rate	within baseline	a new 5xx source
Latency p95/p99	within baseline	a regression in the tail
Readiness	DB + mount + queue healthy	a broken dependency
Log errors	quiet	`DataError`, tracebacks, migration drift

A canary that holds error rate and latency within baseline for the window earns promotion. One that introduces a 5xx source or a p99 spike trips the gate and rolls back automatically. Crucially, log errors count — a revision can return 200s while a background poller spams column does not exist, so "200 OK" alone is not the bar.

A 200 is necessary, not sufficient

Readiness caches for seconds and can stay green while the request path degrades. The auto-rollback gate watches error rate, latency tail, and log errors over the whole window — not a single health ping. The failure mode we're guarding against is exactly the deploy that looks healthy for the first thirty seconds and isn't.

Rollback is a traffic flip, never a data operation

The most important safety property: rollback reverts traffic, not data. Auto-rollback points 100% of traffic back at the known-good stable revision — it does not touch the database, run a down-migration, or wipe anything. This is only safe because migrations are additive: the stable revision still works against the new schema, so flipping back is instant and harmless. A rollback that had to undo schema changes wouldn't be a rollback; it'd be a second risky deploy under pressure.

Tiered gating: not every change auto-promotes

Auto-canary is appropriate for low-risk changes and gated for risky ones. A risk classifier tags each change, and only the safe tier auto-promotes within the window; higher-risk changes stay human-gated. The point isn't to remove humans from every decision — it's to remove them from the routine ones, so attention is spent on the changes that actually warrant it. The content-level invariants (credit safety, RLS, virtual-key-only) are enforced by tests and reviewers regardless of tier; the canary gates operational health, not correctness.

Why bounded windows beat indefinite watching

A human "keeping an eye on it" has no defined end and no defined criteria — attention drifts, and "looks fine" is not a measurement. A fixed window with explicit SLOs is both bounded (it ends) and decidable (promote or roll back, by the numbers). It also composes: every deploy gets the same 20-minute, same-SLO treatment, so deploy safety is a property of the pipeline, not of who happened to be watching.

The takeaway

Deploys shouldn't depend on a human's vigilance. Ship to a small canary slice to bound blast radius, judge it against explicit SLOs — error rate, latency tail, and log errors — over a fixed window, and auto-promote or auto-rollback by the numbers. Keep rollback a traffic flip (safe because migrations are additive), gate the risky tier for humans, and deploy safety becomes a pipeline property instead of a 3 a.m. page.

Canary Deploys and Auto-Rollback by SLO

The shape: 5% for 20 minutes

What SLOs gate the decision

Rollback is a traffic flip, never a data operation

Tiered gating: not every change auto-promotes

Why bounded windows beat indefinite watching

The takeaway

More from Engineering

Hydration-Safe Rendering for Money and Time

Platform Fee

Platform Fee

Credit Ledger Parity Checks: Catching Drift Early

Zero-Downtime Migrations With Two Schema Owners