Canary Deploys and Auto-Rollback by SLO
A deploy shouldn't need a human watching a dashboard. Here is how a 5% canary, a fixed observation window, and SLO-gated auto-rollback let changes ship and self-heal without a 3 a.m. page.
The traditional deploy ends with a human staring at a dashboard for twenty minutes, ready to roll back. That doesn't scale and it doesn't sleep. A better shape ships the change to a small slice of traffic, watches a fixed set of service-level objectives for a bounded window, and automatically promotes or rolls back based on what the numbers say. This post is how canary-plus-SLO-gating turns a deploy from a vigil into a self-healing process.
The shape: 5% for 20 minutes
new revision deployed
→ route 5% of traffic to it (canary), 95% to stable
→ observe SLOs for a fixed window (e.g. 20 min)
├─ SLOs healthy → promote canary to 100%
└─ SLOs breached → roll back: 100% to stable, alertTwo numbers do the work: a small traffic share (5%) bounds the blast radius — a bad revision can only hurt a twentieth of requests — and a fixed observation window (20 min) gives the SLOs time to show a problem before promotion. Neither requires a human in the loop; the decision is a function of the metrics.
What SLOs gate the decision
The canary is judged against objectives that a bad deploy would visibly violate:
| SLO | Healthy | Breach signals |
|---|---|---|
| Error rate | within baseline | a new 5xx source |
| Latency p95/p99 | within baseline | a regression in the tail |
| Readiness | DB + mount + queue healthy | a broken dependency |
| Log errors | quiet | DataError, tracebacks, migration drift |
A canary that holds error rate and latency within baseline for the window earns promotion. One that introduces a 5xx source or a p99 spike trips the gate and rolls back automatically. Crucially, log errors count — a revision can return 200s while a background poller spams column does not exist, so "200 OK" alone is not the bar.
A 200 is necessary, not sufficient
Readiness caches for seconds and can stay green while the request path degrades. The auto-rollback gate watches error rate, latency tail, and log errors over the whole window — not a single health ping. The failure mode we're guarding against is exactly the deploy that looks healthy for the first thirty seconds and isn't.
Rollback is a traffic flip, never a data operation
The most important safety property: rollback reverts traffic, not data. Auto-rollback points 100% of traffic back at the known-good stable revision — it does not touch the database, run a down-migration, or wipe anything. This is only safe because migrations are additive: the stable revision still works against the new schema, so flipping back is instant and harmless. A rollback that had to undo schema changes wouldn't be a rollback; it'd be a second risky deploy under pressure.
Tiered gating: not every change auto-promotes
Auto-canary is appropriate for low-risk changes and gated for risky ones. A risk classifier tags each change, and only the safe tier auto-promotes within the window; higher-risk changes stay human-gated. The point isn't to remove humans from every decision — it's to remove them from the routine ones, so attention is spent on the changes that actually warrant it. The content-level invariants (credit safety, RLS, virtual-key-only) are enforced by tests and reviewers regardless of tier; the canary gates operational health, not correctness.
Why bounded windows beat indefinite watching
A human "keeping an eye on it" has no defined end and no defined criteria — attention drifts, and "looks fine" is not a measurement. A fixed window with explicit SLOs is both bounded (it ends) and decidable (promote or roll back, by the numbers). It also composes: every deploy gets the same 20-minute, same-SLO treatment, so deploy safety is a property of the pipeline, not of who happened to be watching.
The takeaway
Deploys shouldn't depend on a human's vigilance. Ship to a small canary slice to bound blast radius, judge it against explicit SLOs — error rate, latency tail, and log errors — over a fixed window, and auto-promote or auto-rollback by the numbers. Keep rollback a traffic flip (safe because migrations are additive), gate the risky tier for humans, and deploy safety becomes a pipeline property instead of a 3 a.m. page.