Zero-Downtime Deployments: Blue-Green, Canary, and Rolling Strategies Compared
"We deploy during maintenance windows on Saturday nights."
If that's your deployment strategy, you're shipping slower than you need to and burning out your team. Modern deployment strategies let you ship to production during business hours, multiple times a day, with zero customer impact.
But choosing the right strategy matters. Each one has tradeoffs that rarely get discussed.
The Problem with Simple Deployments
The naive approach — stop old version, start new version — creates downtime:
v1 running → v1 stopped → v2 starting → v2 ready
↑ ↑
downtime starts downtime ends
Even if it's only 30 seconds, at scale that means thousands of failed requests, broken WebSocket connections, and angry customers.
Strategy 1: Rolling Deployment
The simplest zero-downtime strategy. Replace instances one at a time.
Time 0: [v1] [v1] [v1] [v1] ← all v1
Time 1: [v2] [v1] [v1] [v1] ← first pod updated
Time 2: [v2] [v2] [v1] [v1] ← second pod updated
Time 3: [v2] [v2] [v2] [v1] ← third pod updated
Time 4: [v2] [v2] [v2] [v2] ← all v2, done
In Kubernetes, this is the default:
apiVersion: apps/v1
kind: Deployment
spec:
replicas: 4
strategy:
type: RollingUpdate
rollingUpdate:
maxSurge: 1 # One extra pod during update
maxUnavailable: 0 # Never reduce below desired count
Pros:
- Simple to implement (Kubernetes default)
- Resource efficient — only one extra pod at a time
- Works for stateless services out of the box
Cons:
- v1 and v2 run simultaneously — Your app must handle this. Database migrations, API changes, and shared state can break.
- Slow rollback — Rolling back is another rolling update. If v2 is crashing, you're waiting minutes to fully revert.
- Hard to test in production — You can't route specific users to v2 for testing.
Best for: Stateless services with backward-compatible changes. Most web applications.
Strategy 2: Blue-Green Deployment
Run two identical environments. Switch traffic atomically.
Blue (v1): [v1] [v1] [v1] [v1] ← serving traffic
Green (v2): [v2] [v2] [v2] [v2] ← ready, not serving
Load Balancer
│
┌─────────┴─────────┐
│ Switch: Blue→Green │
└─────────────────────┘
Blue (v1): [v1] [v1] [v1] [v1] ← idle (rollback target)
Green (v2): [v2] [v2] [v2] [v2] ← serving traffic
# Kubernetes: switch Service selector
apiVersion: v1
kind: Service
metadata:
name: api
spec:
selector:
app: api
version: green # Flip this: blue ↔ green
ports:
- port: 80
targetPort: 8080
Pros:
- Instant rollback — Switch the selector back to blue. Done in seconds.
- No version mixing — All traffic hits v1 OR v2, never both.
- Full testing — Green environment can be tested completely before switching.
Cons:
- Double the resources — You need two full environments running simultaneously.
- Database migrations are tricky — Both versions need to work with the same database schema during the switch.
- Connection draining — Long-running requests on blue will be killed unless you drain properly.
Best for: Critical services where instant rollback is essential. Services with strict version compatibility requirements.
Strategy 3: Canary Deployment
Route a small percentage of traffic to the new version. Gradually increase if healthy.
Time 0: v1: 100% | v2: 0% ← deploy v2 canary
Time 1: v1: 95% | v2: 5% ← watch error rates
Time 2: v1: 80% | v2: 20% ← still healthy, increase
Time 3: v1: 50% | v2: 50% ← looking good
Time 4: v1: 0% | v2: 100% ← full promotion
With an Istio VirtualService:
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
name: api
spec:
hosts:
- api
http:
- route:
- destination:
host: api
subset: stable
weight: 95
- destination:
host: api
subset: canary
weight: 5
Pros:
- Minimum blast radius — If v2 is broken, only 5% of users are affected.
- Real production testing — You see how v2 behaves with real traffic, real data, real load.
- Data-driven promotion — Promote based on metrics (error rate, latency), not gut feeling.
Cons:
- Requires traffic splitting — You need a service mesh (Istio, Linkerd) or smart load balancer.
- Slow — Full promotion takes time (intentionally). Not great for urgent fixes.
- Metric granularity — You need per-version metrics to compare v1 vs. v2. Your observability must be solid.
Best for: High-traffic services where you want maximum safety. Services with complex behavior that's hard to test in staging.
Our Recommendation: Argo Rollouts
We use Argo Rollouts for most deployments. It's a Kubernetes controller that adds canary and blue-green strategies as native CRDs:
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
name: api
spec:
replicas: 5
strategy:
canary:
steps:
- setWeight: 5
- pause: { duration: 5m }
- setWeight: 20
- pause: { duration: 5m }
- setWeight: 50
- pause: { duration: 10m }
- setWeight: 100
analysis:
templates:
- templateName: success-rate
startingStep: 2 # Start analysis at 20%
args:
- name: service-name
value: api
The analysis block automatically checks metrics during promotion. If the error rate exceeds the threshold, the rollout automatically aborts and rolls back. No human intervention needed at 3 AM.
Database Migrations: The Hard Part
Every strategy has the same Achilles heel: database migrations. If v2 requires a new column, you need it before v2 starts serving. But v1 is still running.
The solution: expand-contract migrations.
Step 1: Add new column (nullable) ← v1 still works, ignores new column
Step 2: Deploy v2 (writes to both) ← v2 uses new column, v1 still works
Step 3: Backfill old data ← new column fully populated
Step 4: Deploy v3 (reads from new) ← fully migrated
Step 5: Drop old column ← cleanup
Never do breaking schema changes in a single deployment. Always expand first, migrate data, then contract.
Quick Reference
| Rolling | Blue-Green | Canary | |
|---|---|---|---|
| Resource cost | Low | 2x | Low-Medium |
| Rollback speed | Minutes | Seconds | Seconds |
| Version mixing | Yes | No | Yes (controlled) |
| Blast radius | Gradual | All or nothing | Controllable |
| Complexity | Low | Medium | High |
| Infra required | Kubernetes | Kubernetes + extra env | Service mesh or Argo |
Start Shipping Faster
If you're still doing maintenance-window deployments, start with rolling updates — they're the Kubernetes default and require zero extra tooling. Once you're comfortable, graduate to canary with Argo Rollouts.
Need help setting up zero-downtime deployments? We've done it hundreds of times.