DevOps Kubernetes Reliability Deployments

Zero-Downtime Deployments: Blue-Green, Canary, and Rolling Strategies Compared

Kicked TeamFebruary 8, 20266 min read

"We deploy during maintenance windows on Saturday nights."

If that's your deployment strategy, you're shipping slower than you need to and burning out your team. Modern deployment strategies let you ship to production during business hours, multiple times a day, with zero customer impact.

But choosing the right strategy matters. Each one has tradeoffs that rarely get discussed.

The Problem with Simple Deployments

The naive approach — stop old version, start new version — creates downtime:

v1 running → v1 stopped → v2 starting → v2 ready
                ↑                          ↑
           downtime starts           downtime ends

Even if it's only 30 seconds, at scale that means thousands of failed requests, broken WebSocket connections, and angry customers.

Strategy 1: Rolling Deployment

The simplest zero-downtime strategy. Replace instances one at a time.

Time 0:  [v1] [v1] [v1] [v1]  ← all v1
Time 1:  [v2] [v1] [v1] [v1]  ← first pod updated
Time 2:  [v2] [v2] [v1] [v1]  ← second pod updated
Time 3:  [v2] [v2] [v2] [v1]  ← third pod updated
Time 4:  [v2] [v2] [v2] [v2]  ← all v2, done

In Kubernetes, this is the default:

apiVersion: apps/v1
kind: Deployment
spec:
  replicas: 4
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxSurge: 1        # One extra pod during update
      maxUnavailable: 0   # Never reduce below desired count

Pros:

Simple to implement (Kubernetes default)
Resource efficient — only one extra pod at a time
Works for stateless services out of the box

Cons:

v1 and v2 run simultaneously — Your app must handle this. Database migrations, API changes, and shared state can break.
Slow rollback — Rolling back is another rolling update. If v2 is crashing, you're waiting minutes to fully revert.
Hard to test in production — You can't route specific users to v2 for testing.

Best for: Stateless services with backward-compatible changes. Most web applications.

Strategy 2: Blue-Green Deployment

Run two identical environments. Switch traffic atomically.

Blue  (v1): [v1] [v1] [v1] [v1]  ← serving traffic
Green (v2): [v2] [v2] [v2] [v2]  ← ready, not serving

         Load Balancer
              │
    ┌─────────┴─────────┐
    │  Switch: Blue→Green │
    └─────────────────────┘

Blue  (v1): [v1] [v1] [v1] [v1]  ← idle (rollback target)
Green (v2): [v2] [v2] [v2] [v2]  ← serving traffic

# Kubernetes: switch Service selector
apiVersion: v1
kind: Service
metadata:
  name: api
spec:
  selector:
    app: api
    version: green   # Flip this: blue ↔ green
  ports:
    - port: 80
      targetPort: 8080

Pros:

Instant rollback — Switch the selector back to blue. Done in seconds.
No version mixing — All traffic hits v1 OR v2, never both.
Full testing — Green environment can be tested completely before switching.

Cons:

Double the resources — You need two full environments running simultaneously.
Database migrations are tricky — Both versions need to work with the same database schema during the switch.
Connection draining — Long-running requests on blue will be killed unless you drain properly.

Best for: Critical services where instant rollback is essential. Services with strict version compatibility requirements.

Strategy 3: Canary Deployment

Route a small percentage of traffic to the new version. Gradually increase if healthy.

Time 0:   v1: 100%  |  v2: 0%    ← deploy v2 canary
Time 1:   v1: 95%   |  v2: 5%    ← watch error rates
Time 2:   v1: 80%   |  v2: 20%   ← still healthy, increase
Time 3:   v1: 50%   |  v2: 50%   ← looking good
Time 4:   v1: 0%    |  v2: 100%  ← full promotion

With an Istio VirtualService:

apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
  name: api
spec:
  hosts:
    - api
  http:
    - route:
        - destination:
            host: api
            subset: stable
          weight: 95
        - destination:
            host: api
            subset: canary
          weight: 5

Pros:

Minimum blast radius — If v2 is broken, only 5% of users are affected.
Real production testing — You see how v2 behaves with real traffic, real data, real load.
Data-driven promotion — Promote based on metrics (error rate, latency), not gut feeling.

Cons:

Requires traffic splitting — You need a service mesh (Istio, Linkerd) or smart load balancer.
Slow — Full promotion takes time (intentionally). Not great for urgent fixes.
Metric granularity — You need per-version metrics to compare v1 vs. v2. Your observability must be solid.

Best for: High-traffic services where you want maximum safety. Services with complex behavior that's hard to test in staging.

Our Recommendation: Argo Rollouts

We use Argo Rollouts for most deployments. It's a Kubernetes controller that adds canary and blue-green strategies as native CRDs:

apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
  name: api
spec:
  replicas: 5
  strategy:
    canary:
      steps:
        - setWeight: 5
        - pause: { duration: 5m }
        - setWeight: 20
        - pause: { duration: 5m }
        - setWeight: 50
        - pause: { duration: 10m }
        - setWeight: 100
      analysis:
        templates:
          - templateName: success-rate
        startingStep: 2   # Start analysis at 20%
        args:
          - name: service-name
            value: api

The analysis block automatically checks metrics during promotion. If the error rate exceeds the threshold, the rollout automatically aborts and rolls back. No human intervention needed at 3 AM.

Database Migrations: The Hard Part

Every strategy has the same Achilles heel: database migrations. If v2 requires a new column, you need it before v2 starts serving. But v1 is still running.

The solution: expand-contract migrations.

Step 1: Add new column (nullable)     ← v1 still works, ignores new column
Step 2: Deploy v2 (writes to both)    ← v2 uses new column, v1 still works  
Step 3: Backfill old data             ← new column fully populated
Step 4: Deploy v3 (reads from new)    ← fully migrated
Step 5: Drop old column               ← cleanup

Never do breaking schema changes in a single deployment. Always expand first, migrate data, then contract.

Quick Reference

	Rolling	Blue-Green	Canary
Resource cost	Low	2x	Low-Medium
Rollback speed	Minutes	Seconds	Seconds
Version mixing	Yes	No	Yes (controlled)
Blast radius	Gradual	All or nothing	Controllable
Complexity	Low	Medium	High
Infra required	Kubernetes	Kubernetes + extra env	Service mesh or Argo

Start Shipping Faster

If you're still doing maintenance-window deployments, start with rolling updates — they're the Kubernetes default and require zero extra tooling. Once you're comfortable, graduate to canary with Argo Rollouts.

Need help setting up zero-downtime deployments? We've done it hundreds of times.

Back to all posts