SRE Disaster Recovery Reliability Infrastructure

Disaster Recovery Planning: RTO, RPO, and the Tests Nobody Runs

Kicked TeamDecember 28, 20257 min read

Here's a question that keeps CTOs up at night: if your primary datacenter went offline right now — fire, power failure, fiber cut — how long until your customers can use your product again?

If the answer is "I don't know" or "it depends on who's awake," you don't have a disaster recovery plan. You have a disaster recovery document.

The difference between a plan and a document is testing.

RTO and RPO: The Two Numbers That Matter

Before designing anything, define your targets:

RTO (Recovery Time Objective): How long can you be down?

4 hours? That's a weekend migration.
1 hour? That's automated failover with manual verification.
5 minutes? That's active-active multi-region.

RPO (Recovery Point Objective): How much data can you lose?

24 hours? Daily backups are fine.
1 hour? Hourly snapshots or streaming replication.
0? Synchronous replication across sites.

These targets drive every architectural decision. And they should come from the business, not engineering:

Service Tier	RTO	RPO	Strategy
Tier 1 (revenue-critical)	15 min	0	Active-active, sync replication
Tier 2 (important)	1 hour	15 min	Warm standby, async replication
Tier 3 (internal tools)	4 hours	1 hour	Cold standby, hourly snapshots
Tier 4 (non-critical)	24 hours	24 hours	Backups only

The DR Strategies

Backup and Restore (Cold DR)

The simplest strategy. Take backups. Store them offsite. Restore when needed.

Primary DC ──── Backups ────→ Object Storage (different region)
                (nightly)
                
Disaster: Provision new infra → Restore from backup → DNS switch
Time: 4-24 hours

Good for: Tier 3-4 services, development environments, cost-sensitive workloads.

The catch: Your RTO depends on how fast you can provision new infrastructure AND restore data. A 2TB database restore takes hours, not minutes.

Warm Standby (Pilot Light)

Keep a minimal replica environment running. Core infrastructure is pre-provisioned but scaled down.

Primary DC                    DR Site
┌──────────┐                 ┌──────────┐
│ 8 app    │  async          │ 1 app    │ (scaled down)
│ servers  │  replication    │ server   │
│          │ ──────────────→ │          │
│ DB primary│                │ DB replica│
│ 3 nodes  │                │ 1 node   │
└──────────┘                └──────────┘

Disaster: Scale up DR site → Promote DB replica → DNS switch
Time: 15-60 minutes

Good for: Tier 2 services. The DR site costs ~20% of production (it's running but minimal). Scale-up is fast because the base infrastructure exists.

Active-Passive (Hot Standby)

Full replica environment, fully scaled, receiving real-time data replication. Ready to serve traffic immediately.

Primary DC                    DR Site
┌──────────┐                 ┌──────────┐
│ 8 app    │  sync/async     │ 8 app    │ (full scale)
│ servers  │  replication    │ servers  │
│          │ ──────────────→ │          │
│ DB primary│                │ DB replica│
│ 3 nodes  │                │ 3 nodes  │
└──────────┘                └──────────┘

Disaster: Promote DB replica → DNS switch
Time: 5-15 minutes

Good for: Tier 1 services. Cost is ~100% of production (you're running two of everything). But failover is fast.

Active-Active (Multi-Region)

Both sites serve traffic simultaneously. No failover needed — if one site goes down, the other absorbs the load.

          ┌── Load Balancer (Global) ──┐
          │                            │
     Region A                     Region B
   ┌──────────┐               ┌──────────┐
   │ App servers│              │ App servers│
   │ DB node   │←── sync ───→│ DB node   │
   └──────────┘               └──────────┘

Good for: Tier 1 services that absolutely cannot go down. Cost is 2x+ (and the complexity is 5x). You need to solve data consistency, conflict resolution, and write routing.

The Backup Strategy

Regardless of DR tier, backups are the foundation. Our backup strategy:

The 3-2-1 Rule

3 copies of your data
2 different storage types
1 offsite (different geographic region)

Production DB
  ├── Streaming replica (same DC, different rack)
  ├── Daily snapshot → local NVMe backup server
  └── Daily snapshot → S3-compatible storage (different region)
      └── Retained: 7 daily, 4 weekly, 12 monthly

Testing Backups

A backup that hasn't been restored is not a backup. It's a hope.

We test restores every month:

#!/bin/bash
# Monthly backup verification

# 1. Download latest backup from offsite storage
aws s3 cp s3://backups/db/latest.dump /tmp/restore-test/

# 2. Restore to isolated test instance
pg_restore -h test-db -d restore_test /tmp/restore-test/latest.dump

# 3. Run validation queries
psql -h test-db -d restore_test -c "
  SELECT count(*) FROM users;
  SELECT count(*) FROM orders WHERE created_at > now() - interval '24 hours';
  SELECT max(created_at) FROM events;
"

# 4. Compare row counts with production
# 5. Alert if deviation > 0.1%

# 6. Cleanup
dropdb -h test-db restore_test

If step 2 fails, you find out now — not during an actual disaster.

DNS and Traffic Switching

When disaster strikes, you need to redirect traffic to the DR site. Options:

DNS Failover

api.kicked.ro → Primary IP (health check: pass)
                 ↓ health check fails
api.kicked.ro → DR IP (automatic switch)

TTL matters. If your DNS TTL is 3600 (1 hour), clients will keep hitting the dead primary for up to an hour after you switch. We set critical service TTLs to 60 seconds.

Anycast

With anycast, the same IP is announced from multiple locations via BGP. Traffic automatically routes to the nearest healthy site. This is how we handle failover on AS210622 — no DNS changes needed.

Global Load Balancer

Cloudflare, AWS Global Accelerator, or similar. Health checks + automatic failover at the network edge. Fastest option for HTTP/S traffic.

The DR Test: Chaos Day

The only way to know your DR works is to test it. We run quarterly DR tests:

Test Procedure

Announce the test — Stakeholders know it's coming (at first)
Simulate failure — Block traffic to primary, stop database, etc.
Execute runbook — Follow the documented procedure step by step
Measure — Record actual RTO and RPO
Restore — Fail back to primary
Post-mortem — What broke? What was slower than expected? Update the runbook.

What We Track

Metric	Target	Last Test
Time to detect failure	< 2 min	1 min 23s
Time to initiate DR	< 5 min	3 min 45s
Time to full service (RTO)	< 15 min	11 min 12s
Data loss (RPO)	< 1 min	0 (sync replication)
Runbook accuracy	100%	94% (2 steps outdated)

That last metric — runbook accuracy — is why you test. Every test reveals steps that have drifted from reality.

Common Failures We've Seen

DNS TTL too high — 1-hour TTL means 1-hour failover. Set it to 60s for critical services.
Backup restore never tested — The backup was corrupted for 3 months. Nobody knew.
DR site undersized — "We'll scale up when we need it." Scaling takes 20 minutes. Your RTO is 15.
Credentials not synced — DR site can't connect to payment provider because API keys weren't replicated.
No runbook — The one person who knew the procedure was on vacation.
Single-region backups — Backups in the same DC as production. DC goes down, backups go with it.

Start Today

Define your tiers — Not everything needs active-active. Classify your services.
Implement 3-2-1 backups — Offsite, tested monthly.
Write the runbook — Step by step, assuming the reader has never done it before.
Test once — Even a single test will reveal critical gaps.
Schedule quarterly tests — Put it on the calendar. It won't happen otherwise.

Need help building or testing your DR plan? We've designed disaster recovery for everything from single-server setups to multi-region platforms. Let's make sure your infrastructure survives the worst day.

Back to all posts