Disaster Recovery Planning: RTO, RPO, and the Tests Nobody Runs
Here's a question that keeps CTOs up at night: if your primary datacenter went offline right now — fire, power failure, fiber cut — how long until your customers can use your product again?
If the answer is "I don't know" or "it depends on who's awake," you don't have a disaster recovery plan. You have a disaster recovery document.
The difference between a plan and a document is testing.
RTO and RPO: The Two Numbers That Matter
Before designing anything, define your targets:
RTO (Recovery Time Objective): How long can you be down?
- 4 hours? That's a weekend migration.
- 1 hour? That's automated failover with manual verification.
- 5 minutes? That's active-active multi-region.
RPO (Recovery Point Objective): How much data can you lose?
- 24 hours? Daily backups are fine.
- 1 hour? Hourly snapshots or streaming replication.
- 0? Synchronous replication across sites.
These targets drive every architectural decision. And they should come from the business, not engineering:
| Service Tier | RTO | RPO | Strategy |
|---|---|---|---|
| Tier 1 (revenue-critical) | 15 min | 0 | Active-active, sync replication |
| Tier 2 (important) | 1 hour | 15 min | Warm standby, async replication |
| Tier 3 (internal tools) | 4 hours | 1 hour | Cold standby, hourly snapshots |
| Tier 4 (non-critical) | 24 hours | 24 hours | Backups only |
The DR Strategies
Backup and Restore (Cold DR)
The simplest strategy. Take backups. Store them offsite. Restore when needed.
Primary DC ──── Backups ────→ Object Storage (different region)
(nightly)
Disaster: Provision new infra → Restore from backup → DNS switch
Time: 4-24 hours
Good for: Tier 3-4 services, development environments, cost-sensitive workloads.
The catch: Your RTO depends on how fast you can provision new infrastructure AND restore data. A 2TB database restore takes hours, not minutes.
Warm Standby (Pilot Light)
Keep a minimal replica environment running. Core infrastructure is pre-provisioned but scaled down.
Primary DC DR Site
┌──────────┐ ┌──────────┐
│ 8 app │ async │ 1 app │ (scaled down)
│ servers │ replication │ server │
│ │ ──────────────→ │ │
│ DB primary│ │ DB replica│
│ 3 nodes │ │ 1 node │
└──────────┘ └──────────┘
Disaster: Scale up DR site → Promote DB replica → DNS switch
Time: 15-60 minutes
Good for: Tier 2 services. The DR site costs ~20% of production (it's running but minimal). Scale-up is fast because the base infrastructure exists.
Active-Passive (Hot Standby)
Full replica environment, fully scaled, receiving real-time data replication. Ready to serve traffic immediately.
Primary DC DR Site
┌──────────┐ ┌──────────┐
│ 8 app │ sync/async │ 8 app │ (full scale)
│ servers │ replication │ servers │
│ │ ──────────────→ │ │
│ DB primary│ │ DB replica│
│ 3 nodes │ │ 3 nodes │
└──────────┘ └──────────┘
Disaster: Promote DB replica → DNS switch
Time: 5-15 minutes
Good for: Tier 1 services. Cost is ~100% of production (you're running two of everything). But failover is fast.
Active-Active (Multi-Region)
Both sites serve traffic simultaneously. No failover needed — if one site goes down, the other absorbs the load.
┌── Load Balancer (Global) ──┐
│ │
Region A Region B
┌──────────┐ ┌──────────┐
│ App servers│ │ App servers│
│ DB node │←── sync ───→│ DB node │
└──────────┘ └──────────┘
Good for: Tier 1 services that absolutely cannot go down. Cost is 2x+ (and the complexity is 5x). You need to solve data consistency, conflict resolution, and write routing.
The Backup Strategy
Regardless of DR tier, backups are the foundation. Our backup strategy:
The 3-2-1 Rule
- 3 copies of your data
- 2 different storage types
- 1 offsite (different geographic region)
Production DB
├── Streaming replica (same DC, different rack)
├── Daily snapshot → local NVMe backup server
└── Daily snapshot → S3-compatible storage (different region)
└── Retained: 7 daily, 4 weekly, 12 monthly
Testing Backups
A backup that hasn't been restored is not a backup. It's a hope.
We test restores every month:
#!/bin/bash
# Monthly backup verification
# 1. Download latest backup from offsite storage
aws s3 cp s3://backups/db/latest.dump /tmp/restore-test/
# 2. Restore to isolated test instance
pg_restore -h test-db -d restore_test /tmp/restore-test/latest.dump
# 3. Run validation queries
psql -h test-db -d restore_test -c "
SELECT count(*) FROM users;
SELECT count(*) FROM orders WHERE created_at > now() - interval '24 hours';
SELECT max(created_at) FROM events;
"
# 4. Compare row counts with production
# 5. Alert if deviation > 0.1%
# 6. Cleanup
dropdb -h test-db restore_test
If step 2 fails, you find out now — not during an actual disaster.
DNS and Traffic Switching
When disaster strikes, you need to redirect traffic to the DR site. Options:
DNS Failover
api.kicked.ro → Primary IP (health check: pass)
↓ health check fails
api.kicked.ro → DR IP (automatic switch)
TTL matters. If your DNS TTL is 3600 (1 hour), clients will keep hitting the dead primary for up to an hour after you switch. We set critical service TTLs to 60 seconds.
Anycast
With anycast, the same IP is announced from multiple locations via BGP. Traffic automatically routes to the nearest healthy site. This is how we handle failover on AS210622 — no DNS changes needed.
Global Load Balancer
Cloudflare, AWS Global Accelerator, or similar. Health checks + automatic failover at the network edge. Fastest option for HTTP/S traffic.
The DR Test: Chaos Day
The only way to know your DR works is to test it. We run quarterly DR tests:
Test Procedure
- Announce the test — Stakeholders know it's coming (at first)
- Simulate failure — Block traffic to primary, stop database, etc.
- Execute runbook — Follow the documented procedure step by step
- Measure — Record actual RTO and RPO
- Restore — Fail back to primary
- Post-mortem — What broke? What was slower than expected? Update the runbook.
What We Track
| Metric | Target | Last Test |
|---|---|---|
| Time to detect failure | < 2 min | 1 min 23s |
| Time to initiate DR | < 5 min | 3 min 45s |
| Time to full service (RTO) | < 15 min | 11 min 12s |
| Data loss (RPO) | < 1 min | 0 (sync replication) |
| Runbook accuracy | 100% | 94% (2 steps outdated) |
That last metric — runbook accuracy — is why you test. Every test reveals steps that have drifted from reality.
Common Failures We've Seen
- DNS TTL too high — 1-hour TTL means 1-hour failover. Set it to 60s for critical services.
- Backup restore never tested — The backup was corrupted for 3 months. Nobody knew.
- DR site undersized — "We'll scale up when we need it." Scaling takes 20 minutes. Your RTO is 15.
- Credentials not synced — DR site can't connect to payment provider because API keys weren't replicated.
- No runbook — The one person who knew the procedure was on vacation.
- Single-region backups — Backups in the same DC as production. DC goes down, backups go with it.
Start Today
- Define your tiers — Not everything needs active-active. Classify your services.
- Implement 3-2-1 backups — Offsite, tested monthly.
- Write the runbook — Step by step, assuming the reader has never done it before.
- Test once — Even a single test will reveal critical gaps.
- Schedule quarterly tests — Put it on the calendar. It won't happen otherwise.
Need help building or testing your DR plan? We've designed disaster recovery for everything from single-server setups to multi-region platforms. Let's make sure your infrastructure survives the worst day.