Building an Incident Response Playbook That Works at 3 AM
Your monitoring goes red. PagerDuty fires. It's 3:14 AM and you're the on-call engineer. What do you do first?
If the answer is "it depends" or "let me think about it" — you don't have a playbook, you have a hope-based incident response strategy. Let's fix that.
Why Playbooks Matter
During an incident, cognitive function is impaired. Stress, sleep deprivation, and the pressure of customer impact make it hard to think clearly. A playbook removes the need to think about process so you can focus on the problem.
The best playbooks share three qualities:
- Actionable — Every step is a concrete action, not a vague suggestion
- Accessible — One click from the alert to the playbook
- Current — Updated after every incident (or they rot)
Incident Severity Levels
We use four severity levels. Every alert maps to one:
| Level | Criteria | Response Time | Notification |
|---|---|---|---|
| SEV1 | Customer-facing outage, data loss risk | Immediate | Wake everyone |
| SEV2 | Degraded service, partial outage | 15 min | On-call + backup |
| SEV3 | Non-critical system down, no customer impact | 1 hour | On-call only |
| SEV4 | Minor issue, informational | Next business day | Ticket |
The key decision: who gets woken up? Over-escalation causes alert fatigue. Under-escalation causes extended outages. We err on the side of over-escalation for the first occurrence, then tune.
The Incident Timeline
Every incident follows the same lifecycle:
Detection → Triage → Mitigation → Resolution → Post-mortem
│ │ │ │ │
│ │ │ │ └── Within 48 hours
│ │ │ └── Root cause fix
│ │ └── Stop the bleeding (even if ugly)
│ └── What's broken? Who's affected? What severity?
└── Alert fires or customer reports
Phase 1: Detection (0-2 minutes)
Automated monitoring should catch 90%+ of incidents before customers notice. Our detection stack:
- Prometheus — Metrics + alerting rules
- Alertmanager — Deduplication, routing, silencing
- PagerDuty — Escalation + on-call scheduling
- Uptime checks — External synthetic monitoring from multiple regions
Every alert includes:
- What's broken (service name, component)
- Link to the relevant Grafana dashboard
- Link to the playbook for that alert
- Recent changes (last 3 deployments)
Phase 2: Triage (2-10 minutes)
The on-call engineer's first job is not to fix the problem. It's to assess:
- What is the customer impact? Check error rates, latency percentiles, support tickets
- What is the blast radius? One customer, one region, or everyone?
- Is this a known issue? Check the runbook, recent incidents,
#incidentschannel - Assign severity — This determines who else gets pulled in
Phase 3: Mitigation (10-60 minutes)
Mitigation is not the same as resolution. Mitigation is stopping the bleeding:
- Roll back the last deployment
- Scale up if it's a capacity issue
- Failover to a backup region or instance
- Toggle a feature flag to disable the broken feature
- Block traffic if it's a DDoS or abuse
The goal: reduce customer impact as fast as possible. A hacky fix that restores service in 5 minutes beats an elegant fix that takes 2 hours.
Phase 4: Resolution
Once the incident is mitigated, the proper fix can happen during business hours with fresh eyes. This might be:
- A code fix for the bug that caused the incident
- A configuration change to prevent recurrence
- An infrastructure change (more capacity, better redundancy)
Phase 5: Post-mortem
This is where most teams fail. They either skip post-mortems or turn them into blame sessions.
Our post-mortem template:
## Incident: [Title]
**Date:** [Date] **Duration:** [X minutes] **Severity:** [SEV1-4]
## Summary
[2-3 sentences: what happened, who was affected, what was the impact]
## Timeline
[Minute-by-minute account of detection, triage, mitigation, resolution]
## Root Cause
[Technical root cause — not "human error"]
## Contributing Factors
[What made this worse? Missing monitoring? No runbook? Slow rollback?]
## Action Items
| Action | Owner | Due Date | Status |
|--------|-------|----------|--------|
| [Specific, measurable action] | [Name] | [Date] | Open |
## Lessons Learned
[What did we learn? What will we do differently?]
Blameless is non-negotiable. If someone is afraid to say "I ran the wrong command," they'll hide information that could prevent the next incident.
Automating Response
Manual runbooks are a starting point. The next level is automating common responses:
- Auto-scaling on traffic spikes
- Auto-rollback when error rates exceed thresholds post-deploy
- Auto-failover when health checks fail
- Self-healing with Kubernetes liveness/readiness probes
We've reduced our mean time to mitigation (MTTM) from 23 minutes to 4 minutes by automating the most common incident types.
Tooling We Use
- PagerDuty — Alerting and escalation
- Grafana — Dashboards and investigation
- Loki — Log exploration during incidents
- Teleport — Secure access to production (with audit trail)
- GitHub Issues — Post-mortem tracking and action items
Start Simple
You don't need all of this on day one. Start with:
- Define your severity levels
- Write a playbook for your top 3 alerts
- Do a post-mortem after every SEV1/SEV2
The playbook improves with every incident. The worst playbook is the one you never wrote. Talk to us if you want help building your incident response practice.