Building an Incident Response Playbook That Works at 3 AM

Kicked TeamFebruary 5, 20265 min read

Your monitoring goes red. PagerDuty fires. It's 3:14 AM and you're the on-call engineer. What do you do first?

If the answer is "it depends" or "let me think about it" — you don't have a playbook, you have a hope-based incident response strategy. Let's fix that.

Why Playbooks Matter

During an incident, cognitive function is impaired. Stress, sleep deprivation, and the pressure of customer impact make it hard to think clearly. A playbook removes the need to think about process so you can focus on the problem.

The best playbooks share three qualities:

  1. Actionable — Every step is a concrete action, not a vague suggestion
  2. Accessible — One click from the alert to the playbook
  3. Current — Updated after every incident (or they rot)

Incident Severity Levels

We use four severity levels. Every alert maps to one:

Level Criteria Response Time Notification
SEV1 Customer-facing outage, data loss risk Immediate Wake everyone
SEV2 Degraded service, partial outage 15 min On-call + backup
SEV3 Non-critical system down, no customer impact 1 hour On-call only
SEV4 Minor issue, informational Next business day Ticket

The key decision: who gets woken up? Over-escalation causes alert fatigue. Under-escalation causes extended outages. We err on the side of over-escalation for the first occurrence, then tune.

The Incident Timeline

Every incident follows the same lifecycle:

Detection → Triage → Mitigation → Resolution → Post-mortem
   │          │          │             │            │
   │          │          │             │            └── Within 48 hours
   │          │          │             └── Root cause fix
   │          │          └── Stop the bleeding (even if ugly)
   │          └── What's broken? Who's affected? What severity?
   └── Alert fires or customer reports

Phase 1: Detection (0-2 minutes)

Automated monitoring should catch 90%+ of incidents before customers notice. Our detection stack:

  • Prometheus — Metrics + alerting rules
  • Alertmanager — Deduplication, routing, silencing
  • PagerDuty — Escalation + on-call scheduling
  • Uptime checks — External synthetic monitoring from multiple regions

Every alert includes:

  • What's broken (service name, component)
  • Link to the relevant Grafana dashboard
  • Link to the playbook for that alert
  • Recent changes (last 3 deployments)

Phase 2: Triage (2-10 minutes)

The on-call engineer's first job is not to fix the problem. It's to assess:

  1. What is the customer impact? Check error rates, latency percentiles, support tickets
  2. What is the blast radius? One customer, one region, or everyone?
  3. Is this a known issue? Check the runbook, recent incidents, #incidents channel
  4. Assign severity — This determines who else gets pulled in

Phase 3: Mitigation (10-60 minutes)

Mitigation is not the same as resolution. Mitigation is stopping the bleeding:

  • Roll back the last deployment
  • Scale up if it's a capacity issue
  • Failover to a backup region or instance
  • Toggle a feature flag to disable the broken feature
  • Block traffic if it's a DDoS or abuse

The goal: reduce customer impact as fast as possible. A hacky fix that restores service in 5 minutes beats an elegant fix that takes 2 hours.

Phase 4: Resolution

Once the incident is mitigated, the proper fix can happen during business hours with fresh eyes. This might be:

  • A code fix for the bug that caused the incident
  • A configuration change to prevent recurrence
  • An infrastructure change (more capacity, better redundancy)

Phase 5: Post-mortem

This is where most teams fail. They either skip post-mortems or turn them into blame sessions.

Our post-mortem template:

## Incident: [Title]
**Date:** [Date]  **Duration:** [X minutes]  **Severity:** [SEV1-4]

## Summary
[2-3 sentences: what happened, who was affected, what was the impact]

## Timeline
[Minute-by-minute account of detection, triage, mitigation, resolution]

## Root Cause
[Technical root cause — not "human error"]

## Contributing Factors
[What made this worse? Missing monitoring? No runbook? Slow rollback?]

## Action Items
| Action | Owner | Due Date | Status |
|--------|-------|----------|--------|
| [Specific, measurable action] | [Name] | [Date] | Open |

## Lessons Learned
[What did we learn? What will we do differently?]

Blameless is non-negotiable. If someone is afraid to say "I ran the wrong command," they'll hide information that could prevent the next incident.

Automating Response

Manual runbooks are a starting point. The next level is automating common responses:

  • Auto-scaling on traffic spikes
  • Auto-rollback when error rates exceed thresholds post-deploy
  • Auto-failover when health checks fail
  • Self-healing with Kubernetes liveness/readiness probes

We've reduced our mean time to mitigation (MTTM) from 23 minutes to 4 minutes by automating the most common incident types.

Tooling We Use

  • PagerDuty — Alerting and escalation
  • Grafana — Dashboards and investigation
  • Loki — Log exploration during incidents
  • Teleport — Secure access to production (with audit trail)
  • GitHub Issues — Post-mortem tracking and action items

Start Simple

You don't need all of this on day one. Start with:

  1. Define your severity levels
  2. Write a playbook for your top 3 alerts
  3. Do a post-mortem after every SEV1/SEV2

The playbook improves with every incident. The worst playbook is the one you never wrote. Talk to us if you want help building your incident response practice.