SRE Incident Response Reliability Operations

Building an Incident Response Playbook That Works at 3 AM

Kicked TeamFebruary 5, 20265 min read

Your monitoring goes red. PagerDuty fires. It's 3:14 AM and you're the on-call engineer. What do you do first?

If the answer is "it depends" or "let me think about it" — you don't have a playbook, you have a hope-based incident response strategy. Let's fix that.

Why Playbooks Matter

During an incident, cognitive function is impaired. Stress, sleep deprivation, and the pressure of customer impact make it hard to think clearly. A playbook removes the need to think about process so you can focus on the problem.

The best playbooks share three qualities:

Actionable — Every step is a concrete action, not a vague suggestion
Accessible — One click from the alert to the playbook
Current — Updated after every incident (or they rot)

Incident Severity Levels

We use four severity levels. Every alert maps to one:

Level	Criteria	Response Time	Notification
SEV1	Customer-facing outage, data loss risk	Immediate	Wake everyone
SEV2	Degraded service, partial outage	15 min	On-call + backup
SEV3	Non-critical system down, no customer impact	1 hour	On-call only
SEV4	Minor issue, informational	Next business day	Ticket

The key decision: who gets woken up? Over-escalation causes alert fatigue. Under-escalation causes extended outages. We err on the side of over-escalation for the first occurrence, then tune.

The Incident Timeline

Every incident follows the same lifecycle:

Detection → Triage → Mitigation → Resolution → Post-mortem
   │          │          │             │            │
   │          │          │             │            └── Within 48 hours
   │          │          │             └── Root cause fix
   │          │          └── Stop the bleeding (even if ugly)
   │          └── What's broken? Who's affected? What severity?
   └── Alert fires or customer reports

Phase 1: Detection (0-2 minutes)

Automated monitoring should catch 90%+ of incidents before customers notice. Our detection stack:

Prometheus — Metrics + alerting rules
Alertmanager — Deduplication, routing, silencing
PagerDuty — Escalation + on-call scheduling
Uptime checks — External synthetic monitoring from multiple regions

Every alert includes:

What's broken (service name, component)
Link to the relevant Grafana dashboard
Link to the playbook for that alert
Recent changes (last 3 deployments)

Phase 2: Triage (2-10 minutes)

The on-call engineer's first job is not to fix the problem. It's to assess:

What is the customer impact? Check error rates, latency percentiles, support tickets
What is the blast radius? One customer, one region, or everyone?
Is this a known issue? Check the runbook, recent incidents, #incidents channel
Assign severity — This determines who else gets pulled in

Phase 3: Mitigation (10-60 minutes)

Mitigation is not the same as resolution. Mitigation is stopping the bleeding:

Roll back the last deployment
Scale up if it's a capacity issue
Failover to a backup region or instance
Toggle a feature flag to disable the broken feature
Block traffic if it's a DDoS or abuse

The goal: reduce customer impact as fast as possible. A hacky fix that restores service in 5 minutes beats an elegant fix that takes 2 hours.

Phase 4: Resolution

Once the incident is mitigated, the proper fix can happen during business hours with fresh eyes. This might be:

A code fix for the bug that caused the incident
A configuration change to prevent recurrence
An infrastructure change (more capacity, better redundancy)

Phase 5: Post-mortem

This is where most teams fail. They either skip post-mortems or turn them into blame sessions.

Our post-mortem template:

## Incident: [Title]
**Date:** [Date]  **Duration:** [X minutes]  **Severity:** [SEV1-4]

## Summary
[2-3 sentences: what happened, who was affected, what was the impact]

## Timeline
[Minute-by-minute account of detection, triage, mitigation, resolution]

## Root Cause
[Technical root cause — not "human error"]

## Contributing Factors
[What made this worse? Missing monitoring? No runbook? Slow rollback?]

## Action Items
| Action | Owner | Due Date | Status |
|--------|-------|----------|--------|
| [Specific, measurable action] | [Name] | [Date] | Open |

## Lessons Learned
[What did we learn? What will we do differently?]

Blameless is non-negotiable. If someone is afraid to say "I ran the wrong command," they'll hide information that could prevent the next incident.

Automating Response

Manual runbooks are a starting point. The next level is automating common responses:

Auto-scaling on traffic spikes
Auto-rollback when error rates exceed thresholds post-deploy
Auto-failover when health checks fail
Self-healing with Kubernetes liveness/readiness probes

We've reduced our mean time to mitigation (MTTM) from 23 minutes to 4 minutes by automating the most common incident types.

Tooling We Use

PagerDuty — Alerting and escalation
Grafana — Dashboards and investigation
Loki — Log exploration during incidents
Teleport — Secure access to production (with audit trail)
GitHub Issues — Post-mortem tracking and action items

Start Simple

You don't need all of this on day one. Start with:

Define your severity levels
Write a playbook for your top 3 alerts
Do a post-mortem after every SEV1/SEV2

The playbook improves with every incident. The worst playbook is the one you never wrote. Talk to us if you want help building your incident response practice.

Back to all posts