SRE Practices: Implementing SLOs That Actually Work
Service Level Objectives (SLOs) are at the heart of Site Reliability Engineering. But too many teams implement them wrong — either making them too strict (killing velocity) or too loose (meaningless). Here's how to get them right.
What Are SLOs?
Let's get the terminology straight:
- SLI (Service Level Indicator) — A quantitative measure of service quality (e.g., latency, error rate)
- SLO (Service Level Objective) — A target value for an SLI (e.g., 99.9% of requests < 200ms)
- SLA (Service Level Agreement) — A contract with consequences if SLOs aren't met
- Error Budget — The acceptable amount of unreliability (1 - SLO)
Choosing the Right SLIs
The most important step is choosing what to measure. Focus on what users actually experience:
For API Services
Availability: successful requests / total requests
Latency: requests served < threshold / total requests
Throughput: requests per second within normal range
For Data Processing Pipelines
Freshness: time since last successful processing < threshold
Correctness: valid outputs / total outputs
Coverage: records processed / records expected
Setting Realistic SLO Targets
Here's a reality check on "nines":
| SLO | Downtime/month | Downtime/year |
|---|---|---|
| 99% | 7.3 hours | 3.65 days |
| 99.9% | 43.8 minutes | 8.76 hours |
| 99.99% | 4.38 minutes | 52.6 minutes |
Pro tip: Start with a lower SLO than you think you need. You can always make it stricter, but loosening an SLO feels like failure.
Error Budgets in Practice
The error budget is what makes SLOs actionable. Here's how we use them:
- Budget remaining > 50% — Full speed ahead on features
- Budget remaining 20-50% — Increase review rigor, slow down risky changes
- Budget remaining < 20% — Focus on reliability improvements
- Budget exhausted — Feature freeze until budget regenerates
Monitoring SLOs
We use a combination of Prometheus and Grafana to track SLO compliance:
# Error rate SLI
1 - (
sum(rate(http_requests_total{code=~"5.."}[5m]))
/
sum(rate(http_requests_total[5m]))
)
# Latency SLI (proportion of fast requests)
sum(rate(http_request_duration_seconds_bucket{le="0.2"}[5m]))
/
sum(rate(http_request_duration_seconds_count[5m]))
Common Mistakes
- Too many SLOs — Start with 1-3 per service
- SLOs not tied to user experience — CPU usage is not an SLO
- No error budget policy — SLOs without consequences are just dashboards
- Measuring the wrong thing — Synthetic checks ≠ real user experience
- Set and forget — SLOs should be reviewed quarterly
Building an SLO Culture
The hardest part isn't the technical implementation — it's the cultural change. Here's what we recommend:
- Make SLOs visible — Dashboard on a big screen in the office
- Celebrate error budget — Using budget means you're innovating
- Blameless postmortems — Learn from incidents, don't assign blame
- Executive buy-in — Leadership needs to understand the trade-offs
Need Help?
Implementing SRE practices is a journey, not a destination. At Kicked, we help teams adopt SRE practices that are pragmatic and sustainable. Let's talk about your reliability goals.