SRE Practices: Implementing SLOs That Actually Work

Kicked TeamJanuary 28, 20263 min read

Service Level Objectives (SLOs) are at the heart of Site Reliability Engineering. But too many teams implement them wrong — either making them too strict (killing velocity) or too loose (meaningless). Here's how to get them right.

What Are SLOs?

Let's get the terminology straight:

  • SLI (Service Level Indicator) — A quantitative measure of service quality (e.g., latency, error rate)
  • SLO (Service Level Objective) — A target value for an SLI (e.g., 99.9% of requests < 200ms)
  • SLA (Service Level Agreement) — A contract with consequences if SLOs aren't met
  • Error Budget — The acceptable amount of unreliability (1 - SLO)

Choosing the Right SLIs

The most important step is choosing what to measure. Focus on what users actually experience:

For API Services

Availability:  successful requests / total requests
Latency:       requests served < threshold / total requests
Throughput:    requests per second within normal range

For Data Processing Pipelines

Freshness:     time since last successful processing < threshold
Correctness:   valid outputs / total outputs
Coverage:      records processed / records expected

Setting Realistic SLO Targets

Here's a reality check on "nines":

SLO Downtime/month Downtime/year
99% 7.3 hours 3.65 days
99.9% 43.8 minutes 8.76 hours
99.99% 4.38 minutes 52.6 minutes

Pro tip: Start with a lower SLO than you think you need. You can always make it stricter, but loosening an SLO feels like failure.

Error Budgets in Practice

The error budget is what makes SLOs actionable. Here's how we use them:

  1. Budget remaining > 50% — Full speed ahead on features
  2. Budget remaining 20-50% — Increase review rigor, slow down risky changes
  3. Budget remaining < 20% — Focus on reliability improvements
  4. Budget exhausted — Feature freeze until budget regenerates

Monitoring SLOs

We use a combination of Prometheus and Grafana to track SLO compliance:

# Error rate SLI
1 - (
  sum(rate(http_requests_total{code=~"5.."}[5m]))
  /
  sum(rate(http_requests_total[5m]))
)

# Latency SLI (proportion of fast requests)
sum(rate(http_request_duration_seconds_bucket{le="0.2"}[5m]))
/
sum(rate(http_request_duration_seconds_count[5m]))

Common Mistakes

  1. Too many SLOs — Start with 1-3 per service
  2. SLOs not tied to user experience — CPU usage is not an SLO
  3. No error budget policy — SLOs without consequences are just dashboards
  4. Measuring the wrong thing — Synthetic checks ≠ real user experience
  5. Set and forget — SLOs should be reviewed quarterly

Building an SLO Culture

The hardest part isn't the technical implementation — it's the cultural change. Here's what we recommend:

  • Make SLOs visible — Dashboard on a big screen in the office
  • Celebrate error budget — Using budget means you're innovating
  • Blameless postmortems — Learn from incidents, don't assign blame
  • Executive buy-in — Leadership needs to understand the trade-offs

Need Help?

Implementing SRE practices is a journey, not a destination. At Kicked, we help teams adopt SRE practices that are pragmatic and sustainable. Let's talk about your reliability goals.