SRE Reliability Monitoring Observability

SRE Practices: Implementing SLOs That Actually Work

Kicked TeamJanuary 28, 20263 min read

Service Level Objectives (SLOs) are at the heart of Site Reliability Engineering. But too many teams implement them wrong — either making them too strict (killing velocity) or too loose (meaningless). Here's how to get them right.

What Are SLOs?

Let's get the terminology straight:

SLI (Service Level Indicator) — A quantitative measure of service quality (e.g., latency, error rate)
SLO (Service Level Objective) — A target value for an SLI (e.g., 99.9% of requests < 200ms)
SLA (Service Level Agreement) — A contract with consequences if SLOs aren't met
Error Budget — The acceptable amount of unreliability (1 - SLO)

Choosing the Right SLIs

The most important step is choosing what to measure. Focus on what users actually experience:

For API Services

Availability:  successful requests / total requests
Latency:       requests served < threshold / total requests
Throughput:    requests per second within normal range

For Data Processing Pipelines

Freshness:     time since last successful processing < threshold
Correctness:   valid outputs / total outputs
Coverage:      records processed / records expected

Setting Realistic SLO Targets

Here's a reality check on "nines":

SLO	Downtime/month	Downtime/year
99%	7.3 hours	3.65 days
99.9%	43.8 minutes	8.76 hours
99.99%	4.38 minutes	52.6 minutes

Pro tip: Start with a lower SLO than you think you need. You can always make it stricter, but loosening an SLO feels like failure.

Error Budgets in Practice

The error budget is what makes SLOs actionable. Here's how we use them:

Budget remaining > 50% — Full speed ahead on features
Budget remaining 20-50% — Increase review rigor, slow down risky changes
Budget remaining < 20% — Focus on reliability improvements
Budget exhausted — Feature freeze until budget regenerates

Monitoring SLOs

We use a combination of Prometheus and Grafana to track SLO compliance:

# Error rate SLI
1 - (
  sum(rate(http_requests_total{code=~"5.."}[5m]))
  /
  sum(rate(http_requests_total[5m]))
)

# Latency SLI (proportion of fast requests)
sum(rate(http_request_duration_seconds_bucket{le="0.2"}[5m]))
/
sum(rate(http_request_duration_seconds_count[5m]))

Common Mistakes

Too many SLOs — Start with 1-3 per service
SLOs not tied to user experience — CPU usage is not an SLO
No error budget policy — SLOs without consequences are just dashboards
Measuring the wrong thing — Synthetic checks ≠ real user experience
Set and forget — SLOs should be reviewed quarterly

Building an SLO Culture

The hardest part isn't the technical implementation — it's the cultural change. Here's what we recommend:

Make SLOs visible — Dashboard on a big screen in the office
Celebrate error budget — Using budget means you're innovating
Blameless postmortems — Learn from incidents, don't assign blame
Executive buy-in — Leadership needs to understand the trade-offs

Need Help?

Implementing SRE practices is a journey, not a destination. At Kicked, we help teams adopt SRE practices that are pragmatic and sustainable. Let's talk about your reliability goals.

Back to all posts