Observability Monitoring SRE Infrastructure

Observability Beyond Monitoring: Logs, Metrics, and Traces in Practice

Kicked TeamJanuary 12, 20265 min read

"Our monitoring says everything is green, but customers are complaining." We hear this at least once a month.

That's because monitoring and observability are not the same thing. Monitoring answers known questions: is the server up? Is CPU above 80%? Observability answers questions you haven't thought of yet: why is this specific user getting 500 errors only on Tuesday mornings from the Frankfurt region?

The Three Pillars

You've heard this before. But most teams implement them in isolation, which defeats the purpose.

Metrics — The Overview

Metrics are numerical measurements over time. They're cheap to store, fast to query, and perfect for dashboards and alerts.

# Request rate
rate(http_requests_total{service="api"}[5m])

# Error rate (as a percentage)
rate(http_requests_total{service="api", status=~"5.."}[5m])
/ rate(http_requests_total{service="api"}[5m]) * 100

# 95th percentile latency
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket{service="api"}[5m]))

We follow the RED method for services (Rate, Errors, Duration) and the USE method for infrastructure (Utilization, Saturation, Errors).

Tooling: Prometheus for collection and alerting, Grafana for visualization, Thanos for long-term storage and multi-cluster aggregation.

Logs — The Detail

Logs tell you what happened. But unstructured logs are nearly useless at scale.

# Bad — unstructured, unparseable at scale
[2026-01-12 14:23:01] ERROR: Failed to process request from user 12345

# Good — structured JSON, every field is queryable
{"timestamp":"2026-01-12T14:23:01Z","level":"error","service":"api","trace_id":"abc123","user_id":"12345","method":"POST","path":"/api/orders","status":500,"duration_ms":234,"error":"connection refused: postgres:5432"}

Structured logging is non-negotiable. If you can't filter by user_id, trace_id, and status simultaneously, you'll spend hours grep-ing through files.

Tooling: Loki for log aggregation (pairs perfectly with Grafana — same query interface as metrics), Promtail or Alloy for collection.

Traces — The Journey

Traces show the path of a request through your system. When a request hits your API, calls a microservice, queries a database, and writes to a cache — a trace connects all of those spans:

[Trace: abc123] Total: 450ms
├── [api-gateway] 12ms
│   └── [auth-service] 34ms
│       └── [redis-cache] 2ms
├── [order-service] 380ms     ← bottleneck
│   ├── [postgres-query] 310ms ← root cause
│   └── [inventory-check] 45ms
└── [notification-service] 24ms (async)

Without traces, you'd see "the API is slow" in your metrics and spend an hour guessing which service is the bottleneck. With traces, you see the answer instantly: the Postgres query in the order service is taking 310ms.

Tooling: OpenTelemetry for instrumentation (vendor-neutral), Tempo for trace storage, Grafana for visualization.

The Killer Feature: Correlation

The three pillars become powerful when they're connected:

Alert fires on high error rate (metrics)
Click through to see error logs for that time window (logs)
Click a trace_id from the log to see the full request journey (traces)
Find the root cause — a slow database query, a failing downstream service, a misconfigured timeout

In Grafana, this workflow is seamless. Metrics dashboards link to Loki log queries. Log entries link to Tempo traces. You can go from "something is broken" to "here's why" in under a minute.

Instrumentation Strategy

Application Level

Use OpenTelemetry SDKs. They support every major language and framework:

# Python example with OpenTelemetry
from opentelemetry import trace, metrics

tracer = trace.get_tracer("order-service")
meter = metrics.get_meter("order-service")

order_counter = meter.create_counter(
    "orders.created",
    description="Number of orders created"
)

@tracer.start_as_current_span("create_order")
def create_order(user_id: str, items: list):
    span = trace.get_current_span()
    span.set_attribute("user.id", user_id)
    span.set_attribute("order.items_count", len(items))
    
    order_counter.add(1, {"region": "eu-west"})
    # ... business logic

Infrastructure Level

For infrastructure metrics, we use:

Node Exporter — Linux host metrics (CPU, memory, disk, network)
cAdvisor — Container metrics
kube-state-metrics — Kubernetes object metrics
SNMP Exporter — Network equipment metrics

The Grafana Stack

Our full observability stack:

Component	Role	Data
Prometheus	Time-series DB	Metrics
Loki	Log aggregation	Logs
Tempo	Trace backend	Traces
Grafana	Visualization	All three
Alloy	Collection agent	Ships all telemetry
Alertmanager	Alert routing	Notifications

All open-source. All self-hosted on our infrastructure. No per-seat pricing, no data egress fees, no vendor lock-in.

Common Mistakes

Alerting on symptoms, not causes — Alert on error rate, not on CPU usage. High CPU is a symptom.
Too many dashboards — If you have 200 dashboards, nobody looks at any of them. Have 5 great ones.
Missing service ownership — Every metric, log, and alert needs an owner. Otherwise, alerts get ignored.
No SLOs — Without SLOs, you don't know when something is "bad enough" to page someone. Read our SLO guide.
Sampling too aggressively — Head-based sampling at 1% means you'll miss rare but critical errors. Use tail-based sampling.

Getting Started

If you're starting from zero:

Week 1 — Deploy Prometheus + Grafana, instrument RED metrics for your top 3 services
Week 2 — Add Loki for structured logging, create a log-based dashboard
Week 3 — Add OpenTelemetry tracing to one service, deploy Tempo
Week 4 — Connect everything in Grafana, set up exemplars linking metrics → traces

Or skip the setup and let us build it for you. We've deployed this stack for teams of 5 and teams of 500.

Back to all posts