Observability Beyond Monitoring: Logs, Metrics, and Traces in Practice
"Our monitoring says everything is green, but customers are complaining." We hear this at least once a month.
That's because monitoring and observability are not the same thing. Monitoring answers known questions: is the server up? Is CPU above 80%? Observability answers questions you haven't thought of yet: why is this specific user getting 500 errors only on Tuesday mornings from the Frankfurt region?
The Three Pillars
You've heard this before. But most teams implement them in isolation, which defeats the purpose.
Metrics — The Overview
Metrics are numerical measurements over time. They're cheap to store, fast to query, and perfect for dashboards and alerts.
# Request rate
rate(http_requests_total{service="api"}[5m])
# Error rate (as a percentage)
rate(http_requests_total{service="api", status=~"5.."}[5m])
/ rate(http_requests_total{service="api"}[5m]) * 100
# 95th percentile latency
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket{service="api"}[5m]))
We follow the RED method for services (Rate, Errors, Duration) and the USE method for infrastructure (Utilization, Saturation, Errors).
Tooling: Prometheus for collection and alerting, Grafana for visualization, Thanos for long-term storage and multi-cluster aggregation.
Logs — The Detail
Logs tell you what happened. But unstructured logs are nearly useless at scale.
# Bad — unstructured, unparseable at scale
[2026-01-12 14:23:01] ERROR: Failed to process request from user 12345
# Good — structured JSON, every field is queryable
{"timestamp":"2026-01-12T14:23:01Z","level":"error","service":"api","trace_id":"abc123","user_id":"12345","method":"POST","path":"/api/orders","status":500,"duration_ms":234,"error":"connection refused: postgres:5432"}
Structured logging is non-negotiable. If you can't filter by user_id, trace_id, and status simultaneously, you'll spend hours grep-ing through files.
Tooling: Loki for log aggregation (pairs perfectly with Grafana — same query interface as metrics), Promtail or Alloy for collection.
Traces — The Journey
Traces show the path of a request through your system. When a request hits your API, calls a microservice, queries a database, and writes to a cache — a trace connects all of those spans:
[Trace: abc123] Total: 450ms
├── [api-gateway] 12ms
│ └── [auth-service] 34ms
│ └── [redis-cache] 2ms
├── [order-service] 380ms ← bottleneck
│ ├── [postgres-query] 310ms ← root cause
│ └── [inventory-check] 45ms
└── [notification-service] 24ms (async)
Without traces, you'd see "the API is slow" in your metrics and spend an hour guessing which service is the bottleneck. With traces, you see the answer instantly: the Postgres query in the order service is taking 310ms.
Tooling: OpenTelemetry for instrumentation (vendor-neutral), Tempo for trace storage, Grafana for visualization.
The Killer Feature: Correlation
The three pillars become powerful when they're connected:
- Alert fires on high error rate (metrics)
- Click through to see error logs for that time window (logs)
- Click a trace_id from the log to see the full request journey (traces)
- Find the root cause — a slow database query, a failing downstream service, a misconfigured timeout
In Grafana, this workflow is seamless. Metrics dashboards link to Loki log queries. Log entries link to Tempo traces. You can go from "something is broken" to "here's why" in under a minute.
Instrumentation Strategy
Application Level
Use OpenTelemetry SDKs. They support every major language and framework:
# Python example with OpenTelemetry
from opentelemetry import trace, metrics
tracer = trace.get_tracer("order-service")
meter = metrics.get_meter("order-service")
order_counter = meter.create_counter(
"orders.created",
description="Number of orders created"
)
@tracer.start_as_current_span("create_order")
def create_order(user_id: str, items: list):
span = trace.get_current_span()
span.set_attribute("user.id", user_id)
span.set_attribute("order.items_count", len(items))
order_counter.add(1, {"region": "eu-west"})
# ... business logic
Infrastructure Level
For infrastructure metrics, we use:
- Node Exporter — Linux host metrics (CPU, memory, disk, network)
- cAdvisor — Container metrics
- kube-state-metrics — Kubernetes object metrics
- SNMP Exporter — Network equipment metrics
The Grafana Stack
Our full observability stack:
| Component | Role | Data |
|---|---|---|
| Prometheus | Time-series DB | Metrics |
| Loki | Log aggregation | Logs |
| Tempo | Trace backend | Traces |
| Grafana | Visualization | All three |
| Alloy | Collection agent | Ships all telemetry |
| Alertmanager | Alert routing | Notifications |
All open-source. All self-hosted on our infrastructure. No per-seat pricing, no data egress fees, no vendor lock-in.
Common Mistakes
- Alerting on symptoms, not causes — Alert on error rate, not on CPU usage. High CPU is a symptom.
- Too many dashboards — If you have 200 dashboards, nobody looks at any of them. Have 5 great ones.
- Missing service ownership — Every metric, log, and alert needs an owner. Otherwise, alerts get ignored.
- No SLOs — Without SLOs, you don't know when something is "bad enough" to page someone. Read our SLO guide.
- Sampling too aggressively — Head-based sampling at 1% means you'll miss rare but critical errors. Use tail-based sampling.
Getting Started
If you're starting from zero:
- Week 1 — Deploy Prometheus + Grafana, instrument RED metrics for your top 3 services
- Week 2 — Add Loki for structured logging, create a log-based dashboard
- Week 3 — Add OpenTelemetry tracing to one service, deploy Tempo
- Week 4 — Connect everything in Grafana, set up exemplars linking metrics → traces
Or skip the setup and let us build it for you. We've deployed this stack for teams of 5 and teams of 500.