Course/Infrastructure & DevOps/Observability: Metrics, Logs & Traces

Observability: Metrics, Logs & Traces

The three pillars: metrics (Prometheus, Grafana), structured logging (ELK stack), distributed tracing (Jaeger, OpenTelemetry), and correlating signals.

18 min readHigh interview weight

The Three Pillars of Observability

Observability is the ability to understand the internal state of a system by examining its external outputs. The three pillars are metrics (aggregated numeric time-series), logs (discrete structured events), and traces (end-to-end request timelines). Each pillar answers different questions: metrics tell you *something is wrong*, logs tell you *what happened*, and traces tell you *where in the system it happened and why it was slow*.

Loading diagram...

Observability data flows: metrics scraped by Prometheus, logs shipped to Elasticsearch, traces exported to Jaeger.

Pillar 1: Metrics with Prometheus & Grafana

Prometheus uses a pull model: it scrapes an HTTP `/metrics` endpoint on each service at a configured interval (typically 15 seconds). Metrics are stored as time-series data identified by a metric name and key-value labels (`{service="api", region="us-east-1"}`). PromQL (Prometheus Query Language) enables powerful aggregations.

Metric Type	Description	Example
Counter	Monotonically increasing value	`http_requests_total{status="200"}`
Gauge	Value that can go up or down	`active_connections`, `memory_bytes`
Histogram	Distribution of values in buckets	`http_request_duration_seconds_bucket`
Summary	Quantiles calculated client-side	`rpc_duration_seconds{quantile="0.99"}`

promql

# Rate of 5xx errors per second over 5-minute window
rate(http_requests_total{status=~"5.."}[5m])

# 99th percentile latency by service
histogram_quantile(0.99,
  rate(http_request_duration_seconds_bucket[5m])
)

# Alert: error rate > 1% for 5 minutes
ALERT HighErrorRate
  IF rate(http_requests_total{status=~"5.."}[5m])
     / rate(http_requests_total[5m]) > 0.01
  FOR 5m
  LABELS { severity = "critical" }
  ANNOTATIONS { summary = "Error rate exceeds 1%" }

💡

The RED Method

For microservices, instrument every service with three core metrics: Rate (requests per second), Errors (failed requests per second), Duration (distribution of response latencies). These three signal almost every user-impacting problem.

Pillar 2: Structured Logging with ELK

Log in structured JSON format — not free-text strings — so logs are machine-parseable and queryable. Include a `trace_id` field in every log line so you can correlate logs with distributed traces. The ELK Stack (Elasticsearch + Logstash/Filebeat + Kibana) or Grafana Loki (cheaper, label-based) are common choices.

json

// Bad: free-text log line
"INFO 2024-01-15 Processing order 12345 for user abc failed after 3 retries"

// Good: structured JSON log
{
  "timestamp": "2024-01-15T10:30:00.123Z",
  "level": "error",
  "service": "order-service",
  "trace_id": "abc123def456",
  "span_id": "7890abcd",
  "event": "order_processing_failed",
  "order_id": "12345",
  "user_id": "abc",
  "retry_count": 3,
  "error": "payment_gateway_timeout",
  "duration_ms": 5023
}

Pillar 3: Distributed Tracing with OpenTelemetry & Jaeger

A trace represents a single end-to-end request. It is composed of spans — each span is a unit of work within a service (a database query, an HTTP call, a cache lookup). Spans are linked by a shared trace ID and parent-child relationships via span IDs. The trace ID is propagated in HTTP headers (W3C `traceparent` standard).

OpenTelemetry (OTel) is the CNCF standard for instrumentation — it provides language SDKs and a wire protocol (OTLP) that is vendor-neutral. You instrument once and route to any backend (Jaeger, Zipkin, Honeycomb, Datadog). The OTel Collector receives, batches, and exports telemetry to multiple destinations.

Correlating the Three Pillars

The real power of observability comes when you correlate signals. A Grafana dashboard shows a spike in p99 latency at 14:32. You drill into a trace from that window in Jaeger and see the `payment-service` span took 4 seconds. You filter logs by `trace_id` and `service=payment-service` in Kibana and find an error log: `upstream_connect_timeout`. Now you know root cause — and you found it in under five minutes, without a single SSH session.

Signal	Best For	Primary Tool
Metrics	Alerting, dashboards, trending	Prometheus + Grafana
Logs	Debugging specific events and errors	ELK, Loki + Grafana
Traces	Latency analysis, service dependency mapping	Jaeger, Zipkin, Tempo

💡

Interview Tip

When asked 'how would you debug a latency spike in production?' — say: (1) check metrics dashboards (RED method) to confirm which service's latency increased, (2) sample traces from the degraded time window to find the slow span, (3) correlate the trace ID in logs to get detailed error context. Mention that you'd add `trace_id` to all log entries to enable this correlation. This structured debugging approach impresses interviewers.

Infrastructure as Code

Chaos Engineering