Metrics & Monitoring (Prometheus / Grafana)

Problem statement

Collect time-series metrics (counters, gauges, histograms) from thousands of pods, store efficiently, power SLO dashboards and alerting.

How it works

Apps expose /metrics text (Prometheus exposition format) or use OTLP → collector converts.
Prometheus scrapes on interval; TSDB compresses blocks; PromQL queries.
Recording rules pre-aggregate expensive queries; Alertmanager routes alerts.

Analogy: A hospital vitals monitor: heart rate (gauge), total patients admitted today (counter), response time histogram = blood pressure distribution over time windows.

High-level design

Rendering diagram…

Components explained — this design

Component	What it is	Why we use it here
Pod exporters /metrics	Exposes Prometheus text format counters/gauges/histograms.	Standard instrumentation libraries across languages.
Prometheus HA + Thanos	Scrapes + long-term storage + global query.	HA avoids single scraper failure; Thanos extends retention cheaply via object storage.
Alertmanager	Routes alerts by severity/team/on-call schedule.	Dedupes noisy alerts; supports PagerDuty/Slack receivers.
Thanos Query / Store	Global read path over historical blocks.	Multi-cluster SLO views without federated Prometheus complexity.
Grafana	Dashboards + alert UI.	De facto visualization for PromQL.

Shared definitions: 00-glossary-common-services.md

Low-level design

Cardinality bomb

Problem: http_requests_total{user_id="..."} explodes series count.
Fix: aggregate at app to http_requests_total{route} only; use tracing for per-user.

Histograms vs summaries

Histogram (client-chosen buckets) better for SLI math in PromQL (histogram_quantile).
Summary (client quantiles) harder to aggregate across instances — avoid for global SLO.

HA Prometheus

Two replicas scraping same targets wastes IO — use hashmod sharding or Thanos Receive for HA remote write path.

Remote write

Grafana Mimir, Cortex, AWS AMP for managed long retention + multi-tenant.

E2E: alert on burn rate

Rendering diagram…

Tricky parts

Problem	Solution
Missed scrapes	up==0 alerts; increase() resets on counter restart — use `rate()` over windows
Clock skew	External labels + honor_timestamps caution
High churn pods	label_replace / relabel drop ephemeral pod name from high-cardinality

Caveats

Logs vs metrics: metrics for aggregates; never log histogram buckets per request as log lines.
Cost: downsample after 30d in Mimir compactor.

Azure

Azure Monitor managed Prometheus; Grafana Azure plugin.