Metrics & Monitoring (Prometheus / Grafana)
Problem statement
Collect time-series metrics (counters, gauges, histograms) from thousands of pods, store efficiently, power SLO dashboards and alerting.
How it works
- Apps expose
/metricstext (Prometheus exposition format) or use OTLP → collector converts. - Prometheus scrapes on interval; TSDB compresses blocks; PromQL queries.
- Recording rules pre-aggregate expensive queries; Alertmanager routes alerts.
Analogy: A hospital vitals monitor: heart rate (gauge), total patients admitted today (counter), response time histogram = blood pressure distribution over time windows.
High-level design
Rendering diagram…
Components explained — this design
| Component | What it is | Why we use it here |
|---|---|---|
| Pod exporters /metrics | Exposes Prometheus text format counters/gauges/histograms. | Standard instrumentation libraries across languages. |
| Prometheus HA + Thanos | Scrapes + long-term storage + global query. | HA avoids single scraper failure; Thanos extends retention cheaply via object storage. |
| Alertmanager | Routes alerts by severity/team/on-call schedule. | Dedupes noisy alerts; supports PagerDuty/Slack receivers. |
| Thanos Query / Store | Global read path over historical blocks. | Multi-cluster SLO views without federated Prometheus complexity. |
| Grafana | Dashboards + alert UI. | De facto visualization for PromQL. |
Shared definitions: 00-glossary-common-services.md
Low-level design
Cardinality bomb
- Problem:
http_requests_total{user_id="..."}explodes series count. - Fix: aggregate at app to
http_requests_total{route}only; use tracing for per-user.
Histograms vs summaries
- Histogram (client-chosen buckets) better for SLI math in PromQL (
histogram_quantile). - Summary (client quantiles) harder to aggregate across instances — avoid for global SLO.
HA Prometheus
- Two replicas scraping same targets wastes IO — use hashmod sharding or Thanos Receive for HA remote write path.
Remote write
- Grafana Mimir, Cortex, AWS AMP for managed long retention + multi-tenant.
E2E: alert on burn rate
Rendering diagram…
Tricky parts
| Problem | Solution |
|---|---|
| Missed scrapes | up==0 alerts; increase() resets on counter restart — use rate() over windows |
| Clock skew | External labels + honor_timestamps caution |
| High churn pods | label_replace / relabel drop ephemeral pod name from high-cardinality |
Caveats
- Logs vs metrics: metrics for aggregates; never log histogram buckets per request as log lines.
- Cost: downsample after 30d in Mimir compactor.
Azure
- Azure Monitor managed Prometheus; Grafana Azure plugin.