SuryanandHome

Metrics & Monitoring (Prometheus / Grafana)

Problem statement

Collect time-series metrics (counters, gauges, histograms) from thousands of pods, store efficiently, power SLO dashboards and alerting.

How it works

  • Apps expose /metrics text (Prometheus exposition format) or use OTLPcollector converts.
  • Prometheus scrapes on interval; TSDB compresses blocks; PromQL queries.
  • Recording rules pre-aggregate expensive queries; Alertmanager routes alerts.

Analogy: A hospital vitals monitor: heart rate (gauge), total patients admitted today (counter), response time histogram = blood pressure distribution over time windows.

High-level design

Rendering diagram…

Components explained — this design

ComponentWhat it isWhy we use it here
Pod exporters /metricsExposes Prometheus text format counters/gauges/histograms.Standard instrumentation libraries across languages.
Prometheus HA + ThanosScrapes + long-term storage + global query.HA avoids single scraper failure; Thanos extends retention cheaply via object storage.
AlertmanagerRoutes alerts by severity/team/on-call schedule.Dedupes noisy alerts; supports PagerDuty/Slack receivers.
Thanos Query / StoreGlobal read path over historical blocks.Multi-cluster SLO views without federated Prometheus complexity.
GrafanaDashboards + alert UI.De facto visualization for PromQL.

Shared definitions: 00-glossary-common-services.md

Low-level design

Cardinality bomb

  • Problem: http_requests_total{user_id="..."} explodes series count.
  • Fix: aggregate at app to http_requests_total{route} only; use tracing for per-user.

Histograms vs summaries

  • Histogram (client-chosen buckets) better for SLI math in PromQL (histogram_quantile).
  • Summary (client quantiles) harder to aggregate across instances — avoid for global SLO.

HA Prometheus

  • Two replicas scraping same targets wastes IO — use hashmod sharding or Thanos Receive for HA remote write path.

Remote write

  • Grafana Mimir, Cortex, AWS AMP for managed long retention + multi-tenant.

E2E: alert on burn rate

Rendering diagram…

Tricky parts

ProblemSolution
Missed scrapesup==0 alerts; increase() resets on counter restart — use rate() over windows
Clock skewExternal labels + honor_timestamps caution
High churn podslabel_replace / relabel drop ephemeral pod name from high-cardinality

Caveats

  • Logs vs metrics: metrics for aggregates; never log histogram buckets per request as log lines.
  • Cost: downsample after 30d in Mimir compactor.

Azure

  • Azure Monitor managed Prometheus; Grafana Azure plugin.