SuryanandHome

Centralized Log Aggregation (ELK / OpenSearch)

Problem statement

Ingest terabytes/day of structured logs from containers, VMs, and lambdas; enable search, dashboards, alerts, and retention tiers.

How it works

  • Agents (Fluent Bit, Filebeat) tail files / receive stdout → ship to buffer (Kafka) optional.
  • Indexer parses JSON, applies schema, writes to search cluster.
  • Hot/warm/cold tiers move old indices to cheaper storage (S3 + searchable snapshots).

Analogy: Every store in a chain drops receipts into a central warehouse where auditors can search by SKU across all branches.

High-level design

Rendering diagram…

Components explained — this design

ComponentWhat it isWhy we use it here
Fluent Bit DaemonSetLightweight log forwarder on each K8s node.Avoids per-pod sidecar cost; tails container stdout files.
Kafka buffer (optional)Shock absorber between agents and indexers.Protects OpenSearch from ingest spikes during deploys or incidents.
Logstash / Data PrepperParse, enrich, route logs.Converts unstructured lines to JSON schema; applies PII scrubbing rules.
OpenSearch clusterSearchable log storage + dashboards.Ops teams query by trace_id, service, level interactively.
S3 snapshots / GlacierCheap long-term retention.Compliance retention without keeping everything in hot SSD indices.
AlertingRules on log patterns or counts.Error spike pages without predefining every metric series.

Shared definitions: 00-glossary-common-services.md

Low-level design

Ingestion

  • Kubernetes: DaemonSet per node beats sidecar explosion.
  • AWS: FirehoseOpenSearch Serverless for minimal ops.

Schema

  • JSON lines with mandatory fields: timestamp, service, level, trace_id.
  • Index template mapping timestamp as date, avoid text on high-cardinality IDs.

Cost control

  • Index per day pattern logs-2025.04.25 — easy delete after 14d hot.
  • Ingest pipelines drop healthcheck noise (path:/healthz filter).

Security

  • Fine-grained access control in OpenSearch; KMS encryption at rest; VPC private endpoints.

E2E: error spike alert

Rendering diagram…

Tricky parts

ProblemSolution
Bursty logs DDOS yourselfRate limit per pod; sampling at agent
Multiline stack tracesconcatenate parser in Fluent Bit
Timezone messUTC only in storage; localize in UI

Caveats

  • Search is not OLAP — heavy numeric analytics → warehouse (Snowflake) via batch export.
  • Reindex pain — plan mapping changes with aliases logs_write → physical index.

Azure mapping

  • Azure Monitor Logs (Log Analytics); Event Hub ingestion; Data Explorer (Kusto) for high-scale analytics.