Centralized Log Aggregation (ELK / OpenSearch)

Problem statement

Ingest terabytes/day of structured logs from containers, VMs, and lambdas; enable search, dashboards, alerts, and retention tiers.

How it works

Agents (Fluent Bit, Filebeat) tail files / receive stdout → ship to buffer (Kafka) optional.
Indexer parses JSON, applies schema, writes to search cluster.
Hot/warm/cold tiers move old indices to cheaper storage (S3 + searchable snapshots).

Analogy: Every store in a chain drops receipts into a central warehouse where auditors can search by SKU across all branches.

High-level design

Rendering diagram…

Components explained — this design

Component	What it is	Why we use it here
Fluent Bit DaemonSet	Lightweight log forwarder on each K8s node.	Avoids per-pod sidecar cost; tails container stdout files.
Kafka buffer (optional)	Shock absorber between agents and indexers.	Protects OpenSearch from ingest spikes during deploys or incidents.
Logstash / Data Prepper	Parse, enrich, route logs.	Converts unstructured lines to JSON schema; applies PII scrubbing rules.
OpenSearch cluster	Searchable log storage + dashboards.	Ops teams query by `trace_id`, `service`, `level` interactively.
S3 snapshots / Glacier	Cheap long-term retention.	Compliance retention without keeping everything in hot SSD indices.
Alerting	Rules on log patterns or counts.	Error spike pages without predefining every metric series.

Shared definitions: 00-glossary-common-services.md

Low-level design

Ingestion

Kubernetes: DaemonSet per node beats sidecar explosion.
AWS: Firehose → OpenSearch Serverless for minimal ops.

Schema

JSON lines with mandatory fields: timestamp, service, level, trace_id.
Index template mapping timestamp as date, avoid text on high-cardinality IDs.

Cost control

Index per day pattern logs-2025.04.25 — easy delete after 14d hot.
Ingest pipelines drop healthcheck noise (path:/healthz filter).

Security

Fine-grained access control in OpenSearch; KMS encryption at rest; VPC private endpoints.

E2E: error spike alert

Rendering diagram…

Tricky parts

Problem	Solution
Bursty logs DDOS yourself	Rate limit per pod; sampling at agent
Multiline stack traces	concatenate parser in Fluent Bit
Timezone mess	UTC only in storage; localize in UI

Caveats

Search is not OLAP — heavy numeric analytics → warehouse (Snowflake) via batch export.
Reindex pain — plan mapping changes with aliases logs_write → physical index.

Azure mapping

Azure Monitor Logs (Log Analytics); Event Hub ingestion; Data Explorer (Kusto) for high-scale analytics.