Centralized Log Aggregation (ELK / OpenSearch)
Problem statement
Ingest terabytes/day of structured logs from containers, VMs, and lambdas; enable search, dashboards, alerts, and retention tiers.
How it works
- Agents (Fluent Bit, Filebeat) tail files / receive stdout → ship to buffer (Kafka) optional.
- Indexer parses JSON, applies schema, writes to search cluster.
- Hot/warm/cold tiers move old indices to cheaper storage (S3 + searchable snapshots).
Analogy: Every store in a chain drops receipts into a central warehouse where auditors can search by SKU across all branches.
High-level design
Rendering diagram…
Components explained — this design
| Component | What it is | Why we use it here |
|---|---|---|
| Fluent Bit DaemonSet | Lightweight log forwarder on each K8s node. | Avoids per-pod sidecar cost; tails container stdout files. |
| Kafka buffer (optional) | Shock absorber between agents and indexers. | Protects OpenSearch from ingest spikes during deploys or incidents. |
| Logstash / Data Prepper | Parse, enrich, route logs. | Converts unstructured lines to JSON schema; applies PII scrubbing rules. |
| OpenSearch cluster | Searchable log storage + dashboards. | Ops teams query by trace_id, service, level interactively. |
| S3 snapshots / Glacier | Cheap long-term retention. | Compliance retention without keeping everything in hot SSD indices. |
| Alerting | Rules on log patterns or counts. | Error spike pages without predefining every metric series. |
Shared definitions: 00-glossary-common-services.md
Low-level design
Ingestion
- Kubernetes: DaemonSet per node beats sidecar explosion.
- AWS: Firehose → OpenSearch Serverless for minimal ops.
Schema
- JSON lines with mandatory fields:
timestamp,service,level,trace_id. - Index template mapping
timestampas date, avoid text on high-cardinality IDs.
Cost control
- Index per day pattern
logs-2025.04.25— easy delete after 14d hot. - Ingest pipelines drop healthcheck noise (
path:/healthzfilter).
Security
- Fine-grained access control in OpenSearch; KMS encryption at rest; VPC private endpoints.
E2E: error spike alert
Rendering diagram…
Tricky parts
| Problem | Solution |
|---|---|
| Bursty logs DDOS yourself | Rate limit per pod; sampling at agent |
| Multiline stack traces | concatenate parser in Fluent Bit |
| Timezone mess | UTC only in storage; localize in UI |
Caveats
- Search is not OLAP — heavy numeric analytics → warehouse (Snowflake) via batch export.
- Reindex pain — plan mapping changes with aliases
logs_write→ physical index.
Azure mapping
- Azure Monitor Logs (Log Analytics); Event Hub ingestion; Data Explorer (Kusto) for high-scale analytics.