Distributed Message Queue (Kafka / Pulsar–class concepts)

Problem statement

Provide durable, ordered, replayable streams for event-driven architectures with consumer groups, backpressure, and exactly-once-ish processing.

How it works

Producers append to partitioned log; consumers read with offset cursor.
Consumer group coordinates partition assignment (rebalance protocol).

Analogy: Netflix watch history as an append-only tape; each family member (consumer) has their own bookmark (offset) but shares account rules (consumer group).

High-level design

Rendering diagram…

Components explained — this design

Component	What it is	Why we use it here
Producers	Services appending events/commands.	Kafka handles higher throughput than many RDBMS outboxes alone.
Broker cluster	Partitioned log storage + replication.	Durability + ordering within partition + horizontal read scaling via consumer groups.
Consumer groups A/B	Independent read positions on same topic.	Multiple downstream systems process same events without interfering offsets.
S3 tiered storage	Older log segments archived cheaply.	Retention compliance vs cost tradeoff (Kafka tiered storage pattern).
Controller (KRaft/ZK)	Cluster metadata management.	Leader election for partitions; operational complexity driver.

Shared definitions: 00-glossary-common-services.md

Low-level design

Partitioning key

Choose key with business locality (order_id) to preserve order per entity.
Bad key: random → no ordering guarantee across messages about same entity.

Delivery semantics

At-most-once: commit offset before processing — may lose on crash.
At-least-once: process then commit — may duplicate; handlers idempotent.
Exactly-once: Kafka transactions + idempotent producer + read-process-write same txn — complex, latency cost.

Schema evolution

Confluent Schema Registry with Avro/Protobuf; BACKWARD compatibility default.

Monitoring

Consumer lag per partition → alert; burrow / Kafka Lag Exporter.

E2E: happy path consume

Rendering diagram…

Tricky parts

Problem	Solution
Rebalance storm	Sticky assignor; cooperative rebalance
Poison pill message	DLQ topic + skip with alert
Hot partition	Split domain or accept ordering loss across sub-keys

Caveats

Kafka not great as task queue with per-message TTL — use SQS / RabbitMQ for job queue semantics.
Zookeeper mode deprecated — plan KRaft.

Azure / GCP

Event Hubs Kafka endpoint; Pub/Sub ordering with ordering keys; Pulsar geo-replication.