SuryanandHome

Distributed Message Queue (Kafka / Pulsar–class concepts)

Problem statement

Provide durable, ordered, replayable streams for event-driven architectures with consumer groups, backpressure, and exactly-once-ish processing.

How it works

  • Producers append to partitioned log; consumers read with offset cursor.
  • Consumer group coordinates partition assignment (rebalance protocol).

Analogy: Netflix watch history as an append-only tape; each family member (consumer) has their own bookmark (offset) but shares account rules (consumer group).

High-level design

Rendering diagram…

Components explained — this design

ComponentWhat it isWhy we use it here
ProducersServices appending events/commands.Kafka handles higher throughput than many RDBMS outboxes alone.
Broker clusterPartitioned log storage + replication.Durability + ordering within partition + horizontal read scaling via consumer groups.
Consumer groups A/BIndependent read positions on same topic.Multiple downstream systems process same events without interfering offsets.
S3 tiered storageOlder log segments archived cheaply.Retention compliance vs cost tradeoff (Kafka tiered storage pattern).
Controller (KRaft/ZK)Cluster metadata management.Leader election for partitions; operational complexity driver.

Shared definitions: 00-glossary-common-services.md

Low-level design

Partitioning key

  • Choose key with business locality (order_id) to preserve order per entity.
  • Bad key: random → no ordering guarantee across messages about same entity.

Delivery semantics

  • At-most-once: commit offset before processing — may lose on crash.
  • At-least-once: process then commit — may duplicate; handlers idempotent.
  • Exactly-once: Kafka transactions + idempotent producer + read-process-write same txn — complex, latency cost.

Schema evolution

  • Confluent Schema Registry with Avro/Protobuf; BACKWARD compatibility default.

Monitoring

  • Consumer lag per partition → alert; burrow / Kafka Lag Exporter.

E2E: happy path consume

Rendering diagram…

Tricky parts

ProblemSolution
Rebalance stormSticky assignor; cooperative rebalance
Poison pill messageDLQ topic + skip with alert
Hot partitionSplit domain or accept ordering loss across sub-keys

Caveats

  • Kafka not great as task queue with per-message TTL — use SQS / RabbitMQ for job queue semantics.
  • Zookeeper mode deprecated — plan KRaft.

Azure / GCP

  • Event Hubs Kafka endpoint; Pub/Sub ordering with ordering keys; Pulsar geo-replication.