IoT Telemetry & Device Management
Problem statement
Ingest millions of devices sending high-frequency telemetry, support commands downlink, OTA firmware, and rules (alert if temp > X).
How it works
- MQTT broker (AWS IoT Core / Azure IoT Hub) for pub/sub per device topic hierarchy.
- Hot path: Kinesis / Event Hubs → Flink aggregations → storage (TSDB / data lake).
Analogy: Smart electric meters phoning home every minute — the utility needs a call center (broker) that never drops calls and a billing warehouse (data lake) for history.
High-level design
Rendering diagram…
Components explained — this design
| Component | What it is | Why we use it here |
|---|---|---|
| AWS IoT Core / IoT Hub | MQTT broker + device registry + rules. | Protocol translation, per-device auth, rules without custom broker ops. |
| Rules engine SQL | Routes messages to Kinesis/S3/Lambda based on topic attrs. | Low-code routing for ops teams; still needs code review for safety. |
| Kinesis / Event Hubs | Durable high-throughput stream. | Millions msgs/sec ingest with replay for new analytics jobs. |
| Flink / Kinesis Analytics | Stateful stream processing (windows, joins). | Real-time alerts on sensor thresholds across time windows. |
| Timestream / ADX | Time-series optimized query store. | Operational dashboards faster than scanning raw Parquet. |
| S3 archive | Cheap long-term raw retention. | Regulatory retention + ML training exports. |
Shared definitions: 00-glossary-common-services.md
Low-level design
Topic naming
tenant/{tid}/device/{did}/telemetry— ACL per certificate CN=deviceId.
Security
- X.509 certs per device; JITR registration; rotate certs OTA.
- Private keys in TPM/secure element on device when possible.
Backpressure
- QoS1 MQTT can backlog — device-side queue bounded; drop policy documented.
OTA
- Signed firmware packages in S3; Jobs API tracks rollout percentage waves.
E2E: telemetry to alert
Rendering diagram…
Tricky parts
| Problem | Solution |
|---|---|
| Clock skew on devices | Server ingest time authoritative; device time metadata |
| Replay attacks | TLS + nonce in signed payloads |
| Firmware brick | A/B partitions + hardware rollback pin |
Caveats
- Cost at scale — rules to downsample before lake; pay per message pricing models hurt naive designs.
- Privacy: home address inference from device clusters — aggregate geospatially.
Azure
- Azure IoT Hub device twins; Azure Digital Twins for modeling relationships.