File Storage & Sync (e.g. Dropbox)
Problem statement
Users sync folders across devices with conflict resolution, version history, sharing links, and offline edits that merge later.
How it works
- Chunking: split files into content-defined or fixed-size chunks; dedupe by hash (same chunk stored once).
- Metadata tree: paths → chunk list + versions per device.
- Sync: clients poll or use long polling / WebSocket for change notifications.
Analogy: Lego bricks: many builds (files) reuse the same red 2×2 brick (chunk hash); the instruction booklet is metadata.
High-level design
Rendering diagram…
Components explained — this design
| Component | What it is | Why we use it here |
|---|---|---|
| API Gateway | Authenticated edge for metadata and presign. | Validates OAuth before returning S3 presigned URLs; prevents anonymous unbounded uploads. |
| Upload / Read services | Coordinate chunks, permissions, conflict policy. | Separation lets you tune read-heavy CDN path differently from write-heavy hashing/dedupe. |
| S3 / Blob | Durable object storage for file chunks. | Industry standard for large binary payloads; versioning helps accidental overwrite recovery. |
| PostgreSQL / DynamoDB | Metadata: paths, versions, ACLs, chunk maps. | Postgres for folder queries and sharing rules; Dynamo if metadata access is strictly key-value per file id at extreme scale. |
| CDN cache | Edge cache for public immutable chunk GETs. | Reduces download latency and origin egress costs for popular files. |
| Pub/Sub or SQS | Async jobs: virus scan, thumbnail, search index. | Keeps upload ACK fast; workers can retry independently with DLQ. |
Shared definitions: 00-glossary-common-services.md
Low-level design
Chunk upload
- Multipart upload to S3 with presigned URLs (short TTL).
- Content hash (SHA-256) as idempotency key → skip re-upload if chunk exists.
Metadata
- PostgreSQL for folder hierarchy, ACLs, share tokens — rich transactional updates.
- Global scale: CockroachDB / Spanner for multi-region strong semantics (expensive).
Conflicts
- Last-write-wins (LWW) — simple but loses data.
- Better: vector clocks or CRDT for text; for binary, “conflict copy” file
Report (conflicted).docx. - Server: store version vector per file; client merges or prompts user.
Sharing
- Signed URLs for read-only share; OAuth scopes for “edit” collaborators.
Security
- KMS envelope encryption per tenant or per file.
- ClamAV / Macie on upload path for malware and sensitive data discovery.
E2E: two devices edit offline
Rendering diagram…
Tricky parts
| Problem | Solution |
|---|---|
| Large folder listings | Pagination + ETags; virtual folders in DB |
| Small file overhead | Pack small files into bundle objects |
| Latency worldwide | Regional buckets + CRR; read-your-writes tradeoffs |
Caveats
- True E2E encryption breaks server-side dedupe — product choice.
- Deleted file retention for compliance (legal hold) conflicts with user “delete forever”.