SuryanandHome

File Storage & Sync (e.g. Dropbox)

Problem statement

Users sync folders across devices with conflict resolution, version history, sharing links, and offline edits that merge later.

How it works

  • Chunking: split files into content-defined or fixed-size chunks; dedupe by hash (same chunk stored once).
  • Metadata tree: paths → chunk list + versions per device.
  • Sync: clients poll or use long polling / WebSocket for change notifications.

Analogy: Lego bricks: many builds (files) reuse the same red 2×2 brick (chunk hash); the instruction booklet is metadata.

High-level design

Rendering diagram…

Components explained — this design

ComponentWhat it isWhy we use it here
API GatewayAuthenticated edge for metadata and presign.Validates OAuth before returning S3 presigned URLs; prevents anonymous unbounded uploads.
Upload / Read servicesCoordinate chunks, permissions, conflict policy.Separation lets you tune read-heavy CDN path differently from write-heavy hashing/dedupe.
S3 / BlobDurable object storage for file chunks.Industry standard for large binary payloads; versioning helps accidental overwrite recovery.
PostgreSQL / DynamoDBMetadata: paths, versions, ACLs, chunk maps.Postgres for folder queries and sharing rules; Dynamo if metadata access is strictly key-value per file id at extreme scale.
CDN cacheEdge cache for public immutable chunk GETs.Reduces download latency and origin egress costs for popular files.
Pub/Sub or SQSAsync jobs: virus scan, thumbnail, search index.Keeps upload ACK fast; workers can retry independently with DLQ.

Shared definitions: 00-glossary-common-services.md

Low-level design

Chunk upload

  • Multipart upload to S3 with presigned URLs (short TTL).
  • Content hash (SHA-256) as idempotency key → skip re-upload if chunk exists.

Metadata

  • PostgreSQL for folder hierarchy, ACLs, share tokens — rich transactional updates.
  • Global scale: CockroachDB / Spanner for multi-region strong semantics (expensive).

Conflicts

  • Last-write-wins (LWW) — simple but loses data.
  • Better: vector clocks or CRDT for text; for binary, “conflict copy” file Report (conflicted).docx.
  • Server: store version vector per file; client merges or prompts user.

Sharing

  • Signed URLs for read-only share; OAuth scopes for “edit” collaborators.

Security

  • KMS envelope encryption per tenant or per file.
  • ClamAV / Macie on upload path for malware and sensitive data discovery.

E2E: two devices edit offline

Rendering diagram…

Tricky parts

ProblemSolution
Large folder listingsPagination + ETags; virtual folders in DB
Small file overheadPack small files into bundle objects
Latency worldwideRegional buckets + CRR; read-your-writes tradeoffs

Caveats

  • True E2E encryption breaks server-side dedupe — product choice.
  • Deleted file retention for compliance (legal hold) conflicts with user “delete forever”.