Bloom Filters & Probabilistic Caching
Problem statement
Reduce unnecessary database / cache lookups for “probably does not exist” checks (username taken crawl, CDN negative caching, duplicate URL in crawler) using tiny memory structures that may have false positives but no false negatives.
How it works
Bloom filter: m bits array, k hash functions. Insert: set bits h1(x)...hk(x). Query: all bits must be 1 → maybe in set; any 0 → definitely not in set.
Analogy: Hotel “no vacancy” neon sign — if off, definitely no room; if on, maybe still need to ask front desk (confirm).
High-level usage in systems
Rendering diagram…
Components explained — this design
| Component | What it is | Why we use it here |
|---|---|---|
| Bloom filter (RedisBloom / local) | Probabilistic set membership test. | Cheap negative answers to avoid DB on definitely new keys (URLs, usernames). |
| Authoritative DB | Source of truth on “maybe” positives. | Bloom false positives must be confirmed here. |
| Frontier queue | Crawler scheduling system. | Accepts URLs that passed bloom cheap check first. |
Shared definitions: 00-glossary-common-services.md
Low-level design
Libraries
- RedisBloom module
BF.ADD/BF.EXISTS. - Guava BloomFilter in JVM services.
- Scalable Bloom / Counting Bloom when deletions needed (counters per bit instead of bit).
Sizing
- Choose expected elements n, false positive rate p → compute
m,kvia formulas or online calculators. - Growth: Scalable Bloom or partitioned filters per shard (
bf:user:shard5).
Where not to use
- Safety-critical negatives (“not poisoned”) — false positive catastrophic.
- Tiny sets — overhead not worth it.
E2E: crawler duplicate URL check
Rendering diagram…
Tricky parts
| Problem | Solution |
|---|---|
| False positive overloads DB | Rebuild filter periodically from cold truth snapshot |
| Cannot delete in classic Bloom | Counting Bloom or Cuckoo filter |
| Hot key single BF | Sharding filters by hash prefix |
Caveats
- Serialization across restarts — version filter parameters; rebuild after deploy if hash seeds change.
- Race: two threads add same url concurrently — idempotent downstream.
CDN negative caching
- Cache 404 short TTL at CDN for asset not found — reduces origin hammering; watch so attackers cannot force-cache personalized 404s (use Vary carefully).
Azure / AWS
- Azure CDN rules for 404 TTL; DynamoDB doesn’t natively bloom — app layer or Redis Enterprise.