Bloom Filters & Probabilistic Caching

Problem statement

Reduce unnecessary database / cache lookups for “probably does not exist” checks (username taken crawl, CDN negative caching, duplicate URL in crawler) using tiny memory structures that may have false positives but no false negatives.

How it works

Bloom filter: m bits array, k hash functions. Insert: set bits h1(x)...hk(x). Query: all bits must be 1 → maybe in set; any 0 → definitely not in set.

Analogy: Hotel “no vacancy” neon sign — if off, definitely no room; if on, maybe still need to ask front desk (confirm).

High-level usage in systems

Rendering diagram…

Components explained — this design

Component	What it is	Why we use it here
Bloom filter (RedisBloom / local)	Probabilistic set membership test.	Cheap negative answers to avoid DB on definitely new keys (URLs, usernames).
Authoritative DB	Source of truth on “maybe” positives.	Bloom false positives must be confirmed here.
Frontier queue	Crawler scheduling system.	Accepts URLs that passed bloom cheap check first.

Shared definitions: 00-glossary-common-services.md

Low-level design

Libraries

RedisBloom module BF.ADD / BF.EXISTS.
Guava BloomFilter in JVM services.
Scalable Bloom / Counting Bloom when deletions needed (counters per bit instead of bit).

Sizing

Choose expected elements n, false positive rate p → compute m, k via formulas or online calculators.
Growth: Scalable Bloom or partitioned filters per shard (bf:user:shard5).

Where not to use

Safety-critical negatives (“not poisoned”) — false positive catastrophic.
Tiny sets — overhead not worth it.

E2E: crawler duplicate URL check

Rendering diagram…

Tricky parts

Problem	Solution
False positive overloads DB	Rebuild filter periodically from cold truth snapshot
Cannot delete in classic Bloom	Counting Bloom or Cuckoo filter
Hot key single BF	Sharding filters by hash prefix

Caveats

Serialization across restarts — version filter parameters; rebuild after deploy if hash seeds change.
Race: two threads add same url concurrently — idempotent downstream.

CDN negative caching

Cache 404 short TTL at CDN for asset not found — reduces origin hammering; watch so attackers cannot force-cache personalized 404s (use Vary carefully).

Azure / AWS

Azure CDN rules for 404 TTL; DynamoDB doesn’t natively bloom — app layer or Redis Enterprise.