SuryanandHome

Bloom Filters & Probabilistic Caching

Problem statement

Reduce unnecessary database / cache lookups for “probably does not exist” checks (username taken crawl, CDN negative caching, duplicate URL in crawler) using tiny memory structures that may have false positives but no false negatives.

How it works

Bloom filter: m bits array, k hash functions. Insert: set bits h1(x)...hk(x). Query: all bits must be 1 → maybe in set; any 0 → definitely not in set.

Analogy: Hotel “no vacancy” neon sign — if off, definitely no room; if on, maybe still need to ask front desk (confirm).

High-level usage in systems

Rendering diagram…

Components explained — this design

ComponentWhat it isWhy we use it here
Bloom filter (RedisBloom / local)Probabilistic set membership test.Cheap negative answers to avoid DB on definitely new keys (URLs, usernames).
Authoritative DBSource of truth on “maybe” positives.Bloom false positives must be confirmed here.
Frontier queueCrawler scheduling system.Accepts URLs that passed bloom cheap check first.

Shared definitions: 00-glossary-common-services.md

Low-level design

Libraries

  • RedisBloom module BF.ADD / BF.EXISTS.
  • Guava BloomFilter in JVM services.
  • Scalable Bloom / Counting Bloom when deletions needed (counters per bit instead of bit).

Sizing

  • Choose expected elements n, false positive rate p → compute m, k via formulas or online calculators.
  • Growth: Scalable Bloom or partitioned filters per shard (bf:user:shard5).

Where not to use

  • Safety-critical negatives (“not poisoned”) — false positive catastrophic.
  • Tiny sets — overhead not worth it.

E2E: crawler duplicate URL check

Rendering diagram…

Tricky parts

ProblemSolution
False positive overloads DBRebuild filter periodically from cold truth snapshot
Cannot delete in classic BloomCounting Bloom or Cuckoo filter
Hot key single BFSharding filters by hash prefix

Caveats

  • Serialization across restarts — version filter parameters; rebuild after deploy if hash seeds change.
  • Race: two threads add same url concurrently — idempotent downstream.

CDN negative caching

  • Cache 404 short TTL at CDN for asset not found — reduces origin hammering; watch so attackers cannot force-cache personalized 404s (use Vary carefully).

Azure / AWS

  • Azure CDN rules for 404 TTL; DynamoDB doesn’t natively bloom — app layer or Redis Enterprise.