Recommendation Engine (Homepage / “You may also like”)

Problem statement

Rank millions of items per user in <100ms using behavioral signals, cold start handling, and freshness without filter bubbles becoming a PR issue.

How it works

Retrieval: cheap candidate set (hundreds) via embedding ANN, co-visitation, graph walks.
Ranking: heavier model scores candidates with context features.
Re-rank: diversity, business rules (out of stock downrank).

Analogy: Restaurant menu: appetizer tray (retrieval) brings 8 bites; chef (ranker) picks best 3 for your known allergies; manager (re-rank) ensures vegetarian option visible.

High-level design

Rendering diagram…

Components explained — this design

Component	What it is	Why we use it here
Clickstream Kafka	Raw user behavior events.	Durable input to both batch training and nearline features.
Flink feature pipeline	Real-time aggregations (CTR windows).	Powers fresh signals beyond nightly batch tables.
Feature Store (Redis + Parquet)	Online low-latency features + offline training snapshots.	Training-serving skew reduction by sharing definitions.
ANN service (FAISS/ScaNN)	Approximate nearest neighbor retrieval.	Millions of candidate items can’t be scored by heavy ranker; retrieve hundreds fast.
Ranker endpoint	XGBoost / neural ranker on candidates.	Adds contextual scoring using dense features.

Shared definitions: 00-glossary-common-services.md

Low-level design

Feature store

Online: Redis / DynamoDB low-latency user features (last_category, ctr_7d_bucket).
Offline: Snowflake / BigQuery for training joins.

ANN index

FAISS IVF+PQ for memory efficiency; periodic rebuild from nightly embeddings.
Two-tower model: user tower + item tower cosine similarity.

Exploration

Epsilon-greedy or Thompson sampling bandit layer for cold items.

Fairness

Demographic parity constraints in re-rank; audit dashboards per segment.

E2E: homepage request

Rendering diagram…

Tricky parts

Problem	Solution
Filter bubble	Inject exploration + editorial slots
Latency SLA	Timeout budget per stage; degrade to popularity baseline
Privacy	Federated learning optional; DP noise on sensitive features

Caveats

Feedback loops amplify clickbait — human eval sets and offline metrics beyond CTR.
Seasonality — time-based features and retrain cadence.

Managed

Amazon Personalize, Google Recommendations AI, Azure Personalizer.