SuryanandHome

Recommendation Engine (Homepage / “You may also like”)

Problem statement

Rank millions of items per user in <100ms using behavioral signals, cold start handling, and freshness without filter bubbles becoming a PR issue.

How it works

  • Retrieval: cheap candidate set (hundreds) via embedding ANN, co-visitation, graph walks.
  • Ranking: heavier model scores candidates with context features.
  • Re-rank: diversity, business rules (out of stock downrank).

Analogy: Restaurant menu: appetizer tray (retrieval) brings 8 bites; chef (ranker) picks best 3 for your known allergies; manager (re-rank) ensures vegetarian option visible.

High-level design

Rendering diagram…

Components explained — this design

ComponentWhat it isWhy we use it here
Clickstream KafkaRaw user behavior events.Durable input to both batch training and nearline features.
Flink feature pipelineReal-time aggregations (CTR windows).Powers fresh signals beyond nightly batch tables.
Feature Store (Redis + Parquet)Online low-latency features + offline training snapshots.Training-serving skew reduction by sharing definitions.
ANN service (FAISS/ScaNN)Approximate nearest neighbor retrieval.Millions of candidate items can’t be scored by heavy ranker; retrieve hundreds fast.
Ranker endpointXGBoost / neural ranker on candidates.Adds contextual scoring using dense features.

Shared definitions: 00-glossary-common-services.md

Low-level design

Feature store

  • Online: Redis / DynamoDB low-latency user features (last_category, ctr_7d_bucket).
  • Offline: Snowflake / BigQuery for training joins.

ANN index

  • FAISS IVF+PQ for memory efficiency; periodic rebuild from nightly embeddings.
  • Two-tower model: user tower + item tower cosine similarity.

Exploration

  • Epsilon-greedy or Thompson sampling bandit layer for cold items.

Fairness

  • Demographic parity constraints in re-rank; audit dashboards per segment.

E2E: homepage request

Rendering diagram…

Tricky parts

ProblemSolution
Filter bubbleInject exploration + editorial slots
Latency SLATimeout budget per stage; degrade to popularity baseline
PrivacyFederated learning optional; DP noise on sensitive features

Caveats

  • Feedback loops amplify clickbait — human eval sets and offline metrics beyond CTR.
  • Seasonalitytime-based features and retrain cadence.

Managed

  • Amazon Personalize, Google Recommendations AI, Azure Personalizer.