SuryanandHome

Real-Time Fraud Detection

Problem statement

Score payments, logins, transfers in <100ms with rules + ML, human review queues, false positive management, and explainability for regulators.

How it works

  • Features assembled from Redis (velocity), graph DB (device shared across accounts), warehouse aggregates.
  • Rules engine fast path → ML model heavy path → risk score + action (allow, step_up, block).

Analogy: Airport securitymetal detector (rules) catches obvious knives; random extra screening (ML risk) for subtle patterns; supervisor (human) for edge cases.

High-level design

Rendering diagram…

Components explained — this design

ComponentWhat it isWhy we use it here
Kafka transaction eventsImmutable stream of payment attempts.Replay to re-score with new models; audit for regulators.
Feature joiner FlinkComputes rolling aggregates (velocity).Real-time features impossible to precompute statically.
Scoring APIOrchestrates rules + ML with SLA budget.Hard timeouts per stage to protect checkout latency.
Drools / custom rulesDeclarative if-then policies editable by risk ops.Explainable blocks (“blocked because country mismatch”).
SageMaker endpointHosted ML model inference.Offloads GPU/CPU scaling for models vs embedding in monolith.
Case management UIHuman review for edge fraud.Model uncertainty region; feedback labels improve training.

Shared definitions: 00-glossary-common-services.md

Low-level design

Features

  • Velocity: countDistinct(card_hash, 1h) in Flink window keyed by user.
  • Device fingerprinthash raw signals; GDPR minimize retention.

Actions

  • Step-up: 3DS challenge on card; OTP on bank transfer.
  • Shadow mode: log ML score without blocking — calibrate before enforcement.

Human loop

  • Queue prioritization by expected loss; SLA timers; feedback label retrains model.

E2E: payment scoring

Rendering diagram…

Tricky parts

ProblemSolution
Adversarial driftContinuous training + champion/challenger
Bias in MLFairness constraints; segmented evaluation
Latency budgetPrecomputed features where possible

Caveats

  • Explainability vs accuracy tradeoff — SHAP offline; simple rule overlays for auditors.
  • False positives erode revenue — A/B impact of stricter rules on GMV.

Azure

  • Azure Fraud Protection; Sentinel for account takeover patterns across products.