Real-Time Fraud Detection

Problem statement

Score payments, logins, transfers in <100ms with rules + ML, human review queues, false positive management, and explainability for regulators.

How it works

Features assembled from Redis (velocity), graph DB (device shared across accounts), warehouse aggregates.
Rules engine fast path → ML model heavy path → risk score + action (allow, step_up, block).

Analogy: Airport security — metal detector (rules) catches obvious knives; random extra screening (ML risk) for subtle patterns; supervisor (human) for edge cases.

High-level design

Rendering diagram…

Components explained — this design

Component	What it is	Why we use it here
Kafka transaction events	Immutable stream of payment attempts.	Replay to re-score with new models; audit for regulators.
Feature joiner Flink	Computes rolling aggregates (velocity).	Real-time features impossible to precompute statically.
Scoring API	Orchestrates rules + ML with SLA budget.	Hard timeouts per stage to protect checkout latency.
Drools / custom rules	Declarative if-then policies editable by risk ops.	Explainable blocks (“blocked because country mismatch”).
SageMaker endpoint	Hosted ML model inference.	Offloads GPU/CPU scaling for models vs embedding in monolith.
Case management UI	Human review for edge fraud.	Model uncertainty region; feedback labels improve training.

Shared definitions: 00-glossary-common-services.md

Low-level design

Features

Velocity: countDistinct(card_hash, 1h) in Flink window keyed by user.
Device fingerprint — hash raw signals; GDPR minimize retention.

Actions

Step-up: 3DS challenge on card; OTP on bank transfer.
Shadow mode: log ML score without blocking — calibrate before enforcement.

Human loop

Queue prioritization by expected loss; SLA timers; feedback label retrains model.

E2E: payment scoring

Rendering diagram…

Tricky parts

Problem	Solution
Adversarial drift	Continuous training + champion/challenger
Bias in ML	Fairness constraints; segmented evaluation
Latency budget	Precomputed features where possible

Caveats

Explainability vs accuracy tradeoff — SHAP offline; simple rule overlays for auditors.
False positives erode revenue — A/B impact of stricter rules on GMV.

Azure

Azure Fraud Protection; Sentinel for account takeover patterns across products.