Real-Time Fraud Detection
Problem statement
Score payments, logins, transfers in <100ms with rules + ML, human review queues, false positive management, and explainability for regulators.
How it works
- Features assembled from Redis (velocity), graph DB (device shared across accounts), warehouse aggregates.
- Rules engine fast path → ML model heavy path → risk score + action (
allow,step_up,block).
Analogy: Airport security — metal detector (rules) catches obvious knives; random extra screening (ML risk) for subtle patterns; supervisor (human) for edge cases.
High-level design
Rendering diagram…
Components explained — this design
| Component | What it is | Why we use it here |
|---|---|---|
| Kafka transaction events | Immutable stream of payment attempts. | Replay to re-score with new models; audit for regulators. |
| Feature joiner Flink | Computes rolling aggregates (velocity). | Real-time features impossible to precompute statically. |
| Scoring API | Orchestrates rules + ML with SLA budget. | Hard timeouts per stage to protect checkout latency. |
| Drools / custom rules | Declarative if-then policies editable by risk ops. | Explainable blocks (“blocked because country mismatch”). |
| SageMaker endpoint | Hosted ML model inference. | Offloads GPU/CPU scaling for models vs embedding in monolith. |
| Case management UI | Human review for edge fraud. | Model uncertainty region; feedback labels improve training. |
Shared definitions: 00-glossary-common-services.md
Low-level design
Features
- Velocity:
countDistinct(card_hash, 1h)in Flink window keyed by user. - Device fingerprint — hash raw signals; GDPR minimize retention.
Actions
- Step-up: 3DS challenge on card; OTP on bank transfer.
- Shadow mode: log ML score without blocking — calibrate before enforcement.
Human loop
- Queue prioritization by expected loss; SLA timers; feedback label retrains model.
E2E: payment scoring
Rendering diagram…
Tricky parts
| Problem | Solution |
|---|---|
| Adversarial drift | Continuous training + champion/challenger |
| Bias in ML | Fairness constraints; segmented evaluation |
| Latency budget | Precomputed features where possible |
Caveats
- Explainability vs accuracy tradeoff — SHAP offline; simple rule overlays for auditors.
- False positives erode revenue — A/B impact of stricter rules on GMV.
Azure
- Azure Fraud Protection; Sentinel for account takeover patterns across products.