AI / ML Inference Platform
Problem statement
Serve low-latency model inference (LLM, vision, ranking) with GPU autoscaling, multi-tenant fairness, A/B models, and cost controls.
How it works
- Model artifacts versioned in registry (MLflow, SageMaker Model Registry).
- Runtime: TorchServe, Triton Inference Server, or vLLM for LLMs behind GPU node pool.
- Traffic: API Gateway → router picks model version; queue for batch jobs.
Analogy: A taxi rank with regular and luxury cars (model sizes); dispatch must not let one corporate account book all limos (fairness).
High-level design
Rendering diagram…
Components explained — this design
| Component | What it is | Why we use it here |
|---|---|---|
| API Gateway + auth | Validates caller + rate limits inference. | GPU endpoints are expensive; must block anonymous abuse. |
| Router / feature flags | Chooses model version / canary %. | Safe rollout of new weights without separate DNS per version. |
| Triton GPU pods | Serves batched inference efficiently. | Maximizes GPU utilization via dynamic batching and multiple models per server. |
| SQS batch inference | Lower priority offline jobs. | Separates latency-sensitive online path from massive batch scoring. |
| Prometheus + HPA/KEDA | Autoscale on GPU util / queue depth. | Scale to zero optional for dev; min replicas for prod latency. |
Shared definitions: 00-glossary-common-services.md
Low-level design
Autoscaling
- KEDA on queue depth + Prometheus
gpu_utilizationcustom metrics. - Cold start of large models — minimum replicas > 0 for hot models; KServe serverless trade latency.
Multi-tenant
- Per-tenant rate limits; noisy neighbor isolation via separate deployments for big customers.
Versioning
- Canary traffic 5% → new model; shadow mode log outputs without affecting users.
GPU sharing
- MIG partitions on A100; time-slicing for small models — watch latency interference.
E2E: online inference
Rendering diagram…
Tricky parts
| Problem | Solution |
|---|---|
| OOM on long context | PagedAttention (vLLM); max tokens guardrails |
| Drift | Continuous evaluation on holdout; auto rollback |
| PII in prompts | TLS + log redaction + customer VPC endpoints |
Caveats
- Regulatory: EU AI Act logging requirements emerging — audit trail design early.
- Cost: Spot GPUs for batch; on-demand for latency-sensitive.
Cloud
- AWS SageMaker Endpoints, Azure ML Online Endpoints, Vertex AI.