AI / ML Inference Platform

Problem statement

Serve low-latency model inference (LLM, vision, ranking) with GPU autoscaling, multi-tenant fairness, A/B models, and cost controls.

How it works

Model artifacts versioned in registry (MLflow, SageMaker Model Registry).
Runtime: TorchServe, Triton Inference Server, or vLLM for LLMs behind GPU node pool.
Traffic: API Gateway → router picks model version; queue for batch jobs.

Analogy: A taxi rank with regular and luxury cars (model sizes); dispatch must not let one corporate account book all limos (fairness).

High-level design

Rendering diagram…

Components explained — this design

Component	What it is	Why we use it here
API Gateway + auth	Validates caller + rate limits inference.	GPU endpoints are expensive; must block anonymous abuse.
Router / feature flags	Chooses model version / canary %.	Safe rollout of new weights without separate DNS per version.
Triton GPU pods	Serves batched inference efficiently.	Maximizes GPU utilization via dynamic batching and multiple models per server.
SQS batch inference	Lower priority offline jobs.	Separates latency-sensitive online path from massive batch scoring.
Prometheus + HPA/KEDA	Autoscale on GPU util / queue depth.	Scale to zero optional for dev; min replicas for prod latency.

Shared definitions: 00-glossary-common-services.md

Low-level design

Autoscaling

KEDA on queue depth + Prometheus gpu_utilization custom metrics.
Cold start of large models — minimum replicas > 0 for hot models; KServe serverless trade latency.

Multi-tenant

Per-tenant rate limits; noisy neighbor isolation via separate deployments for big customers.

Versioning

Canary traffic 5% → new model; shadow mode log outputs without affecting users.

GPU sharing

MIG partitions on A100; time-slicing for small models — watch latency interference.

E2E: online inference

Rendering diagram…

Tricky parts

Problem	Solution
OOM on long context	PagedAttention (vLLM); max tokens guardrails
Drift	Continuous evaluation on holdout; auto rollback
PII in prompts	TLS + log redaction + customer VPC endpoints

Caveats

Regulatory: EU AI Act logging requirements emerging — audit trail design early.
Cost: Spot GPUs for batch; on-demand for latency-sensitive.

Cloud

AWS SageMaker Endpoints, Azure ML Online Endpoints, Vertex AI.