SuryanandHome

AI / ML Inference Platform

Problem statement

Serve low-latency model inference (LLM, vision, ranking) with GPU autoscaling, multi-tenant fairness, A/B models, and cost controls.

How it works

  • Model artifacts versioned in registry (MLflow, SageMaker Model Registry).
  • Runtime: TorchServe, Triton Inference Server, or vLLM for LLMs behind GPU node pool.
  • Traffic: API Gatewayrouter picks model version; queue for batch jobs.

Analogy: A taxi rank with regular and luxury cars (model sizes); dispatch must not let one corporate account book all limos (fairness).

High-level design

Rendering diagram…

Components explained — this design

ComponentWhat it isWhy we use it here
API Gateway + authValidates caller + rate limits inference.GPU endpoints are expensive; must block anonymous abuse.
Router / feature flagsChooses model version / canary %.Safe rollout of new weights without separate DNS per version.
Triton GPU podsServes batched inference efficiently.Maximizes GPU utilization via dynamic batching and multiple models per server.
SQS batch inferenceLower priority offline jobs.Separates latency-sensitive online path from massive batch scoring.
Prometheus + HPA/KEDAAutoscale on GPU util / queue depth.Scale to zero optional for dev; min replicas for prod latency.

Shared definitions: 00-glossary-common-services.md

Low-level design

Autoscaling

  • KEDA on queue depth + Prometheus gpu_utilization custom metrics.
  • Cold start of large models — minimum replicas > 0 for hot models; KServe serverless trade latency.

Multi-tenant

  • Per-tenant rate limits; noisy neighbor isolation via separate deployments for big customers.

Versioning

  • Canary traffic 5% → new model; shadow mode log outputs without affecting users.

GPU sharing

  • MIG partitions on A100; time-slicing for small models — watch latency interference.

E2E: online inference

Rendering diagram…

Tricky parts

ProblemSolution
OOM on long contextPagedAttention (vLLM); max tokens guardrails
DriftContinuous evaluation on holdout; auto rollback
PII in promptsTLS + log redaction + customer VPC endpoints

Caveats

  • Regulatory: EU AI Act logging requirements emerging — audit trail design early.
  • Cost: Spot GPUs for batch; on-demand for latency-sensitive.

Cloud

  • AWS SageMaker Endpoints, Azure ML Online Endpoints, Vertex AI.