Inference Serving

Training a model is a project; serving it is a system. Production inference serving has its own discipline distinct from training.

This page covers the practical concerns.

Core requirements

A production inference service must:

- Accept requests (HTTP, gRPC, queue)

- Run models efficiently

- Scale with traffic

- Recover from failures

- Be observable

- Update models safely

Serving frameworks

TorchServe

PyTorch's official serving framework. Decent default for PyTorch models.

TensorFlow Serving

Mature, performant. Strong for TF models.

NVIDIA Triton

Multi-framework. Excellent batching and GPU utilization. Industry standard for GPU serving.

Ray Serve

Python-native, flexible composition. Good for complex pipelines.

vLLM, TGI

LLM-specific. Continuous batching and PagedAttention give major throughput gains.

Custom (FastAPI/Flask + model)

Quick to build; loses out on optimizations like batching.

Choose based on:

- Framework you trained in

- Model type (LLM vs traditional)

- Scale requirements

- Team experience

Batching strategies

No batching

One request → one inference. Wastes hardware.

Static batching

Wait for N requests, then batch. Adds latency.

Dynamic batching

Form batches based on queue + max wait time. Tunable latency-throughput tradeoff.

Continuous batching (LLMs)

Requests can join/leave the batch mid-generation. Major throughput improvement for autoregressive models.

vLLM and TGI implement this.

Latency budget

Define p50, p95, p99 latency targets.

Components:

- Network in

- Queueing

- Preprocessing

- Inference

- Postprocessing

- Network out

Profile each. Common surprises:

- Tokenization can dominate for short LLM requests

- Preprocessing/postprocessing in Python is slow

- TCP setup adds tens of ms for cold connections

Autoscaling

Scale up to handle load; scale down to save money.

Metrics

- QPS / RPS (request-rate based)

- GPU/CPU utilization

- Queue length / age (best for ML)

- Custom metrics

Cold start

GPUs and large models load slowly. Cold start can be 30s+.

Mitigations:

- Keep minimum replicas warm

- Pre-warm on scale-up signals

- Use smaller models for low traffic

Spot instances

For non-critical, interruption-tolerant workloads, spot instances cut cost dramatically.

Multi-model serving

Multi-model on one instance

Multiple models share resources. Saves money for low-traffic models.

Model ensemble pipelines

Output of one model feeds another. Common for vision + classification, retrieval + reranking.

Triton's ensemble feature; Ray Serve composition.

Routing

Choose model per request based on input characteristics.

Versioning and rollout

Blue-green

Two environments; switch traffic atomically.

Canary

Send small % to new version. Monitor metrics. Increase gradually.

Shadow / mirror

Run new version in parallel without serving its responses. Compare quality.

A/B testing

Send different traffic to different versions. Measure business impact.

For ML models, output drift between versions is common. Shadow testing catches surprises.

Monitoring

Latency

p50, p95, p99 — track all three.

Throughput

QPS over time. Detect traffic anomalies.

Errors

Inference errors, timeouts, OOM.

Quality

This is unique to ML:

- Distribution shift detection

- Output statistics

- Sample-based human review

- Online metrics where available

Cost

Cost per request, by model. Surprising things happen.

Caching

Response caching

Identical input → cached output. Effective for queries with repetition.

Embedding caching

For pipelines with embeddings, cache by content hash.

Prompt caching (LLMs)

Cache prefix computations. Major savings for system prompts and RAG.

Resource isolation

Multi-tenant models can interfere:

- One model's batch starves another

- OOM in one impacts all

- GPU memory fragmentation

Mitigations:

- Per-model resource limits

- Queue isolation

- Separate processes/pods for isolation

Failure handling

Graceful degradation

Model down? Fall back to:

- Cached results

- Simpler model

- Static response

- Explicit "unavailable" message

Circuit breakers

If error rate spikes, stop sending traffic. Lets the system recover.

Retries

Retry transient failures. Avoid retry storms.

Timeouts

Per-request timeouts prevent slow requests from blocking workers.

Hardware utilization

GPU utilization

GPU "utilization" metric is misleading. A GPU at 100% utilization may be memory-bandwidth-bound.

Better: tokens/second, requests/second.

Memory

Out-of-memory is the most common failure mode. Monitor headroom.

CPU/GPU split

For small models, CPU may be cheaper. See [CPUInference](CPUInference).

Common failure patterns

Optimizing inference but not the rest

Tokenization, preprocessing, network — often the bottleneck.

Insufficient observability

Without metrics, you can't optimize.

Under-provisioning for tail latency

p99 matters for user experience even when p50 looks fine.

Cold start surprises

Autoscaling that creates 30s of timeouts.

Model swap regressions

New model deploys and quality silently drops.

Batch starvation

One slow request blocks an entire batch. Mitigate with timeouts and dynamic batching.

Operational maturity

Stages:

1. **Notebook to API**: hosted demo, no SLA

2. **Production POC**: serves real traffic, manual ops

3. **Scaled production**: autoscaling, monitoring, on-call

4. **Optimized production**: cost optimization, multi-model, sophisticated routing

Most teams under-invest in stages 3-4.

Build vs buy

Hosted inference services (Replicate, Together, Modal, AWS SageMaker, Vertex AI) handle a lot of this.

For small teams, hosted is often the right call until cost becomes prohibitive.

Further Reading

- [CostEffectiveInference](CostEffectiveInference) — Cost optimization

- [CPUInference](CPUInference) — CPU-based serving

- [MlModelDeployment](MlModelDeployment) — Deployment processes

- [ML Hub](MLHub) — Cluster index