ML Model Deployment

Going from "trained model" to "model serving production traffic" involves more than uploading a file. Deployment touches packaging, versioning, infrastructure, monitoring, and team practices.

This page covers the full process.

Why ML deployment is harder than software deployment

Software:

- Code is the artifact

- Behavior is deterministic

- Failures are usually obvious

ML:

- Code + weights + data are the artifact

- Behavior depends on data distribution

- Failures can be silent (degraded predictions, no exceptions)

This is why ML deployment needs ML-specific practices.

Packaging a model

What needs to be deployed:

- **Weights** (the trained parameters)

- **Architecture** (model code)

- **Preprocessing** (feature engineering, tokenization)

- **Postprocessing** (output formatting)

- **Dependencies** (libraries, versions)

- **Metadata** (training data, metrics, hyperparameters)

Single Python files don't capture this. Use:

- Container images (Docker)

- Model registry tools (MLflow, Weights & Biases, Hugging Face Hub)

- Standardized formats (ONNX, TorchScript, TensorFlow SavedModel)

Model registry

Centralized model storage. Tracks:

- Versions

- Lineage (training data, code, hyperparameters)

- Metrics

- Deployment status

Tools:

- MLflow

- Weights & Biases

- Hugging Face Hub

- SageMaker Model Registry

- Vertex AI Model Registry

A registry separates "model artifacts" from "code repos."

Versioning

Three things to version together:

- Code (git commit)

- Model weights (model registry version)

- Data (dataset version, schema version)

For reproducibility, all three must align.

Schemes:

- Semantic (1.0.0, 1.0.1)

- Date-based (2026-04-26)

- Hash-based (commit + data hash)

Pick one and stick to it.

Deployment patterns

Real-time inference

Synchronous request/response. Used for:

- User-facing predictions

- API integrations

- Interactive systems

Latency-sensitive.

Batch inference

Score large datasets offline.

Used for:

- Email targeting

- Daily/hourly scoring jobs

- Recommendation pre-computation

Throughput-sensitive; latency rarely matters.

Streaming inference

Continuous data through model:

- Fraud detection on transactions

- Real-time content moderation

Backpressure and ordering matter.

Embedded / edge

Model runs on user device. Different constraints (memory, power).

Rollout strategies

Big bang

Deploy new version, switch traffic. Risky for ML.

Canary

Route small % to new version. Monitor. Expand if good.

Most teams should default to canary.

Shadow

New version receives traffic but responses are discarded. Compare quality offline.

Doesn't risk users; doesn't validate behavior under real conditions.

A/B test

Different users see different versions. Measure business metrics.

Requires statistical rigor.

Multi-armed bandit

Dynamically route traffic based on observed performance.

Sophisticated; needed only when frequent retraining matters.

Pre-deployment checks

Before any deployment:

- Model passes accuracy thresholds on held-out test set

- No data leakage in training

- Inference latency meets SLA

- Memory footprint within budget

- Edge cases tested

- Bias/fairness checks if applicable

Make these automated. Manual checks get skipped.

Monitoring

Operational metrics

- Latency

- Throughput

- Error rate

- Resource usage

These are software-deployment standard.

ML-specific metrics

- Input distribution drift

- Prediction distribution drift

- Quality metrics where ground truth is available

- Confidence/calibration metrics

- Per-segment metrics (don't trust the average)

Sample-based human review

Some ML failures are only detectable by humans. Sample outputs regularly.

Alerts

Set thresholds. Alert on:

- Latency regression

- Quality regression

- Distribution shift

- Drop in coverage

Rollback

Plan rollback before deployment.

Rollback artifacts:

- Previous model version available

- Quick switch mechanism (canary reversal)

- Tested rollback path

Time-to-rollback matters. Aim for minutes, not hours.

Retraining cadence

Some models age:

- Recommender systems: hours/days

- Fraud detection: weeks

- Image classification: months/years

Decide:

- Manual retraining or automatic?

- Triggered by drift or scheduled?

- New version per retrain?

Automatic retraining + monitoring is the goal but adds complexity.

Feature stores

For consistent feature engineering between training and serving:

- Feast (open source)

- Tecton, Hopsworks (managed)

- Custom built

Solves: training/serving skew where features computed differently.

Worth it when features are complex or shared across models.

CI/CD for ML

Pipelines should:

- Run tests on code

- Validate data

- Train (or at least eval) the model

- Compare to baseline

- Deploy if quality clears bar

- Run deployment-time tests

Tools: Kubeflow, MLflow, Vertex AI Pipelines, GitHub Actions.

Common failure patterns

Training-serving skew

Features computed differently in training vs serving. Subtle quality regression.

Prevention: shared feature pipeline, integration tests.

No baseline

Without a baseline model, you can't tell if changes help.

Eval set rot

Test set used for hyperparameter tuning becomes contaminated. Need fresh holdout.

No human eval

Some failures only humans can spot.

Insufficient monitoring

Quality silently degrades. Discovered weeks later from business metrics.

Skipping shadow / canary

Risk-aversion theater (lots of pre-deploy checks) doesn't substitute for real-traffic validation.

One-time deployment thinking

Models need redeployment. Build for repeated deploys, not one-shot.

Organizational concerns

Who owns a deployed model?

ML team? Platform team? Application team?

Without clear ownership, models rot.

On-call

Models in production need on-call coverage. Including ML-specific incidents (drift, quality drops).

Documentation

Model cards: what does this model do, what data was it trained on, what are its limitations.

Practical maturity model

1. **Manual**: ML engineer manually deploys on request

2. **Pipeline**: scripted deployment, manual quality gates

3. **CI/CD**: automated deployment, automated quality gates

4. **Continuous training**: automated retraining and deployment with monitoring

Most teams are at level 1-2. Reach level 3 before automating retraining.

Further Reading

- [InferenceServing](InferenceServing) — Serving infrastructure

- [CostEffectiveInference](CostEffectiveInference) — Cost optimization

- [CrossValidationAndModelEvaluation](CrossValidationAndModelEvaluation) — Evaluation

- [ML Hub](MLHub) — Cluster index