ML Model Deployment

Going from "trained model" to "model serving production traffic" involves more than uploading a file. Deployment touches packaging, versioning, infrastructure, monitoring, and team practices.

This page covers the full process.

Why ML deployment is harder than software deployment

Software:

Code is the artifact
Behavior is deterministic
Failures are usually obvious

ML:

Code + weights + data are the artifact
Behavior depends on data distribution
Failures can be silent (degraded predictions, no exceptions)

This is why ML deployment needs ML-specific practices.

Packaging a model

What needs to be deployed:

Weights (the trained parameters)
Architecture (model code)
Preprocessing (feature engineering, tokenization)
Postprocessing (output formatting)
Dependencies (libraries, versions)
Metadata (training data, metrics, hyperparameters)

Single Python files don't capture this. Use:

Container images (Docker)
Model registry tools (MLflow, Weights & Biases, Hugging Face Hub)
Standardized formats (ONNX, TorchScript, TensorFlow SavedModel)

Model registry

Centralized model storage. Tracks:

Versions
Lineage (training data, code, hyperparameters)
Metrics
Deployment status

Tools:

MLflow
Weights & Biases
Hugging Face Hub
SageMaker Model Registry
Vertex AI Model Registry

A registry separates "model artifacts" from "code repos."

Versioning

Three things to version together:

Code (git commit)
Model weights (model registry version)
Data (dataset version, schema version)

For reproducibility, all three must align.

Schemes:

Semantic (1.0.0, 1.0.1)
Date-based (2026-04-26)
Hash-based (commit + data hash)

Pick one and stick to it.

Deployment patterns

Real-time inference

Synchronous request/response. Used for:

User-facing predictions
API integrations
Interactive systems

Latency-sensitive.

Batch inference

Score large datasets offline.

Used for:

Email targeting
Daily/hourly scoring jobs
Recommendation pre-computation

Throughput-sensitive; latency rarely matters.

Streaming inference

Continuous data through model:

Fraud detection on transactions
Real-time content moderation

Backpressure and ordering matter.

Embedded / edge

Model runs on user device. Different constraints (memory, power).

Rollout strategies

Big bang

Deploy new version, switch traffic. Risky for ML.

Canary

Route small % to new version. Monitor. Expand if good.

Most teams should default to canary.

Shadow

New version receives traffic but responses are discarded. Compare quality offline.

Doesn't risk users; doesn't validate behavior under real conditions.

A/B test

Different users see different versions. Measure business metrics.

Requires statistical rigor.

Multi-armed bandit

Dynamically route traffic based on observed performance.

Sophisticated; needed only when frequent retraining matters.

Pre-deployment checks

Before any deployment:

Model passes accuracy thresholds on held-out test set
No data leakage in training
Inference latency meets SLA
Memory footprint within budget
Edge cases tested
Bias/fairness checks if applicable

Make these automated. Manual checks get skipped.

Monitoring

Operational metrics

Latency
Throughput
Error rate
Resource usage

These are software-deployment standard.

ML-specific metrics

Input distribution drift
Prediction distribution drift
Quality metrics where ground truth is available
Confidence/calibration metrics
Per-segment metrics (don't trust the average)

Sample-based human review

Some ML failures are only detectable by humans. Sample outputs regularly.

Alerts

Set thresholds. Alert on:

Latency regression
Quality regression
Distribution shift
Drop in coverage

Rollback

Plan rollback before deployment.

Rollback artifacts:

Previous model version available
Quick switch mechanism (canary reversal)
Tested rollback path

Time-to-rollback matters. Aim for minutes, not hours.

Retraining cadence

Some models age:

Recommender systems: hours/days
Fraud detection: weeks
Image classification: months/years

Decide:

Manual retraining or automatic?
Triggered by drift or scheduled?
New version per retrain?

Automatic retraining + monitoring is the goal but adds complexity.

Feature stores

For consistent feature engineering between training and serving:

Feast (open source)
Tecton, Hopsworks (managed)
Custom built

Solves: training/serving skew where features computed differently.

Worth it when features are complex or shared across models.

CI/CD for ML

Pipelines should:

Run tests on code
Validate data
Train (or at least eval) the model
Compare to baseline
Deploy if quality clears bar
Run deployment-time tests

Tools: Kubeflow, MLflow, Vertex AI Pipelines, GitHub Actions.

Common failure patterns

Training-serving skew

Features computed differently in training vs serving. Subtle quality regression.

Prevention: shared feature pipeline, integration tests.

No baseline

Without a baseline model, you can't tell if changes help.

Eval set rot

Test set used for hyperparameter tuning becomes contaminated. Need fresh holdout.

No human eval

Some failures only humans can spot.

Insufficient monitoring

Quality silently degrades. Discovered weeks later from business metrics.

Skipping shadow / canary

Risk-aversion theater (lots of pre-deploy checks) doesn't substitute for real-traffic validation.

One-time deployment thinking

Models need redeployment. Build for repeated deploys, not one-shot.

Organizational concerns

Who owns a deployed model?

ML team? Platform team? Application team?

Without clear ownership, models rot.

On-call

Models in production need on-call coverage. Including ML-specific incidents (drift, quality drops).

Documentation

Model cards: what does this model do, what data was it trained on, what are its limitations.

Practical maturity model

Manual: ML engineer manually deploys on request
Pipeline: scripted deployment, manual quality gates
CI/CD: automated deployment, automated quality gates
Continuous training: automated retraining and deployment with monitoring

Most teams are at level 1-2. Reach level 3 before automating retraining.