Model Selection

You have a problem and many possible models. Which to choose?

Model selection has two dimensions:

Choosing the algorithm/architecture
Choosing hyperparameters within an algorithm

Both involve evaluation and tradeoffs.

Choosing the algorithm

Match to problem characteristics

Tabular data, mixed features: gradient boosting (XGBoost, LightGBM, CatBoost) is usually best
Lots of structured data, simple features: linear models surprisingly competitive
Images: CNNs or vision transformers
Text: transformer-based models
Time-series: depends on data — gradient boosting often beats specialized models on shorter series; LSTM/transformer for longer
Tabular with very few examples (~100s): linear models, gradient boosting with regularization

Match to constraints

Latency budget: simpler models faster
Memory budget: small models or quantization
Interpretability requirement: linear models, decision trees, GAMs
Lots of labeled data: deep learning
Limited labeled data: pretrained models, simpler architectures
Online learning required: some algorithms support, others don't

Match to team

Skills you have
Tooling already in place
Maintenance burden

A model nobody can debug is a liability.

Establishing baselines

Before complex models, establish baselines:

Random / majority class: lowest bar
Simple heuristic: domain-knowledge baseline
Linear model: classical baseline
Boosting (XGBoost): strong baseline for tabular

If complex models don't beat boosting, you don't need them.

Hyperparameter tuning

Once you've chosen an algorithm, find good hyperparameters.

Grid search

Try all combinations of a discrete grid. Exhaustive; expensive for many parameters.

Random search

Sample randomly from parameter ranges. Often better than grid for high-dim search spaces.

Bayesian optimization

Use a probabilistic model of the objective to choose next parameters. More efficient.

Tools: Optuna, scikit-optimize, Hyperopt.

Population-based / evolutionary

Maintain a population of configurations; evolve.

Used for neural architecture search.

Hyperband / ASHA

Allocate budget adaptively. Stop bad runs early.

Effective for deep learning where each run is expensive.

Bias-variance tradeoff

Two sources of error:

Bias: model can't represent the true relationship. Underfitting.

Variance: model is sensitive to training data. Overfitting.

Total error = bias² + variance + irreducible noise.

Symptoms:

High training error → high bias (underfit) → use bigger model
Low training error, high test error → high variance (overfit) → regularize, more data, simpler model
High training and test error → high bias

Most modern models have flexibility to fit training data perfectly. Variance control becomes the main concern.

Regularization

Reduce variance:

L1 / L2 penalty

Add weight magnitude term to loss. Smaller weights → simpler model.

Dropout

Randomly zero activations during training. Forces redundancy.

Early stopping

Stop training when validation loss starts increasing.

Data augmentation

More effective data without more labels.

Ensembling

Multiple models; average predictions. Reduces variance.

Ensembles

Combining models often beats any single model.

Bagging

Train models on bootstrap samples; average. Random Forest is bagged trees.

Boosting

Train models sequentially, each correcting the previous. Gradient boosting.

Stacking

Train meta-model on predictions of base models.

Ensembles add cost. For production, may not be worth it.

Selection criterion

Choose the metric that matches the business goal:

Accuracy / F1 / AUC: classification quality
MAE / RMSE: regression
NDCG / MRR: ranking
Calibration: probability quality
Latency, cost: production constraints

Often combine: maximize accuracy subject to latency budget.

Multi-objective selection

Real model selection has multiple criteria:

Accuracy
Latency
Cost
Interpretability
Fairness
Memory

Pareto front: models not strictly dominated. Choose from the front based on priorities.

Cross-validation for selection

Cross-validate each candidate. Choose by CV performance.

Pitfall: with many candidates, the best CV score is optimistic. Use a held-out test set after selection.

When to stop

Diminishing returns: each new model gives less improvement.

Set a budget (time, compute). When you hit it, ship the best so far.

Perfect is the enemy of deployed.

Common failure patterns

Skipping baselines

Going straight to deep learning when XGBoost would suffice.

Tuning on test set

Hyperparameters tuned on test data → optimistic estimates.

Single split

One train/val/test split is high variance. Use CV.

Comparing on different data

Different splits, different preprocessing → invalid comparison.

Optimizing for the wrong metric

Validation accuracy when business cares about precision.

Ignoring production constraints

Best CV model that doesn't meet latency budget is useless.

Over-tuning

Diminishing returns; model overfit to validation set.

A practical recipe

Define the metric and constraints
Establish baselines (simple model)
Try 2-3 strong candidates with default parameters
Pick the best; tune hyperparameters
Evaluate on test set
Profile production characteristics
Ship

Resist over-engineering. Most ML wins come from data and feature work, not model selection.