Model Selection

You have a problem and many possible models. Which to choose?

Model selection has two dimensions:

1. Choosing the algorithm/architecture

2. Choosing hyperparameters within an algorithm

Both involve evaluation and tradeoffs.

Choosing the algorithm

Match to problem characteristics

- **Tabular data, mixed features**: gradient boosting (XGBoost, LightGBM, CatBoost) is usually best

- **Lots of structured data, simple features**: linear models surprisingly competitive

- **Images**: CNNs or vision transformers

- **Text**: transformer-based models

- **Time-series**: depends on data — gradient boosting often beats specialized models on shorter series; LSTM/transformer for longer

- **Tabular with very few examples (~100s)**: linear models, gradient boosting with regularization

Match to constraints

- **Latency budget**: simpler models faster

- **Memory budget**: small models or quantization

- **Interpretability requirement**: linear models, decision trees, GAMs

- **Lots of labeled data**: deep learning

- **Limited labeled data**: pretrained models, simpler architectures

- **Online learning required**: some algorithms support, others don't

Match to team

- Skills you have

- Tooling already in place

- Maintenance burden

A model nobody can debug is a liability.

Establishing baselines

Before complex models, establish baselines:

1. **Random / majority class**: lowest bar

2. **Simple heuristic**: domain-knowledge baseline

3. **Linear model**: classical baseline

4. **Boosting (XGBoost)**: strong baseline for tabular

If complex models don't beat boosting, you don't need them.

Hyperparameter tuning

Once you've chosen an algorithm, find good hyperparameters.

Grid search

Try all combinations of a discrete grid. Exhaustive; expensive for many parameters.

Random search

Sample randomly from parameter ranges. Often better than grid for high-dim search spaces.

Bayesian optimization

Use a probabilistic model of the objective to choose next parameters. More efficient.

Tools: Optuna, scikit-optimize, Hyperopt.

Population-based / evolutionary

Maintain a population of configurations; evolve.

Used for neural architecture search.

Hyperband / ASHA

Allocate budget adaptively. Stop bad runs early.

Effective for deep learning where each run is expensive.

Bias-variance tradeoff

Two sources of error:

**Bias**: model can't represent the true relationship. Underfitting.

**Variance**: model is sensitive to training data. Overfitting.

Total error = bias² + variance + irreducible noise.

Symptoms:

- High training error → high bias (underfit) → use bigger model

- Low training error, high test error → high variance (overfit) → regularize, more data, simpler model

- High training and test error → high bias

Most modern models have flexibility to fit training data perfectly. Variance control becomes the main concern.

Regularization

Reduce variance:

L1 / L2 penalty

Add weight magnitude term to loss. Smaller weights → simpler model.

Dropout

Randomly zero activations during training. Forces redundancy.

Early stopping

Stop training when validation loss starts increasing.

Data augmentation

More effective data without more labels.

Ensembling

Multiple models; average predictions. Reduces variance.

Ensembles

Combining models often beats any single model.

Bagging

Train models on bootstrap samples; average. Random Forest is bagged trees.

Boosting

Train models sequentially, each correcting the previous. Gradient boosting.

Stacking

Train meta-model on predictions of base models.

Ensembles add cost. For production, may not be worth it.

Selection criterion

Choose the metric that matches the business goal:

- Accuracy / F1 / AUC: classification quality

- MAE / RMSE: regression

- NDCG / MRR: ranking

- Calibration: probability quality

- Latency, cost: production constraints

Often combine: maximize accuracy subject to latency budget.

Multi-objective selection

Real model selection has multiple criteria:

- Accuracy

- Latency

- Cost

- Interpretability

- Fairness

- Memory

Pareto front: models not strictly dominated. Choose from the front based on priorities.

Cross-validation for selection

Cross-validate each candidate. Choose by CV performance.

Pitfall: with many candidates, the best CV score is optimistic. Use a held-out test set after selection.

When to stop

Diminishing returns: each new model gives less improvement.

Set a budget (time, compute). When you hit it, ship the best so far.

Perfect is the enemy of deployed.

Common failure patterns

Skipping baselines

Going straight to deep learning when XGBoost would suffice.

Tuning on test set

Hyperparameters tuned on test data → optimistic estimates.

Single split

One train/val/test split is high variance. Use CV.

Comparing on different data

Different splits, different preprocessing → invalid comparison.

Optimizing for the wrong metric

Validation accuracy when business cares about precision.

Ignoring production constraints

Best CV model that doesn't meet latency budget is useless.

Over-tuning

Diminishing returns; model overfit to validation set.

A practical recipe

1. Define the metric and constraints

2. Establish baselines (simple model)

3. Try 2-3 strong candidates with default parameters

4. Pick the best; tune hyperparameters

5. Evaluate on test set

6. Profile production characteristics

7. Ship

Resist over-engineering. Most ML wins come from data and feature work, not model selection.

Further Reading

- [CrossValidationAndModelEvaluation](CrossValidationAndModelEvaluation) — Evaluation methods

- [ModelSelectionEfficiency](ModelSelectionEfficiency) — Efficiency considerations

- [TreeBasedModels](TreeBasedModels) — A common strong baseline

- [ML Hub](MLHub) — Cluster index