Tree-Based Models
For tabular data, tree-based models — particularly gradient boosting — are usually the right answer. Modern gradient boosting libraries (XGBoost, LightGBM, CatBoost) win Kaggle competitions, power production systems, and outperform deep learning on most tabular tasks.
Knowing when and how to use tree-based models is essential ML knowledge.
Decision trees
Recursively split data on features to predict outcome.
A tree is a series of if-else questions:
- "Is age < 30?"
- Yes: "Is income > 50K?"
- Yes: predict class A
- No: predict class B
- No: ...
Splits are chosen to maximize information gain (or minimize impurity).
Pros
- Highly interpretable
- Handle mixed data types
- No feature scaling needed
- Capture non-linear relationships
- Handle missing values naturally
Cons
- Single trees overfit
- Poor extrapolation
- Sensitive to small data changes
Random forests
Many trees trained on bootstrap samples; predictions averaged.
Each tree:
- Trained on random subset of data
- Considers random subset of features at each split
Result: less variance than single tree, often less bias.
Pros
- Strong out-of-the-box performance
- Hard to overfit
- Few hyperparameters
- Parallelizable training
- Out-of-bag error estimation (free CV)
Cons
- Less interpretable than single tree
- Larger models
- Slower inference than single tree
Random Forests are an excellent strong baseline.
Gradient boosting
Build trees sequentially, each correcting errors of the previous.
Mathematical framing: each tree fits the gradient of the loss on previous predictions.
XGBoost
The original popularizer. Highly optimized; many features.
LightGBM
Microsoft's implementation. Often faster than XGBoost; comparable quality.
Uses leaf-wise growth (vs depth-wise).
CatBoost
Yandex's implementation. Native categorical handling, ordered boosting.
Often best for categorical-heavy data.
Practical defaults
- LightGBM: usually fastest, strong default
- XGBoost: well-known, mature, GPU support
- CatBoost: when many categorical features
All three give similar quality with reasonable tuning.
Why tree-based wins on tabular
Reasons deep learning struggles on tabular:
1. **Tabular data is heterogeneous**: mix of numeric, categorical, ordinal, with different scales. Trees handle naturally.
2. **Non-smooth target functions**: real-world rules are step-like ("if age > 65 AND income < X then..."). Trees represent these directly.
3. **Less data per task**: deep learning needs lots of data. Tabular tasks often have 10K-1M rows, where trees excel.
4. **Engineered features matter**: domain knowledge encoded as features works well with trees.
5. **Robustness to outliers**: trees aren't affected by extreme values like neural networks.
This is why financial, healthcare, and e-commerce ML systems run on gradient boosting.
Hyperparameters
Common tunables:
Tree-level
- **Max depth**: typically 3-8. Deeper = more capacity, more overfitting risk.
- **Min samples per leaf**: regularization
- **Number of leaves** (LightGBM): more direct control
Boosting
- **Number of trees**: 100-10000. Use early stopping.
- **Learning rate**: 0.01-0.3. Lower = better quality, more trees needed.
- **Subsample**: row sampling per tree (regularization)
- **Column subsample**: feature sampling per tree
Regularization
- **L1, L2 penalties**: on leaf weights
- **Min gain to split**: don't split if gain too small
Default starting point
- 1000 trees with early stopping on validation
- Learning rate 0.05
- Max depth 6
- Subsample 0.8
- Column subsample 0.8
Tune from there.
Categorical features
One-hot encoding
Convert categories to binary columns. Standard.
Ordinal encoding
Map categories to integers. Use only when order is meaningful.
Target encoding
Replace category with mean target value. Risk of leakage; needs cross-validation.
Native categorical
CatBoost and LightGBM handle categoricals natively. Often outperforms manual encoding.
Feature importance
Trees naturally produce feature importance:
- Frequency: how often a feature is used in splits
- Gain: average improvement when used
- Permutation importance: change in performance when feature shuffled
Useful for:
- Feature selection
- Model debugging
- Stakeholder communication
Caveat: importance scores can be misleading with correlated features.
SHAP values
Per-prediction feature attribution. Tells you why a specific prediction was made.
For tree-based models, SHAP can be computed exactly and efficiently.
Use for:
- Model interpretation
- Regulatory compliance
- Debugging surprising predictions
When to use deep learning instead
Deep learning may beat trees when:
- Very large datasets (10M+ rows)
- High-dimensional features (text, images)
- Strong feature interactions hard to engineer
- Sequential structure (time-series, RNNs)
- Transfer learning matters
For most tabular ML in industry: trees are still the right answer.
Common workflows
Kaggle / competition
1. Quick LightGBM baseline
2. Feature engineering
3. Hyperparameter tuning
4. Stacking/blending multiple models
Production
1. Train LightGBM with reasonable defaults
2. Establish baseline
3. Iterate on features
4. Tune hyperparameters once
5. Deploy
6. Monitor and retrain
Common failure patterns
Overfitting on small validation sets
Tune on a single random split; overfit to it. Use CV.
Target leakage
Features that wouldn't be available at prediction time. Common with engineered features.
Insufficient data for the chosen depth
Deep trees on small data overfit. Reduce depth or get more data.
Wrong objective
Regression objective when classification is the task. Subtle but happens.
Ignoring class imbalance
Majority class dominates. Use class weights, scale_pos_weight, or focal loss variants.
Not using early stopping
Without early stopping, you have to manually tune number of trees.
Inconsistent training/inference preprocessing
Different feature engineering pipelines drift. Use a shared pipeline.
Practical advice
1. Start with LightGBM defaults
2. Get a baseline working end-to-end before tuning
3. Focus on feature engineering more than hyperparameters
4. Use early stopping
5. Use CV for tuning
6. Compare to simpler baselines (logistic regression)
7. SHAP for understanding predictions
Tree-based models are reliable, fast, and accurate. They should be the default for tabular data unless proven otherwise.
Further Reading
- [ModelSelection](ModelSelection) — General selection
- [CrossValidationAndModelEvaluation](CrossValidationAndModelEvaluation) — Evaluation
- [ModelSelectionEfficiency](ModelSelectionEfficiency) — Efficiency tradeoffs
- [ML Hub](MLHub) — Cluster index