Tree-Based Models

For tabular data, tree-based models — particularly gradient boosting — are usually the right answer. Modern gradient boosting libraries (XGBoost, LightGBM, CatBoost) win Kaggle competitions, power production systems, and outperform deep learning on most tabular tasks.

Knowing when and how to use tree-based models is essential ML knowledge.

Decision trees

Recursively split data on features to predict outcome.

A tree is a series of if-else questions:

"Is age < 30?"
- Yes: "Is income > 50K?"
  - Yes: predict class A
  - No: predict class B
- No: ...

Splits are chosen to maximize information gain (or minimize impurity).

Pros

Highly interpretable
Handle mixed data types
No feature scaling needed
Capture non-linear relationships
Handle missing values naturally

Cons

Single trees overfit
Poor extrapolation
Sensitive to small data changes

Random forests

Many trees trained on bootstrap samples; predictions averaged.

Each tree:

Trained on random subset of data
Considers random subset of features at each split

Result: less variance than single tree, often less bias.

Pros

Strong out-of-the-box performance
Hard to overfit
Few hyperparameters
Parallelizable training
Out-of-bag error estimation (free CV)

Cons

Less interpretable than single tree
Larger models
Slower inference than single tree

Random Forests are an excellent strong baseline.

Gradient boosting

Build trees sequentially, each correcting errors of the previous.

Mathematical framing: each tree fits the gradient of the loss on previous predictions.

XGBoost

The original popularizer. Highly optimized; many features.

LightGBM

Microsoft's implementation. Often faster than XGBoost; comparable quality.

Uses leaf-wise growth (vs depth-wise).

CatBoost

Yandex's implementation. Native categorical handling, ordered boosting.

Often best for categorical-heavy data.

Practical defaults

LightGBM: usually fastest, strong default
XGBoost: well-known, mature, GPU support
CatBoost: when many categorical features

All three give similar quality with reasonable tuning.

Why tree-based wins on tabular

Reasons deep learning struggles on tabular:

Tabular data is heterogeneous: mix of numeric, categorical, ordinal, with different scales. Trees handle naturally.
Non-smooth target functions: real-world rules are step-like ("if age > 65 AND income < X then..."). Trees represent these directly.
Less data per task: deep learning needs lots of data. Tabular tasks often have 10K-1M rows, where trees excel.
Engineered features matter: domain knowledge encoded as features works well with trees.
Robustness to outliers: trees aren't affected by extreme values like neural networks.

This is why financial, healthcare, and e-commerce ML systems run on gradient boosting.

Hyperparameters

Common tunables:

Tree-level

Max depth: typically 3-8. Deeper = more capacity, more overfitting risk.
Min samples per leaf: regularization
Number of leaves (LightGBM): more direct control

Boosting

Number of trees: 100-10000. Use early stopping.
Learning rate: 0.01-0.3. Lower = better quality, more trees needed.
Subsample: row sampling per tree (regularization)
Column subsample: feature sampling per tree

Regularization

L1, L2 penalties: on leaf weights
Min gain to split: don't split if gain too small

Default starting point

1000 trees with early stopping on validation
Learning rate 0.05
Max depth 6
Subsample 0.8
Column subsample 0.8

Tune from there.

Categorical features

One-hot encoding

Convert categories to binary columns. Standard.

Ordinal encoding

Map categories to integers. Use only when order is meaningful.

Target encoding

Replace category with mean target value. Risk of leakage; needs cross-validation.

Native categorical

CatBoost and LightGBM handle categoricals natively. Often outperforms manual encoding.

Feature importance

Trees naturally produce feature importance:

Frequency: how often a feature is used in splits
Gain: average improvement when used
Permutation importance: change in performance when feature shuffled

Useful for:

Feature selection
Model debugging
Stakeholder communication

Caveat: importance scores can be misleading with correlated features.

SHAP values

Per-prediction feature attribution. Tells you why a specific prediction was made.

For tree-based models, SHAP can be computed exactly and efficiently.

Use for:

Model interpretation
Regulatory compliance
Debugging surprising predictions

When to use deep learning instead

Deep learning may beat trees when:

Very large datasets (10M+ rows)
High-dimensional features (text, images)
Strong feature interactions hard to engineer
Sequential structure (time-series, RNNs)
Transfer learning matters

For most tabular ML in industry: trees are still the right answer.

Common workflows

Kaggle / competition

Quick LightGBM baseline
Feature engineering
Hyperparameter tuning
Stacking/blending multiple models

Production

Train LightGBM with reasonable defaults
Establish baseline
Iterate on features
Tune hyperparameters once
Deploy
Monitor and retrain

Common failure patterns

Overfitting on small validation sets

Tune on a single random split; overfit to it. Use CV.

Target leakage

Features that wouldn't be available at prediction time. Common with engineered features.

Insufficient data for the chosen depth

Deep trees on small data overfit. Reduce depth or get more data.

Wrong objective

Regression objective when classification is the task. Subtle but happens.

Ignoring class imbalance

Majority class dominates. Use class weights, scale_pos_weight, or focal loss variants.

Not using early stopping

Without early stopping, you have to manually tune number of trees.

Inconsistent training/inference preprocessing

Different feature engineering pipelines drift. Use a shared pipeline.

Practical advice

Start with LightGBM defaults
Get a baseline working end-to-end before tuning
Focus on feature engineering more than hyperparameters
Use early stopping
Use CV for tuning
Compare to simpler baselines (logistic regression)
SHAP for understanding predictions

Tree-based models are reliable, fast, and accurate. They should be the default for tabular data unless proven otherwise.

Tree-Based Models

Decision trees

Pros

Cons

Random forests

Pros

Cons

Gradient boosting

XGBoost

LightGBM

CatBoost

Practical defaults

Why tree-based wins on tabular

Hyperparameters

Tree-level

Boosting

Regularization

Default starting point

Categorical features

One-hot encoding

Ordinal encoding

Target encoding

Native categorical

Feature importance

SHAP values

When to use deep learning instead

Common workflows

Kaggle / competition

Production

Common failure patterns

Overfitting on small validation sets

Target leakage

Insufficient data for the chosen depth

Wrong objective

Ignoring class imbalance

Not using early stopping

Inconsistent training/inference preprocessing

Practical advice

Further Reading