Bayesian Hyperparameter Tuning

Tuning hyperparameters (like learning rate, dropout, or layer depth) is fundamentally a non-convex, derivative-free optimization problem. Evaluating the "loss function" requires training the entire model, which is computationally expensive. Bayesian Optimization solves this by building a probabilistic surrogate model of the objective function.

1. The Failure of Grid and Random Search

Grid Search: Suffers from the "Curse of Dimensionality." It is exponentially expensive and often wastes time searching unpromising areas of the parameter space.
Random Search: Better than Grid Search because it explores more unique values per dimension, but it is "memoryless"—it does not learn from past evaluations.

2. The Bayesian Approach

Bayesian optimization treats hyperparameter tuning as a sequence of decisions driven by Bayesian Inference.

A. The Surrogate Model

Instead of evaluating the true objective function $f(x)$ blindly, the algorithm builds a surrogate model (a probabilistic approximation). The most common surrogate is a Gaussian Process (GP), which provides not just a prediction for the loss at point $x$ , but a confidence interval (uncertainty).

B. The Acquisition Function

The algorithm uses an Acquisition Function (like Expected Improvement (EI) or Upper Confidence Bound (UCB)) to decide where to sample next. This function explicitly balances:

Exploitation: Sampling where the surrogate model predicts the loss will be lowest.
Exploration: Sampling where the surrogate model has the highest uncertainty.

3. Tree-structured Parzen Estimators (TPE)

While Gaussian Processes work well for continuous variables, they struggle with categorical or conditional hyperparameters (e.g., "If optimizer=Adam, then tune beta1; else..."). Modern frameworks like Optuna and Hyperopt use TPE. Instead of modeling $P(y|x)$ (probability of loss given parameters), TPE models $P(x|y)$ and $P(y)$ . It divides past trials into "good" and "bad" groups and builds two separate distributions, sampling new points from the "good" distribution.

See Also: