Bayesian Hyperparameter Tuning

Tuning hyperparameters (like learning rate, dropout, or layer depth) is fundamentally a non-convex, derivative-free optimization problem. Evaluating the "loss function" requires training the entire model, which is computationally expensive. **Bayesian Optimization** solves this by building a probabilistic surrogate model of the objective function.

1. The Failure of Grid and Random Search

* **Grid Search**: Suffers from the "Curse of Dimensionality." It is exponentially expensive and often wastes time searching unpromising areas of the parameter space.

* **Random Search**: Better than Grid Search because it explores more unique values per dimension, but it is "memoryless"β€”it does not learn from past evaluations.

2. The Bayesian Approach

Bayesian optimization treats hyperparameter tuning as a sequence of decisions driven by **Bayesian Inference**.

A. The Surrogate Model

Instead of evaluating the true objective function $f(x)$ blindly, the algorithm builds a surrogate model (a probabilistic approximation). The most common surrogate is a **Gaussian Process (GP)**, which provides not just a prediction for the loss at point $x$, but a **confidence interval** (uncertainty).

B. The Acquisition Function

The algorithm uses an Acquisition Function (like **Expected Improvement (EI)** or **Upper Confidence Bound (UCB)**) to decide where to sample next. This function explicitly balances:

* **Exploitation**: Sampling where the surrogate model predicts the loss will be lowest.

* **Exploration**: Sampling where the surrogate model has the highest uncertainty.

3. Tree-structured Parzen Estimators (TPE)

While Gaussian Processes work well for continuous variables, they struggle with categorical or conditional hyperparameters (e.g., "If optimizer=Adam, then tune beta1; else...").

Modern frameworks like **Optuna** and **Hyperopt** use **TPE**.

Instead of modeling $P(y|x)$ (probability of loss given parameters), TPE models $P(x|y)$ and $P(y)$. It divides past trials into "good" and "bad" groups and builds two separate distributions, sampling new points from the "good" distribution.

---

**See Also:**

- [Bayesian Inference](BayesianInference)

- [Optimization Algorithms](OptimizationAlgorithms)