Bayesian Inference: The Calculus of Belief
Bayesian inference provides a mathematically rigorous mechanism for updating beliefs in the face of new evidence. While frequentist statistics treats parameters as fixed, unknown constants, Bayesian inference treats parameters as random variables governed by probability distributions. This paradigm allows for the coherent integration of prior knowledge with observed data.
1. The Mathematical Framework
Bayesian inference is the continuous and multivariate extension of Bayes' Theorem applied to parameter estimation.
1.1 Bayes' Theorem in Continuous Spaces
For a parameter vector $\boldsymbol{\theta}$and observed data$\mathcal{D}$, the posterior distribution is defined as:$$p(\boldsymbol{\theta} | \mathcal{D}) = \frac{p(\mathcal{D} | \boldsymbol{\theta}) p(\boldsymbol{\theta})}{p(\mathcal{D})}$$- **Prior$p(\boldsymbol{\theta})$:** The belief state regarding the parameters before observing$\mathcal{D}$.
- **Likelihood$p(\mathcal{D} | \boldsymbol{\theta})$:** The probability density of observing the data given a specific$\boldsymbol{\theta}$.
- **Posterior$p(\boldsymbol{\theta} | \mathcal{D})$:** The updated belief state.
- **Evidence (Marginal Likelihood)$p(\mathcal{D})$:** The normalizing constant, defined as the integral of the likelihood over the entire prior space:$$p(\mathcal{D}) = \int_{\Theta} p(\mathcal{D} | \boldsymbol{\theta}) p(\boldsymbol{\theta}) d\boldsymbol{\theta}$$### 1.2 The Intractability of Evidence
In high-dimensional models (e.g., neural networks), the integral for$p(\mathcal{D})$over$\mathbb{R}^n$lacks a closed-form solution and is computationally intractable. This bottleneck necessitates approximation methods like Markov Chain Monte Carlo (MCMC) and Variational Inference (VI).
2. Geometric Intuition: Projection and Manifolds
Bayesian updating can be visualized spatially, particularly when using approximate inference.
2.1 Information Projection (Variational Inference)
In Variational Inference, we seek a tractable distribution$q(\boldsymbol{\theta})$that closely approximates the true, intractable posterior$p(\boldsymbol{\theta} | \mathcal{D})$.
Geometrically, we are taking the true posterior, which lives on a complex manifold of probability measures, and **projecting** it onto a simpler, constrained sub-manifold (e.g., the manifold of all independent Gaussian distributions). The distance minimized is the Kullback-Leibler (KL) divergence:$$q^*(\boldsymbol{\theta}) = \text{argmin}_{q \in \mathcal{Q}} \text{KL}(q(\boldsymbol{\theta}) \parallel p(\boldsymbol{\theta} | \mathcal{D}))$$Visually, imagine a highly irregular, curved 3D surface representing the true posterior. VI finds the point on a flat 2D plane (the proxy family$\mathcal{Q}$) that is "closest" to the peak of that 3D surface.
2.2 Prior as a Geometric Regularizer
A Gaussian prior$\mathcal{N}(0, \sigma^2 I)$geometrically restricts the volume of the parameter space. It pulls the posterior distribution toward the origin, acting as a gravitational anchor that prevents the likelihood from stretching out indefinitely along axes where data is sparse.
3. Quantitative Foundations: Conjugacy and Matrices
When the prior and posterior belong to the same probability family, they are **conjugate**. This yields exact, closed-form updates.
3.1 Gaussian-Gaussian Conjugate Update
Suppose the likelihood is normal with known variance$\sigma^2$, and the prior on the mean$\mu$is$\mathcal{N}(\mu_0, \sigma_0^2)$. Given$n$observations with sample mean$\bar{x}$, the posterior for$\mu$is also normal,$\mathcal{N}(\mu_n, \sigma_n^2)$, where:$$\frac{1}{\sigma_n^2} = \frac{1}{\sigma_0^2} + \frac{n}{\sigma^2}$$
$$\mu_n = \sigma_n^2 \left( \frac{\mu_0}{\sigma_0^2} + \frac{n\bar{x}}{\sigma^2} \right)$$#### 3.1.1 Precision Interpretation
Let precision$\lambda = 1/\sigma^2$. The posterior precision is the sum of the prior precision and the data precision ($\lambda_n = \lambda_0 + n\lambda$). The posterior mean is a precision-weighted average of the prior mean and sample mean.
Table 1: Common Conjugate Priors
| Likelihood | Conjugate Prior | Posterior Distribution | Application |
| :--- | :--- | :--- | :--- |
| **Bernoulli** | Beta($\alpha, \beta$) | Beta($\alpha + \Sigma x, \beta + n - \Sigma x$) | A/B Testing |
| **Poisson** | Gamma($\alpha, \beta$) | Gamma($\alpha + \Sigma x, \beta + n$) | Traffic Modeling |
| **Multinomial** | Dirichlet($\mathbf{\alpha}$) | Dirichlet($\boldsymbol{\alpha} + \mathbf{x}$) | Topic Modeling (LDA) |
4. Real-World Applications
4.1 Robotics: The Kalman Filter
The Kalman Filter is a recursive Bayesian estimator for linear dynamical systems with Gaussian noise.
1. **Prediction Step (Prior):** The system uses physics (e.g.,$x_{t} = v t$) to propagate the previous state forward, widening the covariance matrix (uncertainty grows).
2. **Update Step (Posterior):** The system takes a noisy GPS measurement (Likelihood) and applies the Gaussian conjugate update formula, fusing the prior and measurement to yield a sharper posterior state estimate.
4.2 Structural Engineering and Reliability
In civil engineering, assessing the failure probability of a bridge requires fusing historical material strength data (Prior) with localized, non-destructive ultrasound test results (Likelihood). Bayesian inference yields a posterior probability of failure that accounts for both generalized knowledge and site-specific anomalies.
See Also
- [ProbabilityTheory]
- [BayesianReasoning]
- [StatisticalInference]
- [MarkovChainFundamentals]