Regression Analysis: The Geometry of Projection
Regression analysis is the primary statistical tool for modeling the relationship between a dependent variable $y$and one or more independent variables$X$. At its mathematical heart, linear regression is an exercise in **orthogonal projection** within a high-dimensional vector space.
1. Ordinary Least Squares (OLS)
The goal of OLS is to find the vector of coefficients$\boldsymbol{\beta}$that minimizes the sum of squared residuals.
1.1 The Matrix Formulation
Given$n$observations and$k$predictors, we define the model in matrix form:$$\mathbf{y} = \mathbf{X}\boldsymbol{\beta} + \boldsymbol{\epsilon}$$-$\mathbf{y} \in \mathbb{R}^n$: The vector of observations.
-$\mathbf{X} \in \mathbb{R}^{n \times k}$: The design matrix of predictors.
-$\boldsymbol{\beta} \in \mathbb{R}^k$: The vector of unknown coefficients.
-$\boldsymbol{\epsilon} \in \mathbb{R}^n$: The vector of errors.
The OLS solution that minimizes$\parallel \mathbf{y} - \mathbf{X}\boldsymbol{\beta} \parallel^2$is found by solving the **Normal Equations**:$$\mathbf{X}^T \mathbf{X} \hat{\boldsymbol{\beta}} = \mathbf{X}^T \mathbf{y}$$
$$\hat{\boldsymbol{\beta}} = (\mathbf{X}^T \mathbf{X})^{-1} \mathbf{X}^T \mathbf{y}$$## 2. Geometric Intuition: Orthogonal Projection
The most powerful way to understand regression is through the geometry of linear algebra.
2.1 The Column Space of X
The matrix$\mathbf{X}$defines a subspace in$\mathbb{R}^n$called the **Column Space** (or Range) of$\mathbf{X}$, denoted$\text{Col}(\mathbf{X})$. Any prediction$\hat{\mathbf{y}} = \mathbf{X}\boldsymbol{\beta}$must lie within this subspace.
Because the observed vector$\mathbf{y}$likely contains noise, it does not lie exactly within$\text{Col}(\mathbf{X})$.
2.2 The Projection (Hat) Matrix
The "best fit"$\hat{\mathbf{y}}$is the point in$\text{Col}(\mathbf{X})$that is **closest** to$\mathbf{y}$in terms of Euclidean distance. Geometrically, this is the **orthogonal projection** of$\mathbf{y}$onto the column space.
The transformation that performs this projection is the **Hat Matrix**$\mathbf{H}$:$$\hat{\mathbf{y}} = \mathbf{H}\mathbf{y}$$
$$\mathbf{H} = \mathbf{X}(\mathbf{X}^T \mathbf{X})^{-1} \mathbf{X}^T$$- **Properties of H:** It is symmetric ($\mathbf{H}^T = \mathbf{H}$) and idempotent ($\mathbf{H}^2 = \mathbf{H}$).
- **The Residual Vector:**$\mathbf{e} = \mathbf{y} - \hat{\mathbf{y}} = (\mathbf{I} - \mathbf{H})\mathbf{y}$. Geometrically, the residuals are the vector component of$\mathbf{y}$that is orthogonal to the column space of$\mathbf{X}$.
3. Quantitative Foundations: The Gauss-Markov Theorem
Why is OLS so widely used? The Gauss-Markov theorem provides the justification.
3.1 BLUE (Best Linear Unbiased Estimator)
Under the assumptions of linearity, full rank, exogeneity ($\mathbb{E}[\epsilon|X] = 0$), and homoscedasticity ($\text{Var}(\epsilon) = \sigma^2 I$), the OLS estimator$\hat{\boldsymbol{\beta}}$is the **Best Linear Unbiased Estimator**.
"Best" here means that it has the **minimum variance** among all linear unbiased estimators.
Table 1: ANOVA for Linear Regression
| Source | Degrees of Freedom | Sum of Squares (SS) | Mean Square (MS) |
| :--- | :--- | :--- | :--- |
| **Model** |$k - 1$|$\sum (\hat{y}_i - \bar{y})^2$|$\text{SSM} / (k-1)$|
| **Error** |$n - k$|$\sum (y_i - \hat{y}_i)^2$|$\text{SSE} / (n-k)$|
| **Total** |$n - 1$|$\sum (y_i - \bar{y})^2$| |
The Ratio$F = \text{MSM} / \text{MSE}$allows us to test if the model as a whole is statistically significant.
4. Real-World Applications
4.1 Finance: The Capital Asset Pricing Model (CAPM)
In finance, regression is used to calculate the risk of an asset. The "Beta" ($\beta$) of a stock is the slope of a linear regression where the market return is the independent variable and the stock return is the dependent variable. A$\beta > 1$indicates the stock is more volatile than the market (leveraged projection).
4.2 Machine Learning and Regularization
When the number of predictors$k$is large, the matrix$\mathbf{X}^T \mathbf{X}$may be ill-conditioned (nearly singular). This corresponds to the geometric problem of a very "thin" or "flat" column space.
**Ridge Regression** solves this by adding$\lambda \mathbf{I}$ to the diagonal, which geometrically "inflates" the subspace and prevents the coefficients from exploding, a direct application of Bayesian reasoning (Gaussian prior).
See Also
- [StatisticsFundamentals]
- [StatisticalInference]
- [LinearAlgebra]
- [MathematicsHub]