Optimization Algorithms: Finding the Global Minimum

Optimization is the science of finding the input $x$ that minimizes a function $f(x)$ , typically representing a "loss" or "cost." In modern engineering, it is the engine that allows a neural network to "learn." In 2024, the field has shifted from mere gradient descent to Full-Parameter Efficiency, allowing massive models to be trained on consumer hardware.

1. The Landscape of Optimization: Gradient vs. Curvature

To find the minimum of a function, an algorithm must decide: Which direction to go? and How far to step?

1.1 First-Order Methods (The "Foggy Mountain" View)

Gradient Descent uses the first derivative (the gradient, $\nabla f$ ) to find the direction of steepest descent.

Geometric Intuition: Imagine standing on a foggy mountain at night. You can only see the ground immediately beneath your feet. To descend, you feel for the steepest downward slope and take a small step.
Limitation: It is "blind" to the curvature. If you take a step that is too large, you might overshoot the valley.

1.2 Second-Order Methods (The "Flashlight" View)

Newton's Method uses the gradient and the second derivative (the Hessian matrix, $\mathbf{H}$ ), which represents curvature.

Geometric Intuition: You use a flashlight to see the shape of the bowl around you. At every step, the algorithm fits a paraboloid (a quadratic bowl) to the local surface and jumps directly to its bottom.
The Advantage: It adjusts its step size automatically. On flat plateaus, it takes massive leaps; in tight corners, it takes precise, tiny steps.

2. Unconstrained Optimization: The Engines of AI

2.1 Adam and AdamW (The Memory-Heavy Standard)

Adam (Adaptive Moment Estimation) tracks both the average gradient (momentum) and the average squared gradient (uncentered variance).

The 2024 Memory Crisis: For every model parameter, AdamW stores two 32-bit states. For a 7B parameter model, the optimizer states alone require 56GB of VRAM, far exceeding consumer hardware limits.
The 2024 Solution (8-bit Optimizers): Quantizing these states to 8-bit reduces memory by 75% with negligible accuracy loss.

2.2 2024 Breakthroughs: GaLore and DoRA

GaLore (Gradient Low-Rank Projection): Instead of full-rank updates, GaLore projects gradients into a low-rank subspace. It reduces optimizer memory by 65.5%, allowing pre-training of a 7B model from scratch on a single 24GB GPU.
DoRA (Weight-Decomposed LoRA): Decomposes weights into magnitude and direction. It outperforms traditional LoRA by allowing the model to learn directional updates more effectively, bridging the gap to full-parameter tuning.

3. Constrained Optimization: The Geometry of Boundaries

Many problems have rules: "total weight must equal 1" ( $g(x) = 0$ ) or "budget cannot be negative."

3.1 Lagrange Multipliers: The Tangency Condition

Geometric Intuition: Imagine your objective's contour lines (topographic map). Your constraint is a path. You reach the optimum at the exact moment your path kisses (is tangent to) a contour line.
The Equation: At this point, the gradients must be parallel: $\nabla f = \lambda \nabla g$ .

3.2 KKT Conditions: The Balance of Forces

The Karush-Kuhn-Tucker (KKT) conditions generalize Lagrange for inequality constraints ( $h(x) \le 0$ ).

Geometric Intuition: Think of a ball rolling into a corner where two walls meet. The ball stops because the "force" of gravity ( $-\nabla f$ ) is perfectly balanced by the "push" of the walls ( $\lambda \nabla h$ ).
Complementary Slackness: If the ball isn't touching a wall (inactive constraint), that wall's "pushing force" ( $\lambda$ ) must be zero.

4. Convexity and Jensen's Inequality

A function is convex if it is "bowl-shaped" everywhere. Convex problems are the "holy grail" because any local minimum is guaranteed to be the global minimum.

4.1 Visualizing Jensen's Inequality

Jensen's Inequality states: $E[f(X)] \ge f(E[X])$ .

Visual Intuition: Draw a secant line between any two points on a convex curve. The line always stays above the curve.
- $E[f(X)]$ is a point on the line (the average of the outputs).
- $f(E[X])$ is a point on the curve (the output of the average).
- Geometrically, the "Average of the Function" is always greater than the "Function of the Average."

5. Quantitative Foundation: Comparative Performance

Optimizer	Order	Memory (7B Model)	Conv. Speed	Best Use Case
SGD + Momentum	1st	~28 GB	Linear	Vision / Simple Models
AdamW (32-bit)	1st (Adap)	56 GB	Fast	General NLP / LLMs
8-bit AdamW	1st (Adap)	14 GB	Fast	Consumer Fine-tuning
GaLore (2024)	1st (Adap)	~19 GB	Fast	7B Pre-training
L-BFGS	2nd (Quasi)	Very High	Quadratic	Scientific Computing

5.1 Learning Rate Schedulers

Finding the minimum is not just about the algorithm, but the "cooling" schedule.

Cosine Annealing: Gradually reduces the step size following a cosine curve, allowing the model to settle into the deepest, narrowest valleys of the loss landscape at the end of training.