Gradient Descent and Optimizers

Optimizing a neural network involves navigating a high-dimensional loss landscape to find a global minimum. The choice of optimizer and learning rate (LR) schedule determines convergence speed and final model generalization.

1. Stochastic Gradient Descent (SGD) with Momentum

SGD updates weights using a subset (mini-batch) of data. Momentum adds a fraction of the previous update to the current one, helping the optimizer navigate out of local minima and dampening oscillations in high-curvature regions.

Update Rule:

v_{t} = \gamma v_{t-1} + \eta \nabla L(\theta)

\theta = \theta - v_{t}

Where $\gamma \approx 0.9$ and $\eta$ is the learning rate.

2. Adaptive Optimizers: Adam and AdamW

Adam (Adaptive Moment Estimation) computes individual learning rates for each parameter by tracking the first moment (mean) and second moment (uncentered variance) of the gradients.

AdamW is the modern standard, which decouples Weight Decay from the gradient update. In vanilla Adam, L2 regularization is added to the loss, which interacts poorly with adaptive learning rates. AdamW applies weight decay directly to the weights.

3. Lion (EvoLved Sign Momentum)

Lion is a memory-efficient optimizer that uses the sign of the gradient update rather than the magnitude. It requires only the first moment, saving 50% of the optimizer state memory compared to AdamW.

4. Learning Rate Schedules

A static learning rate rarely converges to the optimal minimum. Schedules adjust $\eta$ over time.

Concrete Example: Cosine Annealing with Warmup in PyTorch

Warmup prevents large gradients from destabilizing the model early in training. Cosine annealing then smoothly decays the LR to a minimum value.

import torch
import torch.optim as optim
from torch.optim.lr_scheduler import LambdaLR
import math

# Hyperparameters
max_lr = 3e-4
warmup_steps = 1000
total_steps = 10000

# Model and Optimizer
model = torch.nn.Linear(10, 1)
optimizer = optim.AdamW(model.parameters(), lr=max_lr, weight_decay=0.01)

def lr_lambda(current_step):
    # 1. Linear Warmup
    if current_step < warmup_steps:
        return float(current_step) / float(max(1, warmup_steps))
    
    # 2. Cosine Annealing
    progress = float(current_step - warmup_steps) / float(max(1, total_steps - warmup_steps))
    return 0.5 * (1.0 + math.cos(math.pi * progress))

scheduler = LambdaLR(optimizer, lr_lambda)

# Training loop
for step in range(total_steps):
    optimizer.step()
    scheduler.step()

5. Gradient Clipping

To prevent "exploding gradients" in deep networks (especially Transformers), gradients are clipped by norm. If the global norm $||g||$ exceeds a threshold (typically 1.0), the gradient is rescaled.

torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)

Summary of Technical implementation added

Detailed the math behind SGD with Momentum.
Explained the difference between Adam and AdamW regarding weight decay.
Introduced Lion as a memory-efficient alternative.
Provided a complete PyTorch implementation of Cosine Annealing with Warmup.
Included a concrete example of Gradient Clipping.