Natural Language Processing: The Transformer Era

Modern NLP has transitioned from sequential recurrent models (RNNs/LSTMs) to parallelizable, attention-based architectures. This shift, pioneered by the Transformer, allows for modeling long-range dependencies and massive scaling.

1. The Attention Mechanism: Scaled Dot-Product

The core of the Transformer is Self-Attention. For an input sequence of embeddings $X$ , we derive three matrices: Query ( $Q$ ), Key ( $K$ ), and Value ( $V$ ) via linear projections:

Q = XW_Q, \quad K = XW_K, \quad V = XW_V

1.1 The Attention Formula

The attention weights are calculated by measuring the compatibility between queries and keys, normalized by a softmax function:

\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V

$QK^T$ : Computes the raw similarity scores between every pair of tokens.
$\sqrt{d_k}$ : Scaling factor (where $d_k$ is the dimension of the keys). This prevents the dot products from growing too large in magnitude, which would push the softmax into regions with extremely small gradients.
Softmax: Normalizes the scores into a probability distribution.

2. KV-Caching: Optimizing Inference

In auto-regressive generation (predicting the next token one by one), the model re-processes the entire prefix at every step. This is computationally redundant. KV-Caching is the standard optimization to eliminate this $\mathcal{O}(N^2)$ overhead.

2.1 The Problem: Quadratic Redundancy

Without caching, to generate token $N+1$ , the model computes $K$ and $V$ for all tokens $1 \dots N$. At step $N+2$ , it re-computes $K$ and $V$ for $1 \dots N+1$.

2.2 The Solution: Incremental Updates

Since the embeddings for tokens $1 \dots N$ do not change when token $N+1$ is added, we store their $K$ and $V$ vectors in memory (the "KV Cache").

Step $t$ : Compute $q_t, k_t, v_t$ only for the new token.
Append: Add $k_t$ and $v_t$ to the cache: $K_{cache} = [K_{prev}; k_t]$ , $V_{cache} = [V_{prev}; v_t]$ .
Compute: Run attention using the current $q_t$ against the entire $K_{cache}$ and $V_{cache}$ .

2.3 Memory Constraints

The KV cache size grows linearly with sequence length ( $L$ ) and batch size ( $B$ ):

\text{Memory} \approx 2 \times \text{layers} \times B \times L \times \text{heads} \times d_{head} \times \text{precision\_bytes}

For a 70B parameter model with a 4k context window, the KV cache can consume tens of gigabytes of VRAM, making PagedAttention (used in vLLM) necessary to manage fragmented memory.

3. Beyond Self-Attention: Multi-Head and GQA

Multi-Head Attention (MHA): Runs multiple attention "heads" in parallel, allowing the model to attend to different parts of the sequence (e.g., syntax vs. semantics) simultaneously.
Grouped-Query Attention (GQA): A middle ground between MHA and Multi-Query Attention (MQA). It shares a single set of Key/Value heads across a "group" of Query heads. This significantly reduces the KV cache size (and thus memory bandwidth) with minimal impact on accuracy.

4. Training Objectives

Causal Language Modeling (CLM): Standard for GPT-style models. Predicts $w_t$ given $w_{1 \dots t-1}$ .
Masked Language Modeling (MLM): Standard for BERT. Predicts randomly masked tokens using bidirectional context.

Modern NLP research is currently focused on Long-Context modeling (e.g., Ring Attention) and Parameter-Efficient Fine-Tuning (PEFT) techniques like LoRA, which allow adapting these massive models to specific tasks by only updating a tiny fraction of the weights.