Natural Language Processing: The Transformer Era

Modern NLP has transitioned from sequential recurrent models (RNNs/LSTMs) to parallelizable, attention-based architectures. This shift, pioneered by the Transformer, allows for modeling long-range dependencies and massive scaling.

1. The Attention Mechanism: Scaled Dot-Product

The core of the Transformer is **Self-Attention**. For an input sequence of embeddings $X$, we derive three matrices: **Query ($Q$)**, **Key ($K$)**, and **Value ($V$)** via linear projections:

$$Q = XW_Q, \quad K = XW_K, \quad V = XW_V$$

1.1 The Attention Formula

The attention weights are calculated by measuring the compatibility between queries and keys, normalized by a softmax function:

$$\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V$$

* **$QK^T$:** Computes the raw similarity scores between every pair of tokens.

* **$\sqrt{d_k}$:** Scaling factor (where $d_k$ is the dimension of the keys). This prevents the dot products from growing too large in magnitude, which would push the softmax into regions with extremely small gradients.

* **Softmax:** Normalizes the scores into a probability distribution.

2. KV-Caching: Optimizing Inference

In auto-regressive generation (predicting the next token one by one), the model re-processes the entire prefix at every step. This is computationally redundant. **KV-Caching** is the standard optimization to eliminate this $\mathcal{O}(N^2)$ overhead.

2.1 The Problem: Quadratic Redundancy

Without caching, to generate token $N+1$, the model computes $K$ and $V$ for all tokens $1 \dots N$. At step $N+2$, it re-computes $K$ and $V$ for $1 \dots N+1$.

2.2 The Solution: Incremental Updates

Since the embeddings for tokens $1 \dots N$ do not change when token $N+1$ is added, we store their $K$ and $V$ vectors in memory (the "KV Cache").

1. **Step $t$:** Compute $q_t, k_t, v_t$ only for the new token.

2. **Append:** Add $k_t$ and $v_t$ to the cache: $K_{cache} = [K_{prev}; k_t]$, $V_{cache} = [V_{prev}; v_t]$.

3. **Compute:** Run attention using the current $q_t$ against the entire $K_{cache}$ and $V_{cache}$.

2.3 Memory Constraints

The KV cache size grows linearly with sequence length ($L$) and batch size ($B$):

$$\text{Memory} \approx 2 \times \text{layers} \times B \times L \times \text{heads} \times d_{head} \times \text{precision\_bytes}$$

For a 70B parameter model with a 4k context window, the KV cache can consume tens of gigabytes of VRAM, making **PagedAttention** (used in vLLM) necessary to manage fragmented memory.

3. Beyond Self-Attention: Multi-Head and GQA

* **Multi-Head Attention (MHA):** Runs multiple attention "heads" in parallel, allowing the model to attend to different parts of the sequence (e.g., syntax vs. semantics) simultaneously.

* **Grouped-Query Attention (GQA):** A middle ground between MHA and Multi-Query Attention (MQA). It shares a single set of Key/Value heads across a "group" of Query heads. This significantly reduces the KV cache size (and thus memory bandwidth) with minimal impact on accuracy.

4. Training Objectives

* **Causal Language Modeling (CLM):** Standard for GPT-style models. Predicts $w_t$ given $w_{1 \dots t-1}$.

* **Masked Language Modeling (MLM):** Standard for BERT. Predicts randomly masked tokens using bidirectional context.

Modern NLP research is currently focused on **Long-Context** modeling (e.g., Ring Attention) and **Parameter-Efficient Fine-Tuning (PEFT)** techniques like LoRA, which allow adapting these massive models to specific tasks by only updating a tiny fraction of the weights.