Transformer Architecture
The transformer architecture, introduced in "Attention is All You Need" (2017), powers modern LLMs and most of contemporary deep learning. Understanding transformers is foundational to working with LLMs.
This page covers how they work and why they matter.
The core innovation: attention
Pre-transformer sequence models (RNNs, LSTMs) processed tokens sequentially. Information had to flow through hidden states.
Transformers process all tokens in parallel. Each token "attends" directly to every other token via the attention mechanism.
Self-attention
For each token:
1. Compute query (Q), key (K), value (V) vectors
2. Score against all other tokens: Q · K
3. Softmax over scores
4. Weighted sum of values
Result: each token's output is informed by all input tokens, weighted by relevance.
Multi-head attention
Run attention multiple times in parallel with different projections. Concatenate.
Different heads learn different relationship types (syntax, semantics, coreference, etc.).
Typical: 12-128 heads.
Computational cost
Self-attention is O(n²) in sequence length. This is the dominant cost for long contexts.
Many efficiency variants try to reduce this (sparse attention, linear attention, etc.). Standard attention with hardware-aware implementations (FlashAttention) remains common.
Architecture components
Embedding layer
Convert tokens to vectors. Typically 1024-12288 dimensions for modern LLMs.
Positional encoding
Tokens have no inherent position in self-attention. Positional encodings inject position info.
Variants:
- **Sinusoidal** (original): fixed encoding by position
- **Learned**: trainable position embeddings
- **Relative position**: encode distances, not absolute positions
- **RoPE** (Rotary Position Embedding): rotation in attention; standard in modern LLMs
- **ALiBi**: linear bias based on position; helps extrapolation
Transformer block
Per layer:
1. Layer norm
2. Multi-head attention
3. Residual connection
4. Layer norm
5. Feed-forward network (typically 4x hidden dim)
6. Residual connection
Modern variants tweak normalization placement (pre-norm vs post-norm), use SwiGLU instead of standard FFN, etc.
Layers
Stack N transformer blocks. Modern LLMs: 32-128+ layers.
Output
Final layer norm + linear projection to vocabulary size.
For generation: sample from the resulting distribution.
Variants
Encoder-only (BERT)
Bidirectional attention. Used for understanding tasks (classification, NER, embedding).
Decoder-only (GPT)
Causal (left-to-right) attention. Used for generation.
Modern LLMs (GPT, Claude, Llama, etc.) are decoder-only.
Encoder-decoder (T5, original transformer)
Encoder processes input; decoder generates output. Used for translation, summarization.
Less common for LLMs but still used (T5, FLAN).
Vision transformers (ViT)
Treat image patches as tokens. Same architecture; different input pipeline.
Multimodal
Combine text and image (or audio, etc.) tokens. Input modality has its own encoding; the rest of the architecture is shared.
Training
Pretraining
Self-supervised on huge text corpora.
Objective:
- BERT: masked language modeling (predict masked tokens)
- GPT: next-token prediction
Pretraining is expensive ($1M-100M+ for state-of-the-art LLMs).
Fine-tuning
Adapt pretrained model to specific task with labeled data.
RLHF (Reinforcement Learning from Human Feedback)
Train reward model from human preferences; train policy to maximize reward.
Aligns model with human preferences. Powers ChatGPT, Claude, etc.
Constitutional AI
Variant where principles ("be helpful, harmless") guide self-critique. Anthropic's approach.
Key innovations enabling LLM scale
FlashAttention
Memory-efficient attention. Critical for long contexts.
Mixture of Experts (MoE)
Multiple FFN "experts"; route each token to a subset.
Effective parameter count > active parameters per inference.
Models: Mixtral, DBRX, GPT-4 (rumored), DeepSeek-V2.
Grouped query attention (GQA)
Share keys/values across query heads. Reduces memory.
Sliding window attention
Local attention beyond a window; reduces compute for long context.
Long context
Original transformers: 512-2K tokens.
Modern: 8K-1M+ tokens.
Achieved through:
- Better positional encodings (RoPE)
- Sparse / efficient attention
- Memory optimizations
Inference
Prefill
Process the prompt. Compute K and V for all tokens. Cache them.
This is the parallelizable, fast phase.
Decode
Generate tokens one at a time. Each token requires reading the entire KV cache.
Memory-bandwidth bound. Why LLM inference is "slow."
KV cache
Stores K and V from previous tokens. Reused across decode steps.
Memory grows linearly with context length.
Optimizations
- **Continuous batching**: vLLM, TGI
- **PagedAttention**: efficient KV cache management
- **Speculative decoding**: small model proposes; big model verifies
- **Quantization**: INT8, INT4 weights and activations
Scaling laws
Empirical findings (Kaplan et al., Chinchilla):
- Performance scales as power law in model size, data, compute
- Compute-optimal training has specific data-to-model ratios
Roughly: bigger model + more data → better performance, predictably.
Emergent capabilities
Some abilities appear suddenly at scale:
- In-context learning
- Chain-of-thought reasoning
- Tool use
- Multi-turn instruction following
Why exactly is unclear. Architecture + scale + training data interact.
Limitations
Hallucinations
Plausible but wrong outputs. Models predict tokens, not truth.
Computational cost
Even small LLMs are expensive at scale.
Context window
Limited; long-context models exist but quality degrades with length.
Lack of grounding
Without retrieval/tools, models reason from training data only.
Reasoning limitations
Symbolic / mathematical reasoning is fragile. Better with chain-of-thought, tools.
Training data cutoff
Models don't know recent events without retrieval.
Why transformers won
- Parallelization (vs sequential RNNs)
- Long-range dependencies (attention can be global)
- Scalability (architecture works at huge scale)
- Hardware fit (matrix multiplications align with GPU/TPU strengths)
- Pretraining + fine-tuning paradigm
These advantages compounded; transformers eclipsed RNNs across NLP within a few years.
Where transformers might be challenged
- State space models (Mamba): linear-time alternative
- Hybrid architectures: combine attention with other mechanisms
- Specialized architectures for specific modalities
For now, transformers remain dominant.
Practical takeaways
For practitioners:
- You don't need to implement transformers; use libraries
- Architecture details matter less than data and training
- Pretrained models are the starting point for almost everything
- Inference optimization is its own discipline
For researchers: the architecture is mature; innovation happens at training, data, and adjacent components.
Further Reading
- [AgentPromptEngineering](AgentPromptEngineering) — Working with LLMs
- [OpenSourceLlmEcosystem](OpenSourceLlmEcosystem) — Open models
- [NLPOverview](NLPOverview) — Broader NLP context
- [Generative AI Hub](GenerativeAIHub) — Cluster index