Transformer Architecture

The transformer architecture, introduced in "Attention is All You Need" (2017), powers modern LLMs and most of contemporary deep learning. Understanding transformers is foundational to working with LLMs.

This page covers how they work and why they matter.

The core innovation: attention

Pre-transformer sequence models (RNNs, LSTMs) processed tokens sequentially. Information had to flow through hidden states.

Transformers process all tokens in parallel. Each token "attends" directly to every other token via the attention mechanism.

Self-attention

For each token:

Compute query (Q), key (K), value (V) vectors
Score against all other tokens: Q · K
Softmax over scores
Weighted sum of values

Result: each token's output is informed by all input tokens, weighted by relevance.

Multi-head attention

Run attention multiple times in parallel with different projections. Concatenate.

Different heads learn different relationship types (syntax, semantics, coreference, etc.).

Typical: 12-128 heads.

Computational cost

Self-attention is O(n²) in sequence length. This is the dominant cost for long contexts.

Many efficiency variants try to reduce this (sparse attention, linear attention, etc.). Standard attention with hardware-aware implementations (FlashAttention) remains common.

Architecture components

Embedding layer

Convert tokens to vectors. Typically 1024-12288 dimensions for modern LLMs.

Positional encoding

Tokens have no inherent position in self-attention. Positional encodings inject position info.

Variants:

Sinusoidal (original): fixed encoding by position
Learned: trainable position embeddings
Relative position: encode distances, not absolute positions
RoPE (Rotary Position Embedding): rotation in attention; standard in modern LLMs
ALiBi: linear bias based on position; helps extrapolation

Transformer block

Per layer:

Layer norm
Multi-head attention
Residual connection
Layer norm
Feed-forward network (typically 4x hidden dim)
Residual connection

Modern variants tweak normalization placement (pre-norm vs post-norm), use SwiGLU instead of standard FFN, etc.

Layers

Stack N transformer blocks. Modern LLMs: 32-128+ layers.

Output

Final layer norm + linear projection to vocabulary size.

For generation: sample from the resulting distribution.

Variants

Encoder-only (BERT)

Bidirectional attention. Used for understanding tasks (classification, NER, embedding).

Decoder-only (GPT)

Causal (left-to-right) attention. Used for generation.

Modern LLMs (GPT, Claude, Llama, etc.) are decoder-only.

Encoder-decoder (T5, original transformer)

Encoder processes input; decoder generates output. Used for translation, summarization.

Less common for LLMs but still used (T5, FLAN).

Vision transformers (ViT)

Treat image patches as tokens. Same architecture; different input pipeline.

Multimodal

Combine text and image (or audio, etc.) tokens. Input modality has its own encoding; the rest of the architecture is shared.

Training

Pretraining

Self-supervised on huge text corpora.

Objective:

BERT: masked language modeling (predict masked tokens)
GPT: next-token prediction

Pretraining is expensive ($1M-100M+ for state-of-the-art LLMs).

Fine-tuning

Adapt pretrained model to specific task with labeled data.

RLHF (Reinforcement Learning from Human Feedback)

Train reward model from human preferences; train policy to maximize reward.

Aligns model with human preferences. Powers ChatGPT, Claude, etc.

Constitutional AI

Variant where principles ("be helpful, harmless") guide self-critique. Anthropic's approach.

Key innovations enabling LLM scale

FlashAttention

Memory-efficient attention. Critical for long contexts.

Mixture of Experts (MoE)

Multiple FFN "experts"; route each token to a subset.

Effective parameter count > active parameters per inference.

Models: Mixtral, DBRX, GPT-4 (rumored), DeepSeek-V2.

Grouped query attention (GQA)

Share keys/values across query heads. Reduces memory.

Sliding window attention

Local attention beyond a window; reduces compute for long context.

Long context

Original transformers: 512-2K tokens. Modern: 8K-1M+ tokens.

Achieved through:

Better positional encodings (RoPE)
Sparse / efficient attention
Memory optimizations

Inference

Prefill

Process the prompt. Compute K and V for all tokens. Cache them.

This is the parallelizable, fast phase.

Decode

Generate tokens one at a time. Each token requires reading the entire KV cache.

Memory-bandwidth bound. Why LLM inference is "slow."

KV cache

Stores K and V from previous tokens. Reused across decode steps.

Memory grows linearly with context length.

Optimizations

Continuous batching: vLLM, TGI
PagedAttention: efficient KV cache management
Speculative decoding: small model proposes; big model verifies
Quantization: INT8, INT4 weights and activations

Scaling laws

Empirical findings (Kaplan et al., Chinchilla):

Performance scales as power law in model size, data, compute
Compute-optimal training has specific data-to-model ratios

Roughly: bigger model + more data → better performance, predictably.

Emergent capabilities

Some abilities appear suddenly at scale:

In-context learning
Chain-of-thought reasoning
Tool use
Multi-turn instruction following

Why exactly is unclear. Architecture + scale + training data interact.

Limitations

Hallucinations

Plausible but wrong outputs. Models predict tokens, not truth.

Computational cost

Even small LLMs are expensive at scale.

Context window

Limited; long-context models exist but quality degrades with length.

Lack of grounding

Without retrieval/tools, models reason from training data only.

Reasoning limitations

Symbolic / mathematical reasoning is fragile. Better with chain-of-thought, tools.

Training data cutoff

Models don't know recent events without retrieval.

Why transformers won

Parallelization (vs sequential RNNs)
Long-range dependencies (attention can be global)
Scalability (architecture works at huge scale)
Hardware fit (matrix multiplications align with GPU/TPU strengths)
Pretraining + fine-tuning paradigm

These advantages compounded; transformers eclipsed RNNs across NLP within a few years.

Where transformers might be challenged

State space models (Mamba): linear-time alternative
Hybrid architectures: combine attention with other mechanisms
Specialized architectures for specific modalities

For now, transformers remain dominant.

Practical takeaways

For practitioners:

You don't need to implement transformers; use libraries
Architecture details matter less than data and training
Pretrained models are the starting point for almost everything
Inference optimization is its own discipline

For researchers: the architecture is mature; innovation happens at training, data, and adjacent components.

Transformer Architecture

The core innovation: attention

Self-attention

Multi-head attention

Computational cost

Architecture components

Embedding layer

Positional encoding

Transformer block

Layers

Output

Variants

Encoder-only (BERT)

Decoder-only (GPT)

Encoder-decoder (T5, original transformer)

Vision transformers (ViT)

Multimodal

Training

Pretraining

Fine-tuning

RLHF (Reinforcement Learning from Human Feedback)

Constitutional AI

Key innovations enabling LLM scale

FlashAttention

Mixture of Experts (MoE)

Grouped query attention (GQA)

Sliding window attention

Long context

Inference

Prefill

Decode

KV cache

Optimizations

Scaling laws

Emergent capabilities

Limitations

Hallucinations

Computational cost

Context window

Lack of grounding

Reasoning limitations

Training data cutoff

Why transformers won

Where transformers might be challenged

Practical takeaways

Further Reading