Compiler Design Basics
Most engineers won't write a compiler for a real language. They will write parsers, evaluators, query optimisers, DSL processors, configuration interpreters, code transformations — all compiler-shaped problems.
Knowing the canonical compiler pipeline helps. This page is the working tour: what each phase does, what tools handle it, and what each phase fails at.
The phases
A traditional compiler's pipeline:
```
source code
│
▼ lexer (tokenise)
tokens
│
▼ parser (parse to AST)
abstract syntax tree
│
▼ semantic analysis (type-check, resolve names)
annotated AST
│
▼ lowering to IR
intermediate representation
│
▼ optimisation passes
optimised IR
│
▼ codegen (target-specific)
machine code / bytecode / target language
```
Each phase has well-known techniques, tools, and failure modes.
Lexer (tokeniser)
Converts text into tokens — keywords, identifiers, literals, punctuation.
```
"if (x > 5) { return; }"
→ IF, LPAREN, IDENT("x"), GT, INT(5), RPAREN, LBRACE, RETURN, SEMICOLON, RBRACE
```
Implementation:
- **Regular expressions** for token patterns.
- **State machine** built from those regexes; runs in linear time.
Tools:
- **`re2c`** — generates C lexers.
- **`flex`** — UNIX classic.
- **`lex.rs`, `logos`** — Rust.
- **`ply.lex`** — Python.
- **`antlr4`** lexer subset.
For ad-hoc parsing, hand-rolled lexers are common; for production languages, generators help.
Parser
Converts tokens into a structured tree (Abstract Syntax Tree).
Two main parsing styles:
Recursive descent (top-down)
Hand-write functions for each grammar rule:
```python
def parse_expression():
left = parse_term()
while peek() in (PLUS, MINUS):
op = consume()
right = parse_term()
left = BinaryOp(left, op, right)
return left
def parse_term():
... etc
```
Strengths: easy to write, debug, and understand. Good error messages. Used by most modern compilers (Rust, Go, GCC, clang) for production use.
Weaknesses: have to handle precedence and associativity manually; can be tedious.
Generated parsers (bottom-up)
Specify grammar in a generator's syntax; tool produces a parser.
Tools:
- **`yacc` / `bison`** — UNIX classic; LALR(1).
- **`antlr4`** — modern; LL(*); more flexibility.
- **`tree-sitter`** — incremental parsing for editors; recovery is excellent.
- **PEG parsers** (`pest`, `lark`, `peg.js`) — easier-to-read grammars; potentially slower.
Generated parsers handle precedence in the grammar; less hand-coding. Trade-offs: error messages from generated parsers are often worse; debugging is awkward; a grammar conflict surfaces as a generator error you have to interpret.
For most production compilers, recursive descent has won. For DSLs and quick languages, generators are faster to ship.
Pratt parsing
A specific recursive-descent style that elegantly handles operator precedence. Once-niche; increasingly the default for new compilers.
```
prefix table: { LPAREN: parse_paren, MINUS: parse_unary, ... }
infix table: { PLUS: (binop, 10), TIMES: (binop, 20), ... }
parse_expr(min_precedence) {
left = prefix[token]()
while infix[peek()].precedence >= min_precedence:
...
}
```
Concise; flexible; production-grade. Worth learning.
Semantic analysis
After parsing, the AST is structurally valid but not semantically correct. This phase:
- **Resolves names**: identifiers refer to which declarations.
- **Type-checks**: do types match where they're used.
- **Catches static errors**: undefined variables, type mismatches, illegal references.
Implementation: walk the AST; maintain a symbol table per scope; emit errors when invariants violated.
For complex type systems (Rust's borrow checker, Haskell's type inference), this phase contains most of the language's complexity. Hindley-Milner inference, generic resolution, trait bound checking — all here.
Intermediate Representation (IR)
After type-checking, the AST is lowered to a simpler representation. The IR strips syntax noise and expresses computation in a canonical form.
Common IR shapes:
- **Three-address code (TAC)** — flat sequences of `result = op a b` instructions.
- **Static Single Assignment (SSA)** — each variable assigned once; common in modern optimising compilers.
- **Continuation-passing style (CPS)** — control flow via continuations.
The IR is what optimisation passes operate on. AST optimisations are limited; IR optimisations compose well.
Production compilers:
- **LLVM IR** — the dominant IR for new compilers (Rust, Swift, modern C/C++).
- **MLIR** — Multi-Level IR, for higher-level abstractions before going to LLVM.
- **JVM bytecode** for Java family.
- **CIL** for .NET.
Targeting LLVM means you skip the codegen phase — LLVM's backend produces the actual machine code.
Optimisation
Transformations that preserve semantics while improving performance / size / clarity.
Common optimisations:
- **Constant folding** — compute compile-time-constant expressions.
- **Dead code elimination** — remove code that has no effect.
- **Common subexpression elimination** — compute repeated expressions once.
- **Loop unrolling** — replace small loops with their unrolled form.
- **Function inlining** — replace function calls with the function's body.
- **Strength reduction** — replace expensive operations (multiplication) with cheaper ones (shifts).
- **Tail-call optimisation** — turn tail-recursive functions into loops.
Each is well-understood; LLVM has decades of accumulated passes.
For your hand-rolled language: constant folding and dead-code elimination are worth implementing; the rest depends on what your language does.
Codegen
The final phase: generate the target output.
Targets:
- **Machine code** (x86-64, ARM64). Hardest; most compilers offload to LLVM.
- **Bytecode** (JVM, .NET, WebAssembly). Simpler; portable.
- **Another high-level language** (transpilers — TypeScript→JavaScript, ReScript→JavaScript).
- **An interpreter's IR** for direct execution.
Codegen for native machine code involves register allocation, instruction selection, scheduling. Each is a research area; LLVM handles all of them.
For most modern language projects: target LLVM IR; LLVM does the codegen for every supported architecture.
Common compiler-shaped problems
When you'll write compiler-like code without knowing it:
Query optimisers
A query is parsed; a logical plan is built; rewrites apply; cost-based optimisation picks an execution plan.
Compiler patterns: parsing, semantic analysis, IR transformations, plan selection. PostgreSQL, BigQuery, every database has one.
Templating engines
Mustache, Jinja2, Handlebars. Parse the template; walk the AST applying values; produce output.
Configuration languages
YAML / JSON / TOML / HCL / Cue. Each is parsed; some have semantics (HCL has expressions; Cue has constraints); generate effective config.
DSL processors
A specific in-house mini-language. Parse, validate, execute / transpile.
Code transformers
Babel for JavaScript, AST tools for Python, Rust's macro_rules and proc_macro. Walk an AST; apply transformations; emit modified code.
Knowing compiler basics makes these tasks tractable instead of feeling exotic.
A pragmatic mini-compiler
For a small expression language:
```
Source: "(2 + 3) * 4"
Lexer (regex / hand-rolled):
LPAREN, INT(2), PLUS, INT(3), RPAREN, TIMES, INT(4)
Parser (recursive descent / Pratt):
BinaryOp(BinaryOp(2, '+', 3), '*', 4)
Evaluator (walk AST):
visit(BinaryOp(2, '+', 3)) → 5
visit(BinaryOp(5, '*', 4)) → 20
```
That's interpretation. For compilation, instead of evaluating, emit code:
```
PUSH 2
PUSH 3
ADD
PUSH 4
MUL
```
This minimal pipeline scales surprisingly far. Most production DSLs aren't much more complicated.
What to skip if you're not building a compiler
- **Optimisation passes for fun.** Use LLVM if you need them.
- **Register allocation.** LLVM, again.
- **Assembly output.** LLVM.
Most engineers writing compiler-shaped code stop at IR generation. The interpretation or transformation step takes over from there.
Tools worth knowing
- **`tree-sitter`** — incremental parsing; powers GitHub code-aware features and many editors. For analysing code in arbitrary languages, it's the right tool.
- **`antlr4`** — for grammar-driven parsing.
- **LLVM** — if you're producing native code.
- **`ast-grep`, `comby`** — pattern-based code transformation.
- **`Rascal`, `JetBrains MPS`** — language-workbench tools for serious DSL development.
A pragmatic recommendation
For most engineers:
1. **Learn recursive descent + Pratt parsing.** Sufficient for most real DSL needs.
2. **AST visitors** as the standard way to analyse / transform.
3. **Don't write a full compiler** unless your job is "language designer."
4. **For analysing other languages' code**, use `tree-sitter` or the language's official AST library.
5. **For building DSLs**, evaluate "is this really a DSL or just a configuration language?" Often the latter is enough.
Further reading
- [DataStructures]() — ASTs, symbol tables
- [TypeSystemsComparison]() — semantic analysis context
- [AbstractAlgebra]() — for the formal-language theory underneath