Compiler Design Basics

Most engineers won't write a compiler for a real language. They will write parsers, evaluators, query optimisers, DSL processors, configuration interpreters, code transformations — all compiler-shaped problems.

Knowing the canonical compiler pipeline helps. This page is the working tour: what each phase does, what tools handle it, and what each phase fails at.

The phases

A traditional compiler's pipeline:

source code
    │
    ▼ lexer (tokenise)
tokens
    │
    ▼ parser (parse to AST)
abstract syntax tree
    │
    ▼ semantic analysis (type-check, resolve names)
annotated AST
    │
    ▼ lowering to IR
intermediate representation
    │
    ▼ optimisation passes
optimised IR
    │
    ▼ codegen (target-specific)
machine code / bytecode / target language

Each phase has well-known techniques, tools, and failure modes.

Lexer (tokeniser)

Converts text into tokens — keywords, identifiers, literals, punctuation.

"if (x > 5) { return; }"

→ IF, LPAREN, IDENT("x"), GT, INT(5), RPAREN, LBRACE, RETURN, SEMICOLON, RBRACE

Implementation:

Regular expressions for token patterns.
State machine built from those regexes; runs in linear time.

Tools:

re2c — generates C lexers.
flex — UNIX classic.
lex.rs, logos — Rust.
ply.lex — Python.
antlr4 lexer subset.

For ad-hoc parsing, hand-rolled lexers are common; for production languages, generators help.

Parser

Converts tokens into a structured tree (Abstract Syntax Tree).

Two main parsing styles:

Recursive descent (top-down)

Hand-write functions for each grammar rule:

def parse_expression():
    left = parse_term()
    while peek() in (PLUS, MINUS):
        op = consume()
        right = parse_term()
        left = BinaryOp(left, op, right)
    return left

def parse_term():
    # ... etc

Strengths: easy to write, debug, and understand. Good error messages. Used by most modern compilers (Rust, Go, GCC, clang) for production use.

Weaknesses: have to handle precedence and associativity manually; can be tedious.

Generated parsers (bottom-up)

Specify grammar in a generator's syntax; tool produces a parser.

Tools:

yacc / bison — UNIX classic; LALR(1).
antlr4 — modern; LL(*); more flexibility.
tree-sitter — incremental parsing for editors; recovery is excellent.
PEG parsers (pest, lark, peg.js) — easier-to-read grammars; potentially slower.

Generated parsers handle precedence in the grammar; less hand-coding. Trade-offs: error messages from generated parsers are often worse; debugging is awkward; a grammar conflict surfaces as a generator error you have to interpret.

For most production compilers, recursive descent has won. For DSLs and quick languages, generators are faster to ship.

Pratt parsing

A specific recursive-descent style that elegantly handles operator precedence. Once-niche; increasingly the default for new compilers.

prefix table: { LPAREN: parse_paren, MINUS: parse_unary, ... }
infix table:  { PLUS: (binop, 10), TIMES: (binop, 20), ... }

parse_expr(min_precedence) {
    left = prefix[token]()
    while infix[peek()].precedence >= min_precedence:
        ...
}

Concise; flexible; production-grade. Worth learning.

Semantic analysis

After parsing, the AST is structurally valid but not semantically correct. This phase:

Resolves names: identifiers refer to which declarations.
Type-checks: do types match where they're used.
Catches static errors: undefined variables, type mismatches, illegal references.

Implementation: walk the AST; maintain a symbol table per scope; emit errors when invariants violated.

For complex type systems (Rust's borrow checker, Haskell's type inference), this phase contains most of the language's complexity. Hindley-Milner inference, generic resolution, trait bound checking — all here.

Intermediate Representation (IR)

After type-checking, the AST is lowered to a simpler representation. The IR strips syntax noise and expresses computation in a canonical form.

Common IR shapes:

Three-address code (TAC) — flat sequences of result = op a b instructions.
Static Single Assignment (SSA) — each variable assigned once; common in modern optimising compilers.
Continuation-passing style (CPS) — control flow via continuations.

The IR is what optimisation passes operate on. AST optimisations are limited; IR optimisations compose well.

Production compilers:

LLVM IR — the dominant IR for new compilers (Rust, Swift, modern C/C++).
MLIR — Multi-Level IR, for higher-level abstractions before going to LLVM.
JVM bytecode for Java family.
CIL for .NET.

Targeting LLVM means you skip the codegen phase — LLVM's backend produces the actual machine code.

Optimisation

Transformations that preserve semantics while improving performance / size / clarity.

Common optimisations:

Constant folding — compute compile-time-constant expressions.
Dead code elimination — remove code that has no effect.
Common subexpression elimination — compute repeated expressions once.
Loop unrolling — replace small loops with their unrolled form.
Function inlining — replace function calls with the function's body.
Strength reduction — replace expensive operations (multiplication) with cheaper ones (shifts).
Tail-call optimisation — turn tail-recursive functions into loops.

Each is well-understood; LLVM has decades of accumulated passes.

For your hand-rolled language: constant folding and dead-code elimination are worth implementing; the rest depends on what your language does.

Codegen

The final phase: generate the target output.

Targets:

Machine code (x86-64, ARM64). Hardest; most compilers offload to LLVM.
Bytecode (JVM, .NET, WebAssembly). Simpler; portable.
Another high-level language (transpilers — TypeScript→JavaScript, ReScript→JavaScript).
An interpreter's IR for direct execution.

Codegen for native machine code involves register allocation, instruction selection, scheduling. Each is a research area; LLVM handles all of them.

For most modern language projects: target LLVM IR; LLVM does the codegen for every supported architecture.

Common compiler-shaped problems

When you'll write compiler-like code without knowing it:

Query optimisers

A query is parsed; a logical plan is built; rewrites apply; cost-based optimisation picks an execution plan.

Compiler patterns: parsing, semantic analysis, IR transformations, plan selection. PostgreSQL, BigQuery, every database has one.

Templating engines

Mustache, Jinja2, Handlebars. Parse the template; walk the AST applying values; produce output.

Configuration languages

YAML / JSON / TOML / HCL / Cue. Each is parsed; some have semantics (HCL has expressions; Cue has constraints); generate effective config.

DSL processors

A specific in-house mini-language. Parse, validate, execute / transpile.

Code transformers

Babel for JavaScript, AST tools for Python, Rust's macro_rules and proc_macro. Walk an AST; apply transformations; emit modified code.

Knowing compiler basics makes these tasks tractable instead of feeling exotic.

A pragmatic mini-compiler

For a small expression language:

Source: "(2 + 3) * 4"

Lexer (regex / hand-rolled):
  LPAREN, INT(2), PLUS, INT(3), RPAREN, TIMES, INT(4)

Parser (recursive descent / Pratt):
  BinaryOp(BinaryOp(2, '+', 3), '*', 4)

Evaluator (walk AST):
  visit(BinaryOp(2, '+', 3)) → 5
  visit(BinaryOp(5, '*', 4)) → 20

That's interpretation. For compilation, instead of evaluating, emit code:

PUSH 2
PUSH 3
ADD
PUSH 4
MUL

This minimal pipeline scales surprisingly far. Most production DSLs aren't much more complicated.

What to skip if you're not building a compiler

Optimisation passes for fun. Use LLVM if you need them.
Register allocation. LLVM, again.
Assembly output. LLVM.

Most engineers writing compiler-shaped code stop at IR generation. The interpretation or transformation step takes over from there.

Tools worth knowing

tree-sitter — incremental parsing; powers GitHub code-aware features and many editors. For analysing code in arbitrary languages, it's the right tool.
antlr4 — for grammar-driven parsing.
LLVM — if you're producing native code.
ast-grep, comby — pattern-based code transformation.
Rascal, JetBrains MPS — language-workbench tools for serious DSL development.

A pragmatic recommendation

For most engineers:

Learn recursive descent + Pratt parsing. Sufficient for most real DSL needs.
AST visitors as the standard way to analyse / transform.
Don't write a full compiler unless your job is "language designer."
For analysing other languages' code, use tree-sitter or the language's official AST library.
For building DSLs, evaluate "is this really a DSL or just a configuration language?" Often the latter is enough.