Database Migration Strategies

Database migrations are where engineering hits operations hardest. A schema change that takes 15 minutes in dev can take a database lock for 15 minutes in production, blocking every write. The discipline is making changes that look additive to the database while behaviour evolves in the application.

What can go wrong

The classic outage scenarios:

ALTER TABLE takes an ACCESS EXCLUSIVE lock; the table is unwritable for the duration.
CREATE INDEX (without CONCURRENTLY) locks the table.
Long-running migration on a billion-row table takes hours; another transaction conflicts; the database wedges.
Application deploys the new code that requires the new column; column not yet there; everyone errors.
Migration drops the old column while half the fleet still reads it.

Each of these is a real production incident pattern. The migration patterns below are how to avoid them.

The migration tools

A migration tool versions every schema change and runs them in order. Standard options:

Flyway — Java; widely used; mature.
Liquibase — Java/XML; more declarative; also widely used.
Alembic — Python; for SQLAlchemy.
db-migrate / Knex / Prisma Migrate — Node.js.
golang-migrate / sqlx-cli / sqlc — Go.
dbmate — language-agnostic, simple.
sqlx migrate — Rust.

The tool's job: apply migrations idempotently, in order, recording what's been applied. The hard part isn't the tool; it's the migration content.

For most teams, pick the standard tool for your stack and don't overthink it.

Migration discipline rules

Before any migration:

It's a versioned file, named with a sequence number or timestamp.
It's idempotent: re-running it produces the same outcome (use IF NOT EXISTS, ON CONFLICT DO NOTHING).
It's reviewed: in the same PR as the application change that needs it.
It's never edited after merge: if you got it wrong, write a new migration.
It's tested: applied to staging that reflects production-scale data.

The "never edited after merge" rule prevents the worst kind of drift: prod has migration v17 that's different from v17 in dev because someone rewrote it.

What's safe and what's not

For Postgres specifically (other databases similar but vary):

Operation	Lock level	Safe online?
`CREATE TABLE`	None	Yes
`DROP TABLE`	ACCESS EXCLUSIVE	Yes if no concurrent reads (which is hard to guarantee)
`ADD COLUMN` (with default)	ACCESS EXCLUSIVE briefly (Postgres 11+)	Yes, with caveats
`ADD COLUMN` (without default)	None / minimal	Yes
`DROP COLUMN`	ACCESS EXCLUSIVE	Yes briefly; but app must be ready
`ALTER COLUMN TYPE`	ACCESS EXCLUSIVE; rewrites table	No — rewrites everything
`ALTER COLUMN SET NOT NULL`	ACCESS EXCLUSIVE	Yes if the column has no nulls; brief
`ADD CONSTRAINT FOREIGN KEY`	SHARE ROW EXCLUSIVE; validates	Use `NOT VALID` then `VALIDATE`
`ADD CONSTRAINT CHECK`	ACCESS EXCLUSIVE while validating	Use `NOT VALID` then `VALIDATE`
`CREATE INDEX`	SHARE	No — blocks writes
`CREATE INDEX CONCURRENTLY`	None	Yes — slower but doesn't block
`DROP INDEX`	ACCESS EXCLUSIVE briefly	Yes
`DROP INDEX CONCURRENTLY`	None	Yes
`RENAME COLUMN`	ACCESS EXCLUSIVE briefly	Don't — see expand/contract

In MySQL, similar concerns plus its specific implementation quirks; tools like gh-ost and pt-online-schema-change handle the table-rewrite cases without locking.

The expand-contract pattern

For any change that isn't a simple "add a column with a default":

1. EXPAND  — Add the new structure. Old code still works.
2. MIGRATE — Backfill data. Application uses both old and new in parallel.
3. CONTRACT — Remove the old structure. Application uses only new.

Each phase is independently safe. The cost is more migrations and more code paths during transition; the benefit is no big-bang outages.

Example: rename a column

Bad: ALTER TABLE users RENAME COLUMN phone TO phone_number;

This is fast (no rewrite) but the application has the old name; deploys are fragile.

Better, expand-contract:

Expand: ALTER TABLE users ADD COLUMN phone_number TEXT;. Application writes both phone and phone_number. Reads still use phone.
Backfill: UPDATE users SET phone_number = phone WHERE phone_number IS NULL; (in batches if needed).
Migrate readers: deploy code that reads from phone_number, falls back to phone. Then deploy code that reads only phone_number.
Migrate writers: deploy code that writes only to phone_number.
Contract: ALTER TABLE users DROP COLUMN phone;. Now safe — nobody uses it.

5+ deploys instead of 1; zero downtime; reversible at every step.

Example: change a column type

Bad: ALTER TABLE orders ALTER COLUMN amount TYPE NUMERIC(10,2);

On a 1B-row table, this rewrites the whole table — hours of ACCESS EXCLUSIVE.

Better, expand-contract:

Expand: ADD COLUMN amount_v2 NUMERIC(10,2);.
Trigger copies new writes: CREATE TRIGGER ... amount_v2 := amount;.
Backfill: in batches, UPDATE orders SET amount_v2 = amount::numeric WHERE amount_v2 IS NULL; with batch size and pauses.
Migrate readers to use amount_v2.
Contract: drop amount, drop trigger, optionally rename amount_v2 → amount.

Days instead of seconds, but no downtime.

Example: add a NOT NULL column

Bad: ALTER TABLE orders ADD COLUMN status TEXT NOT NULL; — fails on existing rows.

Better:

ADD COLUMN status TEXT; — nullable.
Backfill: UPDATE orders SET status = 'unknown' WHERE status IS NULL; (in batches).
ALTER TABLE orders ALTER COLUMN status SET NOT NULL; — fast in Postgres if all rows have values.
Application starts writing real status values.

Three migrations; no big locks; reversible.

Locking under load

Even "safe" operations interact poorly with long-running transactions:

ADD COLUMN waits for all current transactions to finish before it can take its momentary lock.
A long-running SELECT can block a quick ADD COLUMN from acquiring its lock.
Other queries queue behind the waiting ADD COLUMN.

Result: a "fast" migration takes hours because of one transaction holding things open.

Defences:

SET lock_timeout = '5s' before the migration — fail fast instead of holding everything up.
Set statement_timeout to prevent runaway migrations.
Run during low-traffic windows when possible, even for "safe" migrations.
Kill long-running competitors if needed (they probably shouldn't be running anyway).

Postgres-specific tooling: pg_repack for table rewrites without long locks; pg_squeeze for similar.

Backfills at scale

For large tables, backfilling all rows in one transaction is dangerous (locks, replication lag, transaction-id wraparound). Batch:

DO $$DECLARE
    batch_size INT := 10000;
    last_id BIGINT := 0;
BEGIN
    LOOP
        UPDATE orders 
        SET status = 'unknown' 
        WHERE id > last_id AND status IS NULL
        ORDER BY id LIMIT batch_size;
        EXIT WHEN NOT FOUND;
        last_id := (SELECT MAX(id) FROM orders WHERE status IS NULL);
        PERFORM pg_sleep(0.1);  -- ease pressure
    END LOOP;
END$$;

For very large tables, use a job system (background worker) rather than SQL DO blocks — better observability and recoverability.

Rollbacks

The classic question: "should migrations have down scripts?"

Pragmatic answer: rarely useful. By the time you'd run a down migration, the application has likely already deployed code expecting the new schema. The down migration becomes a fresh forward migration ("add the column back").

Rollback strategy that works: don't migrate to changes you can't safely roll back from. Use expand-contract; each phase is reversible by deploying older code.

Some tools support down migrations for development convenience (resetting a dev database). Treat those as dev-only; don't rely on them in production.

Multi-database / multi-tenant migrations

For schema-per-tenant or DB-per-tenant systems, migrations apply to many databases. Considerations:

Idempotency is even more critical (some tenants may already be migrated, others not).
Parallel application speeds up large fleets but raises concurrency concerns.
Per-tenant rollout lets you stop on first failure.

Tools: most migration tools handle this if you script the loop; some have built-in multi-database support.

Failure modes seen in the wild

Migration ran in dev, looks fine; took down prod. Different scale; different lock behaviour. Always test on production-equivalent data volumes.
Migration completed but data wasn't backfilled. App expected backfilled values; got nulls; surprised. Always audit post-migration.
Migration deployed; old code still running on some hosts. Mid-rollout. Code must tolerate either schema during deploy.
Migration applied but tracking table didn't update. Migration runs again; double-applied. Use idempotent migrations.
Long-running transaction blocks migration; everything queues; site goes down. Set lock_timeout.

Each of these is a real outage. Each has a cheap mitigation. Apply them.