Database Indexing Strategies

An index is a data structure the database maintains alongside a table so that some queries can answer in O(log N) instead of O(N). Pick the wrong index, or no index, and the query planner does what it has to do — read every row.

Most index problems are one of two: (a) the right index doesn't exist, or (b) the index exists but the planner doesn't use it. This page is a working catalogue, focused on Postgres because it's the most expressive but the principles transfer.

The index types you actually use

Type	Postgres name	What it indexes	Best for
B-tree	`BTREE` (default)	Ordered scalar columns	Equality, range, ORDER BY, LIMIT
Hash	`HASH`	Hashed scalar columns	Equality only — almost always B-tree wins instead
GIN	`GIN`	Composite values (arrays, JSONB, full-text)	"Does the array contain X", "is this JSON path matching"
GiST	`GIST`	Geometric/spatial, range types, custom	Geographic queries, range overlap
BRIN	`BRIN`	Block range summaries	Huge tables with naturally clustered values (time-series)
Bloom	`bloom` extension	Set membership	Many low-selectivity columns combined
Hash (unique)	n/a (use B-tree)	n/a	Don't reach for hash. Postgres B-tree handles equality fine.

For 90% of OLTP queries, B-tree is the answer. The rest of the table matters when your workload has specific shapes.

When each index type wins

B-tree

Default. Sorted, supports equality and range. Multi-column B-tree on (a, b, c) supports queries with a, with (a, b), with (a, b, c) — but NOT just b or just c. The "leftmost-prefix" rule.

Use covering indexes (INCLUDE) to add non-key columns the query needs without paying for them in the sort key:

CREATE INDEX idx_orders_user ON orders (user_id) INCLUDE (status, total);

This makes index-only scans work for SELECT status, total FROM orders WHERE user_id = ?. Saves the trip to the heap.

GIN

For arrays, JSONB, full-text, trigram. The right answer for:

WHERE tags @> ARRAY['urgent', 'paid'] — JSONB or array containment.
WHERE doc_tsv @@ to_tsquery('database & sharding') — full-text.
WHERE name % 'jak' — trigram for fuzzy match.

Trade-off: GIN indexes are slower to update (especially under heavy write workloads) than B-tree. For read-heavy JSONB columns, very much worth it. For write-heavy, profile before adding.

BRIN

For tables where physical order on disk correlates with the indexed column — typically time-series. Stores summaries of block ranges (min/max per range). Tiny on disk. Useful only when the correlation is real.

CREATE INDEX idx_metrics_time ON metrics USING BRIN (recorded_at)
  WITH (pages_per_range = 32);

A BRIN on a 1 TB time-series table might be 50 MB compared to a 100 GB B-tree. Lookup is approximate but precise enough — the planner narrows to a few hundred MB then scans them.

If you don't have natural ordering, BRIN doesn't help. CLUSTER rebuilds the table in index order (offline, expensive); for time-series, ordering tends to be natural.

GiST

Geographic queries (PostGIS), range type queries, custom data types. Pick when those apply; otherwise ignore.

Partial indexes

An index with a WHERE clause. Indexes only the rows that match.

CREATE INDEX idx_orders_active ON orders (user_id) WHERE status = 'active';

Wins when:

The filtered subset is small (< 10% of the table).
Queries always include the same filter.

The index is smaller, faster to maintain, and can dramatically speed up the targeted query. Underused in practice.

Selectivity decides whether the index runs

The query planner uses an index only when it estimates the index will be selective enough that the savings outweigh the heap-fetch overhead. A common rule of thumb: at > 10% of rows, sequential scan often wins.

This is why:

An index on is_deleted (boolean) usually doesn't help — half the table matches.
An index on country_code is bad if 95% of users are in one country — for that country, the planner picks seq-scan.
The fix for low-selectivity columns is usually composing them: (country_code, signup_date) cuts the matching set faster than either alone.

Check the planner's choice with EXPLAIN (ANALYZE, BUFFERS). If you expected an index scan and got a sequential scan, the planner probably has accurate statistics and is right; investigate the row counts before "fixing" with a hint.

Multi-column indexes: the order matters

Index (a, b, c) supports queries on a, (a, b), (a, b, c). It does not support queries on b alone or c alone — they'd need their own indexes.

Heuristic for ordering:

Equality columns first. WHERE a = 1 AND b BETWEEN 10 AND 20 — equality on a first.
High-selectivity columns first within the equality group.
Range / sort columns last. ORDER BY columns at the end of the index let the index satisfy the sort.

Common mistake: indexing (created_at, user_id) when the query is WHERE user_id = ? ORDER BY created_at. The index won't help — it's ordered by created_at first. Reverse the order.

Index-only scans

When all columns the query needs are in the index (key columns or INCLUDE columns), Postgres can answer entirely from the index without touching the heap. This is a 5–50× speedup.

For an index-only scan, Postgres also needs the visibility map up to date — VACUUM regularly. A table with high update churn and no recent vacuum can defeat index-only scans even if the index covers all needed columns.

Index maintenance, the part people skip

Indexes degrade. They bloat under heavy update workloads. They become invalid after partition operations. They develop dead tuples after lots of deletes.

Periodic REINDEX CONCURRENTLY. Postgres bloat accumulates; rebuilding compacts. Quarterly for high-churn indexes.
Watch pg_stat_user_indexes. idx_scan = 0 over months means no one's using this index. Drop it; reduce write amplification.
Watch pg_stat_indexes for bloat. pgstattuple shows free-space ratios. Above 50% free space = candidate for REINDEX.
Don't over-index. Each index slows down inserts and updates by a measurable amount. 5 indexes on a hot table can cut write throughput by 30–40% versus 1.

The bias most teams have is "more indexes are better." It isn't. Profile the actual queries, index for those, drop the rest.

Failure modes

Function on the indexed column kills the index. WHERE LOWER(email) = ? doesn't use an index on email. Either index LOWER(email) (functional index) or store the lowercased version.

Implicit type casts kill the index. WHERE id = '42' (int column, string literal) sometimes prevents index use. Match types.

OR queries often kill index usage. Combine with UNION ALL of two indexed queries, or use a multi-column index that covers both branches.

Statistics out of date. After bulk loads, ANALYZE to update stats. Old stats lead to wrong plans (sequential scan when index would win, or vice versa).

Index used but heap is the bottleneck. Index says "go fetch row X"; row X is on cold disk. Profile via EXPLAIN BUFFERS; consider covering index or partitioning.

Tools

EXPLAIN (ANALYZE, BUFFERS, VERBOSE) — first stop for any slow query. Read the plan; understand where time goes.
pg_stat_statements — top N queries by total time. Aggregate; great for finding what to fix first.
pgbadger — log-based slow-query report; useful for retrospective analysis.
pgstattuple — bloat measurement.
HypoPG — try a hypothetical index without creating it; planner shows whether it would help.

A diagnostic workflow

Find slow query (via pg_stat_statements).
EXPLAIN ANALYZE it. What does the planner do?
If the wrong index is chosen → check stats with ANALYZE, check selectivity assumptions.
If no index exists → add one. Pick the right type. Verify the planner uses it.
If index exists but isn't used → check function-on-column, type cast, OR clause.
Re-run; verify before/after.

This loop catches 80% of indexing problems. The other 20% are workload-shape issues (write-heavy + read-heavy on the same hot row, partitioning needed, sharding needed) that go beyond indexing.