Database Performance Monitoring

Database problems are usually slow-developing — the system gets gradually worse over weeks until it tips over and a query that used to be 10ms is 30s. Catching this requires monitoring. The right metrics turn slow degradation into a Tuesday-morning "we should look at that" instead of a 3am page.

This page is the working set of metrics for Postgres specifically; principles transfer.

The signals that matter

Five categories. Get these and you catch most database problems.

Category	Top metrics
Connections	Active + idle counts, max-connection limit utilisation, pool wait time
Queries	p95/p99 latency, slow-query rate, top queries by total time
Locks	Lock wait time, deadlock count, longest-held locks
I/O & cache	Cache hit ratio, dirty page rate, disk I/O wait
Replication & WAL	Replication lag, WAL volume, archive failures

Each of these has a sane Postgres view to read from. Most observability stacks have prebuilt exporters (postgres_exporter for Prometheus); use them.

Connections

Connection pressure is the single most common Postgres issue. Postgres uses one process per connection; max_connections defaults to 100, can go higher but doesn't scale linearly.

Track:

pg_stat_activity — current connections by state (active, idle, idle in transaction).
idle in transaction count — these are bugs in your application (transactions started but not committed). High counts indicate connection leaks.
Connection-pool wait time — how long client requests wait for a free connection. Should be near zero.

Alert:

Active connections > 80% of max_connections for 5+ minutes.
idle in transaction count > 5 for any sustained period.
Pool wait time p95 > 100ms.

The almost-universal fix for connection pressure is PgBouncer in transaction mode in front of the database. PgBouncer multiplexes thousands of client connections to a pool of hundreds (or fewer) backend connections. Mandatory at any meaningful scale.

Queries: pg_stat_statements

pg_stat_statements is the most useful Postgres extension. It records normalised query stats — total_exec_time, calls, mean_exec_time, rows — for every query the database has seen.

Top-by-total-time view:

SELECT query, calls, total_exec_time, mean_exec_time, rows
FROM pg_stat_statements
ORDER BY total_exec_time DESC
LIMIT 20;

This tells you what to optimise. The query at the top of "total time" is where the database is actually spending its life — even if it's individually fast, high call counts add up.

Alert:

Mean exec time of any top-50 query > 100ms (or whatever your threshold is).
New query appears in top-10 (regression from a deploy).
Total exec time per minute trending up week-over-week (workload growth).

Slow query log

Set log_min_duration_statement = 500 (or wherever your latency threshold is). Slow queries land in the Postgres log. Tools like pgBadger summarise.

Use it as the complement to pg_stat_statements — the latter aggregates, the former gives you full bound-parameter examples to reproduce.

Locks and waits

pg_locks joined with pg_stat_activity shows currently-held locks and waiting queries.

Key metrics:

Long-held locks (> 1s for ROW EXCLUSIVE; > 100ms for ACCESS EXCLUSIVE).
Lock-wait queue length — queries waiting on others.
Deadlock count (pg_stat_database.deadlocks) — should be near zero. Non-zero means application code with conflicting transaction ordering.

Most problems show up here:

DDL operations holding ACCESS EXCLUSIVE waiting on a long-running query, blocking everything.
Application code with SELECT FOR UPDATE holding rows; another transaction hangs.
Autovacuum bumping into a long transaction, can't proceed.

I/O and cache

Postgres uses shared buffers (configurable) plus the OS page cache.

pg_stat_database.blks_hit / (blks_hit + blks_read) — buffer cache hit ratio. > 99% is healthy; < 95% means working set doesn't fit and you're going to disk constantly.
pg_statio_user_tables — per-table I/O stats. Identifies which tables are I/O-heavy.
Disk wait time on the host — iostat, node-exporter. High await or util indicates I/O bottleneck.

Solutions:

More shared buffers (typical: 25% of RAM up to ~16GB).
Larger machine (more RAM = more page cache).
Better indexes (less data to scan).
Partitioning to keep hot working sets smaller.

Replication

For setups with replicas:

Replication lag in seconds — pg_stat_replication.replay_lag on the primary, or pg_last_xact_replay_timestamp() on the replica.
WAL volume produced/sec — surge often indicates a runaway query writing many rows.
Replication slot status — physical/logical slots that fall behind cause WAL retention on the primary; disk fills.

Alert:

Replica lag > 30s (or whatever your tolerance is).
WAL volume > 2× baseline.
Any replication slot inactive for > 5 minutes.

Autovacuum / bloat

Postgres MVCC creates dead tuples; autovacuum cleans them up. When autovacuum can't keep up, tables bloat.

Track:

Dead tuple ratio per table (pg_stat_user_tables.n_dead_tup / n_live_tup).
Last autovacuum time per table — should be recent for active tables.
Autovacuum running count — surge means catching up; sustained = workers full = problem.

Tooling: pgstattuple extension shows true bloat. pg_repack rebuilds bloated tables online.

A high-write table with > 50% dead-tuple ratio means autovacuum is losing. Tune autovacuum_vacuum_scale_factor lower for that table; check that long-running transactions aren't blocking vacuum.

What to put on the dashboard

The dashboard a DBA actually looks at:

Connection panel — active, idle, idle in transaction, max.
Query latency panel — p50, p95, p99 over time. Top 10 slow queries by total time.
Lock panel — current waiting queries, longest held lock, recent deadlocks.
Cache panel — hit ratio, top tables by I/O.
Replication panel — replica lag, WAL produced/sec.
Bloat panel — top bloated tables, autovacuum activity.

If you're staring at this for the first time and one of the panels is missing, that's where the next outage is hiding.

Tools

pg_stat_statements — non-negotiable. Enable in postgresql.conf.
postgres_exporter + Prometheus + Grafana — open-source stack. Many grafana dashboards exist; start with the official Postgres one.
pgBadger — log analysis, slow-query reports.
Datadog / Sentry / commercial APM — paid options; integrate well.
pgstattuple — bloat measurement, when needed.
auto_explain — automatically logs query plans for slow queries; invaluable for "why is this slow."

A minimum starter stack

For a team setting up Postgres monitoring from scratch:

- Enable pg_stat_statements
- Enable auto_explain (log_min_duration = 1000, log_analyze = on)
- Run pgBouncer in transaction mode in front of Postgres
- Run postgres_exporter; ship to Prometheus
- Build grafana dashboard with the six panels above
- Set log_min_duration_statement = 500 to catch slow queries
- Configure pgBadger to run nightly on logs
- Set up alerts on the metrics above

A day's work; permanent operational visibility.

Failure modes monitoring catches

Real examples this monitoring stack catches:

"Why is the app slow?" → connection pool wait time spiked → check idle in transaction → the new feature has a missing commit(). Fix.
"Why did this query slow down?" → pg_stat_statements shows it changed plan → auto_explain shows missing index → add index.
"We seem to be running out of disk?" → replication slot inactive → WAL retention growing forever → fix the dead replica.
"Updates seem stuck?" → lock-wait dashboard shows row lock held by long-running transaction → kill it.

Without monitoring, each of these is a frantic investigation. With it, a click and a fix.

Database Performance Monitoring

The signals that matter

Connections

Queries: pg_stat_statements

Slow query log

Locks and waits

I/O and cache

Replication

Autovacuum / bloat

What to put on the dashboard

Tools

A minimum starter stack

Failure modes monitoring catches

Further reading