Database Backup Strategies

Database backups exist to recover from data loss: hardware failure, accidental deletion, malicious action, application bugs that corrupt data. The goal isn't "we have backups" — it's "we can restore."

The difference matters. Many organizations have backups they've never restored. They're unverified. The first restore attempt during a real incident is the worst time to discover problems.

This page covers the practices that produce restorable backups.

Backup types

Full backup

Complete copy of the database. Largest; longest to take; longest to restore.

Incremental backup

Changes since last backup (full or incremental). Smaller; faster; chains together for restore.

Differential backup

Changes since last full backup. Larger than incremental but simpler restore.

Continuous archiving / WAL shipping

PostgreSQL: write-ahead log files shipped continuously. Enables point-in-time recovery to any moment.

For most production systems, continuous archiving + periodic full backups is the standard.

Recovery objectives

RPO (Recovery Point Objective)

How much data are you willing to lose? With daily backups, up to 24 hours. With continuous archiving, seconds.

RTO (Recovery Time Objective)

How long can recovery take? Tied to backup type and size.

Match RPO/RTO to business needs. Tighter requirements cost more.

Cloud-managed databases

For RDS, Aurora, Cloud SQL, etc., backups are largely automatic:

Automated backups

- Daily snapshots

- Continuous WAL backup

- Configurable retention (1-35 days typically)

- Point-in-time recovery within retention window

Manual snapshots

In addition to automated. For long-term retention; before major changes.

For most cloud databases, automated + manual snapshots covers most needs.

Self-managed databases

For self-hosted databases, you implement backup yourself.

PostgreSQL

`pg_basebackup` for full backups; WAL archiving for continuous.

Tools: pgBackRest, Barman, WAL-E/WAL-G. Production-grade backup tools handle compression, encryption, retention, parallel restore.

MySQL

`mysqldump` for logical backups; Percona XtraBackup for hot backups.

MongoDB

`mongodump` for logical; filesystem snapshots for hot.

What to back up

Data

The actual database contents.

Configuration

Database configuration, user accounts, roles, schemas. The "rebuild from scratch" requires this too.

Application code

Without app code, database alone doesn't help.

Infrastructure

VPC, security groups, IAM. Terraform usually handles this.

Storage location

Same region

Fast access; vulnerable to region failure.

Cross-region

DR-ready. Costs more (transfer + storage).

Cross-cloud / off-cloud

Multi-cloud DR. Highest cost; protects against cloud-provider failure.

For most production systems: same-region for speed; cross-region for disaster recovery.

Encryption

Backups should be encrypted at rest and in transit. The "tar file in S3" pattern is standard:

- Encrypt with KMS key

- Stored in S3 with SSE-KMS or SSE-S3

- Versioned bucket

If backups can be decrypted with a single key compromise, that's a security vulnerability. Manage keys carefully.

Testing restores

The single most important practice. Backups that have never been restored are aspirational.

Periodic restore tests

Monthly or quarterly: restore a backup; verify it works.

What to verify:

- The restore completes

- Data is intact

- Application can connect

- Recent transactions are present (or absent for a specific point-in-time test)

Automated restore tests

Spin up a fresh database; restore latest backup; run smoke tests; tear down.

In CI/CD or scheduled. Continuous verification.

Real disaster simulation

Annual: pretend the primary is gone. Restore in DR region; failover applications. Time it.

This finds problems automated tests miss: documentation gaps, manual steps, organizational coordination.

Retention

Daily backups

Keep 7-30 days. Recent enough for normal recovery.

Weekly backups

Keep for 1-3 months. Catches issues discovered later.

Monthly backups

Keep for 1-7 years. Compliance retention.

Annual archives

Long-term retention as required.

Lifecycle from hot storage to cold (Glacier, etc.) saves cost as backups age. See [CloudStorageOptions](CloudStorageOptions).

Specific scenarios

Accidental DELETE / DROP TABLE

Point-in-time recovery to just before the bad operation.

Without PITR: restore last full backup; lose data since then.

Compromised database

Attacker may have planted persistence. Restore to before compromise; verify.

If compromise was long ago, may need to restore very old backup and replay transactions. Or accept loss of recent data.

Logical corruption

Application bug corrupted data. Restore to before bug; reapply known-good transactions if possible.

Region outage

Restore in another region from cross-region backup. Re-point applications.

Common failure patterns

Backups never tested

Restore fails when actually needed.

Backups in same place as data

Region failure loses both.

Restore documentation outdated

Procedures don't match current environment.

No alerting on backup failures

Backups silently stop; nobody notices.

No retention policy

Storage cost grows; eventually backups are deleted to save money; the wrong ones get deleted.

Manual backup steps

Human forgets; backups missing.

Backup credentials stored with the data

Compromise of database = compromise of backups.

A reasonable starter

For typical production databases:

1. Cloud-managed if possible (handles backups automatically)

2. Daily automated; PITR enabled

3. Cross-region replication for DR

4. Manual snapshots before major changes

5. Lifecycle to cold storage after 30 days

6. Monthly restore test

7. Quarterly DR drill

8. Documented restore procedure

9. Alerts on backup failures

Further Reading

- [ReadReplicasAndReplication](ReadReplicasAndReplication) — Adjacent practice

- [DatabaseConnectionSecurity](DatabaseConnectionSecurity) — Backup security

- [CloudDisasterRecovery](CloudDisasterRecovery) — Broader DR

- [CloudStorageOptions](CloudStorageOptions) — Where backups land