Runbook Automation

A runbook is operational documentation: when this alert fires, do these steps. The on-call engineer at 3am doesn't have to think; they follow the runbook.

Beyond writing runbooks, the next step is automating the recoverable parts. Why have a human run the same sequence of commands when the automation can?

This page covers runbook design and automation patterns.

Anatomy of a good runbook

```markdown

Alert: Database connection pool exhausted

Symptoms

- Alert: db.connection_pool.in_use > 90%

- Customer impact: API errors, latency spikes

Initial actions

1. Check current connection count: `bin/db-stats.sh`

2. Look for runaway query: `bin/long-queries.sh`

3. Check application metrics for unusual patterns

Common causes

1. Long-running query holding connections

2. Application bug leaking connections

3. Genuine load spike

Resolution

If long-running query

```sql

SELECT pid, query, state, age(now(), xact_start) FROM pg_stat_activity

WHERE state != 'idle' ORDER BY xact_start;

```

Kill the offending query: `SELECT pg_cancel_backend(<pid>);`

If connection leak

Restart the affected service:

`kubectl rollout restart deployment/api`

If load spike

Scale up:

`kubectl scale deployment/api --replicas=10`

Escalation

- After 30 minutes if not resolved: page secondary

- After 60 minutes: page database team

Related

- Dashboard: ...

- Recent changes: ...

```

The runbook has specific commands; common causes; escalation criteria. The on-call doesn't invent the response.

Writing principles

Specific commands

Not "check the database connections." Specific: `psql -c "SELECT count(*) FROM pg_stat_activity"`.

Tested

Runbooks rot. The command that worked last year doesn't now. Test runbooks periodically — game days, dry runs.

Linked from alerts

Each alert has a link to its runbook. On-call gets the link in the alert payload.

Maintained

Runbooks that nobody updates become wrong. Make updates part of incident postmortems.

What to automate

Recoverable failures

If the response is "restart the service," automate the restart. Liveness probes in Kubernetes do this for free.

Auto-scaling

Load spikes? Scale up automatically. CPU-based, queue-depth-based, custom metrics.

Failover

Primary region down? Route traffic to secondary. Health checks + DNS failover.

Rollback

Recent deploy is causing errors? Auto-rollback if error rate exceeds threshold.

Cleanup

Stuck jobs? Old logs? Dead resources? Scheduled cleanup tasks.

What not to automate

Decisions requiring judgment

"Is this a real customer impact or a flaky monitoring blip?" Humans decide. Automation paging the human is fine; automation deciding the response usually isn't.

Destructive actions

"Drop the database table" — never automate. Even with confidence.

High-impact actions

Cross-region failover, data migration, etc. Manual approval required.

Untested automation

Automation that hasn't been tested in production might do worse than nothing.

Specific patterns

Self-healing systems

Health checks → automatic restarts → automatic scaling → fewer pages.

For workloads where this fits, the on-call gets paged less.

Auto-rollback on canary failure

Deploy canary; monitor metrics for 10 minutes; auto-rollback if errors exceed baseline.

Circuit breakers

Service fails repeatedly → circuit opens → traffic stops hitting it for a period → tries again.

Application-level resilience that doesn't need on-call involvement.

ChatOps for response

`@bot, restart api in production` runs the restart. The bot logs the action; team sees what was done. Cleaner than SSH-ing in.

Kill switches

Feature flags that disable problematic functionality. On-call can flip without code change.

The progression

Mature operations follows this progression:

1. **Manual response**: human follows runbook

2. **Automated diagnosis**: tools tell you what's wrong faster

3. **Automated recovery for common cases**: alert fires; automation acts; human reviews

4. **Self-healing for known patterns**: alert doesn't even fire because system recovered

Each step reduces on-call load. The investment pays back over time.

Common failure patterns

- **Runbooks that are stale.** Misleading worse than missing.

- **Runbooks no one wrote.** Tribal knowledge.

- **Alerts without runbooks.** On-call invents in the moment.

- **Too aggressive automation.** Auto-rollback during normal load fluctuation.

- **Automation that fails silently.** Things go wrong; nobody knows.

- **No escalation criteria.** On-call doesn't know when to call for help.

A starter pattern

For a service with a new on-call rotation:

1. Document each alert's runbook (manual response)

2. Automate trivially recoverable cases (auto-restart on liveness fail)

3. Add canary deployment with auto-rollback

4. Ensure escalation paths are defined

5. Game day: simulate incident; test runbooks

6. Iterate based on real incidents

The runbook coverage and automation grow over months, not weeks.

Further Reading

- [OnCallPractices](OnCallPractices) — On-call rotations

- [ToilReductionStrategies](ToilReductionStrategies) — SRE concept

- [ScheduledTaskManagement](ScheduledTaskManagement) — Adjacent automation

- [CodeDocumentationBestPractices](CodeDocumentationBestPractices) — Documentation parallels

- [DevOpsAndSre Hub](DevOpsAndSreHub) — Cluster index