Toil Reduction Strategies

Toil, in SRE terminology: manual, repetitive, automatable operational work that doesn't scale. The manual database failover; the weekly backup verification; the incident-after-incident with the same root cause.

Reducing toil is a primary SRE practice. Without it, operational load grows with the size of the system. With it, the team scales beyond linear.

What toil is (and isn't)

The Google SRE definition. Toil is work that's:

Manual: a human does it
Repetitive: same task multiple times
Automatable: a machine could do it
Tactical: reactive, not strategic
No enduring value: doesn't make the system better
O(n) with service growth: scales linearly with users/load

If your service handles 10× more traffic, does this work increase 10×? If yes, it's toil.

What's NOT toil:

New feature work
Novel problem solving
Strategic improvements
One-off projects

These have enduring value or aren't repetitive.

Why toil reduction matters

Scaling

Without toil reduction, headcount grows with system size. Eventually unsustainable.

Burnout

Manual repetitive work is unfulfilling. Engineers burn out. See BurnoutPreventionInTech.

Speed

Toil consumes time that could be spent on improvements. Operations falls behind.

Reliability

Manual work has more errors than automation. Toil produces incidents.

SRE's 50% rule

Google SRE: limit operational load to 50% of an SRE's time. The other 50% is engineering — building tools, reducing toil.

If toil exceeds 50%, the team can't keep up. Either:

Hire more (linear growth; bad)
Reduce toil (sub-linear growth; good)

The 50% target is a forcing function. When toil rises above it, prioritize reduction.

Identifying toil

Track on-call work

What did the on-call do this week? Categorize:

Real incidents (necessary)
Toil (could be automated)
Engineering (reduces future toil)

The toil category gets attention.

Recurring incidents

Same alert fires monthly with the same response. That's toil — the response should be automated.

Manual deploys

Manual production deploys are pure toil. Automate.

Manual scaling

Capacity adjusted by humans on schedule. Auto-scaling.

Health-check verification

Manual checks that things are running. Automated monitoring with alarming.

Onboarding/offboarding

Manual setup of new accounts, access, etc. Self-service or automated.

Reduction strategies

Automate the response

The on-call follows the runbook; automation can follow the same runbook. See RunbookAutomation.

Self-service for common requests

Instead of tickets to the platform team for common requests (new database, scaling change, access grant), self-service portals.

Better tooling

Tools that make manual operations one-click instead of multi-step. Click-to-deploy, click-to-rollback, click-to-scale.

Eliminate the failure mode

The best toil reduction is making the failure unnecessary. Database failover toil? Multi-master so failover is automatic. Backup verification toil? Continuous verification.

Push to product

Some operational complexity reflects product complexity. Simplifying the product reduces toil.

What's worth automating

The math: cost of automation vs. recurring cost of manual work.

Toil consuming 1 hour/week: 50 hours/year. Worth automating if automation is < 50 hours.
Toil consuming 1 hour/day: 250 hours/year. Worth automating up to 250 hours.

Plus second-order effects: error reduction, speed improvement, engineer satisfaction. Often makes automation worth more than the raw math.

What NOT to automate

Rare events

Once-a-year manual work. Cost of automation exceeds savings.

High-judgment work

Decisions requiring expertise. Automation paging a human is fine; automation deciding is risky.

Destructive operations

Drop database, kill cluster, etc. Even if rare and repetitive, the failure mode is too severe.

Premature automation

The work hasn't stabilized. The "right" automation is still emerging. Premature automation locks in the wrong pattern.

The political dimension

Reducing toil sometimes means:

The team that runs the manual process loses scope
The cost shifts (the platform team gets the work)
The organization needs to invest in tooling

These can be politically charged. Frame as: "we're freeing up capacity for higher-value work" rather than "your job is being automated."

Specific patterns

Auto-scaling

Manual capacity changes → automatic based on metrics.

Self-healing

Manual restart of stuck services → liveness probes + automatic restart.

Self-service infrastructure

Tickets for new resources → Terraform module that engineers run.

Automated certificate rotation

Manual cert installs → ACME (Let's Encrypt) or AWS Certificate Manager.

Automated database backups

Manual backup verification → continuous validation.

Common failure patterns

No measurement. Don't know how much toil there is.
Treating toil as inevitable. "It's just operations work."
Burning out the team to handle toil. Should be reducing it instead.
Over-automating in low-value places. Automating once-a-year work.
Building bespoke tools when off-the-shelf works. Reinvented wheels.
Automation that's brittle. Fails worse than the manual process.

A reasonable approach

For teams looking to reduce toil:

Measure: track operational time
Categorize: which categories are biggest?
Prioritize: highest-volume toil first
Automate: replace manual with tools
Iterate: new toil emerges; address it
Sustain: make toil reduction part of regular planning

This is ongoing work, not a project. Toil constantly emerges; reduction must be constant.