Toil Reduction Strategies
Toil, in SRE terminology: manual, repetitive, automatable operational work that doesn't scale. The manual database failover; the weekly backup verification; the incident-after-incident with the same root cause.
Reducing toil is a primary SRE practice. Without it, operational load grows with the size of the system. With it, the team scales beyond linear.
What toil is (and isn't)
The Google SRE definition. Toil is work that's:
- **Manual**: a human does it
- **Repetitive**: same task multiple times
- **Automatable**: a machine could do it
- **Tactical**: reactive, not strategic
- **No enduring value**: doesn't make the system better
- **O(n) with service growth**: scales linearly with users/load
If your service handles 10× more traffic, does this work increase 10×? If yes, it's toil.
What's NOT toil:
- New feature work
- Novel problem solving
- Strategic improvements
- One-off projects
These have enduring value or aren't repetitive.
Why toil reduction matters
Scaling
Without toil reduction, headcount grows with system size. Eventually unsustainable.
Burnout
Manual repetitive work is unfulfilling. Engineers burn out. See [BurnoutPreventionInTech](BurnoutPreventionInTech).
Speed
Toil consumes time that could be spent on improvements. Operations falls behind.
Reliability
Manual work has more errors than automation. Toil produces incidents.
SRE's 50% rule
Google SRE: limit operational load to 50% of an SRE's time. The other 50% is engineering — building tools, reducing toil.
If toil exceeds 50%, the team can't keep up. Either:
- Hire more (linear growth; bad)
- Reduce toil (sub-linear growth; good)
The 50% target is a forcing function. When toil rises above it, prioritize reduction.
Identifying toil
Track on-call work
What did the on-call do this week? Categorize:
- Real incidents (necessary)
- Toil (could be automated)
- Engineering (reduces future toil)
The toil category gets attention.
Recurring incidents
Same alert fires monthly with the same response. That's toil — the response should be automated.
Manual deploys
Manual production deploys are pure toil. Automate.
Manual scaling
Capacity adjusted by humans on schedule. Auto-scaling.
Health-check verification
Manual checks that things are running. Automated monitoring with alarming.
Onboarding/offboarding
Manual setup of new accounts, access, etc. Self-service or automated.
Reduction strategies
Automate the response
The on-call follows the runbook; automation can follow the same runbook. See [RunbookAutomation](RunbookAutomation).
Self-service for common requests
Instead of tickets to the platform team for common requests (new database, scaling change, access grant), self-service portals.
Better tooling
Tools that make manual operations one-click instead of multi-step. Click-to-deploy, click-to-rollback, click-to-scale.
Eliminate the failure mode
The best toil reduction is making the failure unnecessary. Database failover toil? Multi-master so failover is automatic. Backup verification toil? Continuous verification.
Push to product
Some operational complexity reflects product complexity. Simplifying the product reduces toil.
What's worth automating
The math: cost of automation vs. recurring cost of manual work.
- Toil consuming 1 hour/week: 50 hours/year. Worth automating if automation is < 50 hours.
- Toil consuming 1 hour/day: 250 hours/year. Worth automating up to 250 hours.
Plus second-order effects: error reduction, speed improvement, engineer satisfaction. Often makes automation worth more than the raw math.
What NOT to automate
Rare events
Once-a-year manual work. Cost of automation exceeds savings.
High-judgment work
Decisions requiring expertise. Automation paging a human is fine; automation deciding is risky.
Destructive operations
Drop database, kill cluster, etc. Even if rare and repetitive, the failure mode is too severe.
Premature automation
The work hasn't stabilized. The "right" automation is still emerging. Premature automation locks in the wrong pattern.
The political dimension
Reducing toil sometimes means:
- The team that runs the manual process loses scope
- The cost shifts (the platform team gets the work)
- The organization needs to invest in tooling
These can be politically charged. Frame as: "we're freeing up capacity for higher-value work" rather than "your job is being automated."
Specific patterns
Auto-scaling
Manual capacity changes → automatic based on metrics.
Self-healing
Manual restart of stuck services → liveness probes + automatic restart.
Self-service infrastructure
Tickets for new resources → Terraform module that engineers run.
Automated certificate rotation
Manual cert installs → ACME (Let's Encrypt) or AWS Certificate Manager.
Automated database backups
Manual backup verification → continuous validation.
Common failure patterns
- **No measurement.** Don't know how much toil there is.
- **Treating toil as inevitable.** "It's just operations work."
- **Burning out the team to handle toil.** Should be reducing it instead.
- **Over-automating in low-value places.** Automating once-a-year work.
- **Building bespoke tools when off-the-shelf works.** Reinvented wheels.
- **Automation that's brittle.** Fails worse than the manual process.
A reasonable approach
For teams looking to reduce toil:
1. Measure: track operational time
2. Categorize: which categories are biggest?
3. Prioritize: highest-volume toil first
4. Automate: replace manual with tools
5. Iterate: new toil emerges; address it
6. Sustain: make toil reduction part of regular planning
This is ongoing work, not a project. Toil constantly emerges; reduction must be constant.
Further Reading
- [RunbookAutomation](RunbookAutomation) — Specific automation pattern
- [OnCallPractices](OnCallPractices) — Where toil shows up
- [DevOpsFundamentals](DevOpsFundamentals) — Broader practice
- [BurnoutPreventionInTech](BurnoutPreventionInTech) — Why toil matters for people
- [DevOpsAndSre Hub](DevOpsAndSreHub) — Cluster index