On-Call Practices
On-call: someone is responsible for production at all times. When something breaks, they're paged. They diagnose; they mitigate; they involve others if needed.
Done well, on-call is sustainable. Done poorly, it's miserable and people quit.
This page covers the practices that work.
The on-call function
The on-call's job:
1. **Acknowledge alerts**: someone is on it
2. **Triage**: severity? blast radius? customer impact?
3. **Mitigate**: stop the bleeding; revert, scale, kill switch
4. **Investigate**: find root cause (often after mitigating)
5. **Communicate**: status updates; involve others if needed
6. **Postmortem**: capture lessons after
The first priority is mitigation. Root cause investigation comes after.
Rotation patterns
Single-person rotation
One person at a time. Common for small teams.
Pros: simple; one point of contact.
Cons: long shifts hurt; primary alone if complex incident.
Primary + secondary
Primary takes pages; secondary backs up. Both rotate.
Pros: complex incidents have backup; reduced pressure on primary.
Cons: more rotation slots needed.
Follow-the-sun
Different time zones cover different hours. EU primary 8am-5pm; US primary 5pm-2am; APAC overnight.
Pros: no one paged at 3am.
Cons: requires global team; handoffs are weak points.
For most teams, primary + secondary is the right model.
Rotation length
Weekly
Sunday to Sunday, or Monday to Monday. Common.
Daily
Each person on call for one day. Reduces fatigue but more handoffs.
Multi-week
Two-week rotations. Spreads incidents across more time but increases per-rotation burden.
For most teams, weekly is the sweet spot.
Alerting discipline
The hardest part. Alarms must:
Fire only when human action is needed
If automation can handle it (auto-scaling, auto-restart), let automation handle it. Don't page humans for things they can't actually do.
Be actionable
Each alert has a runbook. The on-call knows what to do.
Match severity
Page-worthy: real customer impact; high severity.
Non-paging: warning; investigate during business hours.
Tickets: low priority; backlog.
The default should be ticket; escalate as needed.
Have ownership
Each alert has an owning team. Stray alerts that nobody owns get ignored.
Be tuned
Alarms that fire often without action become noise. Tune until each alert is actionable.
Alert fatigue
The single biggest on-call failure mode. Symptoms:
- "I'll check it in the morning"
- Alerts ignored for non-urgent issues
- Real incidents missed in the noise
Causes:
- Too many alerts
- Alerts on causes instead of symptoms
- Alerts without runbooks
- Inherited alerts no one owns
Fix: ruthless tuning. Every fired alert should have led to action. If not, remove or downgrade.
The 80/20 rule: a few alarm types cause most of the noise. Eliminate them and on-call quality dramatically improves.
Runbooks
Each alert points to a runbook. The runbook says: when this happens, here's how to respond.
Good runbooks:
- Specific commands
- Common causes
- Escalation criteria
- Links to dashboards
See [RunbookAutomation](RunbookAutomation).
Escalation
When to escalate:
- Beyond your expertise
- Severity higher than expected
- Customer-visible impact growing
- Stuck on an issue
Escalation paths defined in advance: secondary on-call, manager, specific subject-matter experts. Not "Bob, but he's on vacation."
Mitigation playbook
Common mitigation tools:
- **Rollback**: revert the recent deploy
- **Kill switch / feature flag**: disable the broken feature
- **Scale up**: add capacity
- **Failover**: switch to backup region
- **Restart**: when the symptom is "service stuck"
Practice these during quiet times. Don't first try them during incidents.
Communication during incidents
For high-severity incidents:
Incident commander
Owns the incident. Coordinates. Not the same as the technical lead.
Status updates
Periodic updates (every 15-30 min for active incidents). Even if nothing has changed: "still investigating." Silence is worse than slow progress.
Customer-facing communication
Status page; sometimes targeted emails. See [StatusPageBestPractices](StatusPageBestPractices).
Internal communication
Slack channel for the incident. Everyone involved. After-action report.
Postmortems
After any meaningful incident:
Blameless
Focus on system causes, not individual blame. People made decisions with the information they had. The system shouldn't have allowed the failure.
Timeline
What happened, when, in what order. Reconstruct.
Root cause analysis
Not just "Bob deployed the bad code." Why did the bad code pass review? Why didn't tests catch it? Why was monitoring late?
Action items
Specific changes with owners. Not "we should improve testing"; specific tests, specific tools, specific timeline.
Sharing
Postmortems shared widely. Other teams learn from the incident.
Compensation and time-off
On-call is real work; compensate appropriately.
- **On-call pay**: stipend or extra compensation
- **Comp time**: time off after busy on-call shifts
- **Right to disconnect**: no expectations of work after rotation ends
- **Vacation coverage**: arranged in advance
Companies that don't compensate on-call lose engineers to companies that do.
Common failure patterns
- **Alerts ignored.** Real incidents missed.
- **No runbooks.** On-call invents response in the moment.
- **Hero on-call.** One person handles everything; burnout.
- **Blame culture.** Postmortems find scapegoats; people hide mistakes.
- **No escalation path.** On-call alone with hard problems.
- **No compensation.** Engineers leave for better deals.
- **No alert tuning.** Noise drowns signal forever.
Further Reading
- [RunbookAutomation](RunbookAutomation) — Automate the manual response
- [ToilReductionStrategies](ToilReductionStrategies) — Reduce on-call load
- [StatusPageBestPractices](StatusPageBestPractices) — Customer communication
- [CloudMonitoring](CloudMonitoring) — Where alerts come from
- [DevOpsAndSre Hub](DevOpsAndSreHub) — Cluster index