Scheduled Task Management
Scheduled tasks: scripts or jobs that run on a schedule. Daily reports, hourly cleanup, weekly billing, periodic reconciliation. Almost every system has them.
The simple version: a cron job on a server. Works for tiny systems. For real production, you need more.
The simple case: cron
```bash
crontab -e
0 2 * * * /path/to/script.sh
```
Runs the script at 2am daily. Cron is fine for:
- Low-stakes tasks
- Single-machine deployments
- Tasks that can tolerate occasional failures
For production at scale, cron has problems:
- Single point of failure (the server with the crontab)
- No retry on failure
- No alerting on failure
- No history of runs
- Hard to audit ("did this run yesterday?")
Kubernetes CronJobs
For Kubernetes-deployed apps:
```yaml
apiVersion: batch/v1
kind: CronJob
metadata:
name: nightly-cleanup
spec:
schedule: "0 2 * * *"
jobTemplate:
spec:
template:
spec:
containers:
- name: cleanup
image: my-cleanup:latest
restartPolicy: OnFailure
```
Pros: integrated with cluster; automatic restart on failure; logs available.
Cons: Kubernetes-specific; cluster needs to be running.
Cloud-native schedulers
AWS EventBridge / CloudWatch Events
Cloud-native cron equivalent. Triggers Lambda, Step Functions, or other AWS services.
```yaml
ScheduleExpression: rate(1 hour) # or cron-like expression
Target: arn:aws:lambda:...
```
Serverless; managed; pay-per-invocation.
GCP Cloud Scheduler
Similar; GCP-native.
Azure Logic Apps / Functions
Azure equivalent.
For cloud-native deploys, the cloud scheduler is usually the right choice. Less operational overhead than self-hosted.
What scheduled tasks need
Idempotency
Tasks may run twice (network retry, infrastructure restart). The same task running twice should produce the same result.
Don't:
- Append to a log without deduplication
- Send notifications without a "did I send this already?" check
- Insert without ON CONFLICT DO NOTHING
See [IdempotencyPatterns](IdempotencyPatterns).
Locking / single execution
Some tasks should run only once even if scheduled twice. Distributed lock (Redis, database row) ensures single execution.
Observability
- Did the task run?
- Did it succeed?
- How long did it take?
- What did it process?
Logs to a central system; metrics on success/failure; alerts on missed runs.
Retry policy
Network blip = retry. But not all errors should retry. Decide:
- Transient errors: retry with backoff
- Permanent errors: alert; don't retry
Alerting on failure
If the cleanup job fails, someone needs to know. Email, Slack, page — depends on severity.
Alerting on missed runs
The job didn't run at all (scheduler down). Some monitoring detects this.
Specific patterns
Heartbeat checks
Job sends a heartbeat to a monitoring service after success. Service alerts if no heartbeat.
Tools: Healthchecks.io, Cronitor, Better Stack heartbeats.
Dead letter handling
Job processes a queue. Failed messages go to a dead letter queue for investigation.
Distributed locks for clusters
Multiple instances might try to run the job. Lock prevents duplicate execution:
```python
with redis_lock("nightly-cleanup", timeout=300):
do_cleanup()
```
Schedules in code, not config
Define schedules in deployable code (Terraform, Kubernetes manifests). Not in someone's crontab on a specific server.
Time zones
Cron schedules in what timezone? Server time? UTC? Match user expectations or document explicitly.
Common failure patterns
Server-specific cron
Crontab on one server. Server dies; tasks stop. Nobody notices for weeks.
No alerting on failure
Job fails silently. Real impact only visible when something downstream breaks.
No idempotency
Retries cause duplicates. Daily report sent twice.
Long-running jobs without checkpointing
Job runs for 2 hours; fails 1.5 hours in; restart from scratch. With checkpointing, restart from where it failed.
Tight schedules
Job scheduled every 5 minutes; sometimes takes 10. Multiple copies run simultaneously; conflict.
No history
"Did the cleanup run yesterday?" — nobody knows.
Manual re-runs
When a job fails, manual "kick it off again." Should be one click; ideally automatic.
A reasonable starter pattern
For new scheduled tasks:
1. Schedule defined in code (Terraform, Kubernetes manifest, or cloud scheduler)
2. Idempotent execution
3. Distributed lock if multiple instances might run
4. Heartbeat to monitoring service
5. Alert on failure
6. Logs to central system
For an existing chaotic cron-based setup, migrate one task at a time to a structured framework.
Common failure patterns
- **Tasks that just stop running.** No detection.
- **Tasks that run on the wrong schedule.** Wrong timezone; off-by-one.
- **Tasks that fail without alerting.** Silent rot.
- **Tasks that run twice.** No idempotency or locking.
- **Tasks that take too long.** Schedule mismatch.
Further Reading
- [RunbookAutomation](RunbookAutomation) — Adjacent automation
- [DevOpsFundamentals](DevOpsFundamentals) — Broader practice
- [AwsLambdaPatterns](AwsLambdaPatterns) — Cloud function scheduling
- [DevOpsAndSre Hub](DevOpsAndSreHub) — Cluster index