Cron jobs are the backbone of automated tasks on Unix systems. But when they fail silently, you might not know until critical processes break. Here's how to set up reliable monitoring for all your cron jobs.
Why Monitor Cron Jobs?
Cron jobs fail silently. Common issues include:
- Job never starts - Typo in crontab, wrong path
- Job crashes - Runtime errors, missing dependencies
- Job hangs - Infinite loops, deadlocks
- Job runs but fails - Database errors, network issues
- Server reboots - Cron doesn't run after reboot
Without monitoring, you discover these problems when users complain or data goes stale.
Method 1: Heartbeat Monitoring (Recommended)
The most reliable approach: your cron job pings a monitoring service on completion.
Step 1: Get Your Ping URL
Create a heartbeat monitor in your monitoring service. You'll receive a unique URL like:
https://wizstatus.com/ping/abc123
Step 2: Modify Your Cron Job
Add a ping after successful execution:
Before:
0 2 * * * /home/user/backup.sh
After:
0 2 * * * /home/user/backup.sh && curl -fsS --retry 3 https://wizstatus.com/ping/abc123
The && ensures the ping only runs if the script succeeds.
Step 3: Configure Expected Schedule
In your monitoring dashboard:
- Set schedule: "Daily at 2:00 AM"
- Set grace period: 30-60 minutes (depending on job duration)
Step 4: Test the Setup
Run the job manually and verify:
- The ping appears in your monitoring dashboard
- Manually fail the job and confirm you get an alert
Method 2: Wrapper Script
For complex jobs, create a wrapper:
#!/bin/bash
# cron-wrapper.sh
PING_URL="$1"
shift
COMMAND="$@"
# Run the command
$COMMAND
EXIT_CODE=$?
# Ping only on success
if [ $EXIT_CODE -eq 0 ]; then
curl -fsS --retry 3 "$PING_URL"
else
echo "Job failed with exit code $EXIT_CODE"
fi
exit $EXIT_CODE
Usage:
0 2 * * * /home/user/cron-wrapper.sh https://wizstatus.com/ping/abc123 /home/user/backup.sh
Method 3: Email + Log Monitoring
Traditional but less reliable approach:
MAILTO=alerts@yourcompany.com
0 2 * * * /home/user/backup.sh 2>&1 | tee -a /var/log/backup.log
Drawbacks:
- Email can be delayed or filtered
- Doesn't catch jobs that don't run at all
- Requires parsing logs
Cron Syntax Refresher
┌───────────── minute (0-59)
│ ┌───────────── hour (0-23)
│ │ ┌───────────── day of month (1-31)
│ │ │ ┌───────────── month (1-12)
│ │ │ │ ┌───────────── day of week (0-6, Sunday=0)
│ │ │ │ │
* * * * * command
Common schedules:
* * * * * # Every minute
0 * * * * # Every hour
0 0 * * * # Daily at midnight
0 2 * * * # Daily at 2 AM
0 0 * * 0 # Weekly on Sunday
0 0 1 * * # Monthly on the 1st
Monitoring Different Job Types
Backup Jobs
#!/bin/bash
# backup.sh
pg_dump production > /backup/db-$(date +%Y%m%d).sql
if [ $? -eq 0 ]; then
# Verify backup file exists and has size
if [ -s /backup/db-$(date +%Y%m%d).sql ]; then
curl -fsS https://wizstatus.com/ping/backup-token
fi
fi
Report Generation
#!/bin/bash
# daily-report.sh
python generate_report.py
if [ $? -eq 0 ] && [ -f /reports/daily-$(date +%Y%m%d).pdf ]; then
curl -fsS https://wizstatus.com/ping/report-token
fi
Data Sync Jobs
#!/bin/bash
# sync.sh
rsync -avz /source/ /destination/
RSYNC_EXIT=$?
if [ $RSYNC_EXIT -eq 0 ]; then
curl -fsS https://wizstatus.com/ping/sync-token
else
echo "Rsync failed with code $RSYNC_EXIT"
fi
Queue Processors
For continuous processors, ping periodically:
import time
import requests
PING_URL = "https://wizstatus.com/ping/queue-token"
PING_INTERVAL = 300 # 5 minutes
last_ping = 0
while True:
process_next_job()
if time.time() - last_ping > PING_INTERVAL:
requests.get(PING_URL)
last_ping = time.time()
Handling Job Duration
For jobs that might exceed the grace period:
Start/End Pings
#!/bin/bash
# Ping start
curl -fsS https://wizstatus.com/ping/job-token/start
# Long running job
./long-backup-process.sh
# Ping complete
curl -fsS https://wizstatus.com/ping/job-token
Dynamic Grace Periods
Estimate job duration and set grace accordingly:
- Short jobs (< 5 min): 10 minute grace
- Medium jobs (5-30 min): 45 minute grace
- Long jobs (30+ min): Job duration + 30 minutes
Common Issues and Solutions
Issue: Ping fails due to network
# Add retries and timeout
curl -fsS --retry 3 --retry-delay 10 --max-time 30 "$PING_URL"
Issue: Job runs as root but curl isn't found
Use full path:
0 2 * * * /path/to/backup.sh && /usr/bin/curl -fsS "$PING_URL"
Issue: Environment variables not available
Define in crontab or script:
PATH=/usr/local/bin:/usr/bin:/bin
0 2 * * * /path/to/backup.sh && curl -fsS "$PING_URL"
Issue: Job output clutters logs
Redirect appropriately:
0 2 * * * /path/to/job.sh > /var/log/job.log 2>&1 && curl -fsS "$PING_URL"
Testing Your Setup
Verify cron is running
systemctl status cron
# or
service cron status
Test job execution
# Run manually
/path/to/your/script.sh
# Check if ping was received
# (verify in monitoring dashboard)
Simulate failure
# Temporarily break the script
exit 1
# Verify no ping is sent
# Verify alert is triggered after grace period
Checklist
- Identified all cron jobs to monitor
- Created heartbeat monitors for each job
- Matched monitor schedule to cron schedule
- Set appropriate grace periods
- Modified cron entries to ping on success
- Tested successful execution pings
- Tested failure scenarios
- Set up notification channels