Backups are only useful if they actually run. Too many organizations discover their backup failed only when they need to restore. Here's how to set up reliable monitoring for all your backup jobs.
The Backup Monitoring Problem
Backup failures are uniquely dangerous because:
- They're silent - No user notices when a backup doesn't run
- They're infrequent - Daily or weekly jobs are easy to forget
- They're critical - Discovered too late, the consequences are severe
- They're assumed - "The backup runs every night... right?"
Without monitoring, failed backups go unnoticed for days, weeks, or until disaster strikes.
What to Monitor
Backup Completion
Did the backup job finish successfully?
#!/bin/bash
pg_dump database > backup.sql
if [ $? -eq 0 ]; then
curl https://wizstatus.com/ping/backup-complete
fi
Backup File Validity
Did it actually produce a valid backup?
# Check file exists and has size
if [ -s backup.sql ] && [ $(stat -f%z backup.sql) -gt 1000 ]; then
curl https://wizstatus.com/ping/backup-valid
fi
Backup Duration
Is it taking longer than expected?
START=$(date +%s)
pg_dump database > backup.sql
END=$(date +%s)
DURATION=$((END - START))
curl "https://wizstatus.com/ping/backup?duration=$DURATION"
Storage Space
Is there room for backups?
USED=$(df /backup | tail -1 | awk '{print $5}' | sed 's/%//')
if [ $USED -gt 90 ]; then
# Alert separately for disk space
echo "Backup disk nearly full: $USED%"
fi
Database Backup Monitoring
PostgreSQL
#!/bin/bash
# postgres-backup.sh
BACKUP_DIR="/backup/postgres"
DATE=$(date +%Y%m%d)
BACKUP_FILE="$BACKUP_DIR/prod-$DATE.sql.gz"
# Create backup
pg_dump production | gzip > "$BACKUP_FILE"
# Verify
if [ $? -eq 0 ] && [ -s "$BACKUP_FILE" ]; then
# Test backup integrity
gunzip -t "$BACKUP_FILE"
if [ $? -eq 0 ]; then
curl -fsS https://wizstatus.com/ping/postgres-backup
fi
fi
MySQL/MariaDB
#!/bin/bash
# mysql-backup.sh
mysqldump --all-databases | gzip > /backup/mysql-$(date +%Y%m%d).sql.gz
if [ ${PIPESTATUS[0]} -eq 0 ] && [ -s /backup/mysql-$(date +%Y%m%d).sql.gz ]; then
curl -fsS https://wizstatus.com/ping/mysql-backup
fi
MongoDB
#!/bin/bash
# mongo-backup.sh
mongodump --out=/backup/mongo-$(date +%Y%m%d)
if [ $? -eq 0 ]; then
curl -fsS https://wizstatus.com/ping/mongo-backup
fi
File Backup Monitoring
Rsync Backups
#!/bin/bash
# rsync-backup.sh
rsync -avz --delete /data/ /backup/data/
if [ $? -eq 0 ]; then
curl -fsS https://wizstatus.com/ping/file-backup
fi
Restic/Borg Backups
#!/bin/bash
# restic-backup.sh
restic backup /important-data
if [ $? -eq 0 ]; then
restic check
if [ $? -eq 0 ]; then
curl -fsS https://wizstatus.com/ping/restic-backup
fi
fi
Cloud Backup Monitoring
AWS S3 Backup
#!/bin/bash
# s3-backup.sh
aws s3 sync /data s3://my-backup-bucket/$(date +%Y%m%d)/
if [ $? -eq 0 ]; then
curl -fsS https://wizstatus.com/ping/s3-backup
fi
Monitoring Cloud Backup Services
For services like AWS Backup, Azure Backup, or GCP:
- Use cloud-native events (SNS, Event Grid)
- Create a webhook endpoint that pings your monitor
- Or poll backup status via API
Setting Up Heartbeat Monitoring
Step 1: Create Monitors
For each backup job:
- Create a heartbeat monitor
- Set the expected schedule (e.g., "Daily at 2:00 AM")
- Set grace period (backup duration + 30 minutes)
Step 2: Configure Alerts
Backup failures are critical:
- Email to backup admin
- Slack to ops channel
- SMS for database backups
- PagerDuty for production systems
Step 3: Add Pings to Scripts
Modify each backup script to ping on success:
# At the end of successful backup
curl -fsS --retry 3 https://wizstatus.com/ping/YOUR-TOKEN
Verification Beyond Monitoring
Monitoring confirms the job ran. Also verify:
Test Restores
Periodically restore backups to verify they work:
#!/bin/bash
# Monthly restore test
createdb test_restore
pg_restore -d test_restore /backup/latest.dump
if [ $? -eq 0 ]; then
# Run some validation queries
psql -d test_restore -c "SELECT count(*) FROM users"
dropdb test_restore
curl https://wizstatus.com/ping/restore-test
fi
Backup Size Monitoring
Alert if backup size changes dramatically:
TODAY_SIZE=$(stat -f%z today-backup.sql)
YESTERDAY_SIZE=$(stat -f%z yesterday-backup.sql)
CHANGE=$(( (TODAY_SIZE - YESTERDAY_SIZE) * 100 / YESTERDAY_SIZE ))
if [ $CHANGE -lt -50 ] || [ $CHANGE -gt 100 ]; then
echo "Backup size changed by ${CHANGE}%"
# Send alert about unusual change
fi
Common Backup Failures
Disk Full
# Check before starting
AVAILABLE=$(df /backup | tail -1 | awk '{print $4}')
REQUIRED=10000000 # 10GB in KB
if [ $AVAILABLE -lt $REQUIRED ]; then
echo "Insufficient space for backup"
exit 1
fi
Permission Errors
# Test write access
touch /backup/.write-test
if [ $? -ne 0 ]; then
echo "Cannot write to backup directory"
exit 1
fi
rm /backup/.write-test
Network Timeouts
# For remote backups, check connectivity first
if ! nc -z backup-server 22 2>/dev/null; then
echo "Cannot reach backup server"
exit 1
fi
Best Practices
1. Ping Only on Verified Success
Don't just check exit code. Verify the backup:
- File exists
- File has reasonable size
- File passes integrity check (gunzip -t, etc.)
2. Keep Backup Logs
exec > /var/log/backup-$(date +%Y%m%d).log 2>&1
# All output goes to log
3. Alert on Long Duration
If backup usually takes 30 minutes but takes 3 hours, investigate.
4. Separate Monitoring per Environment
Production and staging backups should have separate monitors.
5. Document Retention Policy
Know how many backups you keep and verify rotation works.
Monitoring Checklist
- All databases have backup monitoring
- All file backups have monitoring
- Heartbeat monitors match backup schedules
- Grace periods account for backup duration
- Critical alerts go to on-call
- Monthly restore tests are scheduled
- Backup logs are retained
- Disk space is monitored separately