How to Monitor Backup Jobs and Get Alerts on Failure

Backups are only useful if they actually run. Too many organizations discover their backup failed only when they need to restore. Here's how to set up reliable monitoring for all your backup jobs.

The Backup Monitoring Problem

Backup failures are uniquely dangerous because:

They're silent - No user notices when a backup doesn't run
They're infrequent - Daily or weekly jobs are easy to forget
They're critical - Discovered too late, the consequences are severe
They're assumed - "The backup runs every night... right?"

Without monitoring, failed backups go unnoticed for days, weeks, or until disaster strikes.

What to Monitor

Backup Completion

Did the backup job finish successfully?

#!/bin/bash
pg_dump database > backup.sql
if [ $? -eq 0 ]; then
  curl https://wizstatus.com/ping/backup-complete
fi

Backup File Validity

Did it actually produce a valid backup?

# Check file exists and has size
if [ -s backup.sql ] && [ $(stat -f%z backup.sql) -gt 1000 ]; then
  curl https://wizstatus.com/ping/backup-valid
fi

Backup Duration

Is it taking longer than expected?

START=$(date +%s)
pg_dump database > backup.sql
END=$(date +%s)
DURATION=$((END - START))

curl "https://wizstatus.com/ping/backup?duration=$DURATION"

Storage Space

Is there room for backups?

USED=$(df /backup | tail -1 | awk '{print $5}' | sed 's/%//')
if [ $USED -gt 90 ]; then
  # Alert separately for disk space
  echo "Backup disk nearly full: $USED%"
fi

Database Backup Monitoring

PostgreSQL

#!/bin/bash
# postgres-backup.sh

BACKUP_DIR="/backup/postgres"
DATE=$(date +%Y%m%d)
BACKUP_FILE="$BACKUP_DIR/prod-$DATE.sql.gz"

# Create backup
pg_dump production | gzip > "$BACKUP_FILE"

# Verify
if [ $? -eq 0 ] && [ -s "$BACKUP_FILE" ]; then
  # Test backup integrity
  gunzip -t "$BACKUP_FILE"
  if [ $? -eq 0 ]; then
    curl -fsS https://wizstatus.com/ping/postgres-backup
  fi
fi

MySQL/MariaDB

#!/bin/bash
# mysql-backup.sh

mysqldump --all-databases | gzip > /backup/mysql-$(date +%Y%m%d).sql.gz

if [ ${PIPESTATUS[0]} -eq 0 ] && [ -s /backup/mysql-$(date +%Y%m%d).sql.gz ]; then
  curl -fsS https://wizstatus.com/ping/mysql-backup
fi

MongoDB

#!/bin/bash
# mongo-backup.sh

mongodump --out=/backup/mongo-$(date +%Y%m%d)

if [ $? -eq 0 ]; then
  curl -fsS https://wizstatus.com/ping/mongo-backup
fi

File Backup Monitoring

Rsync Backups

#!/bin/bash
# rsync-backup.sh

rsync -avz --delete /data/ /backup/data/

if [ $? -eq 0 ]; then
  curl -fsS https://wizstatus.com/ping/file-backup
fi

Restic/Borg Backups

#!/bin/bash
# restic-backup.sh

restic backup /important-data

if [ $? -eq 0 ]; then
  restic check
  if [ $? -eq 0 ]; then
    curl -fsS https://wizstatus.com/ping/restic-backup
  fi
fi

Cloud Backup Monitoring

AWS S3 Backup

#!/bin/bash
# s3-backup.sh

aws s3 sync /data s3://my-backup-bucket/$(date +%Y%m%d)/

if [ $? -eq 0 ]; then
  curl -fsS https://wizstatus.com/ping/s3-backup
fi

Monitoring Cloud Backup Services

For services like AWS Backup, Azure Backup, or GCP:

Use cloud-native events (SNS, Event Grid)
Create a webhook endpoint that pings your monitor
Or poll backup status via API

Setting Up Heartbeat Monitoring

Step 1: Create Monitors

For each backup job:

Create a heartbeat monitor
Set the expected schedule (e.g., "Daily at 2:00 AM")
Set grace period (backup duration + 30 minutes)

Step 2: Configure Alerts

Backup failures are critical:

Email to backup admin
Slack to ops channel
SMS for database backups
PagerDuty for production systems

Step 3: Add Pings to Scripts

Modify each backup script to ping on success:

# At the end of successful backup
curl -fsS --retry 3 https://wizstatus.com/ping/YOUR-TOKEN

Verification Beyond Monitoring

Monitoring confirms the job ran. Also verify:

Test Restores

Periodically restore backups to verify they work:

#!/bin/bash
# Monthly restore test

createdb test_restore
pg_restore -d test_restore /backup/latest.dump

if [ $? -eq 0 ]; then
  # Run some validation queries
  psql -d test_restore -c "SELECT count(*) FROM users"
  dropdb test_restore
  curl https://wizstatus.com/ping/restore-test
fi

Backup Size Monitoring

Alert if backup size changes dramatically:

TODAY_SIZE=$(stat -f%z today-backup.sql)
YESTERDAY_SIZE=$(stat -f%z yesterday-backup.sql)

CHANGE=$(( (TODAY_SIZE - YESTERDAY_SIZE) * 100 / YESTERDAY_SIZE ))

if [ $CHANGE -lt -50 ] || [ $CHANGE -gt 100 ]; then
  echo "Backup size changed by ${CHANGE}%"
  # Send alert about unusual change
fi

Common Backup Failures

Disk Full

# Check before starting
AVAILABLE=$(df /backup | tail -1 | awk '{print $4}')
REQUIRED=10000000  # 10GB in KB

if [ $AVAILABLE -lt $REQUIRED ]; then
  echo "Insufficient space for backup"
  exit 1
fi

Permission Errors

# Test write access
touch /backup/.write-test
if [ $? -ne 0 ]; then
  echo "Cannot write to backup directory"
  exit 1
fi
rm /backup/.write-test

Network Timeouts

# For remote backups, check connectivity first
if ! nc -z backup-server 22 2>/dev/null; then
  echo "Cannot reach backup server"
  exit 1
fi

Best Practices

1. Ping Only on Verified Success

Don't just check exit code. Verify the backup:

File exists
File has reasonable size
File passes integrity check (gunzip -t, etc.)

2. Keep Backup Logs

exec > /var/log/backup-$(date +%Y%m%d).log 2>&1
# All output goes to log

3. Alert on Long Duration

If backup usually takes 30 minutes but takes 3 hours, investigate.

4. Separate Monitoring per Environment

Production and staging backups should have separate monitors.

5. Document Retention Policy

Know how many backups you keep and verify rotation works.

Monitoring Checklist

All databases have backup monitoring
All file backups have monitoring
Heartbeat monitors match backup schedules
Grace periods account for backup duration
Critical alerts go to on-call
Monthly restore tests are scheduled
Backup logs are retained
Disk space is monitored separately

Never discover a failed backup when you need to restore. Set up heartbeat monitoring for all your backup jobs with WizStatus.