Complete Guide to Heartbeat Monitoring

Traditional uptime monitoring checks if services are available. Heartbeat monitoring does the opposite: it waits for your jobs to check in. If they don't, something is wrong. This guide covers everything about monitoring scheduled tasks and background processes.

This is a pillar article for heartbeat monitoring. See related guides on cron job monitoring, backup job monitoring, and scheduled task monitoring.

What is Heartbeat Monitoring?

Heartbeat monitoring (also called dead man's switch or cron monitoring) works differently from traditional monitoring:

Traditional monitoring: "Let me check if your service is up" Heartbeat monitoring: "Tell me when you've completed your job"

Your scheduled task sends a "ping" to the monitoring service after successful completion. If the ping doesn't arrive within the expected timeframe, you get alerted.

How It Works

text

1. Configure expected schedule (e.g., "every hour")
2. Set grace period (e.g., "allow 5 minutes late")
3. Add ping URL to your job
4. Job runs and pings on success
5. If ping doesn't arrive → Alert

Why Traditional Monitoring Fails for Scheduled Tasks

You can't monitor a cron job with HTTP checks because:

Cron jobs aren't web services - There's no endpoint to check
They run intermittently - Not continuously available
Failures are silent - No error page to detect
They might "succeed" but produce wrong results

A backup job might crash silently. A data sync might fail to start. An email queue processor might hang. Without heartbeat monitoring, you won't know until it's too late.

The most dangerous failures are silent failures. Jobs that should run but don't, with no error messages and no alerts.

Common Use Cases

Backup Verification

Your database backup runs every night at 2 AM. Is it actually running?

bash

#!/bin/bash
# backup.sh
pg_dump mydb > backup.sql
if [ $? -eq 0 ]; then
  curl -fsS --retry 3 https://wizstatus.com/ping/your-token
fi

If the backup fails or never runs, you'll know.

Scheduled Data Processing

ETL jobs, report generation, data synchronization:

python

def daily_report():
    generate_report()
    send_report()
    # Signal successful completion
    requests.get("https://wizstatus.com/ping/your-token")

Queue Worker Health

Ensure your background workers are processing jobs:

javascript

// Process jobs and ping after each batch
async function processQueue() {
  await processBatch();
  await fetch('https://wizstatus.com/ping/your-token');
}

Scheduled Maintenance Tasks

Cache clearing, log rotation, cleanup scripts:

bash

# cleanup.sh
find /tmp -type f -mtime +7 -delete
curl https://wizstatus.com/ping/your-token

Setting Up Heartbeat Monitoring

Step 1: Create a Heartbeat Monitor

Log into your monitoring dashboard
Create new monitor → Select "Heartbeat" type
Name it descriptively (e.g., "Database Backup - Production")
Configure schedule

Step 2: Define Expected Schedule

Common schedule options:

Schedule	Use Case
Every 1 minute	Queue processors, real-time sync
Every 5 minutes	Frequent batch jobs
Every hour	Hourly reports, cache refresh
Every day	Daily backups, reports
Custom cron	Specific schedules

Step 3: Set Grace Period

The grace period determines how late a ping can be before alerting. Set it based on:

Job's typical duration
Acceptable delay tolerance
Network latency buffer

Example: Job runs at 2:00 AM and takes up to 30 minutes. Set grace period to 45 minutes.

Step 4: Add Ping to Your Job

Add a simple HTTP request at the end of successful execution:

bash

# Using curl
curl -fsS --retry 3 https://wizstatus.com/ping/your-token

# Using wget
wget -q --spider https://wizstatus.com/ping/your-token

Important: Only ping on successful completion, not at the start.

Best Practices

Ping Only on Success

python

try:
    do_important_work()
    # Only ping if work completed successfully
    requests.get(PING_URL)
except Exception as e:
    log_error(e)
    # Don't ping - let the timeout trigger an alert

Handle Network Failures

The ping itself might fail. Use retries:

bash

curl -fsS --retry 3 --retry-delay 5 $PING_URL

Use Descriptive Names

Bad: "Cron Job 1" Good: "Daily Database Backup - Production PostgreSQL"

Set Appropriate Grace Periods

Too short: False alarms from slow jobs
Too long: Delayed detection of real failures

Monitor the Monitoring

If your job runs but the ping fails, you'll get false alerts. Consider:

Retrying pings
Logging ping status
Alerting on repeated ping failures

Advanced Patterns

Start and End Pings

Some systems support tracking job duration:

bash

# Signal job started
curl https://wizstatus.com/ping/your-token/start

# Do the work
./run-backup.sh

# Signal job completed
curl https://wizstatus.com/ping/your-token

This catches jobs that start but hang.

Exit Code Reporting

Send the exit code for detailed tracking:

bash

./my-script.sh
EXIT_CODE=$?
curl "https://wizstatus.com/ping/your-token?exit_code=$EXIT_CODE"

Conditional Pinging

Only ping under certain conditions:

python

if records_processed > 0:
    requests.get(PING_URL)
else:
    # Something might be wrong if no records
    notify_team("Zero records processed - investigate")

Integrating with Different Systems

Cron (Linux)

cron

0 2 * * * /path/to/backup.sh && curl -fsS https://wizstatus.com/ping/token

Task Scheduler (Windows)

Create a batch wrapper:

batch

@echo off
call your-script.bat
if %ERRORLEVEL%==0 (
  curl https://wizstatus.com/ping/your-token
)

Docker Containers

dockerfile

CMD ["sh", "-c", "./run-job.sh && curl https://ping-url"]

Kubernetes CronJobs

yaml

apiVersion: batch/v1
kind: CronJob
spec:
  schedule: "0 2 * * *"
  jobTemplate:
    spec:
      containers:
      - name: backup
        command:
        - /bin/sh
        - -c
        - "./backup.sh && curl https://wizstatus.com/ping/token"

CI/CD Pipelines

GitHub Actions:

yaml

- name: Notify completion
  if: success()
  run: curl https://wizstatus.com/ping/token

Serverless Functions

AWS Lambda:

python

def handler(event, context):
    process_data()
    requests.get(PING_URL)
    return {"statusCode": 200}

Troubleshooting

False Alarms

Causes:

Job taking longer than grace period
Network issues preventing ping delivery
Clock skew between systems

Solutions:

Increase grace period
Add ping retries
Check system time synchronization

Missed Actual Failures

Causes:

Ping sent before job truly completes
Job partially completes then fails
Wrong success criteria

Solutions:

Ping at true end of job
Use start/end pings for long jobs
Validate job output before pinging

Heartbeat Monitoring Checklist

Identified all scheduled jobs needing monitoring
Created heartbeat monitors with descriptive names
Set schedules matching job frequency
Configured appropriate grace periods
Added ping calls to jobs (only on success)
Implemented ping retries for network resilience
Tested alert flow end-to-end
Documented monitor configurations

Conclusion

Heartbeat monitoring catches the silent failures that traditional monitoring misses. Any job that runs on a schedule - backups, reports, sync processes, cleanup tasks - should have heartbeat monitoring.

The principle is simple: if your job doesn't check in on time, something is wrong. This early warning prevents cascading failures and data loss.

Set up heartbeat monitoring for your scheduled jobs with WizStatus. Get alerts when your cron jobs fail to complete on time.

Complete Guide to Heartbeat Monitoring

Related Articles

How to Monitor Backup Jobs and Get Alerts on Failure

How to Monitor Cron Jobs: Step-by-Step Guide

Dead Man's Switch: Ensure Critical Jobs Never Fail Silently

Start monitoring your infrastructure today