Definition
Mean Time To Detect (MTTD) measures the average time between when a failure occurs and when it is detected by monitoring systems or users. MTTD is critical because you can't fix what you don't know is broken. Reducing MTTD through better monitoring directly improves MTTR and overall system availability. Effective synthetic monitoring, alerting, and observability practices are key to minimizing MTTD.
Examples
MTTD Impact on Total Downtime
How detection time affects total incident duration.
// Total incident time breakdown
const incident = {
failureTime: '10:00:00', // When failure occurred
detectionTime: '10:15:00', // When detected (MTTD: 15 min)
diagnosisTime: '10:25:00', // When root cause found
recoveryTime: '10:45:00', // When service restored
mttd: 15, // minutes
mttr: 30, // minutes (detection to recovery)
totalDowntime: 45 // minutes
};
// 33% of downtime was detection time!
// Better monitoring could reduce MTTD to < 1 minuteUse Cases
Evaluating monitoring system effectiveness
Optimizing alert thresholds and conditions
Justifying monitoring infrastructure investment
Incident post-mortem analysis
Best Practices
- Use synthetic monitoring with frequent checks (30-60 seconds)
- Monitor from multiple locations to avoid false negatives
- Set up multi-channel alerting for redundancy
- Implement anomaly detection for proactive alerts
- Regularly test monitoring systems
FAQ
Put MTTD Knowledge Into Practice
Start monitoring your infrastructure with WizStatus.
No credit card required • 20 free monitors forever