Definition
Mean Time To Repair (or Recovery) is a key reliability metric that measures the average time taken to restore a system to operational status after a failure. MTTR starts when a failure is detected and ends when normal service is restored. A lower MTTR indicates more efficient incident response and recovery processes. MTTR is one of the four key DevOps metrics (alongside deployment frequency, lead time, and change failure rate) used to measure software delivery performance.
Examples
MTTR Calculation
How to calculate MTTR from incident data.
// MTTR Calculation Example
const incidents = [
{ duration: 45 }, // 45 minutes
{ duration: 30 }, // 30 minutes
{ duration: 60 }, // 60 minutes
{ duration: 15 }, // 15 minutes
];
const totalDowntime = incidents.reduce((sum, i) => sum + i.duration, 0);
const mttr = totalDowntime / incidents.length;
// MTTR = 150 / 4 = 37.5 minutesUse Cases
Best Practices
- Automate detection to reduce time-to-detect
- Maintain runbooks for common failure scenarios
- Practice incident response through game days
- Implement automated recovery where possible
- Track MTTR trends over time to measure improvement
FAQ
Related Articles
best-practices
Alert Fatigue Prevention: Strategies for Effective Monitoring
Combat alert fatigue with proven prevention strategies. Learn how to reduce noise, prioritize alerts, and maintain effective monitoring without overwhelming your team.
devops
Chaos Engineering Monitoring: Measure Resilience in Action
Learn to monitor chaos engineering experiments effectively. Discover metrics, observability patterns, and analysis techniques for resilience testing.
devops
CI/CD Pipeline Monitoring: Ensure Fast, Reliable Deployments
Master CI/CD pipeline monitoring for reliable software delivery. Learn key metrics, alerting strategies, and optimization techniques for deployment pipelines.
Put MTTR Knowledge Into Practice
Start monitoring your infrastructure with WizStatus.
No credit card required • 20 free monitors forever