A well-crafted uptime SLA protects both service providers and customers. It establishes clear expectations, measurement methodology, and consequences for failures.
Vague or poorly structured SLAs lead to disputes and damaged business relationships. This guide provides a comprehensive framework for effective agreements.
What is an Uptime SLA?
An uptime SLA is a formal commitment defining the minimum availability level a provider guarantees to customers.
Key Components
A complete SLA specifies:
- Uptime percentage target (e.g., 99.9%)
- Measurement period (monthly, quarterly, annually)
- What counts as downtime
- Exclusions from calculations
- How availability is measured
- Compensation when targets are missed
Purpose
SLAs serve multiple purposes:
- Set customer expectations
- Drive provider accountability
- Provide contractual protection for both parties
- Serve as competitive differentiators
Why Effective SLAs Matter
Poorly constructed SLAs create significant problems for both parties.
Common Issues with Bad SLAs
Vague language around downtime leads to disputes:
Provider interpretation: "Maintenance doesn't count"
Customer expectation: "I expected it to be included"
Result: Conflict during incidents
Unclear measurement makes verification impossible:
Provider: "Our monitoring shows 99.95%"
Customer: "Our monitoring shows 99.7%"
Result: Distrust, disputes
Missing remedies leave customers without recourse when providers fail.
Risks of Overly Aggressive SLAs
Promising 99.99% when your infrastructure supports 99.9% leads to:
- Inevitable failures
- Financial penalties
- Reputation damage
Risks of Overly Weak SLAs
SLAs that don't commit to meaningful availability:
- May cost competitive deals
- Signal lack of confidence in reliability
- Provide no customer assurance
Essential SLA Template Components
A comprehensive uptime SLA includes these sections.
1. Service Definition
Precisely specify what's covered:
Covered Services:
- Production web application (app.example.com)
- Customer-facing API (api.example.com)
- User authentication service
Not Covered:
- Development/staging environments
- Internal admin tools
- Third-party integrations
2. Uptime Commitment
State the guaranteed percentage and period clearly:
Uptime Commitment: 99.9% monthly availability
This means no more than 43.8 minutes of unplanned downtime
per calendar month.
3. Downtime Definition
Explicitly define what constitutes downtime:
Downtime is defined as:
- Service returns HTTP 5xx errors for >50% of requests
- Service response time exceeds 30 seconds
- Service is completely unreachable
Downtime is NOT:
- Degraded but functional performance
- Issues affecting <5% of users
- Scheduled maintenance (with proper notice)
4. Measurement Methodology
Explain how availability is calculated:
Measurement Methodology:
- Checked every 1 minute from 5 global locations
- Downtime recorded when 3+ locations report failure
- Measured by [third-party monitoring service]
- Monthly availability calculated as:
(Total minutes - Downtime minutes) / Total minutes × 100
5. Exclusions
List what doesn't count toward downtime:
Exclusions (not counted as downtime):
- Scheduled maintenance with 72-hour notice
- Emergency maintenance for security issues
- Customer-caused issues (API misuse, etc.)
- Force majeure events
- Third-party service failures (AWS, Stripe, etc.)
- Issues during beta feature usage
6. Maintenance Windows
Specify maintenance policies:
Scheduled Maintenance:
- Occurs: Sundays 02:00-06:00 UTC
- Notice: Minimum 72 hours advance notice
- Maximum: 4 hours per month
- Communication: Email to designated contacts
Emergency Maintenance:
- For critical security issues only
- Best-effort advance notice
- Post-incident report within 24 hours
7. Remedy Provisions
Define compensation for missed targets:
Service Credits:
| Monthly Uptime | Credit (% of monthly fee) |
|---------------------|---------------------------|
| 99.9% - 99.0% | 10% |
| 99.0% - 95.0% | 25% |
| Below 95.0% | 50% |
Maximum credit: 50% of monthly fee
Credits applied to future invoices
8. Claim Process
Explain how customers request credits:
Claim Process:
1. Submit claim within 30 days of incident
2. Include: Date, time, duration, impact description
3. Submit to: support@example.com
4. Response within 5 business days
5. Credit applied within 1 billing cycle if approved
Required Evidence:
- Timestamp of first detected issue
- Duration of impact
- Description of business impact
SLA Template Example
Here's a condensed template you can adapt:
SERVICE LEVEL AGREEMENT
1. SERVICES COVERED
[List specific services]
2. UPTIME COMMITMENT
Provider commits to [X]% monthly availability
for the services listed above.
3. DOWNTIME DEFINITION
Downtime means [specific criteria].
Partial degradation below [threshold] counts as downtime.
4. MEASUREMENT
Availability measured by [methodology].
Calculations based on [time period].
5. EXCLUSIONS
The following are excluded from downtime calculations:
- [List exclusions]
6. SCHEDULED MAINTENANCE
- Window: [Days/times]
- Notice required: [Hours/days]
- Maximum duration: [Hours per period]
7. REMEDIES
| Availability Level | Credit |
| [X]% - [Y]% | [Z]% |
8. CLAIM PROCESS
Claims must be submitted within [days] to [contact].
Response within [days].
SLA Creation Best Practices
Start with Achievable Commitments
Base SLAs on historical performance, not aspirational targets.
If historical uptime = 99.95%
Then SLA commitment = 99.9% (with buffer)
Define Terms Precisely
Avoid vague language:
| Vague | Precise |
|---|---|
| "Reasonable notice" | "72 hours advance notice" |
| "Brief outages" | "Outages lasting less than 1 minute" |
| "Best effort response" | "Response within 15 minutes" |
Use Objective Measurement
Third-party monitoring prevents disputes:
Good: "Measured by Datadog/WizStatus/Pingdom"
Bad: "Measured by provider's internal systems"
Both parties can verify the data independently.
Make Exclusions Reasonable
Balance provider protection with customer value:
Too broad: "Any issue involving third parties"
(Could exclude almost anything)
Appropriate: "AWS regional outages confirmed by
AWS status page"
(Specific, verifiable)
Align Remedies with Impact
Credits should be meaningful but sustainable:
Too weak: 5% credit for major outage
(Doesn't motivate reliability)
Too severe: 100% refund for any downtime
(Unsustainable, discourages SLAs)
Balanced: 10-50% tiered credits
(Meaningful motivation, sustainable)
Review Periodically
SLAs are living documents. Review annually for:
- Alignment with current capabilities
- Market competitiveness
- Customer feedback
- Incident learnings
Common SLA Mistakes to Avoid
Mistake 1: Unmeasurable Commitments
Bad: "We guarantee great uptime"
Good: "We guarantee 99.9% monthly availability
measured by external monitoring"
Mistake 2: No Exclusion Boundaries
Bad: "Excludes any third-party issues"
Good: "Excludes documented AWS outages
affecting multiple customers"
Mistake 3: Unclear Credit Calculation
Bad: "Credits provided at provider discretion"
Good: "10% credit per 0.1% below target,
maximum 50% of monthly fee"
Mistake 4: Impossible Claim Process
Bad: "Submit detailed technical logs
within 24 hours"
Good: "Submit date/time of issue
within 30 days"
Negotiating SLAs
As a Provider
- Start conservative, offer higher tiers at premium pricing
- Ensure exclusions are clear and documented
- Build in reasonable buffer above actual performance
As a Customer
- Request measurement methodology details
- Clarify all exclusions before signing
- Negotiate meaningful remedies
- Ask for historical uptime data
Conclusion
Effective uptime SLAs require attention to clarity, measurability, and fairness. Follow the template framework and best practices in this guide to create agreements that protect both parties.
Remember that the best SLA is one you can consistently achieve and clearly demonstrate compliance with. It should build rather than damage the provider-customer relationship.