Service Level Agreements (SLI / SLO / SLA)
Reliability is the most important feature of any system. To manage it, we use a tiered framework of indicators, objectives, and agreements.
- **SLI (Service Level Indicator):** A quantitative measure of some aspect of the level of service that is provided (e.g., Latency, Throughput, Availability).
- **SLO (Service Level Objective):** A target value or range of values for a service level that is measured by an SLI (e.g., 99.9% of requests succeed).
- **SLA (Service Level Agreement):** A legal contract that defines what happens if the SLO is not met (e.g., financial credits to the customer).
The Math of Availability
Availability is typically expressed in "nines." The difference between three nines and four nines is an order of magnitude in operational rigor.
The "Nines" Table (30-Day Window)
| Availability % | Downtime per Month | Downtime per Year | Description |
|---|---|---|---|
| **99%** (Two) | 7.2 hours | 3.65 days | Standard for internal/non-critical tools. |
| **99.9%** (Three) | 43.8 minutes | 8.77 hours | Typical for high-quality SaaS products. |
| **99.95%** (3.5) | 21.9 minutes | 4.38 hours | Standard for critical enterprise services. |
| **99.99%** (Four) | 4.38 minutes | 52.6 minutes | "Gold Standard" — requires full automation. |
| **99.999%** (Five) | 26.3 seconds | 5.26 minutes | Global infrastructure (Carrier Grade). |
Availability Calculation Formula
$$Availability = \frac{\text{Total Time} - \text{Downtime}}{\text{Total Time}} \times 100$$## Error Budgets: The Discipline of Risk
An Error Budget is the amount of unreliability you are willing to tolerate in a given window. It is the bridge between Product (feature velocity) and Engineering (reliability).
Calculation
For a 99.9% SLO over a 30-day window:
- **Total Requests:**$1,000,000$- **Allowed Failures:**$1,000,000 \times (1 - 0.999) = 1,000$If you have used 800 failures, you have **20% of your error budget remaining**.
Burn Rate: The Proactive Signal
Burn rate is how fast you are consuming your error budget relative to the time window. It is the primary signal used for SRE paging.
Burn Rate Formula$$\text{Burn Rate} = \frac{\frac{\text{Budget Consumed}}{\text{Time Window Consumed}}}{\frac{\text{Total Budget}}{\text{Total Time Window}}}$$
- **Burn Rate = 1:** You will consume exactly your entire budget by the end of the window.
- **Burn Rate > 1:** You will violate your SLO unless you take action.
- **Burn Rate = 14.4:** You will consume 100% of your monthly budget in 2 days (48 hours). This usually triggers a **Critical Page**.
Implementing SLOs
1. Identify "Golden Signals"
- **Latency:** Time it takes to service a request.
- **Traffic:** Demand placed on the system.
- **Errors:** Rate of requests that fail.
- **Saturation:** How "full" the service is (e.g., CPU/Memory).
2. Define the Window
A **28-day rolling window** is often preferred over a calendar month to ensure that "bad Tuesdays" are always compared against the same day of the week, and it avoids the 28/30/31 day math variance.
3. Set Alerting Thresholds
- **Fast Burn:** Consume 2% of budget in 1 hour (Page).
- **Slow Burn:** Consume 5% of budget in 24 hours (Ticket/Slack).
Further Reading
- [SiteReliabilityEngineering](SiteReliabilityEngineering)
- [IncidentManagement](IncidentManagement)
- [MonitoringAndAlerting](MonitoringAndAlerting)
- [ObservabilityDesign](ObservabilityDesign)