Course/Foundations/SLAs, SLOs & SLIs

SLAs, SLOs & SLIs

How to define and measure service reliability using Service Level Agreements, Objectives, and Indicators. Real examples from production systems.

10 min read

Why Reliability Needs Definitions

When an engineer says a system is 'reliable,' what does that mean? Reliable enough for an internal dashboard? For a payment processor? For a pacemaker? Without precise definitions, 'reliable' is meaningless. The SLI/SLO/SLA framework, popularized by Google's Site Reliability Engineering (SRE) book, provides a rigorous vocabulary for discussing and contractualizing reliability.

Service Level Indicators (SLIs)

An SLI is a carefully defined quantitative measure of some aspect of the level of service. It is a ratio or percentage derived from real measurements of the system. Good SLIs are:

Measurable — Derivable from actual system telemetry (logs, metrics, traces)
Meaningful — Correlated with user experience; if the SLI is good, users are happy
Actionable — When it degrades, engineers know what to investigate

Service Type	Common SLIs
Web service / API	Request success rate, P99 latency, error rate
Storage system	Read/write success rate, durability (data loss rate), throughput
Data pipeline	Freshness (how current is the data), correctness rate, throughput
Message queue	Message delivery rate, end-to-end latency, redelivery rate

📌

Example SLI definition

Availability SLI = (Number of successful HTTP requests with status < 500) / (Total HTTP requests) × 100%, measured over a rolling 30-day window, excluding maintenance windows pre-announced at least 72 hours in advance.

Service Level Objectives (SLOs)

An SLO is a target value or range of values for an SLI. It is the internal engineering commitment — the threshold below which the team considers the service degraded and acts accordingly. SLOs should be set based on user needs and business requirements, not on what is easy to achieve.

Examples of SLOs: 'Availability SLI >= 99.9% measured over 30 days', 'P99 latency <= 200ms for API requests', 'Error rate <= 0.1% for checkout requests'.

⚠️

The 100% SLO trap

Never set a 100% SLO. It is unachievable (all software has bugs and all hardware fails eventually), and it eliminates the concept of an error budget — the core mechanism that allows engineering teams to make risk-based deployment decisions. Google's SRE book explicitly recommends setting SLOs at 99.9% or 99.99%, not 100%.

Service Level Agreements (SLAs)

An SLA is an explicit or implicit contract with your users that includes the consequences of meeting or missing the SLOs it contains. SLAs are external-facing and typically include financial remedies (service credits, refunds) for breaches. SLAs are less strict than internal SLOs to create a safety buffer.

Service	SLA Availability	Remedy for Breach
AWS EC2	99.99%	Service credits (10%–30% of monthly bill)
AWS S3	99.9%	Service credits (10%–25% of monthly bill)
Google Cloud SQL	99.95%	Service credits up to 50% of monthly bill
Stripe API	99.99%	Service credits
Twilio SMS	99.95%	Service credits up to 25%

Error Budgets

The error budget is the acceptable amount of unreliability derived from the SLO. If your SLO is 99.9% availability, your error budget is 0.1% per measurement window — equivalent to 43.8 minutes of downtime per month or 8.76 hours per year.

The error budget is a shared resource between the development team (which wants to ship features quickly, accepting some risk) and the reliability team (which wants stability). If the budget is being consumed too fast (too many incidents), the team slows down risky releases. If the budget has headroom, the team can take more risks with aggressive deployments.

text

# Error budget calculation example

SLO target: 99.9% availability over 30 days
Measurement window: 30 days = 30 × 24 × 60 = 43,200 minutes

Error budget = (100% - 99.9%) × 43,200 minutes
             = 0.1% × 43,200
             = 43.2 minutes of allowed downtime

Current month usage:
  - Deploy incident on Day 5: 12 minutes downtime
  - Database failover on Day 18: 8 minutes downtime
  - Subtotal: 20 minutes consumed

Remaining error budget: 43.2 - 20 = 23.2 minutes
Budget burn rate: 20 / 43.2 = 46.3% consumed at Day 18
→ On track. Team can proceed with planned risky release.

Putting It Together

Loading diagram...

The SLI/SLO/SLA hierarchy and their relationships.

💡

Interview Tip

In system design interviews, mentioning SLOs proactively is a strong signal. After discussing the architecture, say: 'For the availability SLO, I'd target 99.9% — that gives us 43 minutes of error budget per month. We'd measure it with a success rate SLI from our load balancer logs. The SLA with enterprise customers would be 99.5% with service credits.' This shows operational maturity and end-to-end thinking beyond just architecture.

Consistency Models

Back-of-the-Envelope Estimation