SRE: SLOs, SLAs & Error Budgets Explained

Site Reliability Engineering, or SRE, is the discipline that bridges software engineering and operations. Coined at Google in 2003, SRE answers a fundamental question: how do you run large-scale software systems reliably without sacrificing the speed of feature development? The answer lies in a framework of measurable reliability targets — SLIs, SLOs, SLAs — and the concept of the error budget that turns reliability into a shared engineering resource.

SLIs, SLOs, and SLAs: The Reliability Hierarchy

The three terms form a hierarchy. A Service Level Indicator (SLI) is a raw measurement — the percentage of HTTP requests that returned a 2xx status code, or the fraction of requests completing in under 200ms. A Service Level Objective (SLO) is an internal target built on top of an SLI — for example, 99.9% of requests must succeed over a 30-day rolling window. A Service Level Agreement (SLA) is a contractual commitment to customers, usually set below the SLO with financial penalties for breach.

Writing Good SLOs with Prometheus Alerts

A good SLO is specific, measurable, and time-bound. Prometheus multi-window burn-rate alerts are the industry-standard implementation: a fast window (1h) catches sudden spikes, while a slow window (6h or 24h) catches slow burns. The alert fires when the burn rate exceeds a threshold that predicts exhausting the entire error budget within the remaining window.

# Prometheus alerting rule — SLO burn rate alert
groups:
  - name: slo-alerts
    rules:
      # Availability SLO: 99.9% uptime over 30 days
      - alert: HighErrorBudgetBurnRate
        expr: |
          (
            sum(rate(http_requests_total{status=~"5.."}[1h]))
            /
            sum(rate(http_requests_total[1h]))
          ) > 0.001   # 0.1% error rate threshold
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "High error budget burn rate"
          description: >
            Error budget is burning at {{ $value | humanizePercentage }}
            — SLO at risk if this continues.

      # Latency SLO: p99 < 500ms
      - alert: LatencySLOViolation
        expr: |
          histogram_quantile(0.99,
            sum(rate(http_request_duration_seconds_bucket[5m])) by (le)
          ) > 0.5
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "p99 latency exceeds 500ms SLO"

Error Budgets: Reliability as a Shared Resource

The error budget is the allowed amount of unreliability per SLO window. For a 99.9% SLO over 30 days, the error budget is 43.2 minutes of downtime. When the budget is healthy, teams can deploy frequently and take calculated risks. When the budget is nearly exhausted, engineering effort shifts to reliability work — fixing flaky tests, reducing toil, improving deployment safety. The error budget transforms reliability from an ops concern into a shared engineering metric.

Calculating and Tracking Error Budgets

Error budget tracking is best implemented as a Grafana dashboard fed by Prometheus. The dashboard shows the current SLI value, the SLO target, the remaining budget as a percentage, and a burn-rate graph. When remaining budget drops below 10%, automated alerts page the on-call team and trigger a feature freeze policy — no new deployments until reliability is restored.

# Error Budget Calculation
# SLO target: 99.9% availability
SLO_TARGET=0.999
WINDOW_DAYS=30

total_minutes=$((WINDOW_DAYS * 24 * 60))   # 43200
allowed_downtime=$(echo "$total_minutes * (1 - $SLO_TARGET)" | bc)
# => 43.2 minutes of allowed downtime per 30 days

# Remaining budget query (Prometheus / Grafana)
# error_budget_remaining =
#   1 - (
#     sum(increase(http_requests_total{status=~"5.."}[30d]))
#     /
#     sum(increase(http_requests_total[30d]))
#   ) / (1 - SLO_TARGET)
#
# If result < 0 → budget exhausted, freeze feature releases

Set your SLO conservatively at first — start at 99.5% and tighten it as you understand your system's failure modes. Overly aggressive SLOs (99.99% for a service that can't actually hit that) create alert fatigue and burn through your error budget on planned maintenance windows.

Reducing Toil Through Automation

Toil is the work that is manual, repetitive, automatable, and scales with service growth. Google's SRE book recommends keeping toil below 50% of an SRE's time. Common sources of toil in Kubernetes environments include manually restarting crashlooping pods, manually scaling deployments for traffic spikes, and manually running database migrations. Each of these can and should be automated away.

Automating Common On-Call Tasks

Runbook automation is the first step to reducing toil. Rather than a human reading a runbook and executing steps manually, encode the runbook as a script triggered by an alert. Pod restarts, cache flushes, and health check overrides are all good candidates. More advanced automation uses Kubernetes Operators to encode operational knowledge directly into the cluster.

# Toil reduction: automate repetitive restarts with a runbook
#!/usr/bin/env bash
# on-call-runbook: auto-restart crashlooping pods

NAMESPACE=production
THRESHOLD=5   # CrashLoopBackOff restart count before auto-remediation

kubectl get pods -n "$NAMESPACE" -o json | jq -r '
  .items[] |
  select(.status.containerStatuses[].restartCount > '"$THRESHOLD"') |
  .metadata.name' | while read -r pod; do
    echo "Restarting pod $pod (restarts > $THRESHOLD)"
    kubectl rollout restart deployment/"$(kubectl get pod "$pod" -n "$NAMESPACE"       -o jsonpath='{.metadata.labels.app}')" -n "$NAMESPACE"
done

Never Set Your SLA Equal to Your SLO

Your SLA must always be looser than your SLO to give you a buffer. If your internal SLO is 99.9% availability and you promise customers 99.9% in the SLA, any SLO violation immediately breaches your contract. A common pattern is SLA = SLO - 0.1%, giving engineering room to investigate and remediate before customer penalties kick in.

Incident Management and Blameless Postmortems

When systems fail — and they will — SRE relies on structured incident management to minimise mean time to recovery (MTTR) and learn from every event. The blameless postmortem is the cornerstone: a document that records the incident timeline, contributing factors, and action items without assigning blame to individuals. The goal is to fix systems, not people. Every severity-1 incident should produce a postmortem within 48 hours, with action items tracked to completion.

Core SRE Terminology

Mastering SRE requires fluency with these key terms: SLO, SLA, SLI, error budget, toil, and blameless postmortem.

The journey to mature SRE practice is incremental. Start by identifying your two or three most important user-facing metrics — request success rate, latency, and throughput are a good starting point. Define SLOs for them, build Prometheus/Grafana dashboards, and wire up paging alerts. Review the error budget in your weekly engineering meeting. As your team gains confidence, expand to more services and refine your alert thresholds. The goal is not perfect uptime — it's the right amount of reliability for your users and business.

Sources & Further Reading

SRE: SLOs, SLAs & Error Budgets Explained

SRE: SLOs, SLAs & Error Budgets Explained

SLIs, SLOs, and SLAs: The Reliability Hierarchy

Writing Good SLOs with Prometheus Alerts

Error Budgets: Reliability as a Shared Resource

Calculating and Tracking Error Budgets

Reducing Toil Through Automation

Automating Common On-Call Tasks

Incident Management and Blameless Postmortems

Core SRE Terminology

Related Articles

SLIs, SLOs, and SLAs: The Reliability Hierarchy

Writing Good SLOs with Prometheus Alerts

Error Budgets: Reliability as a Shared Resource

Calculating and Tracking Error Budgets

Reducing Toil Through Automation

Automating Common On-Call Tasks

Incident Management and Blameless Postmortems

Core SRE Terminology

Related Articles