Skip to main content
Site Reliability (SRE)

Site Reliability Engineering (SRE)

Keep the internet running: reliability, observability, and on-call excellence

150h total10 courses3 stages
Start this roadmap free

What you'll be able to do

  • Define and track SLIs, SLOs, and error budgets
  • Instrument systems with metrics, logs, and traces
  • Design for reliability and run incident response
  • Automate toil and operate at scale

Before you start

  • Comfort with Linux and the command line
  • Basic scripting (Bash or Python)
  • Experience running or deploying an application

Phase 1 · SRE Foundations

Linux Deep Dive for SRE

beginner18h

Process internals, memory model, file descriptors, networking stack, kernel tuning, and performance analysis with perf, strace, and eBPF.

  • Diagnose CPU bottleneck with top/perf
  • Trace system calls with strace
  • Analyse memory leak with valgrind / /proc
  • Network tuning: sysctl TCP parameters

SRE Principles: Google SRE Book

beginner14h

SLIs, SLOs, SLAs, error budgets, toil elimination, and the SRE vs DevOps mental model. Chapter-by-chapter reading with applied exercises.

  • Write SLOs for a hypothetical e-commerce service
  • Calculate error budget burn rate
  • Identify toil in a daily workflow and automate it

Python & Go for SRE Automation

beginner16h

Scripting, tooling, and automation: Python for ops scripts, Go for reliable CLI tools and services. Focus on practical SRE use cases.

  • Python: health-check script that pages on failure
  • Go: build a simple HTTP load-test tool
  • Automate runbook steps as a script

Phase 2 · Observability Stack

Metrics: Prometheus & Grafana

intermediate16h

PromQL, recording rules, alerting rules, Grafana dashboards, and the RED and USE methods for service health.

  • Instrument a Node/Python service with client library
  • PromQL: p99 latency, error rate, saturation
  • Grafana: RED dashboard (Rate, Errors, Duration)
  • Alertmanager: page on SLO breach

Logging: ELK / Loki + Structured Logs

intermediate12h

Structured JSON logging, log aggregation with Loki or Elasticsearch, Logstash pipelines, Kibana/Grafana queries, and log-based alerting.

  • Structured log format: trace_id, service, level
  • LogQL: find all 5xx errors in last 1h
  • Correlation: trace a request across 3 services

Distributed Tracing: OpenTelemetry & Jaeger

intermediate10h

Spans, traces, context propagation, sampling strategies, and correlating traces with metrics and logs (the three pillars).

  • Instrument a service with OTel SDK
  • View trace in Jaeger: identify slow span
  • Correlate a trace to a Prometheus spike

Phase 3 · Reliability Engineering

Incident Management & On-Call

intermediate12h

Incident response lifecycle, postmortem culture, on-call best practices, PagerDuty setup, runbooks, and blameless retrospectives.

  • Write a postmortem for a real or simulated incident
  • Create a runbook for top-3 alert types
  • Configure PagerDuty escalation policy

Chaos Engineering & Resilience Testing

advanced12h

Game days, fault injection, Chaos Monkey, Gremlin, blast radius limiting, and recovery testing.

  • Run a game day: kill a service instance
  • Verify graceful degradation under load
  • Chaos experiment: latency injection on a dependency

Kubernetes for SRE

advanced16h

Pod autoscaling (HPA/VPA/KEDA), disruption budgets, priority classes, resource quotas, node affinity, and SRE-focused Kubernetes patterns.

  • Configure HPA on a deployment
  • PodDisruptionBudget: survive a node drain
  • Resource requests/limits: avoid OOMKill
  • Pass CKA exam (target)

Capstone: SRE for a Production Service

advanced24h

Apply everything to a real service: define SLOs, instrument metrics + logs + traces, build dashboards, configure alerts, write runbooks, and run a chaos game day.

  • SLO document approved by stakeholders
  • Dashboards covering RED + USE methods
  • Alerting with no false positives for 2 weeks
  • Game day executed and postmortem written

Frequently asked

Is the Site Reliability Engineering (SRE) roadmap free?+

Yes. The entire Site Reliability Engineering (SRE) roadmap and every curated resource is free to follow on Commit. You can track your progress, keep a daily streak, and earn a shareable certificate at no cost — there is no paywall.

How long does the Site Reliability Engineering (SRE) roadmap take to complete?+

About 150 hours of focused study across 10 courses and 3 stages. At roughly one hour a day that is about 5 months; you can move faster by studying more each day.

Do I get a certificate for finishing the Site Reliability Engineering (SRE) roadmap?+

Yes. When you complete the roadmap on Commit you receive a verifiable certificate of completion that you can add to LinkedIn and your public Commit profile as proof of what you finished.

Make it stick

Copy this roadmap into Commit and turn it into a tracked program with a streak graph, study logging, and a shareable certificate when you finish. Free forever.

Start Site Reliability Engineering (SRE) free