What is it
SRE, short for Site Reliability Engineering, is a set of principles and practices for operating large-scale software systems.
The term was pioneered by Google, and is explained in-depth in their Site Reliability Engineering book:
What exactly is Site Reliability Engineering, as it has come to be defined at Google? My explanation is simple: SRE is what happens when you ask a software engineer to design an operations team.
Benjamin Treynor Sloss - founder of Google SRE
SRE at Einride
Many people are surprised that Spotify does not actually have an SRE organization. We don’t have a central SRE team or even SRE-only teams, yet our ability to scale over time has been dependent on our ability to apply SRE principles in everything we do. Given this unusual setup, other companies have approached us to learn how our model (“Ops-in-Squads”) works. Some have adopted a similar model.
When to use it
We aim to apply the SRE principles throughout our entire software development life cycle:
From how we are building our applications with reliability and observability in mind, to how we organize into autonomous teams that own and operate their own systems.
How to learn it
Learning SRE involves a mix between learning the underlying theory, and applying it in practice.
Learning the theory
Start by reading the original Site Reliability Engineering book, to learn the fundamental principles and methodology of SRE.
Follow up by reading Seeking SRE to gain insight into how different organizations have implemented the SRE principles.
Read the Building Secure & Reliable Systems book to gain further insight into best practices for designing, implementing and maintaining systems according to the SRE principles.
Putting it into practice
Besides applying the SRE principles in the design, implementation and deployment of your systems, ask to join the on-call rotation that operates your team's backend services.