SRE

Recommendation

Updated

Moved

USE

2021-10-03

What is it

SRE, short for Site Reliability Engineering, is a set of principles and practices for operating large-scale software systems.

The term was pioneered by Google, and is explained in-depth in their Site Reliability Engineering book:

What exactly is Site Reliability Engineering, as it has come to be defined at Google? My explanation is simple: SRE is what happens when you ask a software engineer to design an operations team.

Benjamin Treynor Sloss - founder of Google SRE

SRE at Einride

At Einride, in line with our principle of taking ownership, we aim for a variant of SRE in line with Spotify's SRE without SRE model, also known as "Ops-in-Squads":

Many people are surprised that Spotify does not actually have an SRE organization. We don’t have a central SRE team or even SRE-only teams, yet our ability to scale over time has been dependent on our ability to apply SRE principles in everything we do. Given this unusual setup, other companies have approached us to learn how our model (“Ops-in-Squads”) works. Some have adopted a similar model.

When to use it

We aim to apply the SRE principles throughout our entire software development life cycle:

From how we are building our applications with reliability and observability in mind, to how we organize into autonomous teams that own and operate their own systems.

How to learn it

Learning SRE involves a mix between learning the underlying theory, and applying it in practice.

Learning the theory

Start by reading the original Site Reliability Engineering book, to learn the fundamental principles and methodology of SRE.
Follow up by reading Seeking SRE to gain insight into how different organizations have implemented the SRE principles.
Read the Building Secure & Reliable Systems book to gain further insight into best practices for designing, implementing and maintaining systems according to the SRE principles.

Putting it into practice

Besides applying the SRE principles in the design, implementation and deployment of your systems, ask to join the on-call rotation that operates your team's backend services.

What is it​

SRE at Einride​

When to use it​

How to learn it​

Learning the theory​

Putting it into practice​