Category: Articles

Check out our written research on SRE and software operations topics ⬇️⬇️⬇️

Articles, SRE Digital Transformation

What is Site Reliability Engineering?

This article is intended to help non-technical stakeholders better understand Site Reliability Engineering. It is part of the SRE Digital Transformation series exploring how to integrate SRE into your organization. I highly recommend that you start by listening to this episode of the SREpath Podcast for a deeper discussion on “What is Site Reliability Engineering…

April 4, 2023
Articles, Team Development

How to convert developers into Site Reliability Engineers (SREs)

In this article, you will learn the following: Introduction Hiring in the Site Reliability Engineering (SRE) space is notoriously difficult. So it makes sense to figure out how to expand the hiring pool beyond existing SREs. One way to increase the hiring pool is to recruit developers (also known as SWEs) and gradually advance them…

February 9, 2023
Articles, Case Studies

Rundown of LinkedIn’s SRE practices

Introduction LinkedIn has one of the most robust Site Reliability Engineering (SRE) practices around. After all, as the social network of record for jobseekers and salespeople, it is the 6th most trafficked website in the world, with over 1.5 billion unique visits per month. LinkedIn’s Site Reliability Engineers (SREs) ensure all that traffic gets served…

January 25, 2023
Articles, Team Development

Analysis of SRE and platform setup at 10+ tech companies

In this article, you will see a breakdown of the platform setup and SRE practices within 12 non-FAANG technology companies. This is based on the case studies by Andrios Robert. “There is a lot of content available on how Google did [Site Reliability Engineering]; let’s uncover what happens with the rest of the world.” —…

November 22, 2022
Articles, Opinion

Is platform engineering at risk of shiny object syndrome?

So much has been debated lately about the emergence of “Platform Engineering” as a solution to software operations problems. It’s an interesting proposition. However, it is not your silver bullet that will fix all things one felt didn’t work out with Dev versus Ops, DevOps, or SRE. We are missing something very important in our…

November 13, 2022
Articles, Reliability Strategy

Reduce software outage risk with passive guardrails

Shocking fact: only 10-25% of software outages are because of hardware or network failure. The rest are the result of human error like misconfiguration — paraphrasing Martin Kleppman, Designing Data-Intensive Applications In this article, I will share with you how setting up passive guardrails in and around developer workflows can reduce the frequency and severity…

October 26, 2022
Articles, Team Development

Where in team topologies does Site Reliability Engineering fit in?

We will explore the workings of the Team Topologies model and how Site Reliability Engineering (SRE) teams can fit into it. In more detail, I will share with you the following: Let’s get started. Overview of team topologies Team topologies is a relatively new model/framework, having been officially introduced in 2019. It’s a response by…

October 12, 2022
Articles, Case Studies

Rundown of Uber’s SRE practice

Introduction Every time you push a button like the one below to request an Uber ride, you activate a sequence of (micro)service requests. You’d never know unless you look under the hood because most of these services run solely in the background. Yet every service contributes to the start and completion of the Uber ride…

July 20, 2022
Articles, Mildly Technical

How Jaeger tracing fits into software observability

In this article, I will share how tracing and more specifically Jaeger tracing can fit into your wider software observability strategy. Before we get into tracing, let’s define observability. What is observability? Observability is a comprehensive means of gaining data on how software services perform in production. This data gives you a picture of the…

June 15, 2022
Articles, Reliability Strategy

SRE’s role in safer infrastructure-as-code

This article explores 2 simple ways for SREs to drive better practices and code hygiene within infrastructure-as-code (IAC) tooling like Terraform. Why bother? Because of its centrality to cloud infrastructure efficiency, it’s highly likely that you will get involved with an IAC problem at some point in your SRE career. I will mention Terraform from…

June 11, 2022