-
Review of Google’s Site Reliability Engineering Hierarchy
Google’s book on SRE, Site Reliability Engineering (2016), has captured wide acclaim in the software operations world. One of the most discussed aspects in SRE circles about the book is its SRE hierarchy. The hierarchy has merit, but it’s also flawed in a way that would prevent you from educating people about SRE. I’ll get…
-
How cloud infrastructure teams evolve – from start to maturity
I recently read a post by Will Larson, who started SRE at Uber. The post is called the Trunks and branches model for scaling infrastructure organizations. Several passages in the post covered how infrastructure teams can evolve from the startup phase. I felt it would be easier to comprehend the dense-and-rich advice with a visual…
-
Cloud infrastructure success is a fine balance of budget and service quality
The visual summary below is based on a post by Will Larson, who started the SRE function at Uber. His post elaborates on a “trunks and branches” model for developing infrastructure-facing teams. It also covered an interesting perspective on the balancing act of budget and service quality. I will explain the visual summary underneath it.…
-
How 6 system resilience patterns increase software reliability
Introduction System resilience thinking can inform better Site Reliability Engineering decisions. Specifically, it can affect how the SRE culture unfolds and handles critical situations. The system resilience concept is rooted in theoretical computer science. Don’t panic. I will explain how it can – in a practical way – support increased software reliability in production. We…
-
Site Reliability Engineering Culture Patterns
Who should read this: Introduction Despite its now antiquated sounding name, Site Reliability Engineering (SRE) as a discipline has strong future promise to proactively improve software reliability in production. As software complexity continues to increase, so will the need for better and better practice of SRE. It is undoubtedly an exciting but enigmatic field, with…
-
Rundown of Netflix’s SRE practice
Introduction A lot goes on in the background every time you load up your favorite Netflix movie or series. Engineers spread across Chaos Engineering, Performance Engineering and Site Reliability Engineering (SRE) are working non-stop to ensure the magic keeps happening. 📊 Here are some performance statistics for Netflix When it was alone on top of…
-
25+ Site Reliability Engineering OKRs
Incident Response OKRs System performance and resilience OKRs Developer support OKRs DevSecOps OKRs FinOps (Cloud Cost Control) OKRs Work practices OKRs Feel free to reach out if you have any questions about the above OKRs or want us to add a new OKR.
-
Runbooks for better incident response
Introduction I can confidently tell you that runbooks form a critical part of the incident response toolkit. I will also tell you that SREs are well-placed to start and oversee the development of runbooks. If you don’t have a runbook yet, let me entice you with the thought of checklist-type documentation to follow when you’re…
-
SRE is not a monolithic role
SRE is gaining more traction and a misconception is gaining steam among senior stakeholders. That SRE is a monolith role like what “programmers” were in the 90s. Let’s burst that misconception… SRE is a broad, overarching responsibility that needs a multitude of role considerations to pull off properly. It is not a monolithic role where…