{"id":5680,"date":"2023-07-11T23:47:25","date_gmt":"2023-07-11T13:47:25","guid":{"rendered":"https:\/\/sysmit.com\/cf22\/?p=5680"},"modified":"2024-01-01T12:29:21","modified_gmt":"2024-01-01T02:29:21","slug":"inside-spotifys-site-reliability-engineering-sre-practice","status":"publish","type":"post","link":"https:\/\/sysmit.com\/cf22\/inside-spotifys-site-reliability-engineering-sre-practice\/","title":{"rendered":"Inside Spotify’s Site Reliability Engineering (SRE) practice"},"content":{"rendered":"\n

You’ve undoubtedly caught wind of the latest Netflix series, dubbed “The Playlist,” a show loosely inspired by the birth of Spotify.<\/p>\n\n\n\n

Chances are, you may have already devoured it in one glorious binge-watching session.<\/p>\n\n\n\n

As for me, I only got around to it recently.<\/p>\n\n\n\n

I was enticed by a Youtube ad that hinted at a captivating tale of the inner workings behind Spotify’s software operations.<\/p>\n\n\n\n

And boy, was I hooked.<\/p>\n\n\n\n

What fascinated me was how much of Spotify’s early success hinged on the wizardry of their server operations.<\/p>\n\n\n\n

Fear not, dear reader, I will delve deeper into this in just a moment.<\/p>\n\n\n\n

It had me wondering whether Spotify’s practice of Site Reliability Engineering (SRE) would be just as enthralling.<\/p>\n\n\n\n

And let me assure you, it most certainly is.<\/p>\n\n\n\n

I’ve got an interesting story related to this toward the end of this piece.<\/p>\n\n\n\n

Brace yourselves as I take you on a journey through the intricate web of Spotify’s SRE practice.<\/p>\n\n\n

History of SRE at Spotify<\/h2>\n\n

Before SRE came in at Spotify<\/h3>\n\n\n

The magic of server-side work was part of Spotify\u2019s early charm.<\/p>\n\n\n\n

The Netflix series showed Daniel Ek (CEO of Spotify) challenging the former CTO Andreas Ehn to make Spotify fast with a song load time of less than 200ms.<\/strong><\/p>\n\n\n\n

Sub-200ms is the load time that is perceivable by the human ear as instantaneous.<\/p>\n\n\n\n

Remember this was in 2006 when Internet capabilities did not readily and consistently allow for sub-second latency.<\/p>\n\n\n\n

To achieve the \u201ctrick\u201d, the engineers at Spotify created a hybrid fetch model.<\/p>\n\n\n\n

This approach predicted what the user would want to listen to next and prefetched it through the peer-to-peer network, decreasing server load by 90%.<\/p>\n\n\n\n

For the remaining 10% of the time, the search went to the servers to play songs that were related.<\/p>\n\n\n\n

Client-level caching and prefetching songs 30 seconds before changeover also helped optimize playback and achieve a latency of 245ms.<\/p>\n\n\n\n

But this server sorcery proved not to be enough to support future growth.<\/p>\n\n\n

SRE was a response to hypergrowth<\/h3>\n\n\n

In 2011, Spotify faced the inevitable challenges of growth.<\/p>\n\n\n\n

Monthly active users (MAU) more than doubled around that time<\/strong> from approximately 3 million users in 2010<\/a> to over 7.4 million users in 2011<\/a>.<\/p>\n\n\n\n

This fast expansion led to the quick development of supporting infrastructure, which in turn increased the underlying complexity of the infrastructure.<\/p>\n\n\n\n

Growing complexity constantly challenged the reliability and scalability of the system.<\/p>\n\n\n\n

To tackle this challenge, Spotify officially introduced Site Reliability Engineering, a strategic move to combat their growing pains and conquer the obstacles that lay ahead.<\/p>\n\n\n\n

It was a pivotal moment as Spotify\u2019s software-in-production was on the cusp of reaching hyper-scale proportions.<\/p>\n\n\n\n

The stage was set for an audacious leap into uncharted (SRE) territory.<\/p>\n\n\n\n

Spotify was inspired by Google’s success with Site Reliability Engineering practices<\/a><\/strong>, which were developed by Ben Treynor<\/a>.<\/p>\n\n\n\n

Its engineering leaders aimed to adopt a similar approach but one that was tailored to its unique challenges.<\/p>\n\n\n\n

As Spotify’s user base and infrastructure continued to grow, the company scaled its SRE practices accordingly.<\/p>\n\n\n\n

Scaling SRE practices at Spotify included:<\/p>\n\n\n\n