{"id":5680,"date":"2023-07-11T23:47:25","date_gmt":"2023-07-11T13:47:25","guid":{"rendered":"https:\/\/sysmit.com\/cf22\/?p=5680"},"modified":"2024-01-01T12:29:21","modified_gmt":"2024-01-01T02:29:21","slug":"inside-spotifys-site-reliability-engineering-sre-practice","status":"publish","type":"post","link":"https:\/\/sysmit.com\/cf22\/inside-spotifys-site-reliability-engineering-sre-practice\/","title":{"rendered":"Inside Spotify’s Site Reliability Engineering (SRE) practice"},"content":{"rendered":"\n
You’ve undoubtedly caught wind of the latest Netflix series, dubbed “The Playlist,” a show loosely inspired by the birth of Spotify.<\/p>\n\n\n\n
Chances are, you may have already devoured it in one glorious binge-watching session.<\/p>\n\n\n\n
As for me, I only got around to it recently.<\/p>\n\n\n\n
I was enticed by a Youtube ad that hinted at a captivating tale of the inner workings behind Spotify’s software operations.<\/p>\n\n\n\n
And boy, was I hooked.<\/p>\n\n\n\n
What fascinated me was how much of Spotify’s early success hinged on the wizardry of their server operations.<\/p>\n\n\n\n
Fear not, dear reader, I will delve deeper into this in just a moment.<\/p>\n\n\n\n
It had me wondering whether Spotify’s practice of Site Reliability Engineering (SRE) would be just as enthralling.<\/p>\n\n\n\n
And let me assure you, it most certainly is.<\/p>\n\n\n\n
I’ve got an interesting story related to this toward the end of this piece.<\/p>\n\n\n\n
Brace yourselves as I take you on a journey through the intricate web of Spotify’s SRE practice.<\/p>\n\n\n
The magic of server-side work was part of Spotify\u2019s early charm.<\/p>\n\n\n\n
The Netflix series showed Daniel Ek (CEO of Spotify) challenging the former CTO Andreas Ehn to make Spotify fast with a song load time of less than 200ms.<\/strong><\/p>\n\n\n\n Sub-200ms is the load time that is perceivable by the human ear as instantaneous.<\/p>\n\n\n\n Remember this was in 2006 when Internet capabilities did not readily and consistently allow for sub-second latency.<\/p>\n\n\n\n To achieve the \u201ctrick\u201d, the engineers at Spotify created a hybrid fetch model.<\/p>\n\n\n\n This approach predicted what the user would want to listen to next and prefetched it through the peer-to-peer network, decreasing server load by 90%.<\/p>\n\n\n\n For the remaining 10% of the time, the search went to the servers to play songs that were related.<\/p>\n\n\n\n Client-level caching and prefetching songs 30 seconds before changeover also helped optimize playback and achieve a latency of 245ms.<\/p>\n\n\n\n But this server sorcery proved not to be enough to support future growth.<\/p>\n\n\n In 2011, Spotify faced the inevitable challenges of growth.<\/p>\n\n\n\n Monthly active users (MAU) more than doubled around that time<\/strong> from approximately 3 million users in 2010<\/a> to over 7.4 million users in 2011<\/a>.<\/p>\n\n\n\n This fast expansion led to the quick development of supporting infrastructure, which in turn increased the underlying complexity of the infrastructure.<\/p>\n\n\n\n Growing complexity constantly challenged the reliability and scalability of the system.<\/p>\n\n\n\n To tackle this challenge, Spotify officially introduced Site Reliability Engineering, a strategic move to combat their growing pains and conquer the obstacles that lay ahead.<\/p>\n\n\n\n It was a pivotal moment as Spotify\u2019s software-in-production was on the cusp of reaching hyper-scale proportions.<\/p>\n\n\n\n The stage was set for an audacious leap into uncharted (SRE) territory.<\/p>\n\n\n\n Spotify was inspired by Google’s success with Site Reliability Engineering practices<\/a><\/strong>, which were developed by Ben Treynor<\/a>.<\/p>\n\n\n\n Its engineering leaders aimed to adopt a similar approach but one that was tailored to its unique challenges.<\/p>\n\n\n\n As Spotify’s user base and infrastructure continued to grow, the company scaled its SRE practices accordingly.<\/p>\n\n\n\n Scaling SRE practices at Spotify included:<\/p>\n\n\n\n The automation practices devised by Spotify’s SREs have proved to be a godsend for developers and product teams alike.<\/p>\n\n\n\n By relieving developers of the tiresome burden of manual and repetitive tasks, these practices allow them to channel their energy into the craftsmanship of feature design and higher-quality code.<\/p>\n\n\n\n The result?<\/p>\n\n\n\n A surge in productivity that paves the way for the efficient delivery of code, without compromising on quality.<\/p>\n\n\n SREs at Spotify excel in collaborating with software engineers, seamlessly integrating reliability and operational considerations into the development process.<\/p>\n\n\n\n This collaboration helps regular developers gain a deeper understanding of the operational aspects of their code<\/strong> and encourages them to write more reliable and resilient software from the outset.<\/p>\n\n\n\n Part of this effort involves regular measurement and monitoring of service and system performance.<\/p>\n\n\n\n This helps SREs give developers valuable insights into the behavior and performance of their applications.<\/p>\n\n\n\n By leveraging metrics and monitoring tools, developers can:<\/p>\n\n\n\n At Spotify, both developers and Site Reliability Engineers (SREs) play a role in responding to incidents.<\/p>\n\n\n\n When it comes to the question, \u201cWho goes in first?\u201d in incident response, SREs take the lead and are the first responders<\/strong>.<\/p>\n\n\n\n There are several reasons for this at Spotify:<\/p>\n\n\n\n SREs have the necessary tools, training, and knowledge to diagnose and mitigate the issues promptly<\/p>\n\n\n\n SREs follow well-defined incident management processes and participate in on-call rotations to ensure 24\/7 coverage<\/p>\n\n\n\n SREs leverage their specialized expertise in system reliability and operations.<\/p>\n\n\n\n However, developers also play a vital role in this process.<\/p>\n\n\n\n They actively contribute their skills and knowledge, working hand in hand with SREs to tackle and resolve incidents effectively.<\/p>\n\n\n\n Depending on the nature and severity of the incident, developers provide their expertise in understanding the codebase <\/strong>and identifying potential root causes.<\/p>\n\n\n\n They collaborate closely with the SREs to investigate the incident, analyze relevant logs, metrics, and system behavior, and contribute to resolving the issue.<\/p>\n\n\n\n It’s a unified effort where both parties bring their strengths to the table, ensuring a comprehensive response to any challenges that arise.<\/p>\n\n\n\n Post-incident retrospectives, also known in Google\u2019s SRE model as postmortems, involve a broader group of stakeholders, including developers.<\/p>\n\n\n\n These postmortems provide an opportunity for developers to contribute their insights, share lessons learned, and collectively work toward preventing similar incidents in the future.<\/p>\n\n\n\n By participating in incident response and postmortem processes, developers gain a deeper understanding of system failures and root causes.<\/p>\n\n\n\n This knowledge helps them:<\/p>\n\n\n\n This combined effort toward effective response and continuous improvement ultimately leads to more reliable and robust software.<\/p>\n\n\n Spotify has cultivated a culture of reliability engineering that permeates every nook and cranny of the organization.<\/p>\n\n\n\n It\u2019s instilled in its engineers the values of:<\/p>\n\n\n\n But how does this culture manifest itself?<\/strong><\/p>\n\n\n\n Spotify involves both its SREs and developers in incident response and postmortem.<\/p>\n\n\n\n This approach leverages the expertise of SREs while harnessing the deep understanding of the codebase possessed by developers to address incidents effectively and enhance the overall reliability of Spotify’s services.<\/p>\n\n\n\n It’s a true collaboration fostering a culture of learning and shared responsibility for the reliability of their systems.<\/p>\n\n\n\n\n In true Spotify fashion where they not only revolutionized how music is consumed, they even revolutionized organizational structures.<\/p>\n\n\n\n SREs are embedded within cross-functional product development teams known as \u201cSquads\u201d.<\/p>\n\n\n\n But they are also part of communities of practice known as \u201cGuilds\u201d.<\/p>\n\n\n\n This \u201cSpotify model\u201d has created quite a buzz in the last 5 or so years.<\/p>\n\n\n\n It’s a topic that permeates conversations on Agile practices, with the idea of squads, tribes, chapters, and guilds taking large mindshare.<\/p>\n\n\n\n But here’s the kicker: Spotify’s ingenious model has transcended the tech realm.<\/p>\n\n\n\n It has spread like wildfire even to unexpected domains such as supermarket chains and, believe it or not, even banks.<\/p>\n\n\n\n It’s a testament to the far-reaching impact of Spotify’s innovative practices.<\/p>\n\n\n\n Despite detractors trying to undermine it, I doubt the model is going anywhere.<\/p>\n\n\n\n Let\u2019s cover them briefly for context:<\/p>\n\n\n A group of individuals with different skills working together for a specific objective<\/p>\n\n\n\n They have the right to make their own decisions while aligning their roadmap with the company’s vision<\/p>\n\n\n\n Example squad: \u201cRecommendation algorithm\u201d squad, which is specifically focused on developing and optimizing algorithms<\/p>\n\n\n Several squads come together to form a tribe, which works towards a shared mission, promoting alignment and collaboration<\/p>\n\n\n\n Example tribe: \u201dDiscovery\u201d tribe which is focused on the broader mandate of enhancing music discovery and recommendations<\/p>\n\n\n Individuals with similar skills or interests gather in chapters to exchange knowledge and develop their expertise<\/p>\n\n\n\n Example chapter: \u201cMachine Learning and Data Science\u201d, which is focused on enhancing work with data and algorithms<\/p>\n\n\n Guilds unite individuals across squads and tribes who share a common interest or passion<\/p>\n\n\n\n They allow knowledge and creativity to flow freely, resulting in breakthrough ideas and cross-pollination of talents.<\/p>\n\n\n\n Example guild: \u201cSite Reliability Engineering\u201d, which is focused on increasing the reliability of systems at scale<\/p>\n\n\n\n Using the examples listed above, the \u201cRecommendation Algorithm\u201d squad members might learn about data reliability by being part of the \u201cSRE\u201d guild.<\/p>\n\n\n\n They could then cross-pollinate this idea with their \u201cMachine Learning and Data Science\u201d chapter.<\/p>\n\n\n\n SRE as a guild within Spotify spans across:<\/p>\n\n\n\n This allowed for effective seeding of the SRE practices throughout the organization.<\/p>\n\n\n\n It also enabled close collaboration between SREs and other technologists.<\/p>\n\n\n Introducing the Backstage internal developer platform (IDP)<\/p>\n\n\n\n Few tools are as public a testament to Spotify\u2019s engineer-first culture as one in particular.<\/p>\n\n\n\n I am referring to the Backstage platform, Spotify\u2019s born-and-bred internal developer platform (IDP).<\/p>\n\n\n\n Backstage plays a crucial role in fostering Spotify’s engineering culture by promoting developer autonomy and end-to-end service ownership.<\/p>\n\n\n\n How does Backstage help developer autonomy?<\/p>\n\n\n\n Through it, Spotify engineers gain access to a centralized hub to manage and support services, as well as knowledge sharing.<\/p>\n\n\n\n In terms of tangible examples, the platform provides a space for engineers to:<\/p>\n\n\n\n This provides a few key benefits. Backstage helps:<\/p>\n\n\n\n Backstage incorporates the concept of “golden paths” as a way to provide streamlined and standardized processes for developers.<\/p>\n\n\n\n Golden paths are predefined and recommended paths that guide developers through the necessary steps and best practices for common tasks or workflows.<\/p>\n\n\n\n In the context of Backstage, golden paths are predefined templates, workflows, and guidelines<\/strong> that help developers follow proven practices.<\/p>\n\n\n\n This ensures consistency across projects.<\/p>\n\n\n\n Golden paths serve as a starting point or steps for specific tasks such as:<\/p>\n\n\n\n By following golden paths, developers can:<\/p>\n\n\n\n These paths are not static. Developers can build upon the golden paths to enhance their effect.<\/strong><\/p>\n\n\n\n What does this have to do with Site Reliability Engineering or more broadly speaking, software operations?<\/p>\n\n\n\n Here\u2019s the answer: a consistent approach to launching services to production means that they are more likely to be reliable in production.<\/p>\n\n\n\n Backstage serves as a valuable resource, contributing to increased productivity, code quality, and overall reliability of the Spotify service ecosystem.<\/p>\n\n\n Spotify hit a critical crossroads in its growth story. At one point, the cost of infrastructure outpaced revenue growth.<\/p>\n\n\n\n Management scrambled to find ways to curb cloud costs.<\/p>\n\n\n\n But they felt that they couldn’t impose new cost controls from above. After all, Spotify cherishes engineer autonomy above all.<\/p>\n\n\n\n So they looked at the issue as an engineering problem.<\/p>\n\n\n\n Spotify\u2019s Insights Cost team devised a brilliant strategy that leveraged the popularity of the Backstage platform.<\/p>\n\n\n\nSRE was a response to hypergrowth<\/h3>\n\n\n
\n
How SRE has helped Spotify\u2019s tech work<\/h2>\n\n
Spotify SREs automate to cut repetitive work<\/h3>\n\n\n
Spotify SREs support DevOps practice adoption<\/h3>\n\n\n
\n
Spotify SREs support developer response to incidents<\/h3>\n\n\n
\n
What is Spotify\u2019s SRE culture like?<\/h2>\n\n
At Spotify, everyone cares about reliability<\/h3>\n\n\n
\n
SRE spread thanks to the famous \u201cSpotify model\u201d<\/h3>\n\n\n
Squads<\/h4>\n\n\n
Tribes<\/h4>\n\n\n
Chapters<\/h4>\n\n\n
Guilds<\/h4>\n\n\n
\n
Spotify\u2019s greatest gift to software operations<\/h2>\n\n\n
\n
\n
Golden paths form the backbone of Backstage<\/h3>\n\n\n
\n
\n
Developer empowerment led to cloud cost savings<\/h3>\n\n\n