{"id":5312,"date":"2023-01-25T07:07:18","date_gmt":"2023-01-24T21:07:18","guid":{"rendered":"https:\/\/sysmit.com\/cf22\/?p=5312"},"modified":"2024-01-01T12:27:43","modified_gmt":"2024-01-01T02:27:43","slug":"rundown-of-linkedins-sre-practices","status":"publish","type":"post","link":"https:\/\/sysmit.com\/cf22\/rundown-of-linkedins-sre-practices\/","title":{"rendered":"Rundown of LinkedIn’s SRE practices"},"content":{"rendered":"
LinkedIn has one of the most robust Site Reliability Engineering (SRE) practices around. <\/p>\n\n\n\n
After all, as the social network of record for jobseekers and salespeople, it is the 6th most trafficked website in the world, with over 1.5 billion unique visits per month<\/strong>. <\/p>\n\n\n\n LinkedIn’s Site Reliability Engineers (SREs) ensure all that traffic gets served with minimal dropouts and performance degradation. <\/p>\n\n\n\n SRE efforts will only continue to grow, as the company has an ambitious goal to \u201ccreate economic opportunity for every member of the global workforce\u201d. <\/p>\n\n\n\n LinkedIn’s management aims for it to be more than an online resume and make an \u201ceconomic graph\u201d; similar to Facebook\u2019s social graph. One that maps every aspect of the global economy – companies, jobs, schools, skills, etc.<\/p>\n\n\n Each Site Reliability Engineer at LinkedIn is responsible for ~500 machines, but that can go up or down depending on the needs of the system on the day.<\/p>\n\n\n\n \u201cIt is impressive to be able to handle this kind of [work]. This contributes to the challenge of finding the right people. Finding SREs who can operate at this level is a very big challenge.\u201d<\/p>\n\u2014 excerpt from video talk on SRE Hiring<\/a> by Greg Leffler, SRE manager at LinkedIn from 2012-2016<\/cite><\/blockquote>\n\n\n\n LinkedIn’s SRE managers found that many SRE candidates sought work that would allow them to leave a legacy when they move on<\/strong>. <\/p>\n\n\n\n This trait correlates with the capability to work with broad-spanning complex systems in ambiguous circumstances.<\/p>\n\n\n\n If you can do out-of-the-box thinking and work in tough situations, you’d want that work to have a lasting impact.<\/p>\n\n\n In the mid-2010s, many SREs were distributed across product engineering teams i.e. embedded into the product team. <\/p>\n\n\n\n The reasoning behind this was that centralized teams eventually face issues that come with becoming a \u201cshared service\u201d i.e. not so critical to the day-to-day but only consulted if really necessary. <\/p>\n\n\n\n The centralized model would not have been conducive to LinkedIn’s SRE needs at the time. The thing to remember is that SRE is a continuously high-involvement practice and not only for when things go wrong.<\/p>\n\n\n LinkedIn’s engineering management focused strongly on creating SRE teams in India. This may be partly due to hiring difficulties endemic to the SRE field in North America and Europe. <\/p>\n\n\n\n There were initial difficulties as SRE had a stigma within talent pools in India. <\/p>\n\n\n\n Many with years of experience perceived SRE as a relabelled or glorified systems administration role. LinkedIn’s SRE hiring managers instead focused on turning junior engineers and graduates into high-potential SREs. <\/p>\n\n\n\n They emphasized from the outset that SRE was not a typical operations role and that it would involve broad-spanning work in ambiguous settings.<\/p>\n\n\n\n As of 2017, the Bangalore office of LinkedIn had 60 SREs in 10 teams. <\/p>\n\n\n In 2013, LinkedIn rebranded teams in the existing discipline of AppOps to SRE. The newly minted SREs worked alongside stratified operations teams in verticals such as systems, networks, applications, and DBA. <\/p>\n\n\n\n This way of running software operations proved difficult as LinkedIn continued its growth trajectory. Several SREs noted issues like:<\/p>\n\n\n\n SREs originally started off in this environment as firefighters but evolved during a drastic shift in the software operations at LinkedIn.<\/p>\n\n\n There was a need for change in the mid-2010s considering 100 million members relied on LinkedIn at the time. <\/p>\n\n\n\n LinkedIn could not mess up its increasingly complex software operations.<\/p>\n\n\n\n Something had to give, as LinkedIn was suffering frequent outages and performance issues despite now having an SRE team. <\/p>\n\n\n\n LinkedIn SREs and the wider organization had to challenge several antipatterns:<\/p>\n\n\n\n The decision was then made across the organization to radically change several aspects of software operations. This included a drastic shift in software architecture and developer involvement in operations for the first time.<\/p>\n\n\n\n To achieve this lofty goal, product development was stopped for 3 months<\/strong>. <\/p>\n\n\n\n\n\n This signified a gutsy commitment to drive the necessary change.<\/p>\n\n\n\n Several factors supported the transition to more effective software operations:<\/p>\n\n\n\n SREs initially advocated a few human-related practices to drive the change, including:<\/p>\n\n\n\n A few of the technical changes included:<\/p>\n\n\n\n Here’s a quick rundown of the cadence for self-service deployments for developers<\/strong> at LinkedIn:<\/p>\n\n\n\n This self-service model has supported 15,000 commits and 600+ feature ramp-ups per day<\/p>\n\n\n\n Since the self-service model is now well entrenched in the LinkedIn culture, SREs continue their work on various other problems. They work in areas like performance, platforms, resilience, architecture, etc.<\/p>\n\n\n\n\ud83d\udcca Here are some performance statistics for LinkedIn<\/h2>\n\n\n
As of 2022, LinkedIn had:<\/h3>\n\n\n
\n
In 2017, LinkedIn had:<\/h3>\n\n\n
\n
\n
\ud83d\udd31 How SRE fits into LinkedIn’s engineering culture<\/h2>\n\n
Team formation<\/h3>\n\n\n
\n
Many SREs at LinkedIn as “embedded SREs”<\/strong><\/h4>\n\n\n
LinkedIn has a large SRE base in India<\/strong><\/h4>\n\n\n
History of SRE at LinkedIn<\/h3>\n\n
Prior to a modern SRE discipline (the early 2010s)<\/strong><\/h4>\n\n\n
\n
Transitioning to SRE from previous operational discipline<\/strong><\/h4>\n\n\n
Antipattern identifier<\/strong><\/td> Antipattern description<\/strong><\/td> Shift to these propatterns\u2026<\/strong><\/td><\/tr> Firefighter \ud83d\udc68\ud83c\udffe\u200d\ud83d\ude92<\/td> \u274c react to handle incidents that happen to keep the company functioning one more day<\/td> \u2705 automate the manual work
\u2705 deliver instrumentation to get rich data on issues and alerts
\u2705 understand the stack at a deeper level<\/td><\/tr>Gatekeeper \ud83d\udc82\ud83c\udffd<\/td> \u274c control releases to protect the site from developers
\u274c \u201ctalk to me if you want to touch production\u201d
\u274c push a button to deploy someone else\u2019s work
\u274c have arbitrary schedules for releases<\/td>\u2705 develop automated gatekeepers to assure quality
\u2705 let developers own their work in production
\u2705 support developers in self-service deployment with the use of pre-launch checklists and release guidance <\/td><\/tr><\/tbody><\/table><\/figure>\n\n\n\n\n
How the change was actioned toward a modern SRE discipline<\/strong><\/h4>\n\n\n
\n
\n
\n