{"id":5312,"date":"2023-01-25T07:07:18","date_gmt":"2023-01-24T21:07:18","guid":{"rendered":"https:\/\/sysmit.com\/cf22\/?p=5312"},"modified":"2024-01-01T12:27:43","modified_gmt":"2024-01-01T02:27:43","slug":"rundown-of-linkedins-sre-practices","status":"publish","type":"post","link":"https:\/\/sysmit.com\/cf22\/rundown-of-linkedins-sre-practices\/","title":{"rendered":"Rundown of LinkedIn&#8217;s SRE practices"},"content":{"rendered":"<h2 class=\"gb-headline gb-headline-a1b666e4 gb-headline-text\" id=\"introduction\">Introduction<\/h2>\n\n\n<p>LinkedIn has one of the most robust Site Reliability Engineering (SRE) practices around. <\/p>\n\n\n\n<p>After all, as the social network of record for jobseekers and salespeople, it is the <strong>6th most trafficked website in the world, with over 1.5 billion unique visits per month<\/strong>.  <\/p>\n\n\n\n<p>LinkedIn&#8217;s Site Reliability Engineers (SREs) ensure all that traffic gets served with minimal dropouts and performance degradation. <\/p>\n\n\n\n<p>SRE efforts will only continue to grow, as the company has an ambitious goal to \u201ccreate economic opportunity for every member of the global workforce\u201d. <\/p>\n\n\n\n<p>LinkedIn&#8217;s management aims for it to be more than an online resume and make an \u201ceconomic graph\u201d; similar to Facebook\u2019s social graph. One that maps every aspect of the global economy &#8211; companies, jobs, schools, skills, etc.<\/p>\n\n\n<h2 class=\"gb-headline gb-headline-3325a492 gb-headline-text\" id=\"%25f0%259f%2593%258a-here-are-some-performance-statistics-for-linkedin\">\ud83d\udcca Here are some performance statistics for LinkedIn<\/h2>\n\n\n<figure class=\"gb-block-image gb-block-image-a925ba7b\"><img loading=\"lazy\" decoding=\"async\" width=\"1057\" height=\"319\" class=\"gb-image gb-image-a925ba7b\" src=\"https:\/\/sysmit.com\/cf22\/wp-content\/uploads\/linkedin-sre-practice-statistics-1.png\" alt=\"\" title=\"linkedin-sre-practice-statistics-1\" srcset=\"https:\/\/sysmit.com\/cf22\/wp-content\/uploads\/linkedin-sre-practice-statistics-1.png 1057w, https:\/\/sysmit.com\/cf22\/wp-content\/uploads\/linkedin-sre-practice-statistics-1-300x91.png 300w, https:\/\/sysmit.com\/cf22\/wp-content\/uploads\/linkedin-sre-practice-statistics-1-1024x309.png 1024w, https:\/\/sysmit.com\/cf22\/wp-content\/uploads\/linkedin-sre-practice-statistics-1-768x232.png 768w\" sizes=\"(max-width: 1057px) 100vw, 1057px\" \/><\/figure>\n\n\n\n<div style=\"height:40px\" aria-hidden=\"true\" class=\"wp-block-spacer\"><\/div>\n\n\n<h3 class=\"wp-block-heading\" id=\"as-of-2022-linkedin-had\">As of 2022, LinkedIn had:<\/h3>\n\n\n<ul>\n<li>850 million members<\/li>\n\n\n\n<li>Between 1.25 to 1.5 billion unique visits per month&nbsp;<\/li>\n\n\n\n<li>In May 2022, close to 1.5 billion unique global visitors had visited LinkedIn.com, up from 1.3 billion visitors in December 2021<\/li>\n\n\n\n<li>39% (or 340 million) of users as Premium members<\/li>\n<\/ul>\n\n\n<h3 class=\"gb-headline gb-headline-1edd296a gb-headline-text\" id=\"in-2017-linkedin-had\">In 2017, LinkedIn had:<\/h3>\n\n\n<ul>\n<li>1500+ services in production<\/li>\n\n\n\n<li>600 TB of stored data&nbsp;<\/li>\n\n\n\n<li>ranking of 8th busiest website in the world<\/li>\n\n\n\n<li>100+ SREs supporting 1000+ SW engineers (ratio 10:1 makes sense)<\/li>\n\n\n\n<li>20,000+ production machines in operations<\/li>\n\n\n\n<li>300+ RESTful services<\/li>\n\n\n\n<li>performance targets of 10ms latency for 99% of services<\/li>\n\n\n\n<li>100+ Kafka clusters handling 7 trillion messages per day<\/li>\n<\/ul>\n\n\n\n<p>Each Site Reliability Engineer at LinkedIn is responsible for ~500 machines, but that can go up or down depending on the needs of the system on the day.<\/p>\n\n\n\n<blockquote class=\"wp-block-quote is-layout-flow wp-block-quote-is-layout-flow\">\n<p>\u201cIt is impressive to be able to handle this kind of [work]. This contributes to the challenge of finding the right people. Finding SREs who can operate at this level is a very big challenge.\u201d<\/p>\n<cite>\u2014 <a href=\"https:\/\/youtu.be\/ZemNg9GYvOA?t=311\">excerpt from video talk on SRE Hiring<\/a> by Greg Leffler, SRE manager at LinkedIn from 2012-2016<\/cite><\/blockquote>\n\n\n\n<p>LinkedIn&#8217;s SRE managers found that many SRE candidates sought <strong>work that would allow them to leave a legacy when they move on<\/strong>. <\/p>\n\n\n\n<p>This trait correlates with the capability to work with broad-spanning complex systems in ambiguous circumstances.<\/p>\n\n\n\n<p>If you can do out-of-the-box thinking and work in tough situations, you&#8217;d want that work to have a lasting impact.<\/p>\n\n\n<h2 class=\"gb-headline gb-headline-3116586b gb-headline-text\" id=\"%25f0%259f%2594%25b1-how-sre-fits-into-linkedins-engineering-culture\">\ud83d\udd31 How SRE fits into LinkedIn&#8217;s engineering culture<\/h2>\n\n<h3 class=\"gb-headline gb-headline-ba7cb332 gb-headline-text\" id=\"team-formation\">Team formation<\/h3>\n\n\n<ul>\n<li>300+ SREs across 4 global offices \u2014 Bay Area (San Francisco, Palo Alto), New York City, Bangalore<\/li>\n\n\n\n<li>Team size depends on whether they&#8217;re embedded in a product team or part of a dedicated SRE team<\/li>\n\n\n\n<li>SREs get to pick the kind of team they want to be part of&nbsp;<\/li>\n\n\n\n<li>Hybrid roles exist i.e. SREs who focus on systems or software development<\/li>\n<\/ul>\n\n\n<h4 class=\"gb-headline gb-headline-9aae1075 gb-headline-text\" id=\"many-sres-at-linkedin-as-embedded-sres\"><strong>Many SREs at LinkedIn as &#8220;embedded SREs&#8221;<\/strong><\/h4>\n\n\n<p>In the mid-2010s, many SREs were distributed across product engineering teams i.e. embedded into the product team. <\/p>\n\n\n\n<p>The reasoning behind this was that centralized teams eventually face issues that come with becoming a \u201cshared service\u201d i.e. not so critical to the day-to-day but only consulted if really necessary. <\/p>\n\n\n\n<p>The centralized model would not have been conducive to LinkedIn&#8217;s SRE needs at the time. The thing to remember is that SRE is a continuously high-involvement practice and not only for when things go wrong.<\/p>\n\n\n<h4 class=\"gb-headline gb-headline-c2bc8b88 gb-headline-text\" id=\"linkedin-has-a-large-sre-base-in-india\"><strong>LinkedIn has a large SRE base in India<\/strong><\/h4>\n\n\n<p>LinkedIn&#8217;s engineering management focused strongly on creating SRE teams in India. This may be partly due to hiring difficulties endemic to the SRE field in North America and Europe. <\/p>\n\n\n\n<p>There were initial difficulties as SRE had a stigma within talent pools in India. <\/p>\n\n\n\n<p>Many with years of experience perceived SRE as a relabelled or glorified systems administration role. LinkedIn&#8217;s SRE hiring managers instead focused on turning junior engineers and graduates into high-potential SREs. <\/p>\n\n\n\n<p>They emphasized from the outset that SRE was not a typical operations role and that it would involve broad-spanning work in ambiguous settings.<\/p>\n\n\n\n<p>As of 2017, the Bangalore office of LinkedIn had 60 SREs in 10 teams.&nbsp; <\/p>\n\n\n<h3 class=\"gb-headline gb-headline-f317bbcc gb-headline-text\" id=\"history-of-sre-at-linkedin\">History of SRE at LinkedIn<\/h3>\n\n<h4 class=\"gb-headline gb-headline-43d29896 gb-headline-text\" id=\"prior-to-a-modern-sre-discipline-the-early-2010s\"><strong>Prior to a modern SRE discipline (the early 2010s)<\/strong><\/h4>\n\n\n<p>In 2013, LinkedIn rebranded teams in the existing discipline of AppOps to SRE. The newly minted SREs worked alongside stratified operations teams in verticals such as systems, networks, applications, and DBA. <\/p>\n\n\n\n<p>This way of running software operations proved difficult as LinkedIn continued its growth trajectory. Several SREs noted issues like:<\/p>\n\n\n\n<ul>\n<li>Cooperation among these teams went only by way of tickets being sent to each other&nbsp;<\/li>\n\n\n\n<li>There were walls not only between dev and ops but within ops itself<\/li>\n\n\n\n<li>Developers did not have access to production and even most non-production environments &#8211; \u201chand me the code, and I will deploy it\u201d so ops were bogged down with release operations<\/li>\n\n\n\n<li>Pager fatigue was extreme with operations people training significant others to keep an eye for alerts on the Blackberry while they slept (3 alerts every 5 minutes on average)<\/li>\n\n\n\n<li>Mon-Wed 6-10am outages due to capacity issues and little visibility into demand due to no observability instrumentation<\/li>\n\n\n\n<li>Dealing with outages that were so frequent, there was an outage every day of the calendar year<\/li>\n\n\n\n<li>MTTR was 1500 minutes, meaning issues were not getting resolved on the same day<\/li>\n<\/ul>\n\n\n\n<p>SREs originally started off in this environment as firefighters but evolved during a drastic shift in the software operations at LinkedIn.<\/p>\n\n\n<h4 class=\"wp-block-heading\" id=\"transitioning-to-sre-from-previous-operational-discipline\"><strong>Transitioning to SRE from previous operational discipline<\/strong><\/h4>\n\n\n<p>There was a need for change in the mid-2010s considering 100 million members relied on LinkedIn at the time. <\/p>\n\n\n\n<p>LinkedIn could not mess up its increasingly complex software operations.<\/p>\n\n\n\n<p>Something had to give, as LinkedIn was suffering frequent outages and performance issues despite now having an SRE team. <\/p>\n\n\n\n<p>LinkedIn SREs and the wider organization had to challenge several antipatterns:<\/p>\n\n\n\n<figure class=\"wp-block-table has-medium-font-size\"><table><tbody><tr><td><strong>Antipattern identifier<\/strong><\/td><td><strong>Antipattern description<\/strong><\/td><td><strong>Shift to these propatterns\u2026<\/strong><\/td><\/tr><tr><td>Firefighter  \ud83d\udc68\ud83c\udffe\u200d\ud83d\ude92<\/td><td>\u274c react to handle incidents that happen to keep the company functioning one more day<\/td><td>\u2705 automate the manual work<br>\u2705 deliver instrumentation to get rich data on issues and alerts<br>\u2705 understand the stack at a deeper level<\/td><\/tr><tr><td>Gatekeeper \ud83d\udc82\ud83c\udffd<\/td><td>\u274c control releases to protect the site from developers<br>\u274c \u201ctalk to me if you want to touch production\u201d<br>\u274c push a button to deploy someone else\u2019s work<br>\u274c have arbitrary schedules for releases<\/td><td>\u2705 develop automated gatekeepers to assure quality<br>\u2705 let developers own their work in production<br>\u2705 support developers in self-service deployment with the use of pre-launch checklists and release guidance <\/td><\/tr><\/tbody><\/table><\/figure>\n\n\n\n<p>The decision was then made across the organization to radically change several aspects of software operations. This included a drastic shift in software architecture and developer involvement in operations for the first time.<\/p>\n\n\n\n<p>To achieve this lofty goal, <strong>product development was stopped for 3 months<\/strong>. <\/p>\n\n\n\n\n\n<p>This signified a gutsy commitment to drive the necessary change.<\/p>\n\n\n\n<p>Several factors supported the transition to more effective software operations:<\/p>\n\n\n\n<ul>\n<li>The transition received solid management support from all affected VPs<\/li>\n\n\n\n<li>The umbrella operations organization with sysadmins, network engineers, DB admins, etc. remained, but communication made it clear that it would not be business as usual&nbsp;<\/li>\n\n\n\n<li>Major changes to how the software was architected came with support from experienced SREs and other seasoned engineers<\/li>\n\n\n\n<li>Developers were increasingly involved in discussions about their ownership of software in production <\/li>\n<\/ul>\n\n\n<h4 class=\"gb-headline gb-headline-74d285d6 gb-headline-text\" id=\"how-the-change-was-actioned-toward-a-modern-sre-discipline\"><strong>How the change was actioned toward a modern SRE discipline<\/strong><\/h4>\n\n\n<p>SREs initially advocated a few human-related practices to drive the change, including:<\/p>\n\n\n\n<ul>\n<li>encouraging and supporting developers in joining the on-call roster; some development teams got more into this, while others kept a distance from on-call&nbsp;<\/li>\n\n\n\n<li>implement DevOps, where developers are involved in operational thinking and behaviors around self-deploying code &#8211; \u201cIf it\u2019s checked-in, it\u2019s ready to go to production\u201d<\/li>\n<\/ul>\n\n\n\n<p>A few of the technical changes included:<\/p>\n\n\n\n<ul>\n<li>Move to a service-oriented architecture (SOA) and reduce dependence on monolithic artifacts<\/li>\n\n\n\n<li>Develop self-service portals for consuming metrics around services<\/li>\n\n\n\n<li>Enabling graceful degradation of services rather than 500 erroring the whole request load<\/li>\n\n\n\n<li>Uptake of distributed database model vs monolith databases&nbsp;to reduce SPOF (single point of failure) risk<\/li>\n\n\n\n<li>Switch from hardware to software-based load balancing<\/li>\n\n\n\n<li>Centralized branch model to simplify the commit process compared to feature branches<\/li>\n\n\n\n<li>Implemented multiple testing methodologies like pre-commit, PCS, and PCL <\/li>\n\n\n\n<li>Develop an auto-remediation system to replace manual network operations center (NOC) processes \u2014 resulting in a career transition program for traditional operations roles into SRE<\/li>\n\n\n\n<li>helping developers run A\/B tests to ensure backward compatibility of their commit before users would see it&nbsp;<\/li>\n\n\n\n<li>support self-service deployment with a <a href=\"https:\/\/engineering.linkedin.com\/blog\/2016\/12\/mttd-and-mttr-are-key\">canary push to production<\/a> to ensure stability before rollout to the entire cluster <\/li>\n<\/ul>\n\n\n\n<p>Here&#8217;s a quick rundown of the <strong>cadence for self-service deployments for developers<\/strong> at LinkedIn:<\/p>\n\n\n\n<ol>\n<li>Canary to a single production instance<\/li>\n\n\n\n<li>Check the automated metrics-based validation of success<\/li>\n\n\n\n<li>Promote deployment to a single production data center<\/li>\n\n\n\n<li>Promote deployment to remaining production centers<\/li>\n\n\n\n<li>Ramp to an increasingly growing member base&nbsp;<\/li>\n<\/ol>\n\n\n\n<p>This self-service model has supported 15,000 commits and 600+ feature ramp-ups per day<\/p>\n\n\n\n<p>Since the self-service model is now well entrenched in the LinkedIn culture, SREs continue their work on various other problems. They work in areas like performance, platforms, resilience, architecture, etc.<\/p>\n\n\n\n<p>They use observability to inform them with data to develop and <a href=\"https:\/\/youtu.be\/C8EL8rw4A8w?t=1246\">implement tools that scale the LinkedIn application in production<\/a>. <\/p>\n\n\n\n<p>In order to create these tools, SREs often use computer science skills and apply the scientific method, statistical analysis, and implement machine learning models.<\/p>\n\n\n<h3 class=\"gb-headline gb-headline-2d1d3dcf gb-headline-text\" id=\"what-is-linkedins-sre-culture-like\">What is LinkedIn&#8217;s SRE culture like?<\/h3>\n\n<h4 class=\"gb-headline gb-headline-d5d987b4 gb-headline-text\" id=\"3-key-principles\"><strong>3 key principles<\/strong> <\/h4>\n\n\n<p>Ben Purgason, Director of SRE at LinkedIn from 2017-2018, summarized LinkedIn&#8217;s SRE principles as:<\/p>\n\n\n\n<ol>\n<li>Site up <\/li>\n\n\n\n<li>Empower developer ownership<\/li>\n\n\n\n<li>Operations is an engineering problem<\/li>\n<\/ol>\n\n\n\n<p>Let&#8217;s explore each of these 3 principles in further detail:<\/p>\n\n\n\n<ul>\n<li>&#8220;Site up&#8221; is a simple catchphrase with the aim of making every engineer and system think of reliability as much as possible<\/li>\n\n\n\n<li>&#8220;Empower developer ownership&#8221; asserts that developers own their work end-to-end, not just in the development phase<\/li>\n\n\n\n<li>&#8220;Operations is an engineering problem&#8221; aims to involve engineering prowess in solving issues, not just push-button solutions<\/li>\n<\/ul>\n\n\n<h4 class=\"gb-headline gb-headline-05f01e62 gb-headline-text\" id=\"hiring-culture-at-linkedin\"><strong>Hiring culture at LinkedIn<\/strong><\/h4>\n\n\n<p>Hiring culture at LinkedIn SRE aims for a good &#8220;culture fit&#8221;. This is a common practice among many modern tech companies. <\/p>\n\n\n\n<p>However, there is an important distinction in LinkedIn SRE&#8217;s definition of culture fit. <\/p>\n\n\n\n<p>Greg Leffler says,  &#8220;Don\u2019t just hire someone you\u2019d want to hang out with after hours or on the weekend, or they\u2019ve won hackathons or went to elite schools, but someone who can do the job&#8221;. <\/p>\n\n\n\n<p>After all, Site Reliability Engineering is a unique line of work where success calls for more than past accolades and social prowess.<\/p>\n\n\n<h4 class=\"gb-headline gb-headline-3a83a392 gb-headline-text\" id=\"enable-everyone-to-selfservice-their-needs\"><strong>Enable everyone to self-service their needs<\/strong><\/h4>\n\n\n<p>As the LinkedIn member base grew, more services were added, and developer demand for hands-on support intensified. The need for a self-service approach became apparent as operations teams became stretched thin.<\/p>\n\n\n\n<p>Over the years, LinkedIn SREs have developed several self-service tools and portals. <\/p>\n\n\n\n<p>In particular, they focused on helping teams consume metrics around their specific service. Such metrics allowed teams to  benchmark their service against other services. <\/p>\n\n\n\n<p>This also allowed SREs to show teams if their service consumed outsized resources and needed to be optimized.<\/p>\n\n\n<h4 class=\"gb-headline gb-headline-90f006fd gb-headline-text\" id=\"fight-the-hero-worship-mentality\"><strong>Fight the &#8220;hero worship&#8221; mentality<\/strong><\/h4>\n\n\n<p>A long-lived philosophy in the software operations space has been to perform acts of heroics to rescue systems from critical failures e.g. &#8220;I survived going through a 36-hour war room to solve an ops crisis&#8221;. <\/p>\n\n\n\n<p>LinkedIn&#8217;s change moved engineers away from that mindset. They have been steered to <strong>aim for proactive solutions that prevent the need for war rooms and heroic efforts in the first place<\/strong>.&nbsp;<\/p>\n\n\n<h4 class=\"gb-headline gb-headline-c4b9b77e gb-headline-text\" id=\"hunt-the-problems-without-a-straightforward-fix\"><strong>Hunt the problems without a straightforward fix<\/strong><\/h4>\n\n\n<p>Ben Pergason believed that it was important for their SREs to hunt problems that don\u2019t have a straightforward fix. <\/p>\n\n\n\n<p>The same went for problems that fit into the \u201ctoo hard\u201d basket for regular or old-school operations mindsets.<\/p>\n\n\n\n<p>An example would be Ben&#8217;s <a href=\"https:\/\/engineering.linkedin.com\/blog\/2016\/11\/every-day-is-monday-in-operations\">\u201cEvery day is Monday\u201d<\/a> story about resolving a Python issue that had to be traced across all systems running the compiler. Let&#8217;s get into the details&#8230;<\/p>\n\n\n\n<p>The fix was not a straightforward triage and took 750 engineer hours of time. However, the ROI was worth the effort. The fix saved close to 233 hours <em>per day<\/em> that would have been wasted in managing deployment errors.&nbsp; <\/p>\n\n\n\n<p>The initial issue seemed from a distance to be a minor hindrance for some engineers. But the gravity of the situation grew fast as more systems and dependencies were found to be affected. <\/p>\n\n\n\n<p>This error was because of a change made that affected a subsystem in an unexpected way. <\/p>\n\n\n\n<blockquote class=\"wp-block-quote is-layout-flow wp-block-quote-is-layout-flow\">\n<p><em>\u201cThe high-profile failures get dealt with immediately (roll back the change), but the minor ones, like an integration test with a 1% increased chance of failure, often slip through the cracks. These minor problems build on top of each other, resulting in a major problem with no obvious single cause. Regardless of the failure type, it is critical that we be aware of changes that occurred around the same time.\u201d<\/em> <\/p>\n<cite>\u2014 Ben Pergason, Director of SRE at LinkedIn from 2017 to 2018<\/cite><\/blockquote>\n\n\n<h2 class=\"gb-headline gb-headline-20bb6558 gb-headline-text\" id=\"parting-words\">Parting words <\/h2>\n\n\n<p>Countless issues happen in software systems all the time. They are often too peculiar or opaque for day-to-day teams to power through, especially if they have less proactive mindsets. <\/p>\n\n\n\n<p>Truth is that systems are constantly changing, and someone needs to be able to navigate this. They do not have to know everything but investigate, experiment, and find solutions through unpaved paths.&nbsp;<\/p>\n\n\n\n<p>As you have read above, the Site Reliability Engineers at LinkedIn are now well-equipped to handle tricky situations.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Introduction LinkedIn has one of the most robust Site Reliability Engineering (SRE) practices around. After all, as the social network of record for jobseekers and salespeople, it is the 6th most trafficked website in the world, with over 1.5 billion unique visits per month. LinkedIn&#8217;s Site Reliability Engineers (SREs) ensure all that traffic gets served [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[60,8],"tags":[49,78],"_links":{"self":[{"href":"https:\/\/sysmit.com\/cf22\/wp-json\/wp\/v2\/posts\/5312"}],"collection":[{"href":"https:\/\/sysmit.com\/cf22\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/sysmit.com\/cf22\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/sysmit.com\/cf22\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/sysmit.com\/cf22\/wp-json\/wp\/v2\/comments?post=5312"}],"version-history":[{"count":62,"href":"https:\/\/sysmit.com\/cf22\/wp-json\/wp\/v2\/posts\/5312\/revisions"}],"predecessor-version":[{"id":5754,"href":"https:\/\/sysmit.com\/cf22\/wp-json\/wp\/v2\/posts\/5312\/revisions\/5754"}],"wp:attachment":[{"href":"https:\/\/sysmit.com\/cf22\/wp-json\/wp\/v2\/media?parent=5312"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/sysmit.com\/cf22\/wp-json\/wp\/v2\/categories?post=5312"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/sysmit.com\/cf22\/wp-json\/wp\/v2\/tags?post=5312"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}