{"id":5379,"date":"2023-02-09T08:38:26","date_gmt":"2023-02-08T22:38:26","guid":{"rendered":"https:\/\/sysmit.com\/cf22\/?p=5379"},"modified":"2024-01-01T12:16:52","modified_gmt":"2024-01-01T02:16:52","slug":"convert-developers-site-reliability-engineers-sre-guide","status":"publish","type":"post","link":"https:\/\/sysmit.com\/cf22\/convert-developers-site-reliability-engineers-sre-guide\/","title":{"rendered":"How to convert developers into Site Reliability Engineers (SREs)"},"content":{"rendered":"\n
In this article, you will learn the following:<\/p>\n\n\n\n
Hiring in the Site Reliability Engineering (SRE) space is notoriously difficult. So it makes sense to figure out how to expand the hiring pool beyond existing SREs<\/strong>.<\/p>\n\n\n\n One way to increase the hiring pool is to recruit developers (also known as SWEs) and gradually advance them into SRE work.<\/p>\n\n\n\n We will exclusively explore the aforementioned method in this article.<\/p>\n\n\n The difficulty in hiring SREs has continued despite the ongoing tech downturn that has been in play since mid-2022.<\/p>\n\n\n\n It\u2019s an interesting dichotomy. <\/p>\n\n\n\n While some of the larger tech companies are offloading some SRE positions, other companies are struggling to source the right kind of reliability talent<\/strong>.<\/p>\n\n\n\n The reasoning behind this could be that many laid-off SREs seek like-for-like opportunities. For example, an ex-FAANG SRE may want similar pay and benefits (and possibly hiring brand cachet) as before. <\/p>\n\n\n\n Most smaller companies would not be able to come close to the perks and pay that were being offered at the \u201ctop end of town\u201d in tech.<\/p>\n\n\n\n What is a hiring manager in a less well-known or less \u201cgenerous\u201d company to do?<\/p>\n\n\n\n In the middle of every difficulty lies opportunity.<\/p>\n\n\n A common pattern for filling SRE talent gaps in organizations \u2014 especially non-tech companies \u2014 is to turn existing sysadmins into SREs.<\/p>\n\n\n\n The challenge that comes with this is that SRE work may involve digging through code or at least understanding how it works<\/strong>. Not all, but many admins don\u2019t have this experience or interest.<\/p>\n\n\n\n Having mastery of at least one programming language is advantageous.<\/p>\n\n\n\n An SRE should be able to configure open-source tools in their codebase and also make custom tools<\/strong>. This gives a natural advantage to people who work with code all the time: developers.<\/p>\n\n\n\n Shashank Katlaparthi, a Site Reliability Engineer at Redhat, shared his thoughts regarding the coding prowess of SREs:<\/p>\n\n\n\n In my current role, we SREs not only code our product\/application, we set up a CI\/CD process and automate the scaling\/provisioning\/error-handling of the Infrastructure to test our product code. To summarize, we not only code for the application, but we code the Infrastructure itself. And no, we are not using tools as-is; we are building new ones.<\/p>\n<\/blockquote>\n\n\n\n Adjacent roles also find the need for developer backgrounds useful.<\/p>\n\n\n\n Performance Engineering is a role and skillset that has a lot of crosslinking with SRE as a practice. It\u2019s interesting to note that performance engineering managers also find it useful to work with people who have software development expertise:<\/p>\n\n\n\n In my current role, I interview candidates applying for Performance Engineer (PE) or Senior Performance Engineer roles and find most don’t have any or enough software background to be effective on my team. In the end, the team I currently lead is mostly software developers, who happen to work on performance-related things. \u2014 Anonymous PE Manager<\/p>\n<\/blockquote>\n\n\n\n Hiring a developer with several years of experience should be an advantage.<\/p>\n\n\n\n Okay, but\u2026<\/p>\n\n\n Here are a few reasons why developers may consider the move:<\/p>\n\n\n\n One other aspect is that there will always be demand for SREs.<\/p>\n\n\n\n Companies will be compelled to hire SREs as they scale up or increase the complexity of their infrastructure and software architecture.<\/p>\n\n\n\n If they aim to scale up linearly by adding more operations engineers, they will need an extremely large volume of such people. <\/p>\n\n\n\n SRE acts like a fulcrum where one SRE can automate to the level that would normally call for several non-software-driven operations people<\/strong>.<\/p>\n\n\n\n The prospect for SREs seems good, right?<\/p>\n\n\n\n There is an elephant in the room that I want to address.<\/p>\n\n\n\n SRE has had a rough start as a practice in many organizations with more traditional backgrounds and without robust investment in operations and infrastructure.<\/p>\n\n\n\n If you go on the r\/SRE subreddit or Slack chats, you may have noticed that many SREs are asking how to switch over to developer work.<\/p>\n\n\n\n This might have you doubting the possibility of turning developers into SREs.<\/p>\n\n\n\n I initially thought, \u201cThey must prefer coding or working on product\u201d, but I was wrong.<\/p>\n\n\n\n Many of these posts by SREs had a common thread: they felt the grass would be greener on the other side i.e. on feature teams. I noticed a common thread among these people. They had SRE job titles, but more than 75-90% of their time was consumed by firefighting work.<\/p>\n\n\n\n Many of these folks were solely focused on handling tickets and infra-provisioning.<\/p>\n\n\n\n These tasks are something a junior SRE might do to understand the system and issues, but they should not form the ongoing career progression of an SRE.<\/p>\n\n\n\n Site Reliability Engineering work is so much more than that.<\/p>\n\n\n\n In plain terms, you\u2019ll need to be proactively working on problems rather than firefighting<\/strong>, have a broader view of systems, and more, which is kinda exciting.<\/p>\n\n\n\n\n Developers have an advantage as they have coding ability that\u2019s useful to SRE work, but there\u2019s a lot more to how the role functions. <\/p>\n\n\n\n Developers moving into SRE roles will need to:<\/p>\n\n\n\n Here are some more reasons why it\u2019s difficult to get that mindset shift into SRE as a developer:<\/p>\n\n\n Site Reliability Engineers often find themselves doing the following:<\/p>\n\n\n\n A Site Reliability Engineer can expect to work with a combination of the following:<\/p>\n\n\n\n 40\u201350% are [SRE] candidates who were very close to the Google Software Engineering qualifications (i.e., 85\u201399% of the skill set required), and who, in addition,<\/strong><\/em> had a set of technical skills that is useful to SRE but is rare for most software engineers. By far, UNIX system internals and networking (Layer 1 to Layer 3) expertise are the two most common types of alternate technical skills we seek. \u2014 Stephen Thorne<\/a>, Staff SRE @ Google<\/p>\n<\/blockquote>\n\n\n\n As stated earlier, an SRE ideally needs development and systems-oriented skills \u2013 a *Pi-shaped* skill set<\/a>, so to speak.<\/p>\n\n\n\n For this type of skill set, an SRE has to be proficient in both trades. Two areas of deep expertise with a broader understanding of other areas to form the pi shape.<\/p>\n\n\n\n Not just one or the other, which is the hallmark of a T-shaped skill set.<\/p>\n\n\n Site Reliability Engineering is still a relatively rare role in the broader software community.<\/p>\n\n\n\n However, there\u2019s little denying that the approach of Site Reliability Engineering is the future of software operations.<\/p>\n\n\n\n Here are some things that make SREs a unique breed in software work:<\/p>\n\n\n Ask any developer what they\u2019re working on, and you\u2019ll see a tiny sliver of the whole codebase. That makes sense for the kind of work that is coding up a feature or update.<\/p>\n\n\n\n Systems work, on the other hand, needs a holistic view of significant complexity in order to make sure the whole unit works harmoniously.<\/p>\n\n\n Because they have a scope spanning the entirety of a software system, SREs can end up working on various types of problems.<\/p>\n\n\n\n They may solve challenging problems which could take days, weeks, or months to resolve.<\/p>\n\n\n\n The old adage of \u201chow long is a piece of string?\u201d can apply to SREs estimating a fix for issues.<\/p>\n\n\n\n Some problems may be well-defined<\/strong>, like spooling up infrastructure based on known demand.<\/p>\n\n\n\n Other problems may be more abstract<\/strong>, like working out how to cost-effectively autoscale a service that has inconsistent usage patterns and needs high performance.<\/p>\n\n\n Most developers work within agile frameworks like Scrum or XP. SREs may also use these frameworks when planning software build work.<\/p>\n\n\n\n That essentially timeboxes their efforts, which is fine but\u2026<\/p>\n\n\n\n That might work for estimable problems but does not always work for production-level work.<\/p>\n\n\n\n Can an SRE stop working on a problem because it does not fit into the mold of a sprint? That could spell disaster for production software. Daniel Wilhite answers the question of \u201cCan scrum be used effectively by SRE teams?\u201d<\/a> very well.<\/p>\n\n\n You\u2019d expect SREs to get used to developers throwing the code over the wall, but no. Many are ex-developers, so they will spend much of their time coding up solutions for infrastructure and software performance.<\/p>\n\n\n\n Sometimes, they may participate in feature teams for job rotation. This helps them get a better understanding of their developer counterparts\u2019 priorities.<\/p>\n\n\n SREs come in many shapes and sizes. In smaller companies, a single SRE may be the one-stop shop for all site reliability matters. As a company grows, SRE roles may get divided into specialized work.<\/p>\n\n\n\n For example, one SRE may focus on supporting platforms like Kubernetes.<\/p>\n\n\n\n Another SRE may spend most of their time supporting developers in taking up DevSecOps.<\/p>\n\n\n\n Yet another may have general SRE responsibilities like being an incident commander.<\/p>\n\n\n Both roles are chalk and cheese, so it\u2019s worth considering key differences in how SREs work compared to software developers.<\/p>\n\n\n\n Chances are they will need to collaborate closely to make sure the software works well in production.<\/p>\n\n\n\n I took inspiration from a Google recruiter\u2019s interview with an SRE, Ciara Kamahele (link here<\/a>).<\/p>\n\n\n\n The key differences I uncovered are in table form below:<\/p>\n\n\n\n Here\u2019s a quick rundown of the steps a developer could take for a smooth transition to SRE work:<\/p>\n\n\n\n Sylvia Fronczak, a Senior Developer at Shopify, made a good point<\/a>: <\/p>\n\n\n\n \u201cAvoiding a sink or swim approach is important if you value inclusivity. Sink or swim breeds stress, frustration, attrition, and imposter syndrome.\u201d<\/em><\/p>\n<\/blockquote>\n\n\n\n These sink-or-swim results are all the things we don\u2019t want our new SREs to develop<\/strong>. It happening may highlight a larger structural problem in how the SRE team itself handles the work.<\/p>\n\n\n\n Google\u2019s 2016 Site Reliability Engineering<\/em> book named this particular issue \u201ctrial by fire”.<\/p>\n\n\n\n Here\u2019s a quote from that book to highlight the issue:<\/p>\n\n\n\n This “trial by fire” method of orienting one\u2019s newbies is often born out of a team\u2019s current environment. Ops-driven, reactive SRE teams “train” their newest members by making them\u2026well, react! Over and over again\u2026 the trial-by-fire approach also presumes that many or most aspects of a team can be taught strictly by doing, rather than by reasoning. If the set of work one encounters in a ticket queue will adequately provide training for said job, then this is not an SRE position<\/strong>. (source<\/a>)<\/p>\n<\/blockquote>\n\n\n Google\u2019s SRE book also has a particularly strong stance on apprenticeship.<\/p>\n\n\n\n Essentially, newly minted SREs should not be doing operations work all the time<\/strong>, but also see what supervisors and senior SREs are doing.<\/p>\n\n\n\n Newly transitioning SREs will look up to the rest of the team, especially senior people, to show them what a typical day should look like over time.<\/p>\n\n\n\n Henri Devieux, SRE at Dropbox, recommends the following for SRE apprentices<\/a>:<\/p>\n\n\n\n \u201cIt\u2019s important to create a focused plan for what you need to learn. The fundamentals will always be more important to nail down than any one awesome new piece of software. (editor\u2019s note: including ChatGPT<\/em>) Stay focused.\u201d<\/p>\n<\/blockquote>\n\n\n\n One of New Relic\u2019s SREs, Yonathan Schultz, has broken down his day in the life of an SRE<\/a> to give a picture of a \u201ctypical day\u201d of an experienced SRE.<\/p>\n\n\n\n But of course, there is no typical day for SREs.<\/p>\n\n\n\n New challenges emerge daily, and SREs do many complex activities that may seem alien to an apprentice.<\/p>\n\n\n\n Henri outlined that his day as an SRE<\/a> could be comprised of a myriad of work like:<\/p>\n\n\n\n To avoid causing cognitive overload, I suggest an onboarding program<\/strong> that eases apprentice SREs into increasingly complex work patterns.<\/p>\n\n\n\n Google’s Site Reliability Engineering<\/em> book has a brilliant progression framework for onboarding SREs<\/a>. <\/p>\n\n\n\n A formal apprenticeship program may be useful for SRE teams that can get VP support for it.<\/p>\n\n\n\n Tammy Bryant Butow created an SRE apprenticeship program when she first started working as an SRE manager at Dropbox.<\/p>\n\n\n\n This was a direct response to the difficulty she experienced hiring SREs. Here are the trigger events supporting the program:<\/p>\n\n\n\n Her apprenticeship program was in a 6-month format.<\/p>\n\n\n\n People from software engineering and technical manager backgrounds had 1:1 mentoring from an experienced SRE.<\/p>\n\n\n\n They could contribute through architecture ideas, sit in on postmortem meetings and ask questions about various aspects of the system.<\/p>\n\n\n\n Concepts were explained from grassroots all the way down to how to pronounce terms like NGINX. Tammy ensured that the curse of knowledge \u2014 \u201cthis should be obvious\u201d \u2014 did not affect her mentees\u2019 learning.<\/p>\n\n\n\n End results were:<\/p>\n\n\n\n The full rundown of Tammy\u2019s experience with her SRE apprenticeship program can be heard here<\/a><\/p>\n\n\n\n One final point: Tammy\u2019s apprenticeship program at Dropbox did not guarantee a job at the end of it, but the pass rate was high.<\/p>\n\n\n\n Once your apprentices are ready to start a fully-fledged role, consider your onboarding work.<\/p>\n\n\n Your aim is to minimize time-to-productivity for your newly hired SRE.<\/p>\n\n\n\n Some prickly issues will come in the way. I picked up several tips from a post written by Gergely Orosz on ways more senior engineers get stuck when they take on a new role.<\/p>\n\n\n\n I\u2019ve noticed that these issues affect new SREs too:<\/p>\n\n\n\n I will aim to help you solve some of these prickly issues:<\/p>\n\n\n\n I recently read an engineer onboarding post by Luca Rossi, who is an Italian developer influencer.<\/p>\n\n\n\n One of his recommendations was to break the work down into small pieces of a bigger puzzle and give them to the new hire for quick wins.<\/p>\n\n\n\n This made me think about the SRE context. Here are my thoughts:<\/p>\n\n\n\n You could potentially start junior SREs on a high process orientation i.e. Do this, then review this<\/em>, etc. But as they progress, you need to let them work outside the boundaries of the \u201cprocess\u201d.<\/p>\n\n\n\n Google\u2019s SRE book considers process orientation an antipattern.<\/p>\n\n\n\n In my experience, initially having a predetermined path for the work in the form of processes can alleviate a lot of ambiguity and stress<\/strong>.<\/p>\n\n\n\n The idea is to give training wheels and slowly take them off as the new SRE gets more confident.<\/p>\n\n\n\n Then it\u2019s time to encourage tinkering, using statistical methods and scientific processes<\/strong> to solve ambiguous, complex problems.<\/p>\n\n\n Final emphasis: at no point should SREs become the main destination for resolving operations tickets<\/strong> or taking the entire on-call load.<\/p>\n\n\n\n That turns them into operations in the traditional sense of responding to tickets as issues arise. Your SRE team will NOT benefit from their developer skills if they\u2019re too busy putting out fires.<\/p>\n\n\n\n Developers will make effective SREs as long as you let them solve operations issues as software engineering problems<\/em>.<\/p>\n\n\n I have been working on something exciting in the background for a while now. It’s not yet ready but I would like to have conversations with SRE managers to refine the approach. <\/p>\n\n\n\n It will be a software-based tool but oriented to the human side of SRE rather than the technology side. So nothing like an observability dashboard or incident triage tool.<\/p>\n\n\n\n Using it, SRE managers can map out the team’s work as it stands. They can then use it for 3 purposes:<\/p>\n\n\n\n #1 and #2 will naturally help with integrating new hires and evolving their work over time. <\/p>\n\n\n\nWhy are SREs so hard to hire despite the tech downturn?<\/h2>\n\n\n
Hiring developers overcomes an SRE hiring issue<\/h2>\n\n\n
\n
\n
Why would developers switch to SRE?<\/h2>\n\n\n
\n
Developers may still find it hard to switch to SRE<\/h2>\n\n\n
\n
An SRE\u2019s scope of work is W-I-D-E<\/h3>\n\n\n
\n
An SRE\u2019s toolbox is H-U-G-E<\/h3>\n\n\n
\n
\n
Difference between working styles of SREs and developers<\/h2>\n\n\n
SREs look at the broader picture<\/strong><\/h3>\n\n\n
SREs thrive in ambiguity<\/strong><\/h3>\n\n\n
SREs work beyond constraints like Scrum<\/strong><\/h3>\n\n\n
SREs don\u2019t stay in their lane<\/strong><\/h3>\n\n\n
SREs don\u2019t have a monolith job description<\/strong><\/h3>\n\n\n
Comparison with software developers<\/strong><\/h3>\n\n\n
Pre-game plan for developers aiming for SRE jobs<\/h2>\n\n\n
\n
Helping developers safely enter SRE work<\/h2>\n\n
Avoid places with a sink-or-swim mentality<\/h3>\n\n\n
\n
\n
Become an apprentice<\/h3>\n\n\n
\n
\n
\n
\n
How to onboard newly minted SREs effectively<\/h2>\n\n\n
\n
Here are some concrete onboarding tips for SREs<\/h3>\n\n\n
\n
Luca Rossi\u2019s \u201csmall part of the puzzle\u201d onboarding technique<\/h3>\n\n\n
\n
Evolving the workload as new hires become more effective<\/h3>\n\n\n
\n
Related work <\/h2>\n\n\n
\n