{"id":5379,"date":"2023-02-09T08:38:26","date_gmt":"2023-02-08T22:38:26","guid":{"rendered":"https:\/\/sysmit.com\/cf22\/?p=5379"},"modified":"2024-01-01T12:16:52","modified_gmt":"2024-01-01T02:16:52","slug":"convert-developers-site-reliability-engineers-sre-guide","status":"publish","type":"post","link":"https:\/\/sysmit.com\/cf22\/convert-developers-site-reliability-engineers-sre-guide\/","title":{"rendered":"How to convert developers into Site Reliability Engineers (SREs)"},"content":{"rendered":"\n
In this article, you will learn the following:<\/p>\n\n\n\n
Hiring in the Site Reliability Engineering (SRE) space is notoriously difficult. So it makes sense to figure out how to expand the hiring pool beyond existing SREs<\/strong>.<\/p>\n\n\n\n One way to increase the hiring pool is to recruit developers (also known as SWEs) and gradually advance them into SRE work.<\/p>\n\n\n\n We will exclusively explore the aforementioned method in this article.<\/p>\n\n\n The difficulty in hiring SREs has continued despite the ongoing tech downturn that has been in play since mid-2022.<\/p>\n\n\n\n It\u2019s an interesting dichotomy. <\/p>\n\n\n\n While some of the larger tech companies are offloading some SRE positions, other companies are struggling to source the right kind of reliability talent<\/strong>.<\/p>\n\n\n\n The reasoning behind this could be that many laid-off SREs seek like-for-like opportunities. For example, an ex-FAANG SRE may want similar pay and benefits (and possibly hiring brand cachet) as before. <\/p>\n\n\n\n Most smaller companies would not be able to come close to the perks and pay that were being offered at the \u201ctop end of town\u201d in tech.<\/p>\n\n\n\n What is a hiring manager in a less well-known or less \u201cgenerous\u201d company to do?<\/p>\n\n\n\n In the middle of every difficulty lies opportunity.<\/p>\n\n\n A common pattern for filling SRE talent gaps in organizations \u2014 especially non-tech companies \u2014 is to turn existing sysadmins into SREs.<\/p>\n\n\n\n The challenge that comes with this is that SRE work may involve digging through code or at least understanding how it works<\/strong>. Not all, but many admins don\u2019t have this experience or interest.<\/p>\n\n\n\n Having mastery of at least one programming language is advantageous.<\/p>\n\n\n\n An SRE should be able to configure open-source tools in their codebase and also make custom tools<\/strong>. This gives a natural advantage to people who work with code all the time: developers.<\/p>\n\n\n\n Shashank Katlaparthi, a Site Reliability Engineer at Redhat, shared his thoughts regarding the coding prowess of SREs:<\/p>\n\n\n\n In my current role, we SREs not only code our product\/application, we set up a CI\/CD process and automate the scaling\/provisioning\/error-handling of the Infrastructure to test our product code. To summarize, we not only code for the application, but we code the Infrastructure itself. And no, we are not using tools as-is; we are building new ones.<\/p>\n<\/blockquote>\n\n\n\n Adjacent roles also find the need for developer backgrounds useful.<\/p>\n\n\n\n Performance Engineering is a role and skillset that has a lot of crosslinking with SRE as a practice. It\u2019s interesting to note that performance engineering managers also find it useful to work with people who have software development expertise:<\/p>\n\n\n\n In my current role, I interview candidates applying for Performance Engineer (PE) or Senior Performance Engineer roles and find most don’t have any or enough software background to be effective on my team. In the end, the team I currently lead is mostly software developers, who happen to work on performance-related things. \u2014 Anonymous PE Manager<\/p>\n<\/blockquote>\n\n\n\n Hiring a developer with several years of experience should be an advantage.<\/p>\n\n\n\n Okay, but\u2026<\/p>\n\n\n Here are a few reasons why developers may consider the move:<\/p>\n\n\n\n One other aspect is that there will always be demand for SREs.<\/p>\n\n\n\n Companies will be compelled to hire SREs as they scale up or increase the complexity of their infrastructure and software architecture.<\/p>\n\n\n\n If they aim to scale up linearly by adding more operations engineers, they will need an extremely large volume of such people. <\/p>\n\n\n\n SRE acts like a fulcrum where one SRE can automate to the level that would normally call for several non-software-driven operations people<\/strong>.<\/p>\n\n\n\n The prospect for SREs seems good, right?<\/p>\n\n\n\n There is an elephant in the room that I want to address.<\/p>\n\n\n\n SRE has had a rough start as a practice in many organizations with more traditional backgrounds and without robust investment in operations and infrastructure.<\/p>\n\n\n\n If you go on the r\/SRE subreddit or Slack chats, you may have noticed that many SREs are asking how to switch over to developer work.<\/p>\n\n\n\n This might have you doubting the possibility of turning developers into SREs.<\/p>\n\n\n\n I initially thought, \u201cThey must prefer coding or working on product\u201d, but I was wrong.<\/p>\n\n\n\n Many of these posts by SREs had a common thread: they felt the grass would be greener on the other side i.e. on feature teams. I noticed a common thread among these people. They had SRE job titles, but more than 75-90% of their time was consumed by firefighting work.<\/p>\n\n\n\n Many of these folks were solely focused on handling tickets and infra-provisioning.<\/p>\n\n\n\n These tasks are something a junior SRE might do to understand the system and issues, but they should not form the ongoing career progression of an SRE.<\/p>\n\n\n\n Site Reliability Engineering work is so much more than that.<\/p>\n\n\n\n In plain terms, you\u2019ll need to be proactively working on problems rather than firefighting<\/strong>, have a broader view of systems, and more, which is kinda exciting.<\/p>\n\n\n\n\n Developers have an advantage as they have coding ability that\u2019s useful to SRE work, but there\u2019s a lot more to how the role functions. <\/p>\n\n\n\n Developers moving into SRE roles will need to:<\/p>\n\n\n\n Here are some more reasons why it\u2019s difficult to get that mindset shift into SRE as a developer:<\/p>\n\n\n Site Reliability Engineers often find themselves doing the following:<\/p>\n\n\n\n A Site Reliability Engineer can expect to work with a combination of the following:<\/p>\n\n\n\n 40\u201350% are [SRE] candidates who were very close to the Google Software Engineering qualifications (i.e., 85\u201399% of the skill set required), and who, in addition,<\/strong><\/em> had a set of technical skills that is useful to SRE but is rare for most software engineers. By far, UNIX system internals and networking (Layer 1 to Layer 3) expertise are the two most common types of alternate technical skills we seek. \u2014 Stephen Thorne<\/a>, Staff SRE @ Google<\/p>\n<\/blockquote>\n\n\n\n As stated earlier, an SRE ideally needs development and systems-oriented skills \u2013 a *Pi-shaped* skill set<\/a>, so to speak.<\/p>\n\n\n\n For this type of skill set, an SRE has to be proficient in both trades. Two areas of deep expertise with a broader understanding of other areas to form the pi shape.<\/p>\n\n\n\n Not just one or the other, which is the hallmark of a T-shaped skill set.<\/p>\n\n\n Site Reliability Engineering is still a relatively rare role in the broader software community.<\/p>\n\n\n\n However, there\u2019s little denying that the approach of Site Reliability Engineering is the future of software operations.<\/p>\n\n\n\n Here are some things that make SREs a unique breed in software work:<\/p>\n\n\n Ask any developer what they\u2019re working on, and you\u2019ll see a tiny sliver of the whole codebase. That makes sense for the kind of work that is coding up a feature or update.<\/p>\n\n\n\n Systems work, on the other hand, needs a holistic view of significant complexity in order to make sure the whole unit works harmoniously.<\/p>\n\n\n Because they have a scope spanning the entirety of a software system, SREs can end up working on various types of problems.<\/p>\n\n\n\n They may solve challenging problems which could take days, weeks, or months to resolve.<\/p>\n\n\n\n The old adage of \u201chow long is a piece of string?\u201d can apply to SREs estimating a fix for issues.<\/p>\n\n\n\n Some problems may be well-defined<\/strong>, like spooling up infrastructure based on known demand.<\/p>\n\n\n\n Other problems may be more abstract<\/strong>, like working out how to cost-effectively autoscale a service that has inconsistent usage patterns and needs high performance.<\/p>\n\n\n Most developers work within agile frameworks like Scrum or XP. SREs may also use these frameworks when planning software build work.<\/p>\n\n\n\n That essentially timeboxes their efforts, which is fine but\u2026<\/p>\n\n\n\n That might work for estimable problems but does not always work for production-level work.<\/p>\n\n\n\n Can an SRE stop working on a problem because it does not fit into the mold of a sprint? That could spell disaster for production software. Daniel Wilhite answers the question of \u201cCan scrum be used effectively by SRE teams?\u201d<\/a> very well.<\/p>\n\n\n You\u2019d expect SREs to get used to developers throwing the code over the wall, but no. Many are ex-developers, so they will spend much of their time coding up solutions for infrastructure and software performance.<\/p>\n\n\n\n Sometimes, they may participate in feature teams for job rotation. This helps them get a better understanding of their developer counterparts\u2019 priorities.<\/p>\n\n\n SREs come in many shapes and sizes. In smaller companies, a single SRE may be the one-stop shop for all site reliability matters. As a company grows, SRE roles may get divided into specialized work.<\/p>\n\n\n\n For example, one SRE may focus on supporting platforms like Kubernetes.<\/p>\n\n\n\n Another SRE may spend most of their time supporting developers in taking up DevSecOps.<\/p>\n\n\n\n Yet another may have general SRE responsibilities like being an incident commander.<\/p>\n\n\n Both roles are chalk and cheese, so it\u2019s worth considering key differences in how SREs work compared to software developers.<\/p>\n\n\n\n Chances are they will need to collaborate closely to make sure the software works well in production.<\/p>\n\n\n\n I took inspiration from a Google recruiter\u2019s interview with an SRE, Ciara Kamahele (link here<\/a>).<\/p>\n\n\n\n The key differences I uncovered are in table form below:<\/p>\n\n\n\nWhy are SREs so hard to hire despite the tech downturn?<\/h2>\n\n\n
Hiring developers overcomes an SRE hiring issue<\/h2>\n\n\n
\n
\n
Why would developers switch to SRE?<\/h2>\n\n\n
\n
Developers may still find it hard to switch to SRE<\/h2>\n\n\n
\n
An SRE\u2019s scope of work is W-I-D-E<\/h3>\n\n\n
\n
An SRE\u2019s toolbox is H-U-G-E<\/h3>\n\n\n
\n
\n
Difference between working styles of SREs and developers<\/h2>\n\n\n
SREs look at the broader picture<\/strong><\/h3>\n\n\n
SREs thrive in ambiguity<\/strong><\/h3>\n\n\n
SREs work beyond constraints like Scrum<\/strong><\/h3>\n\n\n
SREs don\u2019t stay in their lane<\/strong><\/h3>\n\n\n
SREs don\u2019t have a monolith job description<\/strong><\/h3>\n\n\n
Comparison with software developers<\/strong><\/h3>\n\n\n