{"id":559,"date":"2022-04-29T19:14:56","date_gmt":"2022-04-29T09:14:56","guid":{"rendered":"https:\/\/sysmit.com\/cf22\/?p=559"},"modified":"2023-12-13T15:28:02","modified_gmt":"2023-12-13T05:28:02","slug":"agile-software-teams-need-site-reliability-engineers-to-support-ongoing-success","status":"publish","type":"post","link":"https:\/\/sysmit.com\/cf22\/agile-software-teams-need-site-reliability-engineers-to-support-ongoing-success\/","title":{"rendered":"Why Agile software teams need SRE support"},"content":{"rendered":"\n
Agile software delivery is de rigeur <\/em>of modern software. However, as complexity increases, there’s a high risk of frequent, high-velocity breaking software-in-production.<\/p>\n\n\n\n Software-in-production is when the software is accessible by users. <\/p>\n\n\n\n That’s where Site Reliability Engineers (SRE) can come to support the Agile software team’s efforts.<\/p>\n\n\n Site Reliability Engineers (SREs) are a specialist form of software operations engineers<\/strong> who are committed to ensuring your software remains reliable after being deployed across the release train. <\/p>\n\n\n\n They may help build and improve the platform that your software is launched into. However, it’s rare for SREs to own the platform work entirely. They are typically highly experienced engineers.<\/p>\n\n\n\n The unique value of Site Reliability Engineers is in:<\/p>\n\n\n\n In other words, Agile software work needs SRE support<\/strong>. But SREs also need Agile.<\/p>\n\n\n\n It’s not an equal relationship, however. Let’s modify the above statement for accuracy:<\/p>\n\n\n\n Over the years I have I am well-versed in Agile being a certified Scrum Master. The culture at these places made Agile work well. This was a saving grace in fast-moving environments. <\/p>\n\n\n\n My last role (before starting SREpath) was as an operations director at a healthcare company. Certainly more traditional and structured than working at a software startup.<\/p>\n\n\n\n I owned software vendor relations as part of my portfolio. All of the software vendors we had relationships with switched over to cloud and<\/em> Agile delivery since the pandemic in 2020. <\/p>\n\n\n\n These vendors<\/strong> launched more features in the last two years than in the previous eight<\/strong>. The 2 years have correlated with the most unstable period for these systems. <\/p>\n\n\n\n Our end users constantly complained of not being able to access critical systems during business hours. The vendors’ ability to deliver features faster increased, but their ability to do this as a reliable service begs to question. <\/p>\n\n\n\n Part of this fragility comes from their lack of insight into increasing the software’s reliability in production. <\/p>\n\n\n\n Site Reliability Engineering excels at increasing the reliability of software. <\/p>\n\n\n\n If I were to sum up SRE work, it would be to reduce software fragility and increase resilience to black swan events while supporting Agile developer success<\/strong> at the same time. <\/p>\n\n\n The concept of Site Reliability Engineers originated at Google way back in 2003. <\/p>\n\n\n\n An executive at Google, Ben Treynor Sloss, determined that the only way to handle Google’s mega-scale user requests was to create a new operations engineering discipline. <\/p>\n\n\n\n His north star early on was to reduce the likelihood of 500 errors i.e. the server encountered an unexpected condition that prevented it from fulfilling the request.<\/p>\n\n\n\n Seems apt that Ben’s vision succeeded. How often does Google’s service stop working for you? Rarely compared to other services, right? SREs are the secret sauce behind this.<\/strong><\/p>\n\n\n The unfortunate problem with bringing SRE into the Agile mix is that it fits into the wrong software arena. <\/p>\n\n\n\n Reliability slots into the black box of non-functional requirements (NFRs). In a typical organization, no one wants to look at risk until something’s going wrong <\/strong>i.e. it’s too late.<\/p>\n\n\n\n I’ve had many conversations around addressing reliability, uptime, and error handling go the way of, “Uh huh, we’ll look into it next quarter. But first, let’s release this exciting new QR code tool!”<\/p>\n\n\n\n The problem then compounds.<\/p>\n\n\n\n The more agile work these vendors do, the more fragile their software becomes.<\/strong> That\u2019s at least in production, where our many employees and I got to see their work. <\/p>\n\n\n\n The assertions that I’m making here are:<\/p>\n\n\n\n In a standard Agile timeframe, you’re altering your software every 4-6 weeks through continuous deployments. <\/p>\n\n\n\n These continuous deployments compound over time, so software put into production on Day 0 will morph into a very different beast by Day 30, 60, 90, 180, etc. <\/p>\n\n\n\n By day 365, you may not be able to recognize the same software compared to Day zero<\/strong>.<\/p>\n\n\n\n The more services you add or modify over time, the more drastic the difference will be. This long but apt quote describes the problem we face:<\/p>\n\n\n\n There is a fallacy in computer programming circles that all applications are ultimately decomposable – that is to say, you can break down complex applications into many more simple ones. In point of fact, however, you often cannot get more complex behaviors to actually start working until you have the right combination of components working, and even then you will run into problems with synchronization of data availability, memory usage and deallocation and race conditions – problems that will only become apparent when you’ve built most of the plumbing. This is why “but will it scale?” entered the lexicon of programmers everywhere. Scale problems only show up once you’ve built the system out almost completely and attempt to make it work under more extreme conditions. The solutions often entail scrapping significant parts of what you’ve just built, much to the consternation of managers everywhere. \u2014 Kurt Cagle, Community Editor @ Data Science Central<\/p>\n<\/blockquote>\n\n\n\n Increased complexity in microservices architecture means that software is less resilient to even minor conditional changes.<\/p>\n\n\n\n Extreme conditions are becoming the rule rather than the exception<\/strong> for commercial software. Even the kind with a small-ish user base. <\/p>\n\n\n\n I will unpack this issue of extreme conditions in terms of performance:<\/p>\n\n\n\n Despite this tinderbox, practices around software-in-production at many newly-minted Agile houses remain as if users were still using software installed on their PCs or Macs<\/strong>. <\/p>\n\n\n Site Reliability Engineering practices put a proverbial security blanket on top of the above-highlighted deployment mess that can grow and grow and grow. SRE gives:<\/p>\n\n\n\n The lesser risk of excessive downtime can justify the initial cost of creating an SRE function. Downtime can cost serious money in almost every industry<\/strong> now that so many production and service areas are heavily software-dependent. <\/p>\n\n\n Business models now depend on cloud internet infrastructure, which increases production risk due to its complexity. Downtime means money lost, even if it’s for mere minutes.<\/strong> It can cost thousands, sometimes millions of dollars.<\/p>\n\n\n\n Users want great new features, but now they also need reliability because what’s the point of a feature if you can’t even access it when you need it?<\/strong> <\/p>\n","protected":false},"excerpt":{"rendered":" Agile software delivery is de rigeur of modern software. However, as complexity increases, there’s a high risk of frequent, high-velocity breaking software-in-production. Software-in-production is when the software is accessible by users. That’s where Site Reliability Engineers (SRE) can come to support the Agile software team’s efforts. Who are Site Reliability Engineers? Site Reliability Engineers (SREs) […]<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[60,1],"tags":[],"_links":{"self":[{"href":"https:\/\/sysmit.com\/cf22\/wp-json\/wp\/v2\/posts\/559"}],"collection":[{"href":"https:\/\/sysmit.com\/cf22\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/sysmit.com\/cf22\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/sysmit.com\/cf22\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/sysmit.com\/cf22\/wp-json\/wp\/v2\/comments?post=559"}],"version-history":[{"count":24,"href":"https:\/\/sysmit.com\/cf22\/wp-json\/wp\/v2\/posts\/559\/revisions"}],"predecessor-version":[{"id":5014,"href":"https:\/\/sysmit.com\/cf22\/wp-json\/wp\/v2\/posts\/559\/revisions\/5014"}],"wp:attachment":[{"href":"https:\/\/sysmit.com\/cf22\/wp-json\/wp\/v2\/media?parent=559"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/sysmit.com\/cf22\/wp-json\/wp\/v2\/categories?post=559"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/sysmit.com\/cf22\/wp-json\/wp\/v2\/tags?post=559"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}Who are Site Reliability Engineers? <\/h2>\n\n\n
\n
\n
My relationship with Agile practices <\/h2>\n\n\n
moonlighted<\/s> worked in several startups. These ventures depended on Agile methodology to release software with high frequency. <\/p>\n\n\n\nWhere did SRE start?<\/h2>\n\n\n
Many software outfits don’t care about reliability<\/h2>\n\n\n
What is the point in building great features if users can’t load them?<\/em><\/h2><\/blockquote>\n\n\n\n
\n
\n
\n
Site Reliability Engineering can rescue production software<\/h2>\n\n\n
\n
Concluding words<\/h2>\n\n\n