{"id":801,"date":"2022-10-26T03:59:46","date_gmt":"2022-10-25T17:59:46","guid":{"rendered":"https:\/\/sysmit.com\/cf22\/?p=801"},"modified":"2024-01-01T12:25:56","modified_gmt":"2024-01-01T02:25:56","slug":"reduce-software-outage-risk-with-passive-guardrails","status":"publish","type":"post","link":"https:\/\/sysmit.com\/cf22\/reduce-software-outage-risk-with-passive-guardrails\/","title":{"rendered":"Reduce software outage risk with passive guardrails"},"content":{"rendered":"\n
\nShocking fact:<\/strong><\/em> only 10-25% of software outages are because of hardware or network failure. The rest are the result of human error like misconfiguration \u2014 paraphrasing Martin Kleppman, Designing Data-Intensive Applications<\/em><\/p>\n<\/blockquote>\n\n\n\n
In this article, I will share with you how setting up passive guardrails in and around developer workflows can reduce the frequency and severity of incidents and outages.<\/p>\n\n\n\n
We will cover:<\/p>\n\n\n\n
\n
- why passive guardrails are important and<\/em><\/li>\n\n\n\n
- how they can be implemented<\/li>\n<\/ul>\n\n\n
Why passive guardrails are important<\/h2>\n\n\n
Passive guardrails save Site Reliability Engineers (SREs) from becoming the secret police for management to shame developers when they make mistakes. Let me explain.<\/p>\n\n\n\n
Extreme situations of an incident or outage may involve management openly chastising developers as a whole or individually. <\/p>\n\n\n\n
Higher managers might delegate this task to the people who own reliability. Guess who that is? Yes, you, the humble Site Reliability Engineer.<\/p>\n\n\n\n
Reality check:<\/strong> we can push the blameless culture as much as we want within engineering circles, but we have to consider that a large part of management doesn\u2019t buy into our cultural fancies<\/p>\n\n\n\n
A way to prevent this peril is to employ passive guardrails that keep developers within safe confines. <\/p>\n\n\n\n
From a developer\u2019s perspective, it would only seem to serve as a means to stay on a well-trodden \u201cgolden path\u201d, as described by Spotify\u2019s platform engineers<\/a>.<\/p>\n\n\n\n
By definition, guardrails are controls that prevent deviations from required behaviors.<\/p>\n\n\n\n
Let\u2019s delineate active versus passive guardrails.<\/p>\n\n\n\n
\n
- Active guardrails<\/strong> can be seen as the rules and policies that govern day-to-day behaviors and must be consciously considered when deploying code or altering the system. They are seen as punishable if circumvented<\/em>.<\/li>\n\n\n\n
- Passive guardrails<\/strong>, on the other hand, are a more subtle way to drive behavior. They create boundaries that form a relatively unconscious workflow after some time. They are seen as a mishap or accident if somehow bypassed<\/em>.<\/li>\n<\/ul>\n\n\n\n
Here\u2019s a real-life, non-software example of a passive guardrail:<\/strong><\/p>\n\n\n\n
Think of when you drive along the highway. You are not thinking about the median strip and lines that set a passive boundary between your vehicle and traffic going in the opposite direction. But they\u2019re there to keep your mind focused in the right direction.<\/p>\n\n\n\n
Remember, developers are focused on launching features as quickly and efficiently as possible. Very few would (or at least should) be distraught at not having root access to production servers or being intentionally guided around tools and platforms.<\/p>\n\n\n\n
Some benefits of passive guardrails for you and your SRE team will be:<\/p>\n\n\n\n
\n
- less time, mental bandwidth, and energy spent on enforcing policies and procedures<\/li>\n\n\n\n
- less animosity from developers<\/li>\n\n\n\n
- reduced manual toil due to more automated processes and built-in mechanisms (passive guardrails inherently rely on automation)<\/li>\n<\/ul>\n\n\n\n
Developers benefit as well, as they should. They can:<\/p>\n\n\n\n
\n
- move faster with launching services to production without your active involvement<\/li>\n\n\n\n
- cut their risk of accidentally bypassing best practices & protocols because these will be already baked into the workflow<\/li>\n<\/ul>\n\n\n
Techniques for implementing passive guardrails<\/h2>\n\n\n
We will cover in detail 7 techniques that support the passive software guardrail concept:<\/p>\n\n\n\n
\n
- Doubtless software system design<\/li>\n\n\n\n
- Clone production to full-featured sandbox<\/li>\n\n\n\n
- Pre-production checklist for developers to follow<\/li>\n\n\n\n
- 2-person authentication for deploys<\/li>\n\n\n\n
- Stagger rollout of code changes<\/li>\n\n\n\n
- Have an early warning system for failure<\/li>\n\n\n\n
- Service snapshots for rapid rollback<\/li>\n<\/ol>\n\n\n\n
These techniques are an amalgamation of several ideas I\u2019ve noted across several books, including Seeking SRE<\/em> (Blank-Edelman, 2018) and Designing Data-Intensive Applications<\/em> (Kleppmann, 2017), as well as SRECon talks like Confessions of a Systems Engineer<\/a> by David Argent<\/em>.<\/p>\n\n\n\n
Let\u2019s begin.<\/p>\n\n\n
Doubtless software system design<\/h3>\n\n\n
In my opinion, a well-designed system should be the first step to setting passive guardrails for developers. It should remove any ambiguity or doubt from their minds when they get around to launching their service.<\/p>\n\n\n\n
This means having discretely developed and well-documented service boundaries, APIs, and admin interfaces<\/strong>. <\/p>\n\n\n\n
When ambiguity is removed, the path to launch becomes crystal clear, with developer improvisation (and subsequently error) potential approaching 0%.<\/p>\n\n\n\n
Here\u2019s an example from Netflix on system design acting as a passive guardrail:<\/p>\n\n\n\n
\n“Some changes we incorporate into tooling might be called a guardrail. A concrete example is an additional step in deployment added to Spinnaker that detects when someone attempts to remove all cluster capacity while still taking significant amounts of traffic. This guardrail helps call attention to a possibly dangerous action, which the engineer was unaware was risky. This is a method of \u201cjust in time\u201d context.\u201d \u2014 in Seeking SRE: Conversations About Running Production Systems at Scale by David N. Blank-Edelman<\/em><\/p>\n<\/blockquote>\n\n\n\n
Achieving this kind of result may involve creating tool features or prompts to guide developers through key steps. All of this will be a balancing act because you risk making all of these components too restrictive or minimalistic.<\/p>\n\n\n\n
Give enough power to these components for developers to continue being happy with them. This means consistently taking honest feedback from all stakeholders to improve the system design to meet their changing needs.<\/p>\n\n\n\n\n
Clone production to full-featured sandbox<\/h3>\n\n\n
One of the most common complaints I have heard from operations engineers about developers is that \u201cthey code on \u2018monster\u2019 local machines that have 32MB RAM and then wonder why VMs in production with much less RAM allocation keep struggling\u201d.<\/p>\n\n\n\n
Developers are almost always working on features in isolation from production. There\u2019s a good rationale for this: to prevent experimental work from negatively impacting real-world services and users.<\/p>\n\n\n\n
And so, developers code away at their services with unknowns at play. Some developers may have:<\/p>\n\n\n\n
\n
- some idea of what the data will look and play like but lack the complete picture.<\/li>\n\n\n\n
- an indication of resource demand, but even still nowhere near the true numbers.<\/li>\n\n\n\n
- little insight into how hard users are pushing the software in other areas of the system<\/li>\n<\/ul>\n\n\n\n
The end result is that developers risk factoring code for an idealistic world that doesn\u2019t truly reflect your software\u2019s user base or data.<\/p>\n\n\n\n
You can\u2019t blame a developer for any of the above. They are working in a sandbox, after all.<\/p>\n\n\n\n
A solution to this unfortunate problem is to continually give developers a realistic sandbox that reflects the service as it stands in production.<\/p>\n\n\n\n
By doing this, you would be enabling a safe environment for developers to manipulate and test code in an environment that:<\/p>\n\n\n\n
\n
- reflects the \u201creal world\u201d system AND<\/li>\n\n\n\n
- continues to safeguard the production system from experimental work<\/li>\n<\/ol>\n\n\n
Pre-production checklist for developers to follow<\/h3>\n\n\n
Software teams launch services to production more frequently than ever before, and they make many ongoing tweaks to these services. It is critical to help them effectively launch changes.<\/p>\n\n\n\n
Google has a dedicated team of launch coordination engineers (LCE) for this effort, but we are not Google. We\u2019ll forego the extra title and cost as most organizations that aren\u2019t Google-scale cannot justify it.<\/p>\n\n\n\n
In many organizations, SREs review, consult, collaborate, and even contribute, but the final responsibility for production delivery remains with the product engineering team that owns a given service.<\/p>\n\n\n\n
With tight resources in place, SREs can create pre-launch assessments that make sense for developers and prevent future mishaps in production. <\/p>\n\n\n\n
The pre-launch assessment may be called a myriad of terms depending on the organization, like Production Readiness Checklist<\/em> or Operational Readiness Review<\/em>.<\/p>\n\n\n\n
Google\u2019s book, Site Reliability Engineering (2016)<\/em>, has a section on ensuring reliable product launches<\/a>. Below are examples of the kinds of questions you\u2019ll find in their pre-launch checklist:<\/p>\n\n\n\n
\n
- Are you storing persistent data? If yes, make sure you backup the data (here are instructions)<\/li>\n\n\n\n
- Could a user abuse your service? If yes, implement rate limiting and query quotas. (here\u2019s the link to a service to help you do this)<\/li>\n<\/ul>\n\n\n\n
The book\u2019s authors assert that \u201cin practice, there is a near-infinite number of questions to ask about any system, and it is easy for the checklist to grow to an unmanageable size.\u201d So they follow a few logical rules to elucidate the right path:<\/p>\n\n\n\n
\n
- importance of questions must be substantiated from experience, like a launch disaster<\/li>\n\n\n\n
- instructions given to developers must be concrete, practical and reasonable<\/li>\n\n\n\n
- stay on top of changes in the system and reflect these in the question\/instruction set<\/li>\n\n\n\n
- run regular reviews (once to twice a year) of the pre-launch checklist to ensure the above<\/li>\n<\/ul>\n\n\n\n
You may also develop your own SRE checklist for covering key infrastructure components that the service will rely on. That checklist may cover issues concerning some or all of the following:<\/p>\n\n\n\n
\n
- Security – authentication, secrets management, TLS, vulnerability scanning<\/li>\n\n\n\n
- Observability – availability metrics, tracing, monitoring, alerting<\/li>\n\n\n\n
- Storage and backups – statefulness, backup availability, and practices<\/li>\n\n\n\n
- Networking – VPCs, subnets, IPs, service discovery, mesh, and more<\/li>\n\n\n\n
- Performance – benchmarking, load testing, tuning components, etc<\/li>\n\n\n\n
- Capacity – horizontal scaling, vertical scaling, availability zoning<\/li>\n\n\n\n
- Cost management – reserved vs. spot resources, closing underused resources<\/li>\n\n\n\n
- Testing – automated testing after commits, scheduled testing, test coverage<\/li>\n<\/ul>\n\n\n\n
You will need to factor in how dependencies will evolve. New ones will emerge and existing ones will alter or deprecate. There will be upstream and downstream impacts from these changes on components that rely on them.<\/p>\n\n\n
Implement 2-person authentication<\/h3>\n\n\n
Not to be mistaken with 2-factor authentication (2FA), which relies on a single user confirming their intent to a certain action through a secondary device. You may have seen 2FA when trying to log into sensitive systems like your banking service.<\/p>\n\n\n\n
Why not have the same level of corroboration when your system is, for example:<\/p>\n\n\n\n
\n
- expecting a major commit or<\/li>\n\n\n\n
- a critical service is due for an update<\/li>\n<\/ul>\n\n\n\n
But instead of the lone developer making these necessary commits by authenticating with their mobile device, have 2 people \u2014 both being engineers \u2014 sign off on the work.<\/p>\n\n\n\n
What\u2019s the rationale behind this?<\/p>\n\n\n\n
\n
- It puts a second pair of eyes on the code and pre-launch checklist<\/li>\n\n\n\n
- It may cause an unconscious need within developers not to let their peers down<\/li>\n<\/ol>\n\n\n\n
The humble engineer may think twice about their code quality before it reaches production. After all, \u201cI don\u2019t want sloppy code to get in the way of the good rapport I have with my colleague.\u201d<\/p>\n\n\n\n
Check out this infrastructure-as-code (IAC) example for 2-person authentication<\/a><\/p>\n\n\n
Stagger rollout of code changes<\/h3>\n\n\n
Pushing code to production has inherent risks. Pushing code across the entire customer base, region, or organization at once amplifies risk. How can you mitigate this risk?<\/p>\n\n\n\n
I recommend a segment-based staggered rollout to minimize the blast radius of erroneous code changes in production. Examples of segments include:<\/p>\n\n\n\n
\n
- the least fussy customer base, then through the general userbase all the way out to more discerning customers<\/li>\n\n\n\n
- a single VM to a group of VMs to a region, then across regions<\/li>\n<\/ul>\n\n\n\n
This process should be as automated as possible and should force the developer to think. In order to implement this guardrail effectively, you will need to work out:<\/p>\n\n\n\n
\n
- how to segment your users and VMs for appropriate blast radius size<\/li>\n\n\n\n
- how quickly to roll out the changes across these segmented groups<\/li>\n<\/ul>\n\n\n\n
Set a reasonable time interval to allow for the detection of anomalies and failures in production.<\/p>\n\n\n\n
Besides segmenting by audience or machine, you may also consider the following approaches to reducing blast radius:<\/p>\n\n\n\n
I will explore them more in-depth in a future write-up.<\/p>\n\n\n
Have an early warning system for failure<\/h3>\n\n\n
In a way, software systems are not that different from weather systems. You want to know as early as possible when problems are beginning to surface so that you can prevent a problem from turning into a full-blown disaster.<\/p>\n\n\n\n
Or at least minimize the damage from the incoming disaster.<\/p>\n\n\n\n
Observability is the name of this game, and it should serve as the bread and butter of any software engineering team. You may consider setting up a full telemetry suite that tracks performance as well as service uptime. This would include:<\/p>\n\n\n\n
\n
- logging<\/em> relevant system events<\/li>\n\n\n\n
- tracing<\/em> across services for issues<\/li>\n\n\n\n
- monitoring<\/em> resource usage and performance<\/li>\n<\/ul>\n\n\n\n
As an early warning system, observability can advise developers and SREs where tweaks and more resources are required \u2014 before a deluge of users and usage takes the system down.<\/p>\n\n\n\n
I will cover observability in-depth in a dedicated write-up in the future, as the breadth and complexity of the capability lend it such a privilege.<\/p>\n\n\n
Service snapshots for rapid rollback<\/h3>\n\n\n
Let\u2019s say you\u2019ve done all of the above, and yet something still goes wrong. What will you do?<\/p>\n\n\n\n
A viable solution is to take time-interval snapshots of services-in-production, as well as platform configurations (like .tf and YAML files). See an error crop up from the latest update? Roll back to a point that doesn\u2019t break your software-in-production.<\/p>\n\n\n\n
The time interval will depend on your ability and appetite to automate snapshots, as well as frequencies of deployment and platform changes.<\/p>\n\n\n\n
You will essentially have multiple viable versions of the same code that can be shifted to and fro as necessitated by operational needs and challenges.<\/p>\n\n\n\n
Netflix SREs employ this practice in their Spinnaker continuous delivery platform. The platform ensures new code changes are automatically deployed in a blue-green fashion.<\/p>\n\n\n\n
\nI think we\u2019ve all worked at companies where we upgrade something, and it turns out it was bad, and we spend the night firefighting it because we\u2019re trying to get the site back up because the patch didn\u2019t work. [Instead of that] when we go to put new code into production\u2026 we just push a new version alongside the current code base. It\u2019s that red\/black or blue\/green – whatever people call it at their company\u2026 if something breaks in the canary\u2026 we immediately roll back. This brings recovery down from hours to minutes.<\/em> \u2014 Coburn Watson, Director of Reliability, Performance and Cloud Infrastructure at Netflix, in the book \u201cSeeking SRE: Conversations About Running Production Systems at Scale\u201d (2018)<\/p>\n<\/blockquote>\n\n\n\n
In effect, older code is still active on one VM while the newly deployed code is running on another. If the new code deployment fails, the system reverts to the most recent reliably-running code.<\/p>\n\n\n
Bibliography<\/h2>\n\n\n
\n
- Kleppmann, M. (2018). Designing data-intensive applications: the big ideas behind reliable, scalable, and maintainable systems. O’Reilly<\/em> Media.<\/li>\n\n\n\n
- Blank-Edelman, D.N. (2018). Seeking SRE<\/em>. O\u2019Reilly Media.<\/li>\n\n\n\n
- SREcon Conversations with David Argent, Amazon (August 2020)<\/em>. [online] Available at: https:\/\/www.youtube.com\/watch?v=F1HLaTUJy_s<\/a> [Accessed 20-22 Oct. 2022].<\/li>\n\n\n\n
- 5 Lessons Learned From Writing Over 300,000 Lines of Infrastructure Code<\/em>. [online] Available at: https:\/\/www.youtube.com\/watch?v=RTEgE2lcyk4<\/a> [Accessed 18-20 Oct. 2022].<\/li>\n<\/ol>\n","protected":false},"excerpt":{"rendered":"
Shocking fact: only 10-25% of software outages are because of hardware or network failure. The rest are the result of human error like misconfiguration \u2014 paraphrasing Martin Kleppman, Designing Data-Intensive Applications In this article, I will share with you how setting up passive guardrails in and around developer workflows can reduce the frequency and severity […]<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[60,5],"tags":[12,41,80],"_links":{"self":[{"href":"https:\/\/sysmit.com\/cf22\/wp-json\/wp\/v2\/posts\/801"}],"collection":[{"href":"https:\/\/sysmit.com\/cf22\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/sysmit.com\/cf22\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/sysmit.com\/cf22\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/sysmit.com\/cf22\/wp-json\/wp\/v2\/comments?post=801"}],"version-history":[{"count":8,"href":"https:\/\/sysmit.com\/cf22\/wp-json\/wp\/v2\/posts\/801\/revisions"}],"predecessor-version":[{"id":5756,"href":"https:\/\/sysmit.com\/cf22\/wp-json\/wp\/v2\/posts\/801\/revisions\/5756"}],"wp:attachment":[{"href":"https:\/\/sysmit.com\/cf22\/wp-json\/wp\/v2\/media?parent=801"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/sysmit.com\/cf22\/wp-json\/wp\/v2\/categories?post=801"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/sysmit.com\/cf22\/wp-json\/wp\/v2\/tags?post=801"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}