{"id":5809,"date":"2023-08-01T13:30:07","date_gmt":"2023-08-01T03:30:07","guid":{"rendered":"https:\/\/sysmit.com\/cf22\/?p=5809"},"modified":"2023-12-13T15:26:50","modified_gmt":"2023-12-13T05:26:50","slug":"starting-sre-at-startups-and-smaller-organizations","status":"publish","type":"post","link":"https:\/\/sysmit.com\/cf22\/starting-sre-at-startups-and-smaller-organizations\/","title":{"rendered":"Starting SRE at startups and smaller organizations"},"content":{"rendered":"
\u274c SRE at a very small startup with few users rarely makes a difference until you\u2019ve reached a fair userbase size or have growing pains<\/p>\n\n\n\n
\u274c Many organizations without a strong money\/legal incentive e.g. SLAs tied to their operations, cannot justify diving into a complex field like SRE<\/p>\n\n\n\n
\u2705 Scaleup startups should 100% read this as you\u2019re now facing growing pains in your software operations<\/p>\n\n\n\n
\u2705 Smaller organizations that build in-house software and or have large cloud workloads also should pay attention.<\/p>\n\n\n\n
\u2705 Engineering groups of less than 100 engineers<\/p>\n\n\n
Most of the original thinking behind SRE focuses on implementing it in large-scale systems.<\/p>\n\n\n\n
I believe that any organization that has software at the foundation of its core business should at the very least pay attention to SRE principles.<\/p>\n\n\n\n
You can always pare hyperscale ideas down to your level of need, which we will explore later in this article.<\/p>\n\n\n
Call me biased, but I know from first-hand experience that SRE can be useful for smaller organizations.<\/p>\n\n\n\n
In 2014 and 2015, I worked with developers and operators in a small software company that started experiencing growing pains as it took on enterprise clients.<\/p>\n\n\n\n
The company struggled to service the new volume of customers with its existing infrastructure architecture and practices.<\/p>\n\n\n\n
This bogged them down with slow performance, excessive downtime, and random outages in key regions. They failed SLAs several times.<\/p>\n\n\n\n
Does this sound familiar?<\/p>\n\n\n\n
That’s precisely what SRE, as a practice, aims to solve.<\/p>\n\n\n
I have not known a single engineering leader who thought one day, \u201cSRE sounds cool. Let\u2019s do it!\u201d<\/p>\n\n\n\n
There was always an underlying reason for taking up SRE practices<\/a>.<\/p>\n\n\n\n An issue that they were trying to fix.<\/p>\n\n\n\n Some of the problems I have witnessed include:<\/p>\n\n\n\n SRE is a force multiplier that uses automation, passive guardrails<\/a>, and more to give more sustainable operations than hiring more headcount for manual work.<\/p>\n\n\n It\u2019s a great feeling to be excited about starting your SRE journey.<\/p>\n\n\n\n Enough to jump right in.<\/p>\n\n\n\n But that\u2019s a risky proposition if you don\u2019t know what problems you need to solve with SRE.<\/p>\n\n\n\n How would this manifest?<\/p>\n\n\n\n Imagine hiring SREs but then they look at the problems and then say, \u201cI don\u2019t know how to solve this problem.\u201d<\/p>\n\n\n\n This can happen even if you hire experienced SREs because SRE is a huge space, and even experienced SREs may not be \u201cexperienced\u201d in the problem you need solving.<\/p>\n\n\n\n Hiring experienced SREs may also not be right for you.<\/p>\n\n\n\n Some hires may come with preconceptions of how things should be done.<\/p>\n\n\n\n Be very clear about what you need and what resources you have available during the hiring process.<\/p>\n\n\n\n Plan your required capabilities ahead of time and save the headache of backtracking to solve problems.<\/p>\n\n\n At last count, Google had over 3000 Site Reliability Engineers<\/a>.<\/p>\n\n\n\n If we go by a very conservative average salary (by Google\u2019s standards) of $200k per engineer, that amounts to an SRE budget of over $1 billion.<\/p>\n\n\n\n You can call me out on this, but I\u2019m guessing that you are not starting out on SRE with this kind of budget. <\/p>\n\n\n\n That\u2019s why it\u2019s not a good idea to start off by copying Google\u2019s model<\/a> as written out in its 2016 Site Reliability Engineering<\/em> book.<\/p>\n\n\n\n Larger organizations could get away with copying Google\u2019s SRE model because a large budget to your smaller organization might to them be a rounding error on their balance sheet.<\/p>\n\n\n\n Tailoring the SRE function to your organization\u2019s specific needs reduces the risk of failure.<\/p>\n\n\n\n A failure is more likely to be noticed in smaller organizations because of the amount of money or engineer time involved.<\/p>\n\n\n Chances are that because you are just starting out on SRE that you have limited resources. The worst thing you can do at this point is to cover too many areas of this new function.<\/p>\n\n\n\n SRE has a myriad of capability areas that you can cover and call it \u201cSRE\u2019.<\/p>\n\n\n\n Some organizations start out with observability (good choice).<\/p>\n\n\n\n Others start right out the door with incident response (okay-ish choice).<\/p>\n\n\n\n While some poor souls start out their SRE journey with observability, incident response, capability planning, release management, DevSecOps, and more.<\/p>\n\n\n\n See the problem here?<\/p>\n\n\n\n The poor souls have a focus problem.<\/p>\n\n\n\n They are trying to do too many things all at once, and it will blunt their ability to drive substantial value early on in their SRE journey.<\/p>\n\n\n\n Those two ingredients of substantial and early value creation are critical to success.<\/p>\n\n\n\n So\u2026<\/p>\n\n\n Starting off with a whole team dedicated to Site Reliability Engineering (SRE) might not always be doable.<\/p>\n\n\n\n You might not even be able to budget for a single SRE.<\/p>\n\n\n But you can<\/em> find someone who’s already working in operations to take on some of the foundational SRE work.<\/p>\n\n\n\n One way is to have them start by setting up observability instrumentation across a few services.<\/p>\n\n\n\n What would it mean to instrument observability?<\/p>\n\n\n\n It means gathering metrics, logs, and traces to get a better idea of how that part of the system is working.<\/p>\n\n\n\n Once they’ve got that down for one service, they could keep on doing it for other services or show other teams how it’s done.<\/p>\n\n\n Another option is to have a software engineer who’s really good at coding fully take on SRE responsibilities. It\u2019s a promotion \ud83d\ude09<\/p>\n\n\n\n This person would need to know everything about the various systems and services that the business uses, and be able to fix issues like a boss.<\/p>\n\n\n\n With some help and guidance, a skilled software engineer could become the<\/strong><\/em> foundational SRE at your organization.<\/p>\n\n\n\n They could play a pivotal role as a key player on the SRE team and make sure that everything keeps running like butter.<\/p>\n\n\n\n This is also a good selling point when you need to convince them to transition to SRE.<\/p>\n\n\n\n For more on pulling off this kind of transition, read the guide on turning a developer into an SRE<\/a>.<\/p>\n\n\n I\u2019ve already mentioned that a lot of SRE work ala Google<\/strong><\/em> was designed for handling production systems at hyperscale.<\/p>\n\n\n\n SREs at larger organizations have the luxury (and sometimes curse) of having hordes of production data coming in to analyze and make proactive adjustments.<\/p>\n\n\n\n This is much harder to do in smaller environments with low traffic or low throughput because of the simple problem: not enough data.<\/p>\n\n\n\n The lack manifests through:<\/p>\n\n\n\n In larger organizations, with high-traffic and high-throughput systems, SREs have access to a wealth of production data.<\/p>\n\n\n\n This includes metrics such as response times, error rates, resource utilization, and various performance indicators.<\/p>\n\n\n\n This abundance of data allows them to perform in-depth analysis, identify patterns, and make data-driven decisions for optimizing the system.<\/p>\n\n\n\n They can track trends over time, compare different periods, and establish baselines to measure improvements.<\/p>\n\n\n\n In contrast, smaller environments with low traffic or low throughput generate significantly less data.<\/p>\n\n\n\n The limited data availability restricts the scope and depth of analysis that can be performed.<\/p>\n\n\n\n SREs may face challenges in obtaining meaningful insights and identifying patterns due to the scarcity of data points.<\/p>\n\n\n\n This makes it difficult to establish reliable baselines or perform accurate trend analysis, leading to a higher degree of uncertainty when making proactive adjustments.<\/p>\n\n\n In data analysis, statistical significance plays a crucial role in drawing reliable conclusions. It ensures that observed patterns or changes are not merely random fluctuations but have a meaningful impact.<\/p>\n\n\n\n With larger datasets, SREs can apply statistical techniques to validate their findings and ensure confidence in the results.<\/p>\n\n\n\n They can perform hypothesis testing, confidence interval calculations, and statistical modeling to make accurate predictions and decisions.<\/p>\n\n\n\n However, in smaller environments with limited data, achieving statistical significance becomes challenging.<\/p>\n\n\n\n The smaller sample sizes may not provide enough data points to establish statistically significant relationships or patterns.<\/p>\n\n\n\n This makes it difficult to differentiate between real system performance issues and random variations.<\/p>\n\n\n\n SREs may need to rely on alternative approaches, such as simulation or extrapolation from other data sources, to compensate for the lack of statistical significance.<\/p>\n\n\n Larger-scale systems typically have comprehensive monitoring and observability solutions in place, providing SREs with real-time insights into system behavior.<\/p>\n\n\n\n They can access detailed metrics, logs, traces, and other telemetry data, which enables them to diagnose issues quickly and take proactive actions.<\/p>\n\n\n\n These systems often incorporate sophisticated anomaly detection algorithms that can identify deviations from expected behavior, triggering alerts and enabling prompt responses.<\/p>\n\n\n\n In smaller environments, where resources may be limited, the level of operational visibility may be reduced. Monitoring tools and observability solutions might be less comprehensive or absent altogether.<\/p>\n\n\n\n As a result, SREs may face challenges in obtaining timely and detailed information about the system’s performance.<\/p>\n\n\n\n The lack of real-time insights can hinder their ability to make proactive adjustments and respond promptly to potential issues.<\/p>\n\n\n One of the benefits of having a large volume of production data is the ability to experiment and iterate rapidly.<\/p>\n\n\n\n SREs at larger organizations can test changes, optimizations, or new features on subsets of the data, measure their impact, and iterate based on the results.<\/p>\n\n\n\n This iterative approach allows them to fine-tune systems continuously, leading to better performance and reliability over time.<\/p>\n\n\n\n In smaller environments with low-traffic or low-throughput, the limited data available restricts the scope for experimentation and iteration.<\/p>\n\n\n\n SREs may need to rely on external benchmarks, industry best practices, or theoretical knowledge to make adjustments, as they may not have sufficient data to support the evaluation of the impact of changes.<\/p>\n\n\n\n This increases the risk of making suboptimal adjustments or missing opportunities for improvement.<\/p>\n\n\n\n It\u2019s not all doom and gloom for smaller organizations because\u2026<\/p>\n\n\n One area where the low-data disadvantage for smaller organizations is obvious is the alerting process.<\/p>\n\n\n\n I\u2019ve learned a few tips from several SREs to help alleviate this issue:<\/p>\n\n\n Eliminate alerts triggered by individual events. Instead, focus on identifying and responding to broader patterns or trends. By doing so, you can reduce the noise and avoid overwhelming your team with unnecessary notifications.<\/p>\n\n\n If your system has low throughput, consider extending the evaluation time range for your SLOs.<\/p>\n\n\n\n This adjustment allows for a more accurate assessment of system performance over a reasonable timeframe, providing a better understanding of potential issues or deviations from expected behavior.<\/p>\n\n\n If you’re concerned about receiving alarms during off-hours but unable to extend the evaluation window, customize your alarm’s notification policy to limit alerts to business hours only.<\/p>\n\n\n Injecting synthetic traffic can help overcome the low data problem.<\/p>\n\n\n\n Generating synthetic traffic involves simulating artificial requests or interactions with your system to mimic real user behavior.<\/p>\n\n\n\n By incorporating synthetic traffic, you have more data to gauge the health & performance of your system and respond with proactive measures.<\/p>\n\n\n\n This approach helps identify potential issues before they affect real users and trigger alarms.<\/p>\n\n\n\n Synthetic traffic can be scheduled to run at regular intervals or triggered by specific events, providing continuous monitoring and alerting capabilities.<\/p>\n\n\n The chaos from having a myriad of services only compounds in the incident response process.<\/p>\n\n\n\n Grouping related services might reduce the headache.<\/p>\n\n\n\n It involves categorizing related components or microservices into logical groups based on their dependencies and functionality.<\/p>\n\n\n\n By doing so, you can establish more meaningful and manageable alerting rules.<\/p>\n\n\n\n Instead of setting up alerts for individual services, you can define alerts at the group level. This approach reduces alert noise and helps prioritize and troubleshoot issues more effectively.<\/p>\n\n\n\n Grouping services also enables you to create aggregated metrics and observability dashboards, allowing for a holistic view of system health.<\/p>\n\n\n\n By incorporating synthetic traffic and grouping services, small SRE organizations can enhance their alerting processes:<\/p>\n\n\n\n These practices can help small SRE organizations streamline their alerting systems, improve incident response times, and enhance overall system reliability.<\/p>\n\n\n So you\u2019ve had the chance to build a greenfield SRE team because:<\/p>\n\n\n\n How are you going to pull it off? Let\u2019s run through a few considerations<\/p>\n\n\n\n The first step is to focus on observability.<\/p>\n\n\n\n You must<\/em> achieve a degree of proficiency in this before you can make any tangible gains from automation, capacity planning, or anything else.<\/p>\n\n\n\n In simple terms, observability lets you see how well the infrastructure and services on top of it are performing and find any issues.<\/p>\n\n\n\n Once you have this down, you can then use data to make decisions about where and how to develop automation, for example.<\/p>\n\n\n\n You may still have to deal with the low-data problem I mentioned earlier, but you will at least have some direction versus no data at all.<\/p>\n\n\n This next challenge is something that all SREs deal with incident response. In reality, it should be called incident management.<\/p>\n\n\n\n SREs should not be the sole professionals on a 24\/7 basis looking into incidents.<\/p>\n\n\n\n They should be the ones taking charge of developing an effective system around incident response. Hence incident management.<\/strong><\/em><\/p>\n\n\n\n I\u2019ve outlined below some tactics you can deploy to develop your organization\u2019s incident management capability.<\/p>\n\n\n\n Core tactics<\/strong><\/p>\n\n\n\n Premier League tactics<\/strong><\/p>\n\n\n\n Many of the SRE managers I talk with still are struggling with release management as a capability.<\/p>\n\n\n\n This might come as a shock to those working in cloud-native companies or ones that give priority to software engineering.<\/p>\n\n\n\n But here\u2019s the truth: it\u2019s still a big challenge for many smaller organizations.<\/p>\n\n\n\n The first thing I recommend with this is documenting everything in the software system:<\/p>\n\n\n\n This documentation can serve as a foundation for improving and optimizing the deployment processes, making them more efficient and reliable.<\/p>\n\n\n Remember that you have limited resources and your SREs will need help with incident response and other aspects of work. Developers are best placed to help with this through the model of \u201cyou build it, you run it\u201d, but this is easier said than done.<\/p>\n\n\n\n I\u2019ve noted a lot of places that have resistance from developers. To them, it\u2019s additional work that doesn\u2019t add to their bottom line.<\/p>\n\n\n\n It becomes your job then to instill a sense of interest in them. How?<\/p>\n\n\n\n Let\u2019s go through 3 ways to communicate the what\u2019s in it for me<\/strong> for a developer to build and<\/strong> run a service:<\/p>\n\n\n\n Differentiation in the job market<\/strong><\/p>\n\n\n\n Developers who embrace the “you build it, you run it” philosophy stand out from others in the job market.<\/p>\n\n\n\n Employers appreciate engineers who can take responsibility, understand the operational aspects, and deliver strong, dependable software.<\/p>\n\n\n\n This can lead to better job opportunities, higher-paying positions, and more bargaining power during salary negotiations.<\/p>\n\n\n\n Entrepreneurial opportunities<\/strong><\/p>\n\n\n\n Taking responsibility for running their own code is in line with an entrepreneurial mindset.<\/p>\n\n\n\n Developers who can deliver and manage their software from start to finish are better able to identify business opportunities and create their own products or services.<\/p>\n\n\n\n This path can (still) offer significant financial rewards and the satisfaction of being in control of their own destiny.<\/p>\n\n\n\n Reduce future pain of fixing technical debt<\/strong><\/p>\n\n\n\n Developers can gain a better understanding of how their code performs in production by running it themselves.<\/p>\n\n\n\n This way, they can identify potential issues and address them early on, which helps prevent the accumulation of technical debt.<\/p>\n\n\n\n By having direct visibility into the code’s behavior, they can make improvements and optimize its performance before it becomes a bigger issue for them to fix.<\/p>\n\n\n Allan Shone works as a leader of Infrastructure and Platform at a startup in Australia. Recently, he discussed important aspects of SRE at a LinuxCon event.<\/p>\n\n\n\n He said, \u201cWe need to focus on the right things at the right time to get the best benefits and accomplish what we need to accomplish.\u201d<\/p>\n\n\n\n\n
3 antipatterns to starting SRE in smaller organizations<\/h2>\n\n
1. Hiring before you know what you need from SRE<\/h3>\n\n\n
2. Copying Google\u2019s SRE model<\/h3>\n\n\n
3. Starting off with too many SRE practices at once<\/h3>\n\n\n
How can you start SRE on limited resources?<\/h2>\n\n\n
You don\u2019t need to hire an SRE to start it<\/h3>\n\n\n
Convince a developer to become an SRE<\/h3>\n\n\n
Low-volume production environments suffer from a low-data problem<\/h2>\n\n\n
\n
1. Limited type and quantity of data<\/h3>\n\n\n
2. Harder to determine statistical significance<\/h3>\n\n\n
3. Less sophisticated instrumentation<\/h3>\n\n\n
4. Experimentation is much harder<\/h3>\n\n\n
Alerting is one area you can have early SRE wins<\/h2>\n\n\n
Simplify your alerts<\/h3>\n\n\n
Adopt percentage-based Service Level Objectives (SLOs)<\/h3>\n\n\n
Manage alarm notifications effectively<\/h3>\n\n\n
Inject synthetic traffic<\/h3>\n\n\n
Group your services for less alert noise<\/h3>\n\n\n
How to compose your SRE team and function<\/h2>\n\n
Start with this absolutely essential capability first<\/h3>\n\n\n
\n
Your next challenge after handling observability<\/h3>\n\n\n
\n
\n
Fix release management and many worries will go away<\/h3>\n\n\n
\n
Push for \u201cyou build it, you run it\u201d<\/h3>\n\n\n
Philosophy of SRE teaming at small organizations<\/h3>\n\n\n