Who should pay attention to this article
❌ SRE at a very small startup with few users rarely makes a difference until you’ve reached a fair userbase size or have growing pains
❌ Many organizations without a strong money/legal incentive e.g. SLAs tied to their operations, cannot justify diving into a complex field like SRE
✅ Scaleup startups should 100% read this as you’re now facing growing pains in your software operations
✅ Smaller organizations that build in-house software and or have large cloud workloads also should pay attention.
✅ Engineering groups of less than 100 engineers
Should smaller organizations even bother with SRE?
Most of the original thinking behind SRE focuses on implementing it in large-scale systems.
I believe that any organization that has software at the foundation of its core business should at the very least pay attention to SRE principles.
You can always pare hyperscale ideas down to your level of need, which we will explore later in this article.
My own experience
Call me biased, but I know from first-hand experience that SRE can be useful for smaller organizations.
In 2014 and 2015, I worked with developers and operators in a small software company that started experiencing growing pains as it took on enterprise clients.
The company struggled to service the new volume of customers with its existing infrastructure architecture and practices.
This bogged them down with slow performance, excessive downtime, and random outages in key regions. They failed SLAs several times.
Does this sound familiar?
That’s precisely what SRE, as a practice, aims to solve.
SRE solves several problems for smaller organizations
I have not known a single engineering leader who thought one day, “SRE sounds cool. Let’s do it!”
There was always an underlying reason for taking up SRE practices.
An issue that they were trying to fix.
Some of the problems I have witnessed include:
- Not having enough operations staff to handle manual processes
- Regular downtime with negative customer feedback
- Performance meltdowns across software services
- Developers deploying outside the confines of arcane policies
- Inability to handle volatility in traffic patterns
SRE is a force multiplier that uses automation, passive guardrails, and more to give more sustainable operations than hiring more headcount for manual work.
3 antipatterns to starting SRE in smaller organizations
1. Hiring before you know what you need from SRE
It’s a great feeling to be excited about starting your SRE journey.
Enough to jump right in.
But that’s a risky proposition if you don’t know what problems you need to solve with SRE.
How would this manifest?
Imagine hiring SREs but then they look at the problems and then say, “I don’t know how to solve this problem.”
This can happen even if you hire experienced SREs because SRE is a huge space, and even experienced SREs may not be “experienced” in the problem you need solving.
Hiring experienced SREs may also not be right for you.
Some hires may come with preconceptions of how things should be done.
Be very clear about what you need and what resources you have available during the hiring process.
Plan your required capabilities ahead of time and save the headache of backtracking to solve problems.
2. Copying Google’s SRE model
At last count, Google had over 3000 Site Reliability Engineers.
If we go by a very conservative average salary (by Google’s standards) of $200k per engineer, that amounts to an SRE budget of over $1 billion.
You can call me out on this, but I’m guessing that you are not starting out on SRE with this kind of budget.
That’s why it’s not a good idea to start off by copying Google’s model as written out in its 2016 Site Reliability Engineering book.
Larger organizations could get away with copying Google’s SRE model because a large budget to your smaller organization might to them be a rounding error on their balance sheet.
Tailoring the SRE function to your organization’s specific needs reduces the risk of failure.
A failure is more likely to be noticed in smaller organizations because of the amount of money or engineer time involved.
3. Starting off with too many SRE practices at once
Chances are that because you are just starting out on SRE that you have limited resources. The worst thing you can do at this point is to cover too many areas of this new function.
SRE has a myriad of capability areas that you can cover and call it “SRE’.
Some organizations start out with observability (good choice).
Others start right out the door with incident response (okay-ish choice).
While some poor souls start out their SRE journey with observability, incident response, capability planning, release management, DevSecOps, and more.
See the problem here?
The poor souls have a focus problem.
They are trying to do too many things all at once, and it will blunt their ability to drive substantial value early on in their SRE journey.
Those two ingredients of substantial and early value creation are critical to success.
So…
How can you start SRE on limited resources?
Starting off with a whole team dedicated to Site Reliability Engineering (SRE) might not always be doable.
You might not even be able to budget for a single SRE.
You don’t need to hire an SRE to start it
But you can find someone who’s already working in operations to take on some of the foundational SRE work.
One way is to have them start by setting up observability instrumentation across a few services.
What would it mean to instrument observability?
It means gathering metrics, logs, and traces to get a better idea of how that part of the system is working.
Once they’ve got that down for one service, they could keep on doing it for other services or show other teams how it’s done.
Convince a developer to become an SRE
Another option is to have a software engineer who’s really good at coding fully take on SRE responsibilities. It’s a promotion 😉
This person would need to know everything about the various systems and services that the business uses, and be able to fix issues like a boss.
With some help and guidance, a skilled software engineer could become the foundational SRE at your organization.
They could play a pivotal role as a key player on the SRE team and make sure that everything keeps running like butter.
This is also a good selling point when you need to convince them to transition to SRE.
For more on pulling off this kind of transition, read the guide on turning a developer into an SRE.
Low-volume production environments suffer from a low-data problem
I’ve already mentioned that a lot of SRE work ala Google was designed for handling production systems at hyperscale.
SREs at larger organizations have the luxury (and sometimes curse) of having hordes of production data coming in to analyze and make proactive adjustments.
This is much harder to do in smaller environments with low traffic or low throughput because of the simple problem: not enough data.
The lack manifests through:
- Limited type and quantity of data
- Harder to determine statistical significance
- Less sophisticated instrumentation
- Experimentation is much harder
1. Limited type and quantity of data
In larger organizations, with high-traffic and high-throughput systems, SREs have access to a wealth of production data.
This includes metrics such as response times, error rates, resource utilization, and various performance indicators.
This abundance of data allows them to perform in-depth analysis, identify patterns, and make data-driven decisions for optimizing the system.
They can track trends over time, compare different periods, and establish baselines to measure improvements.
In contrast, smaller environments with low traffic or low throughput generate significantly less data.
The limited data availability restricts the scope and depth of analysis that can be performed.
SREs may face challenges in obtaining meaningful insights and identifying patterns due to the scarcity of data points.
This makes it difficult to establish reliable baselines or perform accurate trend analysis, leading to a higher degree of uncertainty when making proactive adjustments.
2. Harder to determine statistical significance
In data analysis, statistical significance plays a crucial role in drawing reliable conclusions. It ensures that observed patterns or changes are not merely random fluctuations but have a meaningful impact.
With larger datasets, SREs can apply statistical techniques to validate their findings and ensure confidence in the results.
They can perform hypothesis testing, confidence interval calculations, and statistical modeling to make accurate predictions and decisions.
However, in smaller environments with limited data, achieving statistical significance becomes challenging.
The smaller sample sizes may not provide enough data points to establish statistically significant relationships or patterns.
This makes it difficult to differentiate between real system performance issues and random variations.
SREs may need to rely on alternative approaches, such as simulation or extrapolation from other data sources, to compensate for the lack of statistical significance.
3. Less sophisticated instrumentation
Larger-scale systems typically have comprehensive monitoring and observability solutions in place, providing SREs with real-time insights into system behavior.
They can access detailed metrics, logs, traces, and other telemetry data, which enables them to diagnose issues quickly and take proactive actions.
These systems often incorporate sophisticated anomaly detection algorithms that can identify deviations from expected behavior, triggering alerts and enabling prompt responses.
In smaller environments, where resources may be limited, the level of operational visibility may be reduced. Monitoring tools and observability solutions might be less comprehensive or absent altogether.
As a result, SREs may face challenges in obtaining timely and detailed information about the system’s performance.
The lack of real-time insights can hinder their ability to make proactive adjustments and respond promptly to potential issues.
4. Experimentation is much harder
One of the benefits of having a large volume of production data is the ability to experiment and iterate rapidly.
SREs at larger organizations can test changes, optimizations, or new features on subsets of the data, measure their impact, and iterate based on the results.
This iterative approach allows them to fine-tune systems continuously, leading to better performance and reliability over time.
In smaller environments with low-traffic or low-throughput, the limited data available restricts the scope for experimentation and iteration.
SREs may need to rely on external benchmarks, industry best practices, or theoretical knowledge to make adjustments, as they may not have sufficient data to support the evaluation of the impact of changes.
This increases the risk of making suboptimal adjustments or missing opportunities for improvement.
It’s not all doom and gloom for smaller organizations because…
Alerting is one area you can have early SRE wins
One area where the low-data disadvantage for smaller organizations is obvious is the alerting process.
I’ve learned a few tips from several SREs to help alleviate this issue:
Simplify your alerts
Eliminate alerts triggered by individual events. Instead, focus on identifying and responding to broader patterns or trends. By doing so, you can reduce the noise and avoid overwhelming your team with unnecessary notifications.
Adopt percentage-based Service Level Objectives (SLOs)
If your system has low throughput, consider extending the evaluation time range for your SLOs.
This adjustment allows for a more accurate assessment of system performance over a reasonable timeframe, providing a better understanding of potential issues or deviations from expected behavior.
Manage alarm notifications effectively
If you’re concerned about receiving alarms during off-hours but unable to extend the evaluation window, customize your alarm’s notification policy to limit alerts to business hours only.
Inject synthetic traffic
Injecting synthetic traffic can help overcome the low data problem.
Generating synthetic traffic involves simulating artificial requests or interactions with your system to mimic real user behavior.
By incorporating synthetic traffic, you have more data to gauge the health & performance of your system and respond with proactive measures.
This approach helps identify potential issues before they affect real users and trigger alarms.
Synthetic traffic can be scheduled to run at regular intervals or triggered by specific events, providing continuous monitoring and alerting capabilities.
Group your services for less alert noise
The chaos from having a myriad of services only compounds in the incident response process.
Grouping related services might reduce the headache.
It involves categorizing related components or microservices into logical groups based on their dependencies and functionality.
By doing so, you can establish more meaningful and manageable alerting rules.
Instead of setting up alerts for individual services, you can define alerts at the group level. This approach reduces alert noise and helps prioritize and troubleshoot issues more effectively.
Grouping services also enables you to create aggregated metrics and observability dashboards, allowing for a holistic view of system health.
By incorporating synthetic traffic and grouping services, small SRE organizations can enhance their alerting processes:
These practices can help small SRE organizations streamline their alerting systems, improve incident response times, and enhance overall system reliability.
How to compose your SRE team and function
Start with this absolutely essential capability first
So you’ve had the chance to build a greenfield SRE team because:
- you got a few early wins like improving incident response or
- got the mandate from an executive who likes the SRE concept
How are you going to pull it off? Let’s run through a few considerations
The first step is to focus on observability.
You must achieve a degree of proficiency in this before you can make any tangible gains from automation, capacity planning, or anything else.
In simple terms, observability lets you see how well the infrastructure and services on top of it are performing and find any issues.
Once you have this down, you can then use data to make decisions about where and how to develop automation, for example.
You may still have to deal with the low-data problem I mentioned earlier, but you will at least have some direction versus no data at all.
Your next challenge after handling observability
This next challenge is something that all SREs deal with incident response. In reality, it should be called incident management.
SREs should not be the sole professionals on a 24/7 basis looking into incidents.
They should be the ones taking charge of developing an effective system around incident response. Hence incident management.
I’ve outlined below some tactics you can deploy to develop your organization’s incident management capability.
Core tactics
- Have well-written runbooks
- Chart out an on-call rotation roster
- Use tools for effective pager setup
Premier League tactics
- Minimise alerting fatigue (like I mentioned earlier)
- Utilize ChatOps to increase response efficiency
- Develop incident commander roles
- Evangelize “you build it, you run it” (I’ll get to this after covering release management)
Fix release management and many worries will go away
Many of the SRE managers I talk with still are struggling with release management as a capability.
This might come as a shock to those working in cloud-native companies or ones that give priority to software engineering.
But here’s the truth: it’s still a big challenge for many smaller organizations.
The first thing I recommend with this is documenting everything in the software system:
- What are the services?
- What tools are being used to deploy them?
- What processes do engineers have in place to deploy each service?
This documentation can serve as a foundation for improving and optimizing the deployment processes, making them more efficient and reliable.
Push for “you build it, you run it”
Remember that you have limited resources and your SREs will need help with incident response and other aspects of work. Developers are best placed to help with this through the model of “you build it, you run it”, but this is easier said than done.
I’ve noted a lot of places that have resistance from developers. To them, it’s additional work that doesn’t add to their bottom line.
It becomes your job then to instill a sense of interest in them. How?
Let’s go through 3 ways to communicate the what’s in it for me for a developer to build and run a service:
Differentiation in the job market
Developers who embrace the “you build it, you run it” philosophy stand out from others in the job market.
Employers appreciate engineers who can take responsibility, understand the operational aspects, and deliver strong, dependable software.
This can lead to better job opportunities, higher-paying positions, and more bargaining power during salary negotiations.
Entrepreneurial opportunities
Taking responsibility for running their own code is in line with an entrepreneurial mindset.
Developers who can deliver and manage their software from start to finish are better able to identify business opportunities and create their own products or services.
This path can (still) offer significant financial rewards and the satisfaction of being in control of their own destiny.
Reduce future pain of fixing technical debt
Developers can gain a better understanding of how their code performs in production by running it themselves.
This way, they can identify potential issues and address them early on, which helps prevent the accumulation of technical debt.
By having direct visibility into the code’s behavior, they can make improvements and optimize its performance before it becomes a bigger issue for them to fix.
Philosophy of SRE teaming at small organizations
Allan Shone works as a leader of Infrastructure and Platform at a startup in Australia. Recently, he discussed important aspects of SRE at a LinuxCon event.
He said, “We need to focus on the right things at the right time to get the best benefits and accomplish what we need to accomplish.”
Allan then went on to describe 5 SRE principles for smaller organizations:
- Everything should be reproducible. It is important to have all servers, load balancers, and anything in between and outside of them configured, version controlled, reviewed, shared, and easily reproducible.
- Not everything needs to be perfect. Aim to work consistently at problems and evaluate the benefits of changing before moving on. You need to understand that some level of manual labor is still necessary.
- Security should not be ignored. Preparing for security considerations can be as straightforward as implementing the right coding standards within a software service. As long as anyone writing code that deals with input uses that library, the input will be filtered.
- Simple is better than magical. Whether it’s the system’s name, description, or another aspect, it should be clear about what it does. This helps to avoid making assumptions. We’re less likely to encounter problems when people don’t have to assume that something means a certain thing.
- Visibility is key. Access is not. To make informed decisions, it’s important to have a clear understanding of all the data coming out of a system. Not everyone needs to access production systems to be able to get this visibility. Something that still happens in many smaller organizations.
Putting it all together
You should now have a more solid idea for implementing Site Reliability Engineering (SRE) in smaller organizations with limited resources.
I recommend starting with observability and incident management as your first capability areas.
I also recommend that you avoid common mistakes like trying to start off your SRE journey by covering every facet at once or copying Google’s SRE model.
Please also consider the tips for improving alerting processes when dealing with limited data.
Finally, keep Allan Shone’s five SRE principles like reproducibility, simplicity, and visibility in mind. They are perfect for smaller organizations starting out their SRE journey.
- #34 From Cloud to Concrete: Should You Return to On-Prem? – March 26, 2024
- #33 Inside Google’s Data Center Design – March 19, 2024
- #32 Clarifying Platform Engineering’s Role (with Ajay Chankramath) – March 14, 2024