4 “Golden Signals” in Site Reliability Engineering
Latency is the delay before data is completely transferred from one end to another. It is typically measured in milliseconds (ms)
Throughput is the amount of data that can transfer across within a given period. It can be measured in bits/second.
Error rate measures errors occurring in the system, such as bugs in code, network outages, or request errors like 500 error. It is expressed as a % of total requests.
Saturation is a measure of the load on your server resources. It can include measures like CPU utilization and memory & storage used.
SLAs, SLOs and SLIs for measuring SRE success
Service Level Agreements (SLAs) are contractual obligations between the service provider and service consumer/payer for a certain level of performance. The consumer may demand money if the SLA is broken at any point.
Service Level Objectives (SLOs) are the guide levels of performance for engineers to aim for. They typically correlate with SLA requirements. For example, they can be goals for a certain level of availability for a service over a given period.
Service Level Indicators (SLIs) are measures of performance that allow engineers to understand if they are meeting the SLOs for the system and, subsequently the business-level SLAs. For example, they can be the uptime metric for a particular service.
Software incident response lingo
On-call implies that the engineer must be available to respond to incidents, should they arise when they are not typically working. This may mean evenings or weekends.
Pager Duty is a term used to refer to being on-call. It harks back to when operations engineers were required to carry a physical pager and respond if alerted by the device.
Follow the sun refers to an incident response timing where the engineer needs to respond to incidents from sunrise to sunset.
Mean time to acknowledge (MTTA) is the average time it takes for an engineer to get to look at an incident from the moment it has been identified and paged to the engineer.
Mean time to recovery (MTTR) is the average time for the engineer to resolve the incident from the moment the alerting system picks up the incident.
Mean time to failure (MTTF) is the average time a system or service is expected to function before it experiences a failure, such as a performance-degrading bug or outage.
Mean time between failures (MTBF) is the average time elapsed between two incidents across a series of incidents.
What SREs do after an incident
Postmortems are events that engineers may undertake after an incident has been resolved or controlled. They may go through logs and analyse the root cause to identify patterns and prevent future similar incidents.
Blameless is the cultural mindset that (most) Site Reliability Engineering teams aim to have when going through an incident. They aim to find out what and not who caused the problem. Even if they find the person behind the incident, they seek not to blame.
How SREs make better systems
Toil in the site reliability engineering sense manual, repetitive work that should be automated away if possible. The catchphrase among SREs is to “eliminate toil”.
Error budgets are an allowance for errors that Site Reliability Engineers make. This allowance helps SREs work on experimental work that may eliminate toil and give developers more breathing room. It is an advanced principle.
- #34 From Cloud to Concrete: Should You Return to On-Prem? – March 26, 2024
- #33 Inside Google’s Data Center Design – March 19, 2024
- #32 Clarifying Platform Engineering’s Role (with Ajay Chankramath) – March 14, 2024