Site Reliability Engineering Glossary – Boost software reliability

4 “Golden Signals” in Site Reliability Engineering

Latency is the delay before data is completely transferred from one end to another. It is typically measured in milliseconds (ms)

Throughput is the amount of data that can transfer across within a given period. It can be measured in bits/second.

Error rate measures errors occurring in the system, such as bugs in code, network outages, or request errors like 500 error. It is expressed as a % of total requests.

Saturation is a measure of the load on your server resources. It can include measures like CPU utilization and memory & storage used.

SLAs, SLOs and SLIs for measuring SRE success

Service Level Agreements (SLAs) are contractual obligations between the service provider and service consumer/payer for a certain level of performance. The consumer may demand money if the SLA is broken at any point.

Service Level Objectives (SLOs) are the guide levels of performance for engineers to aim for. They typically correlate with SLA requirements. For example, they can be goals for a certain level of availability for a service over a given period.

Service Level Indicators (SLIs) are measures of performance that allow engineers to understand if they are meeting the SLOs for the system and, subsequently the business-level SLAs. For example, they can be the uptime metric for a particular service.

Software incident response lingo

On-call implies that the engineer must be available to respond to incidents, should they arise when they are not typically working. This may mean evenings or weekends.

Pager Duty is a term used to refer to being on-call. It harks back to when operations engineers were required to carry a physical pager and respond if alerted by the device.

Follow the sun refers to an incident response timing where the engineer needs to respond to incidents from sunrise to sunset.

Mean time to acknowledge (MTTA) is the average time it takes for an engineer to get to look at an incident from the moment it has been identified and paged to the engineer.

Mean time to recovery (MTTR) is the average time for the engineer to resolve the incident from the moment the alerting system picks up the incident.

Mean time to failure (MTTF) is the average time a system or service is expected to function before it experiences a failure, such as a performance-degrading bug or outage.

Mean time between failures (MTBF) is the average time elapsed between two incidents across a series of incidents.

What SREs do after an incident

Postmortems are events that engineers may undertake after an incident has been resolved or controlled. They may go through logs and analyse the root cause to identify patterns and prevent future similar incidents.

Blameless is the cultural mindset that (most) Site Reliability Engineering teams aim to have when going through an incident. They aim to find out what and not who caused the problem. Even if they find the person behind the incident, they seek not to blame.

How SREs make better systems

Toil in the site reliability engineering sense manual, repetitive work that should be automated away if possible. The catchphrase among SREs is to “eliminate toil”.

Error budgets are an allowance for errors that Site Reliability Engineers make. This allowance helps SREs work on experimental work that may eliminate toil and give developers more breathing room. It is an advanced principle.

Author
Recent Posts

Connect?

Ash Patel

Reliability Nut at SREpath

Ash has an unhealthy obsession with software reliability. Maybe it’s got to do with the trauma of working at a few companies where software kept slowing or went down while he worked to turn it around. His ma hopes that he can one day turn this passion into a respectable job or business. Still waiting…

Connect?