Introduction
I can confidently tell you that runbooks form a critical part of the incident response toolkit. I will also tell you that SREs are well-placed to start and oversee the development of runbooks.
If you don’t have a runbook yet, let me entice you with the thought of checklist-type documentation to follow when you’re woken up to deal with a 3am production meltdown.
You won’t be the only one using the runbook. Its simplicity allows you to more easily product teams into the incident response action. It gives clarity to those who may not be as experienced as you when investigating faults with their work-in-production.
Runbooks are most useful when you are finding your incident response to be a case of “putting out the same fires over and over again”. It removes unnecessary thinking from incident response and helps you focus on the task at hand.
Or at least carry out the work without an overwhelmed 🤯 feeling.
Why runbooks are useful in SRE incident response
Here are 3 reasons why runbooks are superior to “I’ll figure it out as it comes” as a strategy:
Ways that teams have set up their runbooks
Confluence — is not particularly designed for managing runbooks but is an open-ended tool that enables you if you have a solid enough idea of how to effectively design a runbook
Jupyter Notebooks – an open-source tool with a combo of text, image, and live code snippets so decent option if you are happy to install and maintain it
Markdown files hosted in git repo — maintenance might be an issue over time without strict guidelines within the team
Err… this ➝ “Sticky notes on someone’s desk. We’re thinking about getting a laminator to keep the coffee spills from being too serious of a problem.” 😅