Stop writing runbooks no one will read.
Runbooks are how teams lie to themselves about preparedness. Here's the format that actually got opened during incidents — and the three sections I deleted from every template I had.
The first runbook I ever wrote was twenty-two pages long. The last useful one I wrote was a page and a half. The compression is the lesson.
What runbooks are for
A runbook exists for one moment: 03:14, your phone is buzzing, you have no caffeine, and the dashboard is screaming about a thing you last touched eight months ago. Everything else is an indulgence.
Most of the runbooks I’ve read are written for a different audience entirely — usually the author, two weeks after writing them. They include preamble, history, “context for new joiners,” and exhaustive lists of every flag the service supports. None of that helps at 03:14.
The template
The format I now insist on is three sections, in this order:
- What you’re seeing — the alert text, verbatim, with the exact dashboard link.
- What to do right now — the three commands you’ll run first, in order, with example output.
- What this might mean — branching paths, only after the first three commands.
That’s it. If something doesn’t fit, it goes somewhere else: design docs, postmortem retros, the wiki nobody reads. The runbook is the front door, not the museum.
The three sections I delete
Whenever I inherit a runbook, the first thing I do is remove:
- Service overview — belongs in the README.
- Escalation paths — belongs in the on-call tool.
- Historical incidents — belongs in the postmortem archive.
The remaining file is usually a third the size and twice as useful.
What “actually opened” looks like
I added a single line to our incident channel template: “runbook opened? y/n.” Six months in, the answer was “n” 70% of the time. The runbooks we had were not, by the strict definition, runbooks. They were documents that wished they were.
The new template’s open rate is 94%. We didn’t get smarter; we got shorter.