Every company needs a plan for when things go wrong. I've written these plans many times now, and every time I've wished for a reference that reflects the way companies actually work today.
So here it is --- our many years of collective knowledge and experience distilled into a practical guide to incident management for your whole organisation.
If you're looking a quick entry point, or a round-up of the key points in the guide. See here for the full guide.
On-call
On-call isn't just for engineers: consider who else you might need in an emergency; they should be on-call too.
Invest in your training process: onboarding new on-callers well is critical: each on-call rota should define a clear path to becoming ready to 'be on call', including learning domain specific content as well as your incident response process.
Pay anyone that's on-call: compensate them inconvenience of being on-call. We recommend paying per hour spent on-call, and adjust your compensation based on expectations.
Be compassionate and understanding:
- Allow on-call teams to define their own schedules that best suit them
- Use overrides to give on-callers flexibility, and relieve pressure when things get tough
- Look out for anyone taking too much of the burden
Foundations
Create a shared understanding of an incident: an incident is anything that takes you away from planned work.
Declare more incidents: using your incident process frequently means that, when things go really wrong, you're processes will run like a well-oiled machine.
Use 3--5 human-named severities: plain-english words such as minor, major and critical are easier for everyone to understand.
Every incident should have a lead: whether there's one responder or 30, someone has to play the lead role and drive the incident to a resolution.
Only use the roles you really need: you can often lean on actions (and your incident lead) to understand who's doing what.
Response
When you declare an incident:
- Create a fresh space which you can use to co-ordinate your response.
- Announce the incident in a shared space so everyone's in the loop
- Assemble the team that you need to start investigating
As you respond to an incident:
- Identify what's broken & understand the impact
- Mitigate the immediate impact
- Take a pause
- Resolve the issue
- Close everything off, and assign follow-up actions
Send regular, easy to digest, internal updates: using a predictable format helps busy stakeholders get the context that they need. Long gaps between updates can cause confusion or stress.
Show your working: document your response in an incident channel, even if you're the only one there. It'll help you avoid bad assumptions or mistakes, and helps your team learn from what you've done.
Keep your customers in the loop: clear and frequent communication builds trust, and can turn a negative into a positive. Use simple language, tell everyone what you're doing and what they should do in the meantime.
Structure your thinking: use questions and theories to methodically work through a problem, being clear about any assumptions you make along the way.
Calm is contagious: take breaks and keep everyone well fed so your incident response can stay on track, even on the bad days.
When you're remote, over-communicate: to avoid a fragmented response, make sure everything is in one place (the incident channel) and it's really clear who's doing what.
Learn and Improve
Hold a debrief when there's value: the responders for an incident should have a good idea whether a debrief will be valuable. If it becomes mandatory 'red tape', they'll become a useless checkbox exercise.
Make debriefs truly blameless: start with the assumption that everyone came to work to do their best, and don't hold individuals accountable for systematic failures.
Value the conversation over the artifact: having a document is a useful way to share knowledge asynchronously, but the most valuable part of a debrief is usually the conversation that precedes it.
Use incidents to level up your team: they broaden your horizons and teach you how to build resilient systems. Bring junior members into incidents, so that your teams get the full value from them.
Be transparent: building a transparent culture means that stakeholders and customers will trust you and give you space to fix what's broken.
Practice your incident process: just like any other skill, practice makes perfect. Dry-run your incident process regularly to get everyone up-to-speed and find the rough edges while the stakes are low.