When things go wrong in production it is critical to get things fixed immediately. Good teams can do that. But great teams fix the problem in production and get to the root cause so this type of problem never impacts a user again.
The Blameless Postmortem is a technique I’ve been using for several years to help teams take a negative event in production and turn it into an engine for continuous improvement.
I first learned about the concept of a blameless postmortem from Etsy. The key feature of a blameless postmortem is focus on how the problem occurred rather than who took an action that led to a poor outcome.
I first started evolving this format when I was the production support manager for an enterprise SaaS application (and full credit to the senior support developer who suggested we try it out). I’ve refined it over the years into this form.
With that in mind, let’s jump into how to do one.
More important than the steps and questions below is the mindset people need to have when approaching this process.
“Blameless” is a very powerful word. When you use it for the first time after a production issue, pay attend to how people react. Their reaction can tell you a lot about whether they believe individuals or systems are the cause of problems.
As is explained in the link above, if people feel that they will be punished for any mistakes they made, they will be reluctant to be open on how the issue unfolded.
Put more bluntly, you won’t get to the root cause of the problem if people are afraid they’ll be punished if they are honest.
I kick off each blameless postmortem with a reminder that the process is blameless. We want to get to the root cause of the problem and we have no interest in assigning blame to the unlucky person who made a mistake in the heat of the moment.
There are three steps for a blameless postmortem.
- Capture summary of the issue and timeline
- Meet to discuss the issue
- Identify and action remediation ideas
Summary (pre-fill ahead of the meeting)
A high-level summary of the issue, focusing on what is known at this point and what the impact to the customer was. Keep this to a sentence or two.
Rough Timeline (pre-fill ahead of the meeting)
A rough timeline of the issue. Depending on how fast-moving the issue was, this timeline could span a few minutes to a few hours to a few days. If your primary focus is on improving the team’s response times during emergencies, you’ll want this down to the second.
As you capture the timeline, be sure to include:
- When the issue was reported and by whom/what process
- What actions were taken
- When communication was made into and out of the team
When you meet to discuss the issue, invite everyone who worked the issue. This includes the production support team as well as the customer support team members that may have been involved.
Review the summary, review the timeline and add any missing parts, then move into the remediation ideas.
These questions are formulated to help the team take ownership of the problem. There are some issues that feel like they are outside of the team’s control (data center loses power, etc). But even in events like those, the team can still improve their reaction to the disaster.
Detect - How do we detect this problem or a problem like this sooner?
Assume this problem or a problem very much like it will happen again. How can the support team detect this problem faster and find it before a customer does?
React - How do we improve our reaction to issues like these?
Assume the issue is reported. How quick was the reaction? Were minutes lost while people were sending emails around trying to get someone to look at the problem?
The next time this issue happens, how can the team react more quickly or in a more organized fashion?
Quick Fix - How to stop the bleeding faster?
When this happens again, is there a ready workaround that we can provide the customer to reduce the impact of the problem?
If this is something that gets worse over time (like a DDOS attack) do we have a quick way to close the flood gates while we figure out the root cause?
Prevent - How to prevent or reduce the impact of this issue in the future?
This is often the only question teams ask in a postmortem. It is an important question and you should spend a lot of time here. However, if you limit yourself to asking only how to prevent an issue, it lets you not take any responsibility for the things within your control (like how you detect, react or quick fix an issue).
As you brainstorm ideas, don’t limit yourself to technical fixes. Better monitoring, better communication paths, better training, making sure the people in customer support know the people in production support by name, etc.
Other Areas of Risk - What other areas share this same vulnerability?
Every issue is a hint at where your system is weak. Odds are, for each issue you find, there are dozens lurking in the shadows, yet to be found.
There are likely other parts of the system that share the same design assumptions or in some cases the same code (not that anyone would ever copy/paste code).
Spend a few minutes brainstorming for other places that are vulnerable in a similar way.
When teams are stressed and overworked, they will skip this step. I find that this is the most important question to ask to get the team into a proactive mindset and to reduce the occurrence of issues in the future.
After you’ve identified all the possible things you can do to improve how issues are detected, reacted to, quick fixed and prevented…and you’ve found the other areas of your application that need attention…move on to deciding which actions to take.
The way you prioritize these is up to you. But I do have a few pieces of advice.
- Get a name and a date on each one you plan to action before you leave the meeting
- If someone in the meeting is passionate about taking one of the actions, encourage them to, even if you think it might not be the most important thing to fix
Names and Dates
Generally, I’ve found that teams enjoy this exercise (provided you can create a blameless environment for the meeting). They like dissecting the problem and brainstorming solutions. However, everyone feels busy and overworked. Unless this meeting wraps with owners and dates next to the things that need to be done, the greatest likelihood is that none of the improvements will happen.
What will happen is that 3 weeks from now when the same problem occurs on production (but this time in a bigger way) someone will say, “oh yeah, we talked about fixing that.” Not a great place to be.
To combat that, simply ensure that there is a name and date next to each action that the group wants to take.
- The “owner” doesn’t have to be the one who will do all the work. Maybe the action needs to be taken by someone outside the room. The owner in this case is who takes responsibility for ensuring the right people are contacted and that the action is taken.
- The date doesn’t have to be the date when the action will be complete. Some things can’t be done quickly. In this case, the date is the date when there will be measurable progress that can be reported back to the group.
Follow the Excitement
If you find someone in the meeting is really excited about taking one of the actions, encourage them to, even if you don’t think it is the most important action to take. By all means, make sure you’re taking the most important action, but don’t try to dissuade someone from taking an improvement they’re passionate about.
As teams mature in their ability to do production support, the most important change on the team is a sense of ownership. When you own something you take responsibility for its health and maintenance. And when someone want to take proactive steps to improve things, that moves the team in the direction of ownership.
This technique is powerful in 3 ways:
- It reinforces that a learning (blameless) mindset is critical
- It helps the team get out of a reactive mindset and into a proactive one
- It helps the team learn how to react to and triage issues
If you try it out or have any questions, please drop me a line.