Stuff happens! Our IT systems are incredibly complex. Inevitably, things will break and customers will experience the consequences of these failures.
When we have experienced a major loss or degradation of our IT services, it is essential that we learn from what happened. A learning approach ensures either that the incident doesn’t happen again, or that we can remedy the situation more expediently than the first time around.
Major incident reviews, or incident postmortems, form an important part of any continual improvement program. These reviews are opportunities to improve both our IT infrastructure and, possibly more importantly, our processes for dealing with these events. A mature organization will see these events as valuable learning opportunities, rather than taking the opportunity to apportion blame for any errors.
Let’s explore incident postmortems, including the #1 factor for their success. Then, we’ll cover the benefits, rules, and best practices for your incident reviews.
Successful incident postmortems are blameless
In my opinion, the most critical success factor for incident reviews is that they are blameless.
To use a popular phrase: do not make your incident postmortem a witch hunt. ‘Blamestorming’ sessions do not benefit anyone. If your company culture seeks out the person who may have caused, through error or omission, a major outage, it is extremely unlikely that you will get truthful answers during the review. In this culture, no smart person would be willing to raise their hand and admit a mistake. When that happens, your postmortem has failed before its begun.
Consider a company culture that rewards honesty rather than demonizing mistakes. People will put up their hand willingly to flag an error they may have made. Then, real and useful changes can be made to prevent it being made again in the future.
Invaluable benefits of incident reviews
Remember that incident reviews aren’t just for internal stakeholders. Ultimately, your incident reviews show your customers two important characteristics about your company, which provides invaluable benefits:
- Your willingness to learn from mistakes. Reporting the findings of your review back to your key customers, owning any errors or omissions and giving them a roadmap showing what you are doing to improve the reliability of your services will build your reputation as a valued business partner.
- Your ability to create new processes that ensure your customers aren’t impacted by the same issue again. Showing that you understand and take seriously the impact of IT outages on the wider business is essential to growing a relationship based on mutual respect.
How to conduct incident postmortems
A well-run postmortem allows your team to come together in a less stressful environment to achieve several goals:
- Work through what happened
- Discover previously unknown system vulnerabilities
- Mitigate the possibility of repeat incidents
- Uncover any potential process improvements that could speed up resolution of the next major incident
But you’d be hard-pressed to achieve these without some basic rules in place, so let’s set a few:
- Have a template. Create a template that you will work off for each review. This ensures you don’t miss anything. A template also provides the basis for the reporting, that goes to your management team, and the communications that goes out to affected customers and stakeholders.
- Define roles and owners. The owner of the review is responsible for managing the meeting and producing the subsequent report. The owner(s) should be someone who has sufficient understanding of the technical details, familiarity with the incident, and an understanding of the business impact.
- Set rules around which incidents need reviews. You must have clear, well defined rules about which incidents will trigger the postmortem process. A good rule of thumb is any incident that has been given a severity one rating. There may be other incidents where a review may be useful. Consider establishing a process whereby service owners can request reviews of incidents that do not meet the severity criteria but that may have severely impacted their services and customers
- Act timely. A critical incident will almost always require some downtime for your team; do not delay any longer than necessary. Procrastinating too long means that important details are forgotten. So, when a critical incident occurs, convene within 24-48 hours if possible, and certainly do not delay more than a week.
Best practices for incident reviews
Even with rules in place, an incident postmortem can go all over the place. Consider these best practices as you embark on your next incident review, and then revisit them with each postmortem iteration.
Conduct a review for every incident classified as ‘major’. Every major incident! Even if it’s too hard. Even if you already know the root cause or you’ve developed a permanent fix. Don’t skip any major incident review. Remember that not everyone is aware of the final resolution or the steps that were taken. The review is as much about reviewing how well your process performs as it is about finding the technical or true root cause.
Choose a moderator. Ensure that one person controls the room, so that it stays on track and doesn’t become a “blamestorming” session. Typically, the moderator is the owner of the incident review, whom you’ve already designated. If not, perhaps rely on a person who can command a room. The moderator is responsible for maintaining order and giving every participant the chance to speak.
Involve many people. Most major incidents involve many players from internal and vendor teams. The review gives everyone a chance to contribute their views and learn from the experience. Beyond this specific incident, being inclusive helps build trust and resiliency in the team, creating relationships that will help the next major incident war room run more smoothly.
Lay the ground rules at the start of your meeting. No finger pointing, no dismissing anyone’s ideas. Treat everyone with respect.
Single out no one. Successful postmortems are blameless postmortems. Do not single out any individuals as being responsible for the incident: it’s negative and it wastes time. Instead, you must concentrate on actions, results, and impact.
Use “The 5 Whys” technique. I like this technique and promote it often. First, make sure everyone is on the same page about the original problem and its details. Then, ask why that happened. As you get that answer, ask why again. Keep asking “Why?” at least five times. This ensures you uncover all the underlying factors that contributed to the incident. The information obtained from this exercise will also form the basis for the ongoing problem investigation.
Don’t let participants shy away from uncomfortable truths. In group settings, it’s easy for participants to choose the truth of least resistant, or come to an easy or convenient consensus on cause. The owner/moderator should prevent this from happening.
Do not skimp on time. Your incident review is all about detail—things that did not seem important during the heat of the incident may provide valuable insights that could help with understanding the root cause. Give everyone a chance to contribute, and consider each and every one of those contributions, no matter how far-fetched they may seem.
Review your postmortems. The last thing I will leave you with: reviewing your incident reviews encourages you to do better next time, and there will be a next time. For continual improvement, everything we do contributes.