image_pdfimage_print

Stuff happens! Our IT systems are incredibly complex. Inevitably, things will break and customers will experience the consequences of these failures.

When we have experienced a major loss or degradation of our IT services, it is essential that we learn from what happened. A learning approach ensures either that the incident doesn’t happen again, or that we can remedy the situation more expediently than the first time around.

Major incident reviews, or incident postmortems, form an important part of any continual improvement program. These reviews are opportunities to improve both our IT infrastructure and, possibly more importantly, our processes for dealing with these events. A mature organization will see these events as valuable learning opportunities, rather than taking the opportunity to apportion blame for any errors.

Let’s explore incident postmortems, including the #1 factor for their success. Then, we’ll cover the benefits, rules, and best practices for your incident reviews.

Successful incident postmortems are blameless

In my opinion, the most critical success factor for incident reviews is that they are blameless.

To use a popular phrase: do not make your incident postmortem a witch hunt. ‘Blamestorming’ sessions do not benefit anyone. If your company culture seeks out the person who may have caused, through error or omission, a major outage, it is extremely unlikely that you will get truthful answers during the review. In this culture, no smart person would be willing to raise their hand and admit a mistake. When that happens, your postmortem has failed before its begun.

Consider a company culture that rewards honesty rather than demonizing mistakes. People will put up their hand willingly to flag an error they may have made. Then, real and useful changes can be made to prevent it being made again in the future.

Invaluable benefits of incident reviews

Remember that incident reviews aren’t just for internal stakeholders. Ultimately, your incident reviews show your customers two important characteristics about your company, which provides invaluable benefits:

  • Your willingness to learn from mistakes. Reporting the findings of your review back to your key customers, owning any errors or omissions and giving them a roadmap showing what you are doing to improve the reliability of your services will build your reputation as a valued business partner.
  • Your ability to create new processes that ensure your customers aren’t impacted by the same issue again. Showing that you understand and take seriously the impact of IT outages on the wider business is essential to growing a relationship based on mutual respect.

How to conduct incident postmortems

A well-run postmortem allows your team to come together in a less stressful environment to achieve several goals:

  • Work through what happened
  • Discover previously unknown system vulnerabilities
  • Mitigate the possibility of repeat incidents
  • Uncover any potential process improvements that could speed up resolution of the next major incident

But you’d be hard-pressed to achieve these without some basic rules in place, so let’s set a few:

  1. Have a template. Create a template that you will work off for each review. This ensures you don’t miss anything. A template also provides the basis for the reporting, that goes to your management team, and the communications that goes out to affected customers and stakeholders.
  2. Define roles and owners. The owner of the review is responsible for managing the meeting and producing the subsequent report. The owner(s) should be someone who has sufficient understanding of the technical details, familiarity with the incident, and an understanding of the business impact.
  3. Set rules around which incidents need reviews. You must have clear, well defined rules about which incidents will trigger the postmortem process. A good rule of thumb is any incident that has been given a severity one rating. There may be other incidents where a review may be useful. Consider establishing a process whereby service owners can request reviews of incidents that do not meet the severity criteria but that may have severely impacted their services and customers
  4. Act timely. A critical incident will almost always require some downtime for your team; do not delay any longer than necessary. Procrastinating too long means that important details are forgotten. So, when a critical incident occurs, convene within 24-48 hours if possible, and certainly do not delay more than a week.

Best practices for incident reviews

Even with rules in place, an incident postmortem can go all over the place. Consider these best practices as you embark on your next incident review, and then revisit them with each postmortem iteration.

Conduct a review for every incident classified as ‘major’. Every major incident! Even if it’s too hard. Even if you already know the root cause or you’ve developed a permanent fix. Don’t skip any major incident review. Remember that not everyone is aware of the final resolution or the steps that were taken. The review is as much about reviewing how well your process performs as it is about finding the technical or true root cause.

Choose a moderator. Ensure that one person controls the room, so that it stays on track and doesn’t become a “blamestorming” session. Typically, the moderator is the owner of the incident review, whom you’ve already designated. If not, perhaps rely on a person who can command a room. The moderator is responsible for maintaining order and giving every participant the chance to speak.

Involve many people. Most major incidents involve many players from internal and vendor teams. The review gives everyone a chance to contribute their views and learn from the experience. Beyond this specific incident, being inclusive helps build trust and resiliency in the team, creating relationships that will help the next major incident war room run more smoothly.

Lay the ground rules at the start of your meeting. No finger pointing, no dismissing anyone’s ideas. Treat everyone with respect.

Single out no one. Successful postmortems are blameless postmortems. Do not single out any individuals as being responsible for the incident: it’s negative and it wastes time. Instead, you must concentrate on actions, results, and impact.

Use “The 5 Whys” technique. I like this technique and promote it often. First, make sure everyone is on the same page about the original problem and its details. Then, ask why that happened. As you get that answer, ask why again. Keep asking “Why?” at least five times. This ensures you uncover all the underlying factors that contributed to the incident. The information obtained from this exercise will also form the basis for the ongoing problem investigation.

Don’t let participants shy away from uncomfortable truths. In group settings, it’s easy for participants to choose the truth of least resistant, or come to an easy or convenient consensus on cause. The owner/moderator should prevent this from happening.

Do not skimp on time. Your incident review is all about detail—things that did not seem important during the heat of the incident may provide valuable insights that could help with understanding the root cause. Give everyone a chance to contribute, and consider each and every one of those contributions, no matter how far-fetched they may seem.

Review your postmortems. The last thing I will leave you with: reviewing your incident reviews encourages you to do better next time, and there will be a next time. For continual improvement, everything we do contributes.

Additional resources

BMC Helix: The Future of Service Management

BMC Helix ITSM is the industry-leading service management tool that uses cognitive automation technologies, delivered on your choice of cloud.


Last updated: 12/04/2019

These postings are my own and do not necessarily represent BMC's position, strategies, or opinion.

See an error or have a suggestion? Please let us know by emailing blogs@bmc.com.

Run and Reinvent Your Business with BMC

BMC has unmatched experience in IT management, supporting 92 of the Forbes Global 100, and earning recognition as an ITSM Gartner Magic Quadrant Leader for six years running. Our solutions offer speed, agility, and efficiency to tackle business challenges in the areas of service management, automation, operations, and the mainframe. Learn more about BMC ›

About the author

Kirstie Magowan

Kirstie Magowan

Kirstie has been active in service management since 2000, working in a wide range of organizations, from primary industry to large government entities, across New Zealand and Australia. Kirstie has spent much of the past 15 years working at a strategic level as an ITSM consultant. She regularly takes on operational assignments to remember what it's like to be on the ‘coal face’ of service management, as this allows her to provide real and actionable advice as a consultant. Kirstie first qualified as an V2 ITIL Manager in 2004 and spent four years working as the Chief Editor for itSMF International from 2012 where she built a strong global network of service management experts. Kirstie is a member of the authoring team for the ITIL4 book - Direct, Plan and Improve, and a contributing author to the ITIL4 practice guides.