Innovation Service Management Blog

Detecting Major Incidents using Automated Intelligence and Machine Learning

Detecting and managing live incident storms using AI
4 minute read
Ajoy Kumar

It’s a typical Monday morning and incidents are streaming in. As a service desk manager, you notice more incidents than usual—“something just does not look right for Skype business services.” You start chatting with a few service desk agents to understand their incidents and determine if there is a pattern among them, but while doing so more incidents pile up, impacting customer satisfaction and delaying resolution. You need quicker answers to these two questions:

  1. Is there a major incident brewing for any business service right now?
  2. How many duplicate incidents are being created and worked on by different service desk agents?

Early detection of major incidents is critical for achieving higher customer satisfaction while improving the efficiency of the service desk. Artificial intelligence and machine learning (AI/ML)-driven clustering can help address these challenges effectively.

Natural language processing (NLP) can be used to understand the meaning of each incident and then similarity-based ML algorithms that continuously group the streaming incidents into meaningful, evolving groups of incidents that are correlated based on time, text, and business services.

Challenge #1: Is there a major incident brewing for any business service right now?

A major incident is defined as a critical and urgent issue that has widespread organizational impact and affects multiple users or regions. It is usually associated with an outage of a business service and can cause financial impact to the company.

The typical scenario is that service desk agents start to see a flood of critical incidents on a specific service. These could be a mix of both user-generated and infrastructure-triggered incidents. Service desk managers rely on “word of mouth” to detect major incidents by calling agents or having group chat sessions—ad-hoc, inconsistent, and non-repeatable solutions that can delay the detection of major incidents. Service desk managers need a better way.

Challenge #2: How many duplicate incidents are being created and worked on by different service desk agents?

Not every incident storm is a major incident. Duplicate incidents may reflect a very localized issue. For example, if I have five incidents related to Salesforce application file downloads, a traditional incident management system would have five agents working on these independently. This can cause huge inefficiencies, especially if all five incidents relate to the same underlying cause. Detecting duplicate incidents so they can be managed efficiently is critical to an organized service desk operation.

A better way

Let’s see how we can address these two challenges by using an AI/ML-driven clustering workflow:

1. Match and maintain cluster lifecycle. As new incidents are created, they are matched, using common ML-based similarity algorithms, with existing incidents to determine the degree of similarity. NLP and pre-trained language models such as BERT (Bidirectional Encoder Representations from Transformers) or TF-IDF (term frequency inverse document frequency) techniques can be used for text. For example, assume we get three new incidents within a few minutes, each having slightly different text but the same intent: “I cannot login to Skype,” “My Skype fails to login,” and “Skype login doesn’t work.” NLP and language models-based algorithms will detect that all three incidents have the same meaning (“intent”) and group them into a single cluster.

Once a match is determined with high confidence, a new cluster of incidents is formed and tracked by the system. As new incidents flow in, they are compared to existing clusters to determine which one they belong to. Clusters can evolve and grow as more incidents are matched and added.

Incident clusters are interesting and useful only for a short period of time, so the cluster is closed after a set time period (e.g., 30 days) if there are no additions. Incident clusters can also be closed if all incidents in that cluster are resolved. Automatically closing clusters is an important capability so that only the top “emerging” and “fresh” situations are presented to the service desk manager. The lifecycle of a cluster is managed by the system for incident creations, closures, and incident updates.

2. Detect major incidents. Major incidents are detected by identifying fast-growing clusters of incidents, as well as those that have high criticality. Multiple factors should be considered, such as: incident count; average priority/criticality of the cluster; the importance of the business service that the cluster impacts; the region(s) where it is happening; and so on. Notifications based on these criteria, as well as drill-down visualization of these incident clusters, inform whether a major incident should be raised.

In our Skype example, if there were only three incidents in the cluster, it is likely not a major incident. However, if you get 30 incidents on Skype within 10 to 20 minutes, all low priority or a few high priority, then this cluster needs to be identified as a major incident based on predefined thresholds. The ability to view the cluster’s aggregate-level properties (incident count, average, and trending of count), as well as average priority, region, and service impacted are important considerations in deciding whether or not this cluster is a major incident. Visualization tools and automated rules engines that can use customer-specified threshold criteria to indicate and notify major incident candidates would help speed major incident detection.

3. Recommend and manage duplicates. Modern IT service management (ITSM) systems can manage and prevent service duplications by manually creating parent-child hierarchies so that child incidents do not need separate agents, i.e., since only one agent is assigned to the parent incident. This greatly improves the efficiency of a service desk.

Until now, detecting and managing duplicate incidents has been largely a manual, inefficient process, which increases the workload of each agent without adding value. AI/ML technologies for clustering can present recommendations to the user on parent-child relationships to streamline service desk operations. By recommending parent/child tickets, instead of assigning one agent to each child, you can assign one agent to the parent. This can save a two-to-five-time factor of duplicate work and hence improve efficiencies.

Conclusion

You need a major incident warning system to avoid extended outages and impacts. BMC Helix ITSM has recently released these ML-powered capabilities as a part of the ITSM Insights offering to help you achieve that.

New strategies for modern service assurance

86% of global IT leaders in a recent IDG survey find it very, or extremely, challenging to optimize their IT resources to meet changing business demands.


These postings are my own and do not necessarily represent BMC's position, strategies, or opinion.

See an error or have a suggestion? Please let us know by emailing [email protected].

Business, Faster than Humanly Possible

BMC empowers 86% of the Forbes Global 50 to accelerate business value faster than humanly possible. Our industry-leading portfolio unlocks human and machine potential to drive business growth, innovation, and sustainable success. BMC does this in a simple and optimized way by connecting people, systems, and data that power the world’s largest organizations so they can seize a competitive advantage.
Learn more about BMC ›

About the author

Ajoy Kumar

With over 20 years of experience, Ajoy specializes in enterprise software, IT management, cloud architecture, SaaS, and ITSM. He currently serves as cloud architect at BMC, focused on understanding the needs of markets, customers and technology disruptions, and driving innovation.