Haroon Ahmed – BMC Software | Blogs

A Fully Integrated, Open Observability and AIOps Solution

Haroon Ahmed — Fri, 12 Apr 2024 07:20:08 +0000

Organizations often struggle to maintain seamless functionality and uninterrupted service delivery due to the overwhelming amount of data and events they face. This can lead to operational inefficiencies, prolonged downtime, and fragmented insights. Without a clear understanding of their IT infrastructure, businesses find themselves in a cycle of reactive firefighting. This lack of visibility not only hampers agility and innovation, but also leaves organizations vulnerable to costly disruptions and reputational damage.

In this type of landscape, the need for a comprehensive solution that ingests data while also synthesizing it into actionable intelligence becomes increasingly apparent. I would like to tell you about BMC Helix IT Operations Management (ITOM), a suite of software that goes beyond telemetry and event data to address your pain points head-on, empowering your IT to prevent incidents and delivering composite artificial intelligence (AI)-powered services (predictive, causal, and generative AI) for fast innovation.

Observability—Your key to reliable service delivery across complex IT systems

In a recent customer briefing, the topic of observability was discussed, along with the question of how BMC helps organizations solve the overwhelming amount of data and events. We shared a few key differentiators that demonstrate how BMC goes beyond processing telemetry and a short explanatory video.

There are many key features that BMC offers as part of BMC Helix ITOM, but I will highlight three here:

1. The FIRST key feature we’ll explore is dynamic service modeling (DSM), which revolutionizes how IT teams discover and manage services within their infrastructure. Through automated, real-time service modeling, BMC Helix ITOM ingests topology from BMC and third-party sources, identifying dependencies and relationships across the IT landscape, including infrastructure, applications, and software, to provide crucial visibility and context. By ensuring accuracy and consistency through automated reconciliation, this approach transforms the traditional methods, offering a comprehensive and dynamic understanding of the IT environment.

2. The SECOND key feature I’d like to highlight is root cause isolation. The business service is a complex and ephemeral graph of configuration items (CIs) and their relationships. Without the connected topology you get with DSM, and a business service to provide context for the domain, root cause isolation would not be possible.

Telemetry, events, and change requests are automatically mapped to the business service and impacted CIs. Causal AI is used to identify the root cause CI and correlate any impactful change requests. This eliminates the blame game when dealing with thousands of changes in the system.

3. The THIRD key feature I’ll highlight today is the BMC HelixGPT-powered best action recommendation (BAR) capability. BMC ITOM ingests textual data from telemetry, service management, and vulnerability systems, and ships a pre-trained large language model (LLM) with domain expertise.

The composite AI pipeline, based on predictions, impact, and root cause isolation, can then contextually ask BMC HelixGPT to summarize problem scenarios, surface log insights, and provide a BAR based on historical data.

As an example, imagine that you performed a code update, which resulted in increased CPU utilization that significantly strained host resources. A BAR can provide tailored recommendations by leveraging its pre-trained domain expertise and optionally fine-tuning it with customer data to address resource-related issues efficiently.

Achieving seamless functionality and uninterrupted service delivery is paramount. However, overwhelming data and events can lead to operational inefficiencies and prolonged downtime. With features like dynamic service modeling, root cause isolation, and best action recommendation, BMC Helix ITOM is the only fully integrated, open observability and AIOps solution with AI/ML-powered discovery, monitoring, optimization, automation, self-healing, and remediation of services, empowering IT to prevent incidents and innovate quickly. Click here to discover how BMC Helix ITOM can help you revolutionize your IT operations.

7 Ways BMC HelixGPT Reduces Manual Toil to Achieve Zero-Touch Operations

Haroon Ahmed — Fri, 26 Jan 2024 07:00:42 +0000

Modern enterprise applications are deployed in hybrid multi-cloud environments. The system of engagement for end users is supported by modern, cloud-scale architectures deployed as containerized microservices. The system of record requires hybrid architectures spanning cloud to mainframe for seamless integration between modern and legacy systems.

For end users—global customers, partners, or employees who interface with mission-critical business services for financial transactions or administrative tasks—business services are synonymous with the brand. Bad service quality or outages can have a negative impact, resulting in financial penalties and brand damage. The attention span of a mobile end user is measured in seconds; a bad mobile experience in any industry can result in subscriber churn. App stores place the power of switching loyalties in the hands of the end user, who can download, install, and delete applications at will in seconds. This is why it is critical to measure end user experience in the context of end-to-end business service performance and availability.

Organizations are constantly looking to improve operational efficiencies, reduce errors, and optimize productivity; however, these same organizations are also challenged with the burden of manual toil. The goal is to achieve zero-touch operations, where processes and operations require little to no manual intervention during unexpected disruptions to business services that span a multi-layered public and private IT landscape.

For zero-touch operations, the solution—in addition to ingesting observability artifacts and integrating with service management solutions for incident and change—will apply correlation, predictive, causal, and generative artificial intelligence, and machine learning (AI/ML) algorithms to recommend and automate actions to remove manual steps. Additionally, a conversational AI-based experience enriches and personalizes the user experience.

In this blog post, I would like to highlight seven core AIOps capabilities that are required by organizations to successfully achieve zero-touch operations (see Figure 1):

Figure 1. Seven core AIOps capabilities for managing complex and constantly changing hybrid cloud environments.

These seven core AIOps capabilities are used to apply domain context, derive actionable insights from data, and, finally, automate the best action with confidence.

Apply domain context and derive actionable insights

The criticality for an impacted business service is measured by the service impact. Figure 2 below overlays the hybrid multi-cloud deployment with the first four AIOps capabilities that model the service and build Situation awareness to identify root cause and assess service impact.

Figure 2. Model a dynamic service (1), build Situation awareness based on alertevent fatigue (2), perform root cause isolation (3), and assess service impact (4).

1. Dynamic service model

Dynamic service models are used to represent a business service (e.g., mobile banking, voicemail, etc.). The business service adds domain-specific knowledge, which in turn adds context and improves decision-making.

A business service is modeled from topology ingested by discovery and monitoring tools. The goal is to eliminate any manual tasks and automatically reconcile and dynamically update the end-to-end topology that spans application to network and cloud to mainframe. In Figure 2 above, the grey oval shape (1) encapsulates all the configuration items (CIs) and their relationships that represent a business service.

The dynamic service model provides the underlying connected topology as a foundation to apply AI/ML algorithms. The model represents domain-specific knowledge of the business service. Putting a boundary around the impactful CIs helps make informed predictions, root cause determination, and recommendations.

Business services are built automatically starting with a CI that represents a service (e.g., application name, database cluster name, VMware cluster name, Kubernetes name space, or network switch). Based on the starting point CI, related CIs are pattern-matched to automatically build and update the service model. This removes the need to manually build or maintain service models. You can refer to my previous blog post for additional details on service modeling and blueprints.

2. Situation awareness

A Situation represents awareness around an active issue that is impacting or has the potential to impact a business service (e.g., network switch issues could potentially impact mobile banking users if not triaged and fixed in a timely manner).

Alert/event fatigue is a common problem for the service desk that is compounded by the complex and distributed nature of modern applications. For example, a network problem can trigger an alert/event storm that impacts every layer of the business service, from application to network and everything in between.

Deciphering the signal from the noise is beyond human cognitive abilities and requires an AI/ML-based approach to build Situation awareness of the problem. Correlation AI/ML algorithms are used to group alerts/events across the dimensions of time and text using clustering and natural language processing (NLP) algorithms. The resulting noise reduction draws focus to the problematic cluster(s).

The Situation clusters alerts/events in relation to the impacted CIs and is represented as a slice of the dynamic service model, as shown above in the blue oval shape (2) in Figure 2. While this Situation puts a lens on the brewing problem, at this stage, there is no root cause isolation or indication of current or future service impact. Noise reduction based on event/alert clustering is not enough to determine root cause.

3. Root cause isolation

Root cause isolation is required to identify the culprit in an active Situation, and then engage and/or automate best actions to restore an impacted business service.

Correlation AI/ML algorithms do a good job of noise reduction; however, they lack causation, which is required for root cause isolation. By the same token, algorithmic root cause isolation is required to eliminate the blame game that takes place in a typical war room. The ultimate goal is to minimize the impact of an outage and restore service.

To be deterministic and explainable, root cause isolation requires domain-specific knowledge. To achieve this, dynamic service models provide a third dimension of topology. Causal AI/ML algorithms are used to build a causal graph for the active Situation, perform graph traversal in the context of the alerts/events, and identify the root cause CI(s). In our example, the root cause CI is the network switch, represented by a green circle (3) in Figure 2 above.

The active Situation is automatically updated with the root cause CI. However, at this stage, the impact and urgency of the Situation is unknown. This can be determined by predicting the service impact.

4. Service impact predictions

Service impact analysis is required during an active Situation to assess the current or potential future impact to a business service. This helps identify the criticality to determine the best action needed to engage with ticketing and automation systems.

Service impact to a mission-critical business service can result in brand damage and financial penalties for an organization. Proactively assessing service impact is critical to prioritize the criticality of the active Situation, as shown in the red circle (4) in Figure 2 above.

Service impact is linked to the key performance indicators (KPIs) that are used to measure and assess the health and performance of a business service. KPI examples are service- and industry-specific (e.g., end user response time for a mobile banking application, voice quality for a voicemail application, user sentiment from social media feeds, number of transactions processed to measure revenue, or saturation events for infrastructure resources).

Predictive AI/ML algorithms are used to assess current or future KPI impact to the business service. They are applied against historical and current data to identify patterns and predict future outcomes. Common proactive approaches include:

Predictions are used to report near-future deviations from threshold and/or normal behavior. This helps fix issues before users are impacted.
Forecasting is used to assess and report on resource saturation. Proactively fixing resource saturation helps with planning and avoids potential outages.
Univariate anomaly detection is used to observe metrics over time to find and report on outliers. This helps identify the needle(s) in the haystack, especially when looking across thousands of metrics.
Multivariate anomaly detection is used to compare multiple metrics to find and report deviations from normal metric patterns. This helps find abnormal trends in data patterns across multiple dimensions.

With the Situation in place, the AIOps solution has successfully modeled the business service, reduced noise, identified the root cause, and assessed the service impact to establish criticality. The Situation also provides contextual input (prompts) for generative AI algorithms to accurately build a human-readable summary as shown in Figure 3 below. Note how context is captured based on the root cause, service impact, and causal chain of alerts/events to write an accurate problem summary based on event/alert data.

Figure 3. Human-readable generative AI problem summary identifying root cause CI and its impact.

Automate with confidence

Once the Situation has matured, the next step is recommending and automating the best action to take based on past Situations and historical ticket resolutions. The last three AIOps capabilities (5-7 below) help you make an informed decision and automate with confidence.

5. Best action recommendation

Best action recommendation is based on the analysis of past Situations represented in a knowledge graph and the processing of historical ticket data using generative AI.

Knowledge graph and past Situations

The knowledge graph is a graph-based reasoning framework used to represent past Situations. The Situation captures the causal chain of alerts/events across the different layers (cloud to mainframe), isolates the root cause CI, and identifies the end user impact.

The Situation enriches the knowledge graph with a semantic meaning that represents real-world knowledge, allowing for intelligent reasoning and inference. A good example is the generative AI Situation summary described in the previous section, which summarizes the problem across text, time, and topology for event/alert data. The knowledge graph also pattern-matches similar Situations from the past, as shown in Figure 4 below.

Figure 4. Similar Situation aggregated view for the past four months.

The knowledge graph learns and grows over time, recommending the best action based on past behavior. Past similar Situations in the knowledge graph are clustered and pattern-matched to provide automation recommendations based on the success or failure of past actions, as shown in Figure 5 below.

Figure 5. Automation recommendation based on past Situations.

Historical ticket and large language models

Additionally, historical ticket data (incident, change, defect) is used to train a large language model (LLM) that provides a best action recommendation based on historical resolutions. Figure 6 below shows a generative AI-based Situation event summary, along with two recommended actions based on the processing of historical ticket data. Using generative AI, the trained model is asked, for example, how to fix, a storage issue. Based on the results, the user can automate or manually run a recommended automation, ask the LLM to generate an automation script using the code wizard, and/or chat with the model to get more information. In the example below, we click on “Ask BMC HelixGPT” to ask questions and better understand the issue’s impact and which team has solved the issue in the past.

Figure 6. Generative AI problem summary and best action recommendation with actionable insight and conversational UI.

The Situation at this stage has enough context to confidently engage with other solutions to create incident(s) and change request(s), and then act by running automation tasks using automation tools. Based on root cause isolation, service impact prediction, and recommended best action, the system can automate with confidence.

6. Automatic ticket management

With the root cause and service impact identified, we can now automate the creation, prioritization, and routing of a ticket.

Create a single incident ticket for the Situation (not per each individual event) and target the right support group based on the CI(s) identified as the root cause. This eliminates the need for the first level of support to triage the Situation, and a second level of noise reduction is applied to reduce help desk incident fatigue. This also bypasses the need for a war room and eliminates the back and forth between different monitoring teams to establish root cause and ownership to fix the issue.

Assign ticket severity based on the current or predicted service impact. This helps prioritize the Situation so that support staff can focus and work on the most critical business-impacting issues.

7. Intelligent automation

Run the recommended automation based on resolution insights from similar Situations and past tickets. The recommended action(s) is based on the success of past actions and ticket resolutions, eliminating the need for support teams to manually process historical Situations and tickets. Change request approvals can also be automated depending on the change request risk assessment (e.g., restarting pods or scaling out virtual machines may be considered low risk, allowing for automated approvals).

To summarize, zero-touch operations can help revolutionize how organizations minimize customer outages and improve the end user experience. Adoption of AI/ML and automation with the seven core AIOps capabilities discussed in this blog provide the foundation to apply domain-centric service context, derive actionable insights from monitoring data that spans a complex multi-cloud IT landscape, and, finally, automate a best action recommendation based on historical success with confidence.

The AIOps capabilities in our BMC Helix Operations Management and BMC Helix Discovery solutions provide the foundation, and BMC HelixGPT brings the domain knowledge—typically held by subject matter experts—to apply correlation, predictive, causal, and generative AI/ML algorithms to solve complex IT operational issues.

A Moving Target: Rethinking Service Modeling forDynamic Workloads

Haroon Ahmed — Thu, 06 Jul 2023 14:25:47 +0000

In the news, we regularly hear about how changes or faults result in major outages at large corporations, impacting millions of end users (or thousands of airline passengers, in the case of the Notice to Air Missions (NOTAM) system outage earlier this year).

Changes or issues with downstream dependencies like applications, infrastructures, or networks can impact critical services, resulting in financial penalties and brand damage.

Traditional service modeling practices

Historically, configuration management databases (CMBDs) have helped organizations identify and resolve issues faster, manage risks associated with change, and make more informed decisions. Configuration item (CI) details are updated in the CMBD either manually or, more commonly, using a discovery tool that runs infrastructure and network scans on a periodic basis (daily or weekly in most cases). Manually maintaining CIs and their relationships has been a challenge. Most customers I have talked to regularly express concerns regarding the accuracy of their CMDB.

Service modeling is used to fence off targeted CIs to define a service boundary that represents a business, application, or technical service. Over time, users have modeled thousands of such services, which have proven difficult to maintain and keep current based on the shifting IT landscape.

Service models help organizations identify critical business services and their downstream dependencies in order to measure service risk and fault. When enriched with artificial intelligence (AI) and machine learning (ML) insights, models that represent a critical business service should help mitigate outages and prevent bad headlines; however, service modeling processes and technology need to evolve and take modern workloads into consideration.

The challenges created by modern workloads

Modern applications, infrastructures, and networks are software-defined, virtual, and ephemeral in nature. CIs and their relationships to the modern IT landscape change very frequently based on cloud-scale workloads. DevOps practices introduce new features daily, and cloud architecture can scale on demand depending on user load; this helps bring new features to market faster, optimize resources for cost savings, and improve the customer experience.

This continuous change, however, poses a major challenge for traditional service modeling practices. Discovery tools continue to play an important role in this modern landscape for asset discovery, change impact analysis, and dependency mappings; however, they are not enough. A dynamic approach to service modeling is required to deal with near real-time changes in hybrid-multicloud environments.

Moving to dynamic service modeling

Enrich service models with CIs from monitoring tools
Monitoring tools provide additional CI details and their relationships at a higher frequency and, in the case of real user and application performance monitoring, do it based on real user transactions, making it more accurate (refer to Figure 1 below for additional details). For example:

Real user or customer experience monitoring tools provide end user experience details with geolocation; this helps overlay service health and impact with customer experience metrics (e.g., click-through or bounce rate for websites or delay and jitter for voice).
Application performance monitoring tools provide application-specific details on software components and their relationships across containers, cloud, distributed, and mainframe topologies. This provides an accurate and up-to-date snapshot of an end-to-end user journey, from mobile to the mainframe.
Infrastructure and network monitoring tools provide low-level details and connect virtual and physical devices. These tools complement discovery tool scans.

Figure 1. Scan and update frequencies associated with different monitoring tools.

Take advantage of a highly performant graph database
Traditional CMDBs use relational databases for storing CI details and their relationships, which addresses asset and change management use cases. However, to account for modern operational use cases, there is also a need for a graph database to operate at scale, account for changing CI relationships, and maintain a history of the moving landscape.

Figure 2 below shows a typical CI landscape with siloed layers. There is a need to reconcile discovered CI relationships across the application, container, infrastructure (cloud, distributed), network, and mainframe layers. This is a hard problem and has been discussed in detail in my recent blog post, “How BMC HelixGPT-Powered AIOps Connects Observability Silos for Faster Probable Root Cause Isolation.”

Figure 2. Layers with reconciled topology and service CIs shown in red.

Fortunately, the next generation of service modeling datastores are purpose-built to handle highly complex and connected data structures and use highly performant graph traversal algorithms to query data.

Next, model your business service
Assuming we have a dynamic and reconciled graph database of our IT landscape, the next step is to model a business service, where the goal is to connect all the CIs representing the service as shown in red above. This is a three-step process:

Identify one or more applications used by end users. If there is an application performance monitoring tool, then that is the best starting point. Provide application details as the inputs to the service modeling tool.
Dynamically traverse all these layers to automatically stitch together all dependent CIs for the application service. This requires a reusable blueprint that dynamically filters and pattern-matches to build the end-to-end service. Our example shows end users connecting to cloud-based and distributed applications that eventually connect to a database on the mainframe (see Figure 3 below).
Identify end user key performance indicators and use them to calculate the service health score. A problem with a network switch port flapping is only critical, for example, if it is impacting an end-user experience like response time or voice quality.

Figure 3. Business service with connected topology.

Model a business service using reusable blueprints
Blueprints are reusable, can be created and maintained using a graphical editor, and can apply predefined patterns and filters to a connected topology stored in the graph database. Blueprints understand vendor- or industry-specific CI relationship mapping. Unique topology models exist for monitoring and infrastructure solutions like AppDynamics, Dynatrace, SolarWinds, VMware, Kubernetes, OpenShift, and mainframe, etc.

Figure 4 below shows an example blueprint for an AppDynamics and Kubernetes deployment. The AppDynamics blueprint starts with an application name and connects software components to virtual/physical hosts and the underlying network devices, whereas the Kubernetes blueprint starts with a namespace and then connects to deployments, pods, clusters, hosts, and network devices.

Figure 4. AppDynamics and Kubernetes blueprint example and reconciliation with virtual/physical infrastructure and network devices.

Figure 5 below shows the topology from the AppDynamics monitoring tool, which is limited to application-only topology and has no knowledge of the underlying physical infrastructure and network.

Figure 5. AppDynamics topology (limited to application components).

The blueprint and graph database will need to do the hard work to reconcile topology across the many layers (application, infrastructure, and network).

When building a dynamic service, the service modeler will just select the software cluster (which is an application name in the AppDynamics model). The blueprint will then automatically model a service that connects the application topology to physical hosts and network devices discovered using a discovery tool and further enriched by other monitoring tools. In seconds, the service modeler can automatically create a service model—that spans application to network and cloud to mainframe—for AIOps, change management, and asset management use cases. This service model is dynamic and will be automatically maintained based on the blueprint. If, in the future, CIs are added or removed, the pattern matching will dynamically account for any changes and maintain a current service model.

Figure 6 shows the result of using the AppDynamics blueprint. You can see the relationships between software components, runtime processes, virtual hosts, physical hosts, and interface cards/port connected to a switch. This level of detail is required for change impact analysis and AI/ML algorithms to perform root cause analysis when a critical business service is at risk or at fault.

Figure 6. AppDynamics blueprint result (application to network topology).

Show inter-service impacts by modeling service dependencies
Blueprints for infrastructures or core networks can also be used to model low-level virtual or physical shared services, and customer-facing services can be modeled to depend on these shared services. A service hierarchy map helps provide answers on which services, for example, are impacted by a shared service like core network, infrastructure, or storage. Figure 7 below shows how a core network issue impacts a customer-facing retail outlet service.

Figure 7. Shared service impact on retail outlet application.

In a future blog post, I will show how we can easily enhance the AppDynamics blueprint to build an uber-blueprint that includes multi-cloud, mainframe, and storage for a distributed microservices application.

How BMC HelixGPT-Powered AIOps Connects Observability Silos for Faster Probable Root Cause Isolation

Haroon Ahmed — Fri, 09 Jun 2023 14:22:46 +0000

In the news last year, a core network change at a large service provider resulted in a major outage that not only impacted end consumers and businesses, but also critical services like 911 and Interac. At home, phone and internet services were down for the workday as millions were without service.

Modern distributed and ephemeral systems are connecting us better than ever before, and the latest ChatGPT phenomenon has opened the possibility for new and mind-blowing innovations. However, at the same time, our dependency on this connected world, along with its nonstop innovations, challenges our ethos with important questions and concerns around privacy, ethics, and security, and challenges our IT teams with outages of often unknown origin.

When it comes to system outages, artificial intelligence for IT operations (AIOps) solutions with the right foundation can help reduce the blame game so the right teams can spend valuable time restoring the impacted services rather than improving their mean time to innocence (MTTI) score. In fact, much of today’s innovation around ChatGPT-style algorithms can be used to significantly improve the triage process and user experience.

In the monitoring space, impact analysis for services spanning application to network or cloud to mainframe has known gaps that, if solved, can have a big impact on service availability. Today, these gaps require human intervention and result in never-ending bridge calls where each siloed team responsible for applications, infrastructure, network, and mainframe are in a race to improve their MTTI score. This, unfortunately, also has a direct impact on customer experience and brand quality.

The challenge faced by teams is a layering issue, or “layeritis.” Figure 1 below shows the different layers that can contribute to a typical business service like mobile banking or voicemail:

Figure 1: This diagram shows observability silos and the resulting challenge of reconciliation.

For each layer, different kinds of monitoring solutions are used. Each solution, in turn, has its own team and applies different techniques like code injection, polling, or network taps. This wide spectrum of monitoring techniques eventually generates key artifacts like metrics, events, logs, and topology that are unique and useful in the given solution but operate in silos and do not provide an end-to-end impact flow.

Tool spam leads to a noise reduction challenge, which many AIOps tools solve today with algorithmic event noise reduction using proven clustering algorithms. However, in practice, this has not been proven to reveal root cause. The hard problem is root cause isolation across the layers, which requires a connected topology (knowledge graph) that spans the multiple layers and can deterministically reconcile devices and configuration items (CIs) across the different layers.

Figure 2: The challenge of root cause isolation across the siloed application, infrastructure, network, and mainframe layers.

Seven steps to cure layeritis:

The solution to an AIOps layeritis challenge requires planning and multiple iterations to get it right. Once steps 1–3 are in a good state, steps 4–7 are left to AI/machine learning (ML) algorithms to decipher the signal from the noise and provide actionable insights. The seven steps are as follows:

Data ingestion from monitoring tools representing the different layers to a common data lake that include metrics, events, topology, and logs.
Automatic reconciliation across the different layers to establish end-to-end connectivity.
a. Since end user experience is tied to service health score, include key end-user performance metrics like browser response time or voice quality.

b. Application topology to underlying virtual and physical infrastructure for cloud, containers, and private data centers (e.g., application performance monitoring (APM) tools may connect to the virtual host, but will not provide visibility to the underlying physical infrastructure used to run the virtual hosts).

c. Infrastructure connectivity to the underlying virtual and physical network devices like switches, routers, firewalls, and load balancers.

d. Virtual and physical infrastructure connectivity to mainframe services like IBM^® Db2^®, IBM^® MQ^®, IBM^® IMS^™, and IBM^® CICS^®.
Dynamic service modeling to draw boundaries and build business services based on reconciled layers.
Clustering algorithm for noise reduction of events from metrics, logs, and events within a service boundary.
Page ranking and network centricity algorithms for root cause isolation using the connected topology and historical knowledge graph.
Large language model (LLM) and generative AI (GPT) algorithm to build human-readable problem summaries. This helps less technical help desk resources quickly understand the issue.
Knowledge graph updated with the causal series of events, a.k.a. a fingerprint, which is compared with historical occurrences to help make informed decisions on root cause, determine the next best action, or take proactive action on issues that could become major incidents.

For algorithms to give positive results with a high level of confidence, good data ingestion is required. Garbage data will always give bad results. For data, organizations rely on proven monitoring tools for the different layers to provide artifacts like topology, metrics, events, and logs. Additionally, with metrics and logs, it’s possible to create meaningful events based on anomaly detection and advanced log processing.

Below are three use cases that focus on common issues today’s IT teams face, all of which can be resolved using AIOps in a single consolidated view to identify the root cause and automate the next best action. Note in each use case that 1) the generative AI-based problem summary reduces event noise and makes it easy for the help desk to understand the issue, 2) the clustering of all events, event deduplication, and single incident creation help with noise reduction, and 3) root cause isolation eliminates the blame game and improves mean time to repair (MTTR).

The reconciliation engine for BMC Helix AIOps capabilities is key to automatically reconciling and building a connected topology from application to network and cloud to mainframe. The reconciled topology is based on CIs and their relationships from monitoring and discovery tools.

Use case 1: Application-Only Issue (No Infrastructure or Network Impact)

Isolate root cause to the application (application issue where infrastructure and application are not impacted), as shown in Figure 3.

In this example, the root cause was isolated to the application software components, monitored by an APM tool.

Figure 3: An application-only issue with no infrastructure or network impact.

In this scenario, AIOps will summarize the issues in Figure 3 as follows:

Generative AI problem summary: “Business Transaction BookingService.storeBooking – Business Transaction Health had caused an increase in response times.”
Noise reduction: Clusters and deduplicates events for the impacted service based on time, text, and topology. Opens a single ITSM incident for the problem cluster (“INC000000543651”).
Root cause isolation: Root cause is the “BookingService.storeBooking” service. The situation explanation/fingerprint provides evidence on how the events started with the application software components and eventually impacted the software cluster.

Use case 2: Network issue impacting host and application

Isolate root cause from application to network (network issue where infrastructure and application are impacted, but not at fault), as shown in Figure 4.

In this example, the root cause was isolated to a network device, monitored by a network monitoring tool.

Figure 4: A network-only issue impacting the host and application.

In this scenario, AIOps will summarize the issues in Figure 4 as follows:

Generative AI problem summary: “The Interface was down.”
Noise reduction: Clusters and deduplicates events for the impacted service based on time, text, and topology. Opens a single ITSM incident for the problem cluster (“INC000000543662”).
Root cause isolation: Root cause is the “pun-clm-n7k-wt2.bmc.com” network device. The situation explanation/fingerprint provides evidence on how the events started with the network device and eventually impacted the host and software pod.

Use case 3: Mainframe database issue impacting distributed applications

Isolate root cause from cloud to mainframe (mainframe database issue where distributed application impacted), as shown in Figure 5.

In this example, the root cause was isolated to the mainframe Db2 database, monitored by a mainframe monitoring tool.

Figure 5: Mainframe database issue impacting upstream applications.

In this scenario, AIOps will summarize the issues in Figure 5 as follows:

Generative problem summary: “The lock was held for a long time.”
Noise reduction: Clusters and deduplicates events for the impacted service based on time, text, and topology. Opens a single ITSM incident for the problem cluster (“000000549830”).
Root cause isolation: The database lock held by the “DB2DB-DSNDIA-DIA1” database, which is impacting dependent application services. The situation explanation/fingerprint provides evidence on how the events started with a database issue and impacted a software instance.

With a defined service model and reconciled topology, AI/ML algorithms are used to derive insights and root cause with a high level of confidence.

In each use case above, AIOps removes the need for time-intensive investigation and guesswork so your team can see and respond to issues before they affect the business and instead focus on higher-value projects.

With today’s complex challenges across modern, distributed architectures, AIOps solutions can provide visibility and generate proactive insights across the entire application structure, from end user to cloud to data center to mainframe.

Signal-to-Noise Ratio: Bridging the ITSM-ITOM Divide

Haroon Ahmed — Wed, 09 Feb 2022 08:07:46 +0000

Over the past couple of years, I’ve been working with a large financial services organization and its director of IT operations, who has a mandate to improve operational efficiency, reduce costs, and rationalize the organization’s tool stack. Of course, they still need to deliver a five-star user experience while doing more with less.

While the organization’s digital transformation projects were delivering better customer self-service, the interaction between new (public cloud) and old (mainframe) technology stacks was proving to be a challenge. DevOps was pushing the boundaries with small and frequent releases, but monitoring was showing blind spots in end-to-end user interactions and slow recovery from system failures was impacting customer confidence. Key business stakeholders were getting nervous because increased customer churn could have a direct impact on revenue.

The director was facing three key challenges in:

Observability at a business service level for prioritizing resources during critical situations.
Noise reduction and artificial intelligence and machine learning (AI/ML)-based root cause recommendations to automate and speed recovery from poor performance and outages.
Operating expenditures (OpEx) costs attributed to service and operations management tool sprawl.

The director talked about a recent outage where his staff was overwhelmed with 30,000 events that spanned user-initiated complaints to the global help desk and system-generated alarms from multiple monitoring tools. While AI/ML techniques were reducing alarm noise, it was still difficult to narrow down root cause, which slowed the resolution time. The organization had no way to correlate between the incidents or to change and monitor events. They had plenty of tools, but it was like having dozens of watches where none of them could give an accurate time.

To put the problem in mathematical terms, he was dealing with a signal-to-noise ratio problem. Mission-critical business services are supporting an omnichannel user experience (mobile, web, voice, API); systems of engagement are hosted on cloud architectures; and systems of record hosted in private data centers are running distributed and mainframe applications. When something goes wrong, like the 30,000-event problem, a lot of noise is created from the events triggered by complex user transactions.

Noise is increased by end users opening service desk tickets, changes from DevOps automation, alarms based on static service level agreements (SLAs), faults from network equipment, and even automatically generated anomalies based on abnormal system behavior. Deciphering the signal from the noise to find the root cause of a system outage is a complex mathematical problem that’s best addressed by a machine-based algorithmic approach instead of the traditional approach of putting business users and different IT departments on bridge calls. That just adds more noise to the blame game as disparate teams race to improve their mean time to innocence (MTTI).

For the purposes of this discussion, I am going to focus on the need for a more holistic end-to-end approach to automating noise reduction and root cause analysis. A machine-based algorithmic approach that applies trained AI/ML models can help bridge the divide between different teams and their disparate tools to increase the odds of quicker resolution and overall cost savings.

I will use the example of a mobile application where a user transaction that starts on a mobile device depends both on modern microservices running on a public cloud and monolithic applications hosted in private data centers. Slow response time on the mobile device can be difficult to triage because it can be attributed to code execution, infrastructure resource constraints, or network congestion anywhere on this long and complex execution path, as shown in Figure 1.

Figure 1. End-to-End Mobile Transaction Flow

The two areas that we need to focus on to more quickly decipher the signal from the noise are IT service management (ITSM) and IT operations management (ITOM). Per our example above, a mobile application experiencing slow response times can potentially generate thousands of events. Figure 2 below shows how ITSM and ITOM systems exist in silos, with multiple tools that generate tickets, metrics, logs, and alarms, etc.

Figure 2. Event Noise from ITSM and ITOM

To address this, advanced analytics should be applied to ITSM and ITOM datasets to build situational awareness of impacted users and business services. ITSM systems deal with incident and change management and are a goldmine of historical and real-time data of user issues, change/work orders, and knowledge base articles. AI service management (AISM) applies AI/ML techniques like natural language processing (NLP), clustering, and classification to reduce incoming ticket noise, group major incidents/change events, recommend knowledge base articles, or automatically assign support groups.

Figure 3 illustrates how AISM can automatically group and create situational awareness, represented by the cluster of solid circles, diamonds, and squares. This automatic grouping of incidents and change events based on text, time, and relationships can help service desk agents focus their efforts on higher priority issues.

Figure 3. AISM Real-Time Incident Cluster

In our mobile application example, if the business service is impacted by slow response times, real-time incident correlation will help accelerate the triage process by eliminating the manual work that would have been done by multiple service desk agents. Figure 4 shows a real-time incident correlation dashboard where AISM automatically correlates and groups related incidents into a single view.

If the mobile application slowness is caused by a known issue, then this would trigger a runbook for quick remediation. However, in many cases this will require further investigation and correlation with the ITOM systems for further diagnosis. Speeding resolution requires an ITSM and ITOM integration strategy, which can be complex.

Our research shows only 23 percent of organizations have integrated the two disciplines. Without such integration, there are many hand-offs between teams and inefficient, error-prone manual processes that result in delays and customer dissatisfaction.

Figure 4. Real-Time Incident Correlation

ITOM systems deal with observability and automation to improve operational efficiency across applications, infrastructure, and networks. A typical mobile application journey is monitored by many tools that collect performance metrics, alarms, anomalies, logs, and topology information. Additionally, modern DevOps practices have increased the volume and frequency of changes in this dynamic landscape. Triaging and diagnosing production issues is akin to trying to find a needle in a haystack when dealing with very large and diverse datasets.

AI for IT operations (AIOps) applies AI/ML to these datasets to reduce noise and find root cause more quickly. AIOps creates situational awareness by applying algorithms for noise reduction, anomaly detection, clustering, and graph theory to automatically assess impact and arrive at a root cause, represented by the cluster of crossed circles, diamonds, and squares shown in Figure 5. Automatic grouping of alarms, metrics, logs, and topology for an impacted business service accelerates root cause analysis and recovery from outages.

Figure 5. AIOps Probable Root Cause Clusters

In our mobile banking application example, AIOps will accelerate the diagnosis process by identifying a probable root cause for the slow response time. Figure 6 below shows a service monitoring dashboard for an impacted business service where metrics, topology, events, and change are grouped together with an identified root cause. This saves support staff hours, if not days, of triaging and collecting evidence. A probability score for the determined root cause can also increase confidence and lead to automation for self-remediation in the future.

Figure 6. AIOps Service Monitoring Dashboard

Recognizing the lack of an integration strategy between ITSM and ITOM systems, there’s still a need to correlate and bridge the gap between the insights discovered by AISM and AIOps to avoid error-prone hand offs and inefficient manual processes. Figure 7 illustrates how solid and crossed cluster groups can be automatically correlated using advanced analytics and common relationships. Automatic correlation between real-time incident clusters and probable root cause recommendations requires a service-centric approach that shares a common data model and datastore to correlate across shared resources.

Figure 7. AISM and AIOps Correlation

In our mobile example, the system would automatically correlate the reported slow response time incidents and change requests with the probable root cause, as shown in Figure 8 below. This empowers the service desk agent with the context to intelligently engage the right support groups, initiate tasks, or automate runbooks for quick remediation.

Figure 8. Incident and Probable Root Cause Correlation

The ability to improve service delivery with integrated ITSM and ITOM capabilities is what BMC refers to as ServiceOps. It brings technology and data together in one central platform (BMC Helix Platform) that spans organizational silos and has a common data store with open integrations to third-party tools, further strengthened by our recent StreamWeaver acquisition. The entire solution is designed to help your organization reduce the signal-to-noise ratio, improve response time, and provide a better customer experience.