Stela Udovicic – BMC Software | Blogs https://s7280.pcdn.co Wed, 07 Feb 2024 13:59:53 +0000 en-US hourly 1 https://s7280.pcdn.co/wp-content/uploads/2016/04/bmc_favicon-300x300-36x36.png Stela Udovicic – BMC Software | Blogs https://s7280.pcdn.co 32 32 How BMC DevOps and SRE Teams Prevent Outages with AIOps and Observability https://s7280.pcdn.co/how-bmc-devops-sre-teams-prevent-outages/ Wed, 07 Feb 2024 13:59:53 +0000 https://www.bmc.com/blogs/?p=53425 Proactively addressing potential situations before they become outages is highly important for site reliability engineers (SREs) and DevOps teams. In this blog, I cover how BMC’s SRE and DevOps teams use artificial intelligence for IT operations (AIOps) to resolve issues before they become problems. I spoke with BMC Senior Director of DevOps Jason Rush and […]]]>

Proactively addressing potential situations before they become outages is highly important for site reliability engineers (SREs) and DevOps teams. In this blog, I cover how BMC’s SRE and DevOps teams use artificial intelligence for IT operations (AIOps) to resolve issues before they become problems. I spoke with BMC Senior Director of DevOps Jason Rush and BMC SRE Manager Jason Ferens about it. Their organization supports all BMC SaaS customers, monitoring the reliability of their services and infrastructure. Their charter is to prevent and mitigate customer outages. Since introducing the BMC Helix AIOps solution, this organization has seen a dramatic increase in the prevention of outages and faster resolution of incidents.

Navigating heavy alert load and short response times

Before deploying BMC Helix AIOps, BMC SREs dealt with high alert noise. Customers’ issues would sometimes auto close, leaving the engineers without a way to track the issue down. Overall, there needed to be a more efficient way to proactively deal with the issues that would otherwise become service quality risks in the long run. Enter BMC Helix.

Service reliability improved with AI and observability

The DevOps and SRE teams implemented the solution about a year ago and use it as part of their daily operations. They built custom dashboards to get detailed insight into the performance of all customer environments and critical trends. The impact was an overall reduction in alert noise, shorter mean time to resolution (MTTR), and better health of customers’ environments.

BMC Helix AI-powered Situations enable engineers to understand what’s happening in customer environments, where the root cause is, and what was impacted. This insight shifts them from being reactive, chasing incident after incident, to being proactive. Now, because the IT teams know the health of the services, they can proactively address emerging issues and prevent outages.

Every second counts when troubleshooting an issue, and the BMC DevOps and SRE teams automatically pinpoint the incident’s root cause through automation. SREs can remediate incidents more precisely and improve and optimize customer environments. Here are the results these teams have achieved since they deployed the BMC Helix AIOps solution:

  • 76 percent improvement in service health
  • 60 percent MTTR reduction (under 30 minutes resolution time)
  • 64 percent of outages prevented
  • 1,034 successful remediations from three intelligent automations in one month

Running health checks to prevent incidents

BMC DevOps and SRE teams also track how often outages are avoided, called health checks. Health checks are performed when no alerts are open, but based on trend analyses by BMC Helix AIOps, identify when an issue will become an incident if the trend continues. As a result, engineers can remediate the customer issue, averting the problem altogether.

Here is an example of how a potential incident was resolved in a preventative way. Using BMC Helix Service Monitoring and BMC HelixGPT, the configuration item (CI) topology and analysis pointed to a root cause indicating critical status at the customer’s platform.

Figure 1: BMC Service Monitoring and BMC HelixGPT.

Figure 1: BMC Service Monitoring and BMC HelixGPT.

Based on the details of the associated incident, a third-party tool increased logging, resulting in high file system utilization. The team implemented BMC Helix Intelligent Automation, which ran when the alert fired, allowing the deletion of excessive logs to be performed before the file system utilization affected the customer.

By using BMC Helix AIOps, the Situation was cleared, and the performance of other tools was continuously monitored, allowing for self-healing operations, a nirvana for SREs.

Preventing outages with ServiceOps insights

The SRE team shares a commonality with BMC customers since many of them use BMC Helix AIOps and ServiceOps solutions to manage their day-to-day operations, including alerts and incidents. For instance, SREs developed a custom dashboard that pools data from BMC Helix ITSM and BMC Helix AIOps. This combined visibility allows the SRE team to address open incidents in their full context and accelerate decision-making. “Using BMC Helix AIOps and ServiceOps gives us a much better ability to prevent incidents.” – Jason Rush, Senior Director, DevOps, BMC

Engineers can easily track the number of incidents per customer and address them accordingly. As a result, the SRE team is more proactive, preventing outages with much higher precision.

An SRE’s 360-degree view of customers’ services and operations

As I mentioned previously, engineering teams can use BMC Helix to see the health of a customer’s environment. At BMC, the SRE team uses another BMC Helix dashboard, “BMC 360 Customer View,” to track the health of all our customer environments. SREs can see the health of all applications, infrastructure, resource utilization (such as CPU and historical ITSM tickets), and everything they need to know about each customer’s environment. Based on the overview dashboard, the SRE team knows where the issues are, and as a result, can dive into the incidents and services that require their attention.

How SREs remediate Kubernetes issues with BMC Helix

Let’s examine two examples of how AIOps and observability help the BMC SRE team solve common operational issues. The first example is an out-of-memory issue that was restarting a customer’s Kubernetes pods regularly. While the customer was not affected by this issue, it needed to be resolved to prevent it from progressing. Using the BMC Helix AIOps solution, SREs knew that adding more memory to the configuration would resolve all related alerts in the future. This represents an example of a situation that can be further automated with BMC Helix Intelligent Automation.

The other issue is filling out the logs on the user Kubernetes pods, which causes what is known as “pod evictions.” A Kubernetes pod eviction occurs when a pod running on a node is terminated and rescheduled on another node, resulting in instability when memory in the pod fills up, causing the pod to shut down and restart. The solution for this issue is upgrading the environment.

The SREs use BMC Helix AIOps to see when alerts related to this issue arise and remediate the threat before the pod is forced into the eviction. The proactive work done by the SRE team prevents the degradation of customers’ service health.

Achieving self-healing operations with ServiceOps, AIOps, and observability

It has been a remarkable journey for this BMC department as they embrace the power of ServiceOps. They’ve evolved from a constant firefighting mode to achieving self-healing operations thanks to the BMC Helix AIOps solution. What’s even more exciting is how the combination of AIOps and observability is ushering a new era of predictive solutions into the hands of DevOps and SRE teams.

“My day starts with AIOps.” – Jason Ferens, Senior SRE Manager, BMC

If you’d like your day to start smoothly, try BMC Helix solutions and follow this link.

]]>
New BMC Helix Release Helps IT Resolve Incidents Using Patented AI https://www.bmc.com/blogs/bmc-helix-itom-intelligent-incident-resolution/ Wed, 24 Jan 2024 14:00:04 +0000 https://www.bmc.com/blogs/?p=53397 In today’s dynamic, cloud environments, IT teams that include DevOps, IT operations, site reliability engineering (SRE), and platform engineering need a way to get accurate and easy-to-setup insights from large volumes of observability data. Without proper tooling to glean comprehensive insight across thousands of key performance indicators (KPIs), IT teams face slow reaction times, which […]]]>

In today’s dynamic, cloud environments, IT teams that include DevOps, IT operations, site reliability engineering (SRE), and platform engineering need a way to get accurate and easy-to-setup insights from large volumes of observability data. Without proper tooling to glean comprehensive insight across thousands of key performance indicators (KPIs), IT teams face slow reaction times, which can lead to service degradation. Manual analyses are no longer enough. The 24.1 release of the BMC Helix IT Operations Management portfolio demonstrates our investment in applying more practical use cases for causal, generative, and predictive AI.

Figure 1: Best Action Recommendation Example

We have enhanced our solutions with that include Advanced Anomaly Detection and a patented BMC HelixGPT-Powered Best Action Recommendation (BAR) for AIOps using BMC HelixGPT. We also added updates of our observability solution, described in further detail below. With these key enhancements, modern IT teams can:

Improve service reliability with Advanced Anomaly Detection

  • Autodetect all anomalies using one-click configuration across your cloud services and infrastructure
  • Fine-tune anomaly detection to unique environments with adjustable sensitivity
  • Combine static thresholds and machine learning (ML) to identify both the known unknowns and unknown unknowns

Resolve incidents quickly and easily with generative AI

  • Utilize knowledge from past incidents, situations, and remediation actions to reduce mean time to repair (MTTR)
  • Use patented BAR insights to accelerate your response
  • Get a sample code recommendation using BAR

Optimize performance and resource utilization

  • Understand and act on trends more quickly with a combination of Advanced Anomaly Detection and BAR
  • Find anomalies instantly, without domain knowledge or the need for query language
  • Detect performance or resource bottlenecks more quickly without tedious configuration steps

Improve service reliability with Advanced Anomaly Detection

Advanced Anomaly Detection improves identification of issues and helps IT teams proactively find both known and unknown unknowns. IT environments are unique and complex, which makes setting thresholds complicated and time-consuming. Advanced Anomaly Detection (univariate) adds an autoconfiguration option to existing BMC Helix ML-based anomaly detection. Now, all KPIs are automatically analyzed and alerted on when the anomaly matches user-defined sensitivity settings. A single click enables anomaly detection for the entire environment, helping IT teams find previously unknown problems and saving time by eliminating tedious parameter configuration. Policies can still override the global settings across one time series (univariate anomaly detection), while also allowing for management of anomalies across multiple time series (multivariate anomaly detection).

Figure 2: Advanced Anomaly Detection Example

In combination with BMC Helix AIOps functionalities, Advanced Anomaly Detection events further enhance the Situations functionality, creating a powerful solution that allows IT teams to be proactive and find and fix issues faster.

BAR and BMC HelixGPT help IT remediate issues instantly

We are continuing to help IT teams be more productive with practical uses for our generative AI. Back in June, we announced BMC HelixGPT, embedded across the BMC Helix platform. BMC HelixGPT uses large language models (LLMs) trained on enterprise domain data. It becomes an expert in your IT environment. With the 24.1 release, we have added the BAR feature based on BMC HelixGPT to help IT practitioners resolve issues and eliminate days of troubleshooting.

Trained on past incidents, situations, and remediation actions, BAR uses generative AI algorithms to accelerate the time to resolution with actionable insights in a human-readable language—no need to learn another query language. Additionally, BAR can dramatically improve an IT team’s efficiency by using insights from similar correlated incidents to automatically generate code templates for the end user to fix an issue.

Figure 3: Best Action Recommendation with Ansible Code Snippet

Practical BAR examples

Let’s assume that your code update resulted in an increased CPU utilization that significantly strained host resources. To remediate the issue last time, an on-call SRE rolled back a code deploy, helping reduce CPU load. When a similar situation happens in the future, BAR will surface how the situation was resolved and help recommend a potential resolution with the steps provided to resolve the problem.

If you are experiencing longer than expected response times on your requests (slow queries or similar), it may be due to higher-than-expected resource utilization. BAR provides guidance on how to fix the issue. In this case, it recommends running the script to increase the storage space or other pegged resource and then restarting a Kubernetes pod.

Another practical use would be to help recommend patching. Let’s assume you missed patching a set of host instances with the latest security or operational updates. Based on past resolutions, BAR will be able to help you identify what was missed and recommend implementing the latest patch.

The applications of BAR are virtually limitless.

Observability enhancements bring more comprehensive visibility

New BMC Helix Intelligent Integrations enhance IT coverage

We have expanded BMC Helix Intelligent Integrations with Icinga, allowing our customers to get enhanced visibility into their tooling, as well as bring and correlate new data sources into BMC Helix. In this release, we also enhanced our existing connectors, including Entuity, Zabbix, Prometheus, SolarWinds, Datadog, VMware vRealize Operations (vROPS), Cisco AppDynamics, and CA UIM. With these updated connectors, BMC Helix IT Operations Management solutions provide better coverage and visibility into data from these tools, helping IT to quickly navigate to a specific issue.  For details, please refer to our documentation.

Better control and security with flexible log index management

With this release, BMC Helix AIOps capabilities, specifically those in BMC Helix Log Analytics, deliver enhanced security and allow better control over log data. Now, IT practitioners get flexible log management with multi-index support per tenant. With this flexible log segregation and archival duration, IT teams can better manage security and costs.

Enhanced BMC Helix Discovery Technology Knowledge Updates content

Now, BMC Helix Discovery Technology Knowledge Updates (TKU) content is expanded with new cloud, software, storage, and network solutions. BMC Helix Discovery continues to lead the industry with comprehensive, out-of-the-box discovery coverage, enabling IT teams to automatically discover and map their IT assets and dependencies with unparalleled accuracy. BMC Helix Discovery provides even more comprehensive visibility into complex IT landscapes, helping IT teams optimize operations, reduce risk, and accelerate digital transformation. For the full list, please see our documentation.

If you wish to check these out, please contact sales.

]]>