Proactively addressing potential situations before they become outages is highly important for site reliability engineers (SREs) and DevOps teams. In this blog, I cover how BMC’s SRE and DevOps teams use artificial intelligence for IT operations (AIOps) to resolve issues before they become problems. I spoke with BMC Senior Director of DevOps Jason Rush and BMC SRE Manager Jason Ferens about it. Their organization supports all BMC SaaS customers, monitoring the reliability of their services and infrastructure. Their charter is to prevent and mitigate customer outages. Since introducing the BMC Helix AIOps solution, this organization has seen a dramatic increase in the prevention of outages and faster resolution of incidents.
Navigating heavy alert load and short response times
Before deploying BMC Helix AIOps, BMC SREs dealt with high alert noise. Customers’ issues would sometimes auto close, leaving the engineers without a way to track the issue down. Overall, there needed to be a more efficient way to proactively deal with the issues that would otherwise become service quality risks in the long run. Enter BMC Helix.
Service reliability improved with AI and observability
The DevOps and SRE teams implemented the solution about a year ago and use it as part of their daily operations. They built custom dashboards to get detailed insight into the performance of all customer environments and critical trends. The impact was an overall reduction in alert noise, shorter mean time to resolution (MTTR), and better health of customers’ environments.
BMC Helix AI-powered Situations enable engineers to understand what’s happening in customer environments, where the root cause is, and what was impacted. This insight shifts them from being reactive, chasing incident after incident, to being proactive. Now, because the IT teams know the health of the services, they can proactively address emerging issues and prevent outages.
Every second counts when troubleshooting an issue, and the BMC DevOps and SRE teams automatically pinpoint the incident’s root cause through automation. SREs can remediate incidents more precisely and improve and optimize customer environments. Here are the results these teams have achieved since they deployed the BMC Helix AIOps solution:
- 76 percent improvement in service health
- 60 percent MTTR reduction (under 30 minutes resolution time)
- 64 percent of outages prevented
- 1,034 successful remediations from three intelligent automations in one month
Running health checks to prevent incidents
BMC DevOps and SRE teams also track how often outages are avoided, called health checks. Health checks are performed when no alerts are open, but based on trend analyses by BMC Helix AIOps, identify when an issue will become an incident if the trend continues. As a result, engineers can remediate the customer issue, averting the problem altogether.
Here is an example of how a potential incident was resolved in a preventative way. Using BMC Helix Service Monitoring and BMC HelixGPT, the configuration item (CI) topology and analysis pointed to a root cause indicating critical status at the customer’s platform.
Based on the details of the associated incident, a third-party tool increased logging, resulting in high file system utilization. The team implemented BMC Helix Intelligent Automation, which ran when the alert fired, allowing the deletion of excessive logs to be performed before the file system utilization affected the customer.
By using BMC Helix AIOps, the Situation was cleared, and the performance of other tools was continuously monitored, allowing for self-healing operations, a nirvana for SREs.
Preventing outages with ServiceOps insights
The SRE team shares a commonality with BMC customers since many of them use BMC Helix AIOps and ServiceOps solutions to manage their day-to-day operations, including alerts and incidents. For instance, SREs developed a custom dashboard that pools data from BMC Helix ITSM and BMC Helix AIOps. This combined visibility allows the SRE team to address open incidents in their full context and accelerate decision-making. “Using BMC Helix AIOps and ServiceOps gives us a much better ability to prevent incidents.” – Jason Rush, Senior Director, DevOps, BMC
Engineers can easily track the number of incidents per customer and address them accordingly. As a result, the SRE team is more proactive, preventing outages with much higher precision.
An SRE’s 360-degree view of customers’ services and operations
As I mentioned previously, engineering teams can use BMC Helix to see the health of a customer’s environment. At BMC, the SRE team uses another BMC Helix dashboard, “BMC 360 Customer View,” to track the health of all our customer environments. SREs can see the health of all applications, infrastructure, resource utilization (such as CPU and historical ITSM tickets), and everything they need to know about each customer’s environment. Based on the overview dashboard, the SRE team knows where the issues are, and as a result, can dive into the incidents and services that require their attention.
How SREs remediate Kubernetes issues with BMC Helix
Let’s examine two examples of how AIOps and observability help the BMC SRE team solve common operational issues. The first example is an out-of-memory issue that was restarting a customer’s Kubernetes pods regularly. While the customer was not affected by this issue, it needed to be resolved to prevent it from progressing. Using the BMC Helix AIOps solution, SREs knew that adding more memory to the configuration would resolve all related alerts in the future. This represents an example of a situation that can be further automated with BMC Helix Intelligent Automation.
The other issue is filling out the logs on the user Kubernetes pods, which causes what is known as “pod evictions.” A Kubernetes pod eviction occurs when a pod running on a node is terminated and rescheduled on another node, resulting in instability when memory in the pod fills up, causing the pod to shut down and restart. The solution for this issue is upgrading the environment.
The SREs use BMC Helix AIOps to see when alerts related to this issue arise and remediate the threat before the pod is forced into the eviction. The proactive work done by the SRE team prevents the degradation of customers’ service health.
Achieving self-healing operations with ServiceOps, AIOps, and observability
It has been a remarkable journey for this BMC department as they embrace the power of ServiceOps. They’ve evolved from a constant firefighting mode to achieving self-healing operations thanks to the BMC Helix AIOps solution. What’s even more exciting is how the combination of AIOps and observability is ushering a new era of predictive solutions into the hands of DevOps and SRE teams.
“My day starts with AIOps.” – Jason Ferens, Senior SRE Manager, BMC
If you’d like your day to start smoothly, try BMC Helix solutions and follow this link.
These postings are my own and do not necessarily represent BMC's position, strategies, or opinion.
See an error or have a suggestion? Please let us know by emailing [email protected].