Monica Brink – BMC Software | Blogs https://s7280.pcdn.co Thu, 25 Apr 2024 12:33:53 +0000 en-US hourly 1 https://s7280.pcdn.co/wp-content/uploads/2016/04/bmc_favicon-300x300-36x36.png Monica Brink – BMC Software | Blogs https://s7280.pcdn.co 32 32 IT Operations Trends and AIOps Adoption – Feedback from the Frontline https://s7280.pcdn.co/it-operations-trends-and-aiops-adoption-feedback-from-the-frontline/ Tue, 24 Mar 2020 00:00:09 +0000 https://www.bmc.com/blogs/?p=16778 It’s certainly an interesting time to be in IT Operations, being at the forefront as they are of ‘always on’ customer experience expectations. The success of digital transformation initiatives increasingly relies on frictionless IT Operations processes which positions these teams in a powerful and important role within organizations. With these opportunities comes many challenges. In […]]]>

It’s certainly an interesting time to be in IT Operations, being at the forefront as they are of ‘always on’ customer experience expectations. The success of digital transformation initiatives increasingly relies on frictionless IT Operations processes which positions these teams in a powerful and important role within organizations. With these opportunities comes many challenges. In addition to managing performance and availability across complex, hybrid environments, new challenges for monitoring and event management processes are emerging including:

  • Huge increases in the volumes of operational data now beyond human scale to manage
  • Observability of new cloud native container-based applications and microservices
  • DevOps initiatives driving the need to monitor and manage many more apps on faster release cycles

We recently delivered a webinar with Forrester Consulting, ‘The IT Operations Balancing Act – Deliver Velocity with Quality’, during which we asked webinar attendees to respond to several polls to gauge how IT Operations leaders and teams are dealing with these challenges, how strategies are evolving and the rate of AIOps adoption. The results of these polls serve as an interesting snapshot of aspects of Infrastructure & Operations (I&O) evolution and maturity.

Poll 1: ITOps challenges

In the first poll, we asked about what was behind the major challenges they are facing and the opposing demands of velocity and quality in IT Operations.

Poll 1: What factors are challenging IT Ops processes and driving the need to balance velocity with quality in your organization?​ (Choose any answer that is relevant – multiple answers if required.)

Results:

1.       Digital transformation initiatives​ 44%
2.       DevOps processes​ 34%
3.       Adoption of new technologies (cloud, containers, microservices, cloud native apps)​ 73%
4.       Increased volumes of IT Operations data 51%

The adoption of new technologies is the biggest challenge according to respondents which is not surprising given the large increases we are seeing in containerized applications being run in production. According to Gartner, by 2023, more than 70% of global organizations will be running more than two containerized applications in production, up from less than 20% in 2019. These cloud-native apps and containers are creating monitoring visibility gaps that challenge traditional monitoring practices. As discussed in the webinar, a holistic approach to monitoring and event management across traditional on-prem and new cloud and container environments with consolidation of all IT operations data in a single, AI driven solution is key to overcoming this challenge.

Poll 2: Siloed vs holistic I&O processes

The second poll we asked our webinar audience was related to how they are transitioning from a siloed approach to I&O processes with, for example, multiple monitoring tools for different technologies and little integration between ITOM and other IT processes to a more integrated and holistic approach.

As IT Operations teams are dealing with more complex environments and new technologies like containers and microservices, the disadvantages of a siloed approach to monitoring and event management are becoming ever more apparent. During the webinar, Rich Lane from Forrester pointed out some of the key challenges that arise from this approach:

  • Too many voices looking at disparate sets of siloed data
  • No real way to correlate events, logs, etc. to call volume
  • No clear sense of incident ownership
  • Unnecessarily long MTTR

Poll 2: Do you consider your I&O processes to be siloed or holistic and integrated or in transition between the two?

Results:

1.       Silo’d 41%
2.       Holistic and Integrated 10%
3.       In transition 48%

The poll results show that this is very much an issue within IT Ops teams today although it is encouraging that almost 50% of respondents are ‘in transition’ and in the process of breaking down the silos both within IT Ops and across other IT disciplines such as ITSM. Indeed, AIOps can be a key facilitator of this transition by unifying all business data under one umbrella and pairing intelligence with automation to bridge processes, drive automation and increase visibility.

Poll 3: AIOps adoption

This brings us to the results of our next poll of the webinar which was focused on AIOps adoption. Having asked this exact same question in an AIOps webinar we delivered back in June 2019, we can compare the evolution of AIOps adoption across 9 months.

Poll 3: At what stage of AIOps adoption is your IT Ops team?​

Results:

Feb 2020 June 2019
Exploring options & use cases​ 70% 62%
Planning to deploy​ 21% 15%
Actively deploying ​ 9% 19%
Fully implemented AIOps strategy​ 0% 4%

So, the categories of exploring options and planning for AIOps have gone up between June 2019 and February 2020 which we would expect as awareness of the benefits has grown. Conversely, the categories of actively deploying and having fully implemented AIOps have gone down. There was general agreement from the discussion and comments on the webinar that this reflects a more realistic approach to AIOps today than many IT Ops teams had 9 months ago. There is now broad recognition of the potential immense value of a successful AIOps strategy and many organizations are therefore taking a more planful and considered approach to implementing AIOps use cases.

Poll 4: I&O maturity

Our last poll was focused on transformation and maturity across Infrastructure & Operations organizations and processes. As IT Operations struggles to achieve that balancing act of maintaining performance while moving at the speed required by digital innovation, what are the key ways they are transforming and moving along the  I&O maturity curve?

Poll 4: What is the level of  I&O maturity in your organization

Results:

1.       Low level maturity – manual processes, legacy systems, no AI/ML, Dev and Ops separated, silos of monitoring.​ 50%
2.       Medium level maturity – some automated processes, applying AI/ML for limited use cases, supporting DevOps initiatives 47%
3.      High Level Maturity – Single pane of glass, consolidated monitoring, fully implemented AIOps use cases, automated event resolution 3%

Our webinar respondents took a very realistic approach to this poll as well with 50% recognizing they are still at low level maturity and a slightly smaller number at medium level with only 3% at high level. This is reflective of the market although the fact that many realize they are at these low and medium maturity levels will no doubt foreshadow significant change and innovation in IT Operations processes in the coming years.

BMC supports the State of IT Operations

Also covered in the webinar was the recent release of BMC Helix Monitor which has been developed to help IT Ops teams undertake the required transformation along the maturity curve from manual, siloed processes to AI and Machine Learning driven processes that are automated, integrated and provide deep visibility via a single view into performance across hybrid, complex environments.

BMC Helix Monitor is part of the BMC Helix end-to-end service and operations SaaS platform which unites Discovery, Monitoring, Service Management, Optimization and Remediation. BMC Helix Monitor combines broad capabilities across monitoring and event management with a cloud native containerized microservices architecture that enables fast deployment and upgrades, elastic scalability, enterprise grade high availability and performance along with the reduced infrastructure costs that come with a SaaS deployment model. The solution features a modern user experience and automated workflows to streamline monitoring and event management processes and enables large scale ingestion of events and metrics for AIOps use cases.

]]>
Learn AIOps webinar series https://www.bmc.com/blogs/learn-aiops-webinar-series/ Thu, 24 Oct 2019 00:00:31 +0000 https://www.bmc.com/blogs/?p=14963 According to Gartner, AIOps has not yet reached peak hype! And yet, many IT Operations professionals are researching options and planning to implement an AIOps strategy to manage huge data volumes and complexity and realize significant benefits including faster MTTR and MTBF, improved service levels, reduced downtime and costs. But there’s lots to consider before […]]]>

According to Gartner, AIOps has not yet reached peak hype! And yet, many IT Operations professionals are researching options and planning to implement an AIOps strategy to manage huge data volumes and complexity and realize significant benefits including faster MTTR and MTBF, improved service levels, reduced downtime and costs. But there’s lots to consider before implementing an AIOps strategy – business drivers, prioritization of use cases, data sources, required skills, and more. How do you get started?

That’s why we kicked off the ‘Learn AIOps’ webinar series – to meet the demand from IT Operations professionals looking for tangible ways to get value from AIOps and practical advice for deploying use cases. This webinar series is an opportunity for you to get AIOps education to help with your planning and learn what’s needed to start putting machine learning and analytics to work for your business. Check out the schedule below – you can sign up for upcoming webinars and watch on demand replays of previous webinars. Got an AIOps webinar topic you’d like us to cover? Add it to the comments below. Hope to talk to you soon on a webinar!

In this webinar, we delve into the challenges IT Operations teams face as they strive to optimize performance and speed for the digital business while ensuring service quality. The webinar includes an overview of the new BMC Helix Monitor solution and covers key topics including:

  • How to support the speed required by DevOps initiatives
  • The role of AI, Machine Learning and AIOps
  • The importance of integration across technologies and IT disciplines
  • Practical use cases for increasing agility and reducing risk

This webinar explores practical ways that an AIOps strategy can help IT Operations address the challenges of hybrid & complex environments, rapidly rising data volumes, and increased pressures from line of business teams, including:

  • Why IT needs an AIOps strategy
  • AIOps adoption considerations
  • Key AIOps use cases and tangible benefits
  • Driving value from AIOps now and in the future

The Roadmap to AIOps – Watch the on demand replay

This webinar provides practical advice on developing a roadmap to AIOps, covering the initial key planning steps including:

  • Aligning IT use cases and KPIs with business initiatives
  • Defining AIOps goals and success criteria
  • Assessing systems of measure and data models
  • Establishing systems of record

The next steps on your AIOps Journey  – watch the on demand replay

This webinar continues the roadmap discussion with a deeper dive into the next steps needed to fully implement AIOps use cases and start driving business value, including:

  • How to implement analytics workflows and automation
  • Adapting your organization to new skills sets
  • Customizing analytical techniques for optimal results

In this webinar, we explore the challenges of cloud operations management including:

  • Budget management and cost optimization
  • Aligning operations and line-of-business stakeholders
  • Configuration security posture management
  • Role of machine learning and automation in modern #CloudOps

Bookmark this page and re-visit often to keep updated on the latest webinars in the series.

Automated Event Remediation and AIOps – watch the on demand replay

In this webinar, we will explore the value of including automated event remediation as part of your AIOps strategy, including:

  • Emerging challenges driving AIOps adoption
  • Role of automation in the AIOps and digital transformation journey
  • Automation use cases for event remediation
  • Real world client example

Please join us to learn and understand how automated event remediation can enhance your AIOps strategy and help you achieve critical business KPIs.

]]>
How Automation Maximizes AIOps Value https://www.bmc.com/blogs/how-automation-maximizes-aiops-value/ Fri, 23 Aug 2019 00:00:02 +0000 https://www.bmc.com/blogs/?p=15308 In today’s enterprises, the potential for AIOps is massive. However, the reality is that many organizations have only scratched the surface in terms of what’s possible. In recent weeks, I’ve been writing posts that describe some key AIOps use cases, focusing on those areas that offer organizations some of the most significant near-term potential. In […]]]>

In today’s enterprises, the potential for AIOps is massive. However, the reality is that many organizations have only scratched the surface in terms of what’s possible. In recent weeks, I’ve been writing posts that describe some key AIOps use cases, focusing on those areas that offer organizations some of the most significant near-term potential.

In recent posts, we examined how teams can leverage AIOps to perform intelligent probable cause analysis and reduce event noise and enable predictive alerts. In this post, we’ll look at how, by employing automation, teams can fully harness the power of AIOps-fueled insights, and so reap maximum rewards from their AIOps investments.

The problem: Why is AI so difficult?

While the move to AI is definitely on, the reality is that the journey is proving to be filled with obstacles for many organizations. A recent IDC report offers some sobering stats as to how widespread these challenges can be. Through their research on more than 2,400 hundred organizations that are employing AI across their operations, they found that only 25% have established an enterprise-wide AI strategy. Further, one-quarter of organizations are seeing a 50% failure rate in their AI projects.

Looking at AI in IT Operations – AIOps – more specifically, it is clear that while AI and machine learning can yield tremendous insights, it can be a challenge for IT operations teams to act on, and fully harness, these insights. One key reason is that teams remain mired in manual tasks; too many efforts are time consuming, costly, and error prone. Exacerbating matters is the siloed nature of many IT organizations. Different teams are often employing distinct tools and workflows, which presents increasing problems as environments continue to grow more complex, dynamic, and interconnected. These disjointed, labor-intensive efforts have profound implications for teams:

  • It takes too long for staff to resolve issues, putting SLA compliance at risk.
  • Struggling to keep pace with existing implementations, teams are ill equipped to support innovation and strategic efforts.

Not only do these manual efforts hinder staff efficiency and organizational agility; they limit the team’s ability to act on insights from machine learning and analytics. Therefore, adopting an AIOps strategy but leaving teams to struggle with these manual efforts will diminish the benefits of AIOps investments or negate the value completely. Ultimately, to fully capitalize on the advantages of AIOps, IT teams need to address these shortcomings.

Key AIOps capabilities

To make AIOps initiatives pay off fully for the organization, it is vital for teams to leverage AIOps’ rich, machine-learning-driven insights to power automation. Automation represents a key way to ensure that the insights delivered by AIOps ultimately are employed to maximum advantage. For example, automated remediation can be a key initiative that delivers significant near-term value. IT Ops teams can automate the triage and remediation of commonly recurring issues and tasks, such as restarting services, cleaning up temporary resources, and provisioning additional capacity. This automation can deliver significant enhancements in staff efficiency, mean-time-to-resolution metrics, and service levels.

In our earlier post, we’d outlined how AIOps can power intelligent probable cause analysis that enables the fast identification of the cause of issues. Automation represents an optimal complement to this capability, ensuring team can both find—and fix—issues fast.

In addition, automated, closed-loop processes with the service desk can be established, setting the stage to maximize automation through the entire lifecycle from event identification to ticket generation to remediation, change and incident management and ticket close. This is an excellent step towards breaking down the silos between ITOM and ITSM to drive maximum value from AI for the business.

Benefits of AIOps

Through employing AIOps to establish automation, organizations are realizing a number of compelling benefits:

  • Speeding remediation workflows, enabling up to 75% improvements in mean-time-to-resolution metrics.
  • Reducing costs and risk by minimizing the potential for human errors.
  • Offloading repetitive administrative tasks from skilled IT resources, enabling them to focus on higher value, more strategic efforts.
  • Facilitating the unified, end-to-end workflows that help bridge the gaps that often exist between various groups in the IT organization, including IT operations and IT service management.
  • Automating tens of thousands of event responses, and therefore saving thousands of hours of staff time.

Transamerica Life Insurance Company is an example of one leading enterprise that has done much more than scratch the surface of the potential of automation. By harnessing machine learning and automation, the financial services firm has realized significant benefits from increased productivity related to event management. In the first seven months of its implementation, the organization automatically handled 94,273 events, saving more than 9,000 hours of staff time. Further, event-driven automation has reduced the load on level-2 staff, freeing them to spend more time focusing on strategic activities. (View the Transamerica customer success page to learn more.)

AIOps and automation: better together

Within many organizations, automation is the use case that will enable breakthroughs in the realization of AIOps value. Through automation, teams will be able to harness AI-driven insights, and ensure they are leveraged and acted upon in the most comprehensive, efficient manner possible. To learn more about how our solutions are enabling customers to realize breakthrough value from AIOps, be sure to visit the TrueSight AIOps page.

]]>
Probable Cause Analysis: A Key Value Driver of AIOps https://www.bmc.com/blogs/probable-cause-analysis-a-key-value-driver-of-aiops/ Thu, 18 Jul 2019 00:00:38 +0000 https://www.bmc.com/blogs/?p=14832 In their report entitled “IDC FutureScape: Worldwide CIO Agenda 2019 Predictions,” IDC predicts that, by 2021, 70 percent of CIOs will aggressively apply data and AI to IT operations, tools, and processes. Driven by demands for improving service levels, efficiency, and agility, IT leadership is clearly counting on the promise of artificial intelligence for IT […]]]>

In their report entitled “IDC FutureScape: Worldwide CIO Agenda 2019 Predictions,” IDC predicts that, by 2021, 70 percent of CIOs will aggressively apply data and AI to IT operations, tools, and processes. Driven by demands for improving service levels, efficiency, and agility, IT leadership is clearly counting on the promise of artificial intelligence for IT operations (AIOps).

As CIOs set out to pursue their AIOps initiatives, it will be critical to establish the near-term wins that are vital in demonstrating value and fueling longer-term buy-in, support, and investment. Toward that end, many IT leaders are struggling to determine the best near-term use cases to start with.

This is the third in a series of posts I’m publishing on AIOps use cases. In these posts, I’ve been focusing on those use cases that offer organizations some of the most significant near-term potential. In our last post, we examined how AIOps enables teams to establish event noise reduction and predictive alerting. In this post, we’ll look at some of the near-term gains AIOps can provide in the area of probable cause analysis.

The Demand for Effective Probable Cause Analysis—and the Penalty for the Lack Thereof

Today, the ability to deliver reliable, optimally performing digital services plays an increasingly influential role in an organization’s performance and competitiveness. Even small hiccups can delay critical operations, frustrate customers, and erode revenues. Business service downtime has an immediate, significant budgetary impact. For example, one survey found that, for more than 80 percent of businesses, an hour of downtime costs more than $300,000.

Given these realities, the pressure continues to mount on IT organizations, who are tasked with ensuring that required service levels are attained. To deliver reliable, optimized services, IT Ops must be able to identify issues that arise across the environment, quickly determine the cause of that issue and remediate it to maintain performance and service levels. In these efforts, effective capabilities for probable cause analysis are an imperative.

What is Probable Cause Analysis and Root Cause Analysis?

Probable cause analysis is the ability to understand relationships between infrastructure, applications and services and correlate millions of monitoring data points including performance metrics, events, logs, anomalies and baselines to deliver a scored and ranked list of the most likely causes for any problem in the environment. Root cause analysis refers to the next stage of problem identification. Log analytics enable deep analysis of log files to troubleshoot, recognize patterns, detect anomalies and identify the root cause.

Issues can be anything across the environment from a server being down to a slowdown in application response to network latency problems or CPU capacity levels. In some cases, probable cause analysis and root cause analysis can be a straightforward exercise. For example, a server administrator could receive an alert that a server is down, inspect the server’s performance metrics, see that that CPU utilization is maxed out, and take steps to reallocate workload to another server. In other cases, particularly in today’s interrelated, complex environments, probable cause analysis may require the investigations of several domain experts and different types of data from a number of monitoring tools.

At a high level, probable cause analysis requires a combination of effort, expertise, and data. Ultimately, the more complete, current, and targeted the data, the better equipped the administrator will be to assess the probable cause. In other words, the better the data, the less expertise and effort that will be required for probable cause analysis.

The Problem: Data Volumes are Overwhelming, While Insights are in Short Supply

In pursuing probable cause analysis, too many teams currently have the odds stacked against them. First, as outlined in our last post, that’s because environments continue to grow more complex. Whether it’s due to the proliferation of microservices, DevOps, or multi-cloud approaches, the reality is that environments continue to grow more dynamic, interrelated, and composite in nature.

Exacerbating these realities is the fact that teams tend to be battling with very limited visibility. Operators rely on technology- or domain-specific tools that only provide a fragmented view of the environment. With these tools, it’s difficult, if not impossible, to assess how a given system-level issue affects the business services that run on top of the infrastructure. At the same time, if an issue is discovered at the business service level—for example, users are calling to complain about a service being down—it’s very difficult and time consuming to determine where the actual issue is.

These obstacles have been significant, and only continue to be compounded as IT environments keep growing in scale and complexity. Ultimately, these obstacles leave IT teams plagued by staff inefficiency, high costs, and poor service levels.

How can AIOps improve Probable Cause Analysis?

Today, AIOps represents a key approach for teams looking to more quickly, accurately, and efficiently diagnose issues. By harnessing AIOps capabilities, teams can:

  • Employ service modeling to gain critical context around how users and business services are affected by issues.
  • Correlate millions of monitoring data points, including metrics, events, logs, anomalies, and baselines to automatically identify the causes of issues.
  • Score and rank causal metrics to quickly reveal the most likely source of issues.
  • Use analytics on log files to drill down into root cause.
  • Employ event analytics to track event patterns within the context of applications.

How Have Organizations Benefited from AIOps?

By employing AIOps solutions, organizations can gain the advanced probable cause analysis capabilities they need to maximize their speed and efficiency in troubleshooting and remediating issues. We’ve seen organizations across diverse industries achieve 30 – 75% reductions in the time it takes to diagnose issues and similar reductions in mean-time-to-repair (MTTR).

Following are examples of two TrueSight customers that have are harnessing AIOps to fuel enhanced probable cause analysis:

  • Park Place Technologies. Park Place Technologies has employed an AIOps solution to power more intelligent probable cause analysis. By doing so, their teams have been able to gain better, more timely insights, which has enabled staff to resolve incidents 31 percent faster and achieve a first time fix rate of 99%.
  • Brazil Ministry of Education. The Brazil Ministry of Education deploys AIOps to understand event service impact across their environment and uses log analytics to identify root cause. Now, when major infrastructure events do occur, the team is able to do root cause analysis 50 percent faster than it could before.

The Potential of AIOps

For today’s IT operations teams, delivering optimized service levels is a vital imperative. However, tracking and managing service levels only seems to get more difficult as environments grow ever more complex and dynamic. This is a key reason why the promise of AIOps is so compelling. With the right AIOps platform, teams can far more quickly and intelligently detect the probable and root causes of issues.
Be sure to keep an eye out for our next blog post in this series, which will provide a detailed look at another use case in the AIOps value chain: automatic remediation, incident, and change management to link IT Operations and the Service Desk. In the meantime, to learn more about our AIOps offerings, be sure to visit the TrueSight AIOps page.

]]>
Why Event Noise Reduction and Predictive Alerting are Critical for AIOps https://www.bmc.com/blogs/why-event-noise-reduction-and-predictive-alerting-are-critical-for-aiops/ Fri, 05 Jul 2019 00:00:34 +0000 https://www.bmc.com/blogs/?p=14579 In recent months, there’s been a significant amount of press and analyst coverage on AIOps. Is AIOps being over-hyped? While the market is in its early days, that doesn’t mean there’s nothing behind the buzz. In fact, the reality is that organizations are seeing significant benefits from AIOps today. This is the second in a […]]]>

In recent months, there’s been a significant amount of press and analyst coverage on AIOps. Is AIOps being over-hyped? While the market is in its early days, that doesn’t mean there’s nothing behind the buzz. In fact, the reality is that organizations are seeing significant benefits from AIOps today.

This is the second in a series of posts I’m publishing on use cases that offer opportunities for harnessing the benefits of AIOps in the near term. In this post, we’ll look in more detail at how AIOps enables teams to reduce event noise and predictively alert.

Before we jump in, some definitions:

What is Event Noise Reduction?

Event noise is the term used to describe the hundreds of hourly and daily notifications and alarms (eg: CPU utilization, memory utilization, end user response time) delivered by monitoring systems to IT Ops teams to show the health and performance of infrastructure and applications across their IT environment.

Event noise reduction involves applying machine learning to historical and real-time operational data to identify patterns and suppress events that fall within bands of normalcy while surfacing the most critical alarms and events for prioritized triage and remediation.

What is Predictive Alerting?

Predictive alerting refers to the ability to use machine learning, pattern identification and log analytics to identify abnormalities in operational data and predictively alert IT Ops on these abnormalities that could potentially impact an application or service. By highlighting activity that falls out of operational norms, IT Ops can proactively remediate issues before any service impact to meet SLAs, optimize customer experience and increase productivity.

The IT Operations Complexity Challenge

While you could argue the job of IT operations has long been demanding, you could also make a strong case that the job has never been tougher than it is today. IT teams are being tasked with ensuring optimized service levels in a climate that’s increasingly unforgiving of even short amounts of downtime. Further, these teams are tasked with managing environments that are introducing unprecedented complexity, dynamism, and interrelatedness.

Not all that long ago, monolithic, relatively static, on-premises computing stacks were the norm. Now, teams have to manage these legacy environments, plus a lot more. This typically includes implementations in dynamic microservices and container-based deployments and multiple cloud environments. Further, the expanded adoption of agile, continuous integration and continuous delivery, and DevOps approaches continues to fuel a massive acceleration in application release cycles and infrastructure changes.

While these modern environments present a number of challenges, one of the most urgent issues is the deluge of event volumes teams are being forced to sift through on an ongoing basis. Over the course of any given week, operations staff are overwhelmed by hundreds, if not thousands of alarms. Beyond the sheer scale, what compounds matters is that a significant percentage of alarms are redundant or false. Further, a single outage can often be responsible for triggering alarm storms, or massive spikes in alarms.

This overwhelming event noise creates several problems. Most fundamentally, it’s a drain on resources, it saps staff energy, and it erodes morale. Further, the more teams are overwhelmed by event volumes, the more likely it is that significant incidents will be missed or spotted too late, which means service levels are at risk on a constant basis.

How AIOps Can Help

By establishing robust AIOps capabilities, IT operations teams can address the challenges outlined above to reduce event noise in their environments. It starts with ingesting data from diverse sources and technologies, and aggregating a variety of data types, including events, logs, metrics and end user experience monitoring data in a single consolidated data repository.

IT Ops teams can then employ policy-based rules that start to filter and suppress events. To be effective, teams need an engine that can group events, apply rules, enforce policies, and enable filtering by a range of attributes—including location, monitoring source, application tier, severity, and more. This sets the foundation for significant event noise reduction.

Machine learning is at the heart of an effective event noise reduction strategy and must be applied to historical and real-time data to study behavior and identify patterns. Based on this pattern identification, dynamic baselines can be established, reducing the overhead and inaccuracy associated with static thresholds and associated event noise. Ultimately, event suppression is achieved by distinguishing between those arising within bands of normalcy versus those arising due to true abnormalities that could impact users.

Using AIOps for Predictive Alerting

The pattern identification and anomaly detection enabled by machine learning enables the valuable AIOps use case of predictive alerting. Machine learning, anomaly detection and log analytics enable IT Ops teams to spot potential issues, prioritize them for triage and diagnosis and address them before any impact on business services. TrueSight customers have been able to receive warnings 3 hours before a baseline is breached to proactively remediate. Predictive alerting is a valuable use case applied to capacity management as well – applying machine learning and analytics to capacity data enables potential capacity constraints to be identified ahead of time and corrective action taken. Since capacity outages are some of the most difficult to resolve, this type of insight is hugely valuable to IT Ops and capacity teams.

The Benefits of Event Noise Reduction and Predictive Alerting

When organizations employ AIOps to reduce event noise and establish predictive alerting, they can realize a range of significant benefits:

  • Harness more targeted intelligence. IT Ops teams can better understand how specific issues affect business services, so they can more quickly identify and prioritize the most business-critical issues.
  • Boost staff productivity. Reduce event volume levels by up to 90% to avoid the massive costs and inefficiency associated with managing thousands of redundant and inaccurate alerts.
  • Enhance service levels. Receive warnings up to three hours before baselines are breached to remediate issues before services are affected. As a result, teams can enhance SLA compliance and improve the user experience.
  • Avoid outages and continuously optimize cost. Use machine learning and automation to identify inefficiencies in IT infrastructure usage to prevent resource shortages and identify cost reduction opportunities

Recently, BMC worked with a large U.S. based insurer that deployed AIOps to reduce the event noise that the IT operations team had to contend with. Every month, the IT Ops team would have to sift through more than 15,000 events to diagnose, prioritize and triage. Now, by leveraging machine learning and establishing dynamic baselines, the team has been able to reduce this down to 1,500 events per month, all of which are ‘meaningful’ events which need to be actioned. This is a huge boost to efficiency and cost reduction across IT Ops processes.

Achieving a balance between the competing demands of supporting business innovation and optimizing service levels can be difficult in today’s complex, fast-moving environments. With the right AIOps platform, IT Ops can significantly reduce event noise and gain predictive insights, so they can optimize staff productivity and service levels and avoid downtime across their environment.

Be sure to keep an eye out for our next blog post in this series, which will provide a more detailed look at the AIOps use case of probable cause analysis. In the meantime, to learn more about our AIOps offerings, visit the TrueSight AIOps page.

]]>
Gartner Market Guide for IT Process Automation https://www.bmc.com/blogs/gartner-it-process-automation/ Thu, 07 Feb 2019 00:00:37 +0000 https://www.bmc.com/blogs/?p=13508 The 2018 Gartner Market Guide for IT Process Automation (ITPA) delivers insights about the current state of automation in enterprises and actionable recommendations for ITOps teams seeking to evolve automation initiatives to the next level. In this blog, let’s look at what’s working today and what’s needed to help ITOps teams move along the automation […]]]>

The 2018 Gartner Market Guide for IT Process Automation (ITPA) delivers insights about the current state of automation in enterprises and actionable recommendations for ITOps teams seeking to evolve automation initiatives to the next level. In this blog, let’s look at what’s working today and what’s needed to help ITOps teams move along the automation continuum.

Automation is indeed a stated goal for most – if not all – ITOps teams. After all, the benefits of automating across processes, technology stacks and systems can be significant – labor efficiencies, cost savings, increased agility to support the business, better service levels, consistent governance and compliance practices to name just a few. In this latest version of the ITPA Market Guide, Gartner paints the picture that very few enterprises are achieving the level of automation that delivers these benefits. In fact, they make the point that “ITPA tools remain aspirational automation targets for most I&O leaders, who are focused on task-level automation.”

This issue of task automation vs. process automation is an important one in the context of overall automation initiatives. Most ITOps teams are automating tasks specific to one purpose and it’s often done in an adhoc and disjointed way with the resulting benefits being minimal. Automation projects hit a wall because they are done in isolation without focusing on the overall outcome. While it may deliver some short-term efficiency gains, random task automation is not a long-term strategy. ITOps teams need to focus on how task automation efforts are driving towards a holistic automation approach. Gartner represents this in the below graphic, showing the transition from IT task automation to IT Service Automation to Business Service Automation.

Gartner notes that very few enterprises are at the level of Business Service Automation with most being at the task level. In the current environment of increased complexity, multi-platform, multi-cloud, rapid increases in data quantity and the need to keep up with the pace of DevOps, automation is key to the success of ITOps teams. So, what can be done to progress the automation agenda and avoid getting stuck in low-value task automation?

Gartner recommends that I&O leaders focus on the following to make IT process automation a central tenet in ITOM:

  • Drive skills development
  • Review and change organizational design
  • Invest in Automation tools and training
  • Proactively support automation initiatives

Additionally, Gartner recommends a strategic approach to automation with careful assessment of organizational automation objectives and priorities and avoid continuing to “opportunistically automate with limited strategic vision.” Existing successes in task-level automation are a good place to start, building on that task automation to orchestrate across more complex services.

Many BMC customers are evolving from task automation to IT process automation by using TrueSight Orchestration to link tasks together and automate workflows to address pain points in ITOps processes. Two key use cases include:

Automated Event Remediation: automating the triage and resolution of events that have standard remediation processes with orchestration across the monitoring and service management processes. With the amounts of data organizations and IT Ops teams have to manage exponentially growing, this automation use case helps take IT Operations from incident response to automated, proactive problem management.

Closed-Loop Change and Compliance Management: ITOps teams are automating change and configuration processes to address the need to maintain compliance while supporting DevOps initiatives that drive shorter release cycles and rapid change. This enables IT Ops needs to create audit ready processes to drive compliance while reducing labor cost, errors and governance risks.

IT process automation initiatives like these are being achieved with out-of-the-box content to speed workflow setup and time to value which is in line with Gartner’s recommendation to “‘Maximize automation investments by selecting ITPA tools that can orchestrate workflow execution across functional requirements, groups and management tools.”

Outcomes from these automation initiatives are driving significant change and efficiencies across ITOps processes, including benefits such as:

  • reductions of 60% in Mean-time-to-repair (MTTR) of operational issues
  • 83% reduction in time spent remediating events
  • 30% decrease in cost of audit preparations

An important takeaway for I&O leaders from this latest Gartner ITPA report is that implementing automation in a meaningful way involves moving beyond task automation to process automation and eventually to service orchestration. This requires a strategic automation approach and an ITPA tool with the out-of-the-box content and flexibility to support business automation imperatives.

]]>
AIOps and the New IT Skill Sets https://www.bmc.com/blogs/aiops-and-the-new-it-skill-sets/ Wed, 17 Oct 2018 00:00:25 +0000 http://www.bmc.com/blogs/?p=11013 This post is about how AIOps will change the way IT Operations personnel (IT Ops) work and the new skill sets they have to adopt in an AIOps world. For a definition of AIOps, refer to the blog post: “What is AIOps?” How does AIOps work, again? Gartner explains that an AIOps platform (figure 1) […]]]>

This post is about how AIOps will change the way IT Operations personnel (IT Ops) work and the new skill sets they have to adopt in an AIOps world. For a definition of AIOps, refer to the blog post: “What is AIOps?”

How does AIOps work, again?

Gartner explains that an AIOps platform (figure 1) uses machine learning and big data to aggregate observational data (from monitoring systems output, job logs, syslogs, etc.) and engagement data (from ticketing, incident, and event recording system data) to produce a virtuous circle of continuous insights yielding continuous improvements and fixes.

Automation is both an input and output of AIOps. The results or statuses of automated workloads and jobs can be used like operational data and engagement data for analytic purposes. Manual improvements can take the form of automating tasks, responses, remediations, etc. Machine learning that handles analytics at scale and adjusts algorithms accordingly is a form of automated improvement, e.g. Amazon and eBay online shopping, machine systems stock trading, or Netflix recommendations. In practice, a solid foundation of automation and orchestration across systems, processes and workflows is the ideal starting point for AIOps and ensures a greater likelihood of success.

If it’s automated, what does IT Ops do?

The implications of implementing AIOps are significant not only in terms of technology, but also in terms of process, culture and skills. AIOps will produce a big change in IT Ops’ role in both the Data Center and the business, leading IT organizations to ask this question:

What happens to the traditional IT Ops role when you turn IT Operations tasks over to an AIOps system that can respond to issues, manage applications and infrastructure, and adjust for cost and business value faster than the human beings that oversee it?

The answer is that just as Data Centers evolve using new technologies, IT Ops must also evolve by learning and using new skills to manage these new technologies.

Traditional IT Ops skills versus AIOps skills

Traditional IT Ops work focuses on producing and maintaining consistent, stable environments for service and application delivery. It also is concerned with meeting customer/user expectations and planning for growth and change. Traditional IT Ops tools try and provide useful information for the execution of these tasks. Generally these tools use human domain knowledge or analytic techniques or are modeled on them.

AIOps uses big data, algorithms, and machine learning to examine the profile of IT and business data, determine what “normal” looks like, find what factors are causal and correlative when things aren’t normal, and automatically recommend or implement a response. Machines execute these steps at incredibly fast rates on exponentially increasing amounts of data.

With AIOps, IT Ops job skills expand to include auditing AIOps results. IT Ops will need to understand how and why the AIOps platform is producing the outcomes it’s recommending or implementing. In an AIOps environment, IT Ops personnel need an enhanced skill set that helps them oversee the machine’s work, rather than just performing the work themselves. The AI skills gap is very real as pointed out in this Forbes article which reveals “the AI skills gap is the largest barrier to AI adoption, although data challenges, company culture, hardware and other company resources are also impediments.” So, how can IT Ops teams take steps to avoid the AIOps skills gap derailing their AIOps initiatives? Here are four skills IT Ops personnel will need as the world transitions into AIOps and application-centric infrastructures.

Skill #1: Auditing and Adjusting Machine Outcomes

In machine learning, there is a concept of ‘supervised’ and ‘unsupervised’ learning. Supervised learning is where one trains a system using sample (historical) data. When the system outputs expected results, it is considered ‘trained’ and can be applied to new data. Unsupervised learning is where no training data is provided and the system must organize and analyze data with no outside guidance.

AIOps will almost always involve supervised learning. IT Ops personnel will need a good understanding of the algorithms behind AIOps processing in order to train and validate the system. They won’t need to be data scientists or understand complex math to do this, but they’ll need a better understanding of how the machine learning algorithms apply analytics to the data. The goal is to understand the “why” of the machine-produced outcomes so that they can be accepted, rejected or adjusted.

As a simple example, in the traditional IT Ops world, you might set a specific metric such as processor utilization at 70%. When CPU utilization hits 70%, you would specify your monitoring software to send you an alert so that you can investigate. You do this because you know from experience that 70% is when something problematic happens or indicates an undesired state of affairs. 70% may or may not be the exact right number but it works for you to get the job done.

In an AIOps world, the machine examining your data will create a baseline of what normal looks like for CPU utilization. Told what the metric is for the problem or undesired state, the machine can more closely look at the relationship between CPU and that metric. It will then determine the right threshold for when to send alerts or to make an automatic adjustment (such as assigning more capacity or adjusting runaway job resources). The machine may discover a different threshold is more accurate or gives more lead time to you, that the issue correlates with another metric you should be monitoring instead, or that it only happens when a series of conditions apply, not just CPU activity.

IT Ops personnel will need a deep enough understanding of how machine learning analytics work so that, when they turn control over to the machine, they can audit to see how that automated control is evolving and doing its job. With AIOps, IT Ops moves from a totally manual process to an auditing and adjustment process, where you’re fine-tuning the system according to changes in your environment that the machine learning algorithms need to learn. Seasonal historical events (e.g. Black Friday, Amazon Prime Day) as well as one-off events (marketing campaigns, launches) will introduce new data into the system to which it will need to adjust and be validated by IT Operators.

AIOps auditing and management is a key skill that ITOps will need to develop. It will be informed by the specific working environment (tribal knowledge) and the industry. Some skill training will come from vendors. Some of it will be obtained through self-education, and some will be obtained through certification. AIOps education will be similar to the type of education staffs had to obtain when they learned network skills, and you should expect a similar education process for AIOps management.

Skill #2: Understand APIs and other modern-stack application technologies

As I’ve noted before, with application-centric infrastructures, DevOps, and Agile software development, IT Ops are increasingly taking responsibility for resolving application issues that software developers previously handled. Regardless of where your organization is with application delivery, it is undeniable that the application has become king and developers are getting consistently more influence and budget.

IT Ops must now speak the language of developers (APIs, continuous delivery), understand application technologies (microservices, containers) and determine the correct way to measure their impact on the IT ecosystem (and respond when things go wrong). For example, IT Ops needs to be able to answer:

  • Is an application processing data correctly and do we need to correct any data issues?
  • What portions of the code are causing issues?
  • Is code execution or a database call causing slow response time?
  • Is a 3rd party service or API impacting application performance?
  • Is auto-scaling in cloud services (AWS, Azure) delivering performance at the right price?
  • Is engaging multiple APIs or external services introducing latency?

And many other questions besides. In addition to understanding, IT Ops must also open channels of communication with developers to alert and collaborate on application-related issues.

Perhaps the key application technology for today’s enterprise is the on-demand cloud. Application developers have essentially been given carte blanche to use cloud resources as they see fit while the organizational budget for cloud sits with IT Ops. Developers may not care individually about a $30 -$50 a month bill but over 1000s of developers across the organization, costs add up. IT Ops must gain visibility into what is happening with cloud resources and an understanding of workload profiles in order to determine where they should be placed for cost/performance optimization.

Duties that used to be handled by applications programmers are now shifting to IT Ops. Applications are becoming more function and service specific and are being built as services that talk to each other through APIs. Cloud resources used by developers are still owned by IT Ops. A working familiarity with APIs and other applications technologies (what they do, how to test, how to address, etc.) is becoming a requirement for IT Ops. It will also be needed for AIOps management.

Skill #3: Security, Security, Security

If your IT Ops organization isn’t already responsible for security, understanding what a security event is in an operational context and how to react to it is critical. In many organizations, security functions are siloed away from IT Operations. As AIOps becomes more prevalent, a security event storm such as a denial of service attack or some of the recent ransomware attacks will likely be quickly detected by AIOps machine learning. Knowing how to recognize them as a security failure rather than an operational failure and responding to them as such will again, be critical. In the AIOps environment, a greater awareness of security issues and how IT Ops personnel should react to them will be more critical than ever.

Skill #4: Working Closely with Lines of Business

IT has come a long way since the days of being separated from the business and perceived as being simply ‘problem solvers’ as opposed to harbingers of real business value. However, the advent of AI will necessitate IT taking an even more proactive approach to working with their lines of business to ensure success of AIOps initiatives. In their ‘Leadership Vision for 2019: Infrastructure and Operations Leader, September 2018’ report, Gartner states:

“I&O leaders have to be active contributors toward their organizations digital success in delivering growth by helping scale digital initiatives throughout their enterprise. Infrastructure and operations (I&O) leaders have to do so primarily by better aligning their capabilities with the business, by delivering improving quality, and by lowering costs.”

AIOps creates many opportunities for I&O teams to achieve that alignment with lines of business. By applying machine learning and data analytics to the rich data from infrastructure and applications monitoring across multi-platforms including on-prem and cloud, invaluable insights can be gleaned for lines of business. For example, I&O teams can increasingly take a leadership role to reduce costs for lines of business by analyzing usage patterns of apps and relating that back to cost reduction initiatives. And, application-aware data on end user experience can deliver valuable insights for customer service, marketing and digital transformation initiatives.

AIOps teams will need increasingly savvy cross-domain communication skills to engage the lines of business and a strategic approach that enables them to be able to apply data insights to business objectives and priorities. That’s quite a shift for many IT folks – but an exciting and rewarding one. It also necessitates strong leadership at the CIO level to establish those relationships with the line of business leaders to pave the way for their AIOps teams to engage and deliver valuable insights gleaned from machine learning and data analytics.

It takes a generalist

Digital business innovation happens at the edge of an organization’s IT eco-system. Once innovation matures into production, be it an infrastructure, application, or security improvement, ownership will pass over to IT Operations.

With the advent of AIOps and other new technologies such as application-centric infrastructures, microservices, and DevOps/Agile, IT Operations personnel will no longer be permitted to remain specialists in the area of IT performance management. They must become generalists in a number of different areas. IT Ops skill sets must evolve to include practical working knowledge of such things as machine learning/algorithm management, applications programming, and security. As our organizations become increasingly digitized, possessing these four skills will become the new normal for IT Ops.

]]>