Seth Paskin – BMC Software | Blogs

Total Economic Impact™ Study Finds 314% ROI on with BMC Helix Discovery

Seth Paskin — Fri, 14 Jul 2023 16:19:39 +0000

IT teams cannot manage what they cannot see, and that challenge is made more complex when trying to manage IT assets across on-premises, cloud, and hybrid environments, while also supporting DevOps teams and managing containerized and microservices-based application delivery.

BMC Helix Discovery, an industry-leading SaaS-based, cloud-native discovery and dependency mapping system, can solve these issues by delivering instant visibility across your entire IT estate. A recently released Forrester Consulting Total Economic Impact^™ (TEI) study commissioned by BMC found benefits culminating in a return on investment (ROI) of a significant 314 percent.

Before their implementation of BMC Helix Discovery, participants in the study were managing assets with legacy and homegrown solutions and manual processes, which led to security breaches, a lack of visibility, and outdated information, as well as high costs and inefficient processes. By implementing the solution and gaining end-to-end visibility, the composite organization comprised of interviewees with experience using BMC Helix Discovery realized benefits that include:

Improved IT productivity: By gaining better visibility into the IT asset ecosystem and achieving accurate, real-time data, organizations netted $684,000 worth of increased IT team productivity related to managing assets and infrastructures.
More efficient IT asset incident resolution and recovery: By leveraging the solution’s better-quality data to identify the root cause of incidents and reduce time spent resolving them, end users avoided downtime worth $ 2.8 million.
Better IT asset optimization: Improved visibility helped organizations avoid the licensing fees and costs associated with managing unused assets, saving $2.5 million.
Improved IT asset security: By mitigating risk and improving their overall security posture with better visibility, organizations realized security improvements worth $1.2 million.

In addition to the quantifiable benefits of BMC Helix Discovery, the organizations in the study credited the solution with helping them grow their business alongside managing their complex IT infrastructures. To learn more about how BMC Helix Discovery can help your business increase IT team productivity, remediate and recover IT assets faster, and improve IT asset optimization and security, download the full study here.

And… Consider watching the webinar with BMC experts and guest speaker, Forrester Analyst Will McKeon-White as they discussed how BMC Helix Discovery delivers rapid ROI, automates asset management, and reduces IT burden.

Achieve Compliance for CISA’s Binding Operational Directive 23-01 with BMC

Seth Paskin — Fri, 09 Dec 2022 07:53:43 +0000

The United States Cybersecurity and Infrastructure Security Agency (CISA) released the Binding Operational Directive 23-01, a compulsory directive to the federal, executive branch, departments, and agencies to safeguard federal information and information systems. Under the directive, agencies must have weekly automated asset discovery and vulnerability enumeration in place by April 3, 2023.

Federal agencies are embracing the challenge of managing and securing hardware and software assets across multi-cloud, on-premises, and mobile. This complexity comes with increased cybersecurity risk. One way organizations can manage this risk is through continuous and comprehensive asset visibility. Maintaining accurate and up-to-date accounting of assets residing on federal networks is also critical for CISA to effectively manage cybersecurity for the Federal Civilian Executive Branch (FCEB) enterprise.

The new requirements

Binding Directive 23-01 focuses on two core areas:

Asset discovery as a building block of operational visibility, defined as an activity through which an organization identifies the network-addressable IP assets that reside on its networks and their associated IP addresses (hosts).
Vulnerability enumeration identifies and reports suspected vulnerabilities on those assets. It detects host attributes (e.g., operating systems, applications, open ports, etc.) and attempts to identify outdated software versions, missing updates, and misconfigurations. It validates compliance with or deviations from security policies by identifying host attributes and matching them with information on known vulnerabilities.

BMC answers the call

You can’t manage what you can’t see. Below are the ways that BMC Helix Discovery, a FedRAMP Moderate-certified, SaaS solution delivered on Amazon Web Services (AWS), can help you meet the Binding Operational Directive 23-01 requirements:

Requirement	BMC Helix Discovery
Maintain an up-to-date inventory of networked assets	Inventories networked hardware and software assets across cloud, hybrid, and on-premises environments. Adds the additional benefit of relationship/dependency mapping and service modeling.
Perform automated asset discovery every seven days	Agentless discovery of assets with automated scheduling at any interval (hourly, daily, weekly, etc.)
Initiate vulnerability enumeration across all discovered assets, including all discovered nomadic/roaming devices, every 14 days	Completely catalogs asset configurations and profiles for vulnerability enumeration at every scan
Develop and maintain the operational capability for on-demand asset discovery and vulnerability enumeration to identify specific assets or subsets of vulnerabilities within 72 hours of CISA request and provide results within seven days	Can be executed on-demand to meet CISA requests and immediately provides results
Perform the same type of vulnerability enumeration on mobile devices and other devices that reside outside of an agency’s on-premises networks	Treats mobile devices and other offsite devices, including tablets, iOS and Android devices, the same as on-premises networked assets

BMC Helix Discovery provides real-time visibility into hardware and software assets as well as their relationships and service dependencies across on-premises and cloud environments. It is designed to handle the complexity of managing a wide spectrum of configurations, including physical and logical components. Learn more about what BMC Helix Discovery can do to help your agency meet CISA’s Binding Operational Directive 23-01 requirements. Reach out to federal@bmc.com, speak to your BMC Account Team, or visit www.bmc.com/discovery.

Power of AIOps Visualization: Moving from One Dimension to a Layered Topology

Seth Paskin — Fri, 30 Sep 2022 14:13:00 +0000

Much of the attention on artificial intelligence for IT operations (AIOps) focuses on artificial intelligence and machine learning (AI/ML)—and rightly so. The promise of AIOps is having machines analyze data, identify issues, and act automatically to remediate them in real time—which humans simply can’t do. There are many use cases that AIOps in its current state can address, but what about situations where IT operations (ITOps) or site reliability engineers (SREs) need to intervene? What are we doing to improve that experience?

Information presentation is a critical part of ITOps. The space has evolved quite a lot over the last few years, with single-metric views giving way to consolidated dashboards; overlaying events on metrics; timeline views that correlate metrics, events, changes, and incidents; and service health scoring. One area that has seen less innovation is that of topology visualization, or service modeling.

Business service topology mapping

Traditionally, topologies have been represented in a one-dimensional hierarchy, like this:

Figure 1. A representative one-dimensional topology map.

BMC has always been at the forefront of asset and dependency discovery and mapping. Our industry-leading capabilities are the foundation of our BMC Helix service and operations management solutions.

A new way of looking at topologies

We’ve been thinking about ways we can enhance topology representations for IT operators and SREs. Beyond shapes, colors, and images, there is an opportunity to use three-dimensional models to represent different layers of a technology stack (e.g., containers, database, network, compute, etc.).

And that’s just what we’ve done. We’ve created a “layered topology” view that separates logical groupings of technologies into layers that can be visualized in relationship to each other or abstracted from the full topology view. To do this, you need complete and accurate asset and dependency mapping, which we have with our dynamic service modeling capability.

Here’s a two-minute video demonstrating layered topology in our BMC Helix Operations Management with AIOps solution.

The three-dimensional visualization is simple, intuitive, and easy to navigate. IT operators or SREs can see immediately both the location and depth of the issue. They can bring in only the relevant teams to work on resolution, who in turn have access to important contextual information that is presented clearly, making it easy to investigate.

This is just one of many examples showing how BMC is innovating to realize our vision of ServiceOps and help customers transform their organizations. Click here to learn more about BMC Helix Operations Management with AIOps.

What Is AIOps? A Beginner’s Guide

Seth Paskin — Fri, 17 Apr 2020 00:00:37 +0000

When I first wrote about AIOps in 2017, Gartner was predicting that IT operations (ITOps) personnel were in for a major change over the next few years. Traditional IT management techniques were viewed as unable to cope with digital business transformation. Gartner predicted that there would be significant changes in ITOps procedures and a restructuring of how we manage our IT ecosystems. They called the evolving platform on which these changes would take place “AIOps.”

Changes in IT over the intervening years have proven Gartner correct. Interest and adoption of AIOps has increased exponentially as organizations have sought to:

Enable innovation
Fend off disruptors
Manage the velocity, volume, and variety of digital data that is beyond human scale

This article covers the original and current market drivers of AIOps and its components and benefits. It’s also been updated with the latest release of the Gartner Market Guide to AIOps.

Digital transformation: The road to AIOps

It’s important to understand how digital transformation gave rise to Gartner’s AIOps platform.

Digital transformation encompasses DevOps and the adoption of cloud and new technologies like containers. It represents a shift from centralized IT to applications and developers, an increased pace of innovation and deployment, and the acquisition of new digital users—machine agents, Internet of Things (IoT) devices, Application Program Interfaces (APIs), etc.—that organizations previously didn’t need to service.

All of these new technologies and users are straining traditional performance and service management strategies and tools to the breaking point. AIOps is the ITOps paradigm shift required to handle these digital transformation issues.

What is AIOps?

AIOps is short for artificial intelligence for IT operations. It refers to multi-layered technology platforms that automate and enhance IT operations through analytics and machine learning (ML). AIOps platforms leverage big data, collecting a variety of data from various IT operations tools and devices in order to automatically spot and react to issues in real-time while still providing traditional historical analytics.

Gartner explains how an AIOps platform works by using the diagram in Figure 1. AIOps has two main components: big data and ML. It requires a move away from siloed IT data in order to aggregate observational data (such as that found in monitoring systems and job logs) alongside engagement data (usually found in ticket, incident, and event recording) inside a big data platform.

AIOps then implements a comprehensive analytics and ML strategy against the combined IT data. The desired outcome is automation-driven insights that yield continuous improvements and fixes. AIOps can be thought of as continuous integration and deployment (CI/CD) for core IT functions.

Figure 1: Gartner’s visualization of the AIOps platform

To accomplish teh goal of continuous insights and improvements, AIOps bridges three different IT disciplines:

Service management (“Engage”)
Performance management (“Observe”)
Automation (“Act”)

AIOps creates a game plan that recognizes that, within our new accelerated IT environments, there must be a new approach that’s underwritten by advances in big data and ML.

What’s driving AIOps?

AIOps is the evolution of IT operational analytics (ITOA). It grows out of several trends and needs affecting ITOps, including:

IT environments exceeding human scale. Traditional approaches to managing IT complexity—offline, manual efforts that require human intervention—don’t work in dynamic, elastic environments. Tracking and managing this complexity through manual, human oversight is no longer possible. ITOps has been exceeding human scale for years and it continues to get worse.
The amount of data that ITOps needs to retain is exponentially increasing. Performance monitoring is generating exponentially larger numbers of events and alerts. Service ticket volumes experience step-function increases with the introduction of IoT devices, APIs, mobile applications, and digital or machine users. Again, it is simply becoming too complex for manual reporting and analysis.
Infrastructure problems must be addressed at ever-increasing speeds. As organizations digitize their business, IT becomes the business. The “consumerization” of technology has changed user expectations for all industries. Reactions to IT events—whether real or perceived—need to occur immediately, particularly when an issue impacts user experience.
More computing power is moving to the edges of the network. The ease with which cloud infrastructure and third-party services can be adopted has empowered line of business (LOB) functions to build their own IT solutions and applications. Control and budget have shifted from the core of IT to the edge. And more computing power (that can be leveraged) is being added from outside core IT.
Developers have more power and influence but accountability still sits with core IT. As I talk about in my post on application-centric infrastructure, DevOps and Agile are forcing programmers to take on more monitoring responsibility at the application level, but accountability for the overall health of the IT ecosystem and the interaction between applications, services, and infrastructure still remains the province of core IT. ITOps is taking on more responsibility just as their networks are getting more complex.

Humans aren’t being replaced

It should be noted that an acknowledgement that ITOps management is exceeding human scale does not mean that the machines are replacing humans. It means we need big data, AI/ML, and automation to deal with the new reality. Humans aren’t replaced, but ITOps personnel will need to develop new skills. New roles will emerge.

The elements of AIOps

I’m going to take a moment here to go through the elements of AIOps as represented in the Gartner diagram above. While I encourage everyone to read the Market Guide, what follows should serve as an adequate grounding in the key pieces of the AIOps puzzle and how they contribute.

Extensive and diverse IT data. Enumerated in the black and blue chevrons, AIOps is predicated on bringing together diverse data from both IT operations management (ITOM) (metrics, events, etc.) and IT service management (ITSM) (incidents, changes, etc.). This is often referred to as “breaking down data silos”—bringing data together from disparate tools so they can “speak” to each other and accelerate root cause identification or enable automation.
Aggregated big data platform. At the heart of the platform, the center of the above graphic, is big data. As the data is liberated from siloed tools, it needs to be brought together to support next-level analytics. This needs to occur not just offline—as in a forensic investigation using historical data—but also in real-time as data is ingested. See my other post for more on AIOps and big data.
Machine learning. Big data enables the application of ML to analyze vast quantities of diverse data. This is not possible prior to bringing the data together nor by manual human effort. ML automates existing, manual analytics and enables new analytics on new data—all at a scale and speed unavailable without AIOps.
Observe. This is the evolution of the traditional ITOM domain that integrates development (traces) and other non-ITOM data (topology, business metrics) to enable new modalities of correlation and contextualization. In combination with real-time processing, probable-cause identification becomes simultaneous with issue generation.
Engage. The evolution of the traditional ITSM domain includes bi-directional communication with ITOM data to support the above analyses and auto-create documentation for audit and compliance/regulatory requirements. AI/ML expresses itself here in cognitive classification plus routing and intelligence at the user touchpoint, e.g., chatbots.
Act. This is the “final mile” of the AIOps value chain. Automating analysis, workflow, and documentation is for naught if responsibility for action is put back in human hands. Act encompasses the codification of human domain knowledge into the automation and orchestration of remediation and response.

The future of AIOps

Understanding what is driving AIOps and how it is a response gets us to the current state of the market. As IT moves beyond human scale, IT tooling needs to adapt. But simply reacting defensively is not enough. The organizations that embrace AIOps will see the challenge it is meant to address as an opportunity to grow, evolve, innovate, and disrupt.

Here are some ways that AIOps-enabled organizations will transform their business in the next five years.

Technology becomes more human: Analytics and orchestration enable frictionless experiences, allowing ubiquitous self-service.
The automation of technology, and, hence, business processes: Costs lower, speed increases, and errors decrease while freeing up human capital for higher-level achievement.
Enterprise ITOps gains DevOps agility: Continuous delivery extends to operations and the business.
Data becomes currency: The vast wealth of untapped business data is capitalized, unleashing high-value use cases and monetization opportunities.

At BMC, we call this vision of an AIOps-enabled future the Autonomous Digital Enterprise. Our mission is to enable our customers to innovate and differentiate quickly and continuously to deliver customer-driven value. The successful organizations of tomorrow will be the ones embracing intelligent, tech-enabled systems that allow them to thrive while others falter during times of massive change.

AIOps: seismic change, but not radical

Although AIOps is a seismic change for IT operations, it’s not a radical application of analytics and machine learning. A similar ML approach was implemented when stockbrokers moved from manual trading to machine trading. Analytics and ML are used in social media and in applications like Google Maps, Waze, and Yelp, as well as in online marketplaces like Amazon and eBay. These techniques are used reliably and extensively in environments where real-time responses to dynamically-changing conditions and user customization are required.

AIOps is the application of tried-and-true technology and processes to ITOps. ITOps personnel are typically slow to adopt new technologies because, out of necessity, our jobs have always been more conservative. It’s the job of ITOps to make sure the lights stay on and provide stability for the infrastructure that supports organizational applications.

We’ve passed the tipping point, however, and AIOps adoption is the key indicator for the trajectory of the digital enterprise.

Additional resources

For related reading, explore these resources:

AIOps Machine Learning: Supervised vs Unsupervised

Seth Paskin — Thu, 16 May 2019 00:00:05 +0000

This post is intended to provide a short explanation of the difference between supervised and unsupervised machine learning (ML) and offer some simple examples of how we use them in TrueSight AIOps. I am not suggesting that you must have ML skills in your IT organization; rather that an understanding of how ML functions for IT Operations will help you evaluate AIOps strategy and vendors.

For a deeper discussion of evolving IT skill sets, see my other post.

What is “machine learning”?

Machine learning refers to a process of ‘training’ a machine to execute a task and is differentiated from writing software code to ‘program’ a machine to execute the same task.

In software programming, you tell a machine every specific action to take and in what order. You let it know in advance what outcomes to expect and how to deal to them. A software application is just a set of instructions to a machine about what to do and how to react to what happens (user input, feedback, data, etc.) “Bugs” are cases where the programmers failed to put in an instruction, did it incorrectly, didn’t account for output or user response, etc. That’s why good programming is hard: you have to anticipate every possibility and eventuality.

In machine learning, you are concerned with the ‘what’ the program should accomplish, but not the ‘how’. You don’t give a specific set of instructions to the machine to execute in order. You don’t tell it what data or input to expect and how to respond. You let the machine figure it out. This is a broad generalization but you get the point.

Obviously, you can’t start out from nothing. Machine learning requires the user to make design decisions about what analytics and algorithms the machine should use to learn.
Also obviously, the more complex and abstract your task is, the more complicated machine learning becomes. Therefore, machine learning is generally implemented to solve very specific problems.
Additionally, there are different ways to train the machine based on the desired task. These different approaches are captured in the terms ‘supervised’ and ‘unsupervised’ machine learning.

Supervised Machine Learning

If a machine needs to learn a task using sample data (“input”) and an expected outcome (“output”), then the learning is supervised. Supervised machine learning gives the machine a starting point – the input – and an end point – the output. The job of the machine is to infer how to get from input to output.

The machine must be told the ‘what’, it has to figure out the ‘how’. In supervised machine learning:

The ‘supervisor’ must make decisions about what sample data will best train the machine.
The supervisor must determine what learning algorithm should be used.
The supervisor must verify the accuracy of the machine output.

Once the machine can accurately give the expected output from the sample data, it can be considered ‘trained’. It can then be applied to input data that has not previously been analyzed. This type of machine learning is best used on data that is labeled (in the IT world = “structured”) to solve classification problems like ‘spam/not spam’ or ‘threat/not threat’ and regression problems like ‘when will metric X hit 90%’.

Unsupervised Machine Learning

Machine learning is unsupervised when you have input data but no expected outcome. With no outcome, you can’t train the machine so the input data cannot be used as a sample. Instead, the machine is tasked to learn from the data itself. There are no correct answers and no supervisor.

Unsupervised machine learning is used to look at the structure of the data or the distribution of elements in the data set. It is used for clustering to identify inherent groupings like common phrases in logs/events, or associations, like the frequency when X failure occurs, failure Y also occurs.

Machine Learning Considerations and BMC Implementations

What type of machine learning should be used depends on the data available and the problem you are trying to solve. No one approach works for everything, and even within the same area, different approaches have tradeoffs. Some considerations for machine learning in AIOps:

Whether you pick supervised or unsupervised learning depends on the problem. IT has problems that fit both profiles. There is no one single “correct” approach: different IT problems require different approaches and multiple different approaches can be used to solve a specific problem.
Someone must make design decisions about which algorithms are used for machine learning and in the case of supervision, what data is used to train the system and what constitutes “correct”. If not done by a vendor, customers must supply that knowledge themselves.
- NOTE: If you don’t know what ‘good’ looks like, you can’t supervise machine learning.
A lot of enterprise IT data is similarly structured regardless of industry or application. E.g. CPU utilization – as a data input – is highly structured and follows general patterns regardless of what workload is running on the server you are monitoring. This means for many use cases, vendors can build machine learning analytics that will be broadly applicable across different IT environments – which BMC in fact does.

Some examples of machine learning in TrueSight AIOps

Here are some examples of machine learning analytics that BMC has implemented and to which products they apply. For each one I indicate whether we have added proprietary BMC IT domain knowledge (e.g. IT data model output for supervised learning) and what value the analytics provide.

Forecasting

Forecasting is determining when metrics will hit thresholds and performing “what if?” scenarios

Algorithms: Proprietary combination of multiple techniques including Linear regression, Regime change detection, Seasonality decomposition, Box and Jenkins method, and more.
BMC IT domain knowledge added? Yes
Type of Machine Learning: Supervised
- The system is already trained by BMC, so you benefit without having to actively supervise, but you can modify parameters.
Products: TrueSight Capacity
Value:
- Reduce on-premises cost up to 30% by optimization of IT resources
- Reduce or eliminate infrastructure related application failures
- Eliminate surprise infrastructure expenditures and budget over-runs
- Plan for upcoming resources needs, budget and expenses

Dynamic Baselining

Determine future behavior of a metric based on that metric’s past behavior. Dynamic baselining incorporates seasonality.

Algorithms: Poisson and normal linear regression
BMC domain knowledge added? Yes
Type of Machine Learning: Unsupervised
- Historical data from metric used without training or specific output)
Products: TrueSight Capacity, TrueSight Operations Management
Value:
- Reduce event noise up to 90% and improve productivity
- Reduce the number of incidents generated from events up to 40%
- Proactively remediate issues before any service impact to meet SLAs

Clustering

Find similarities and frequency distributions of word pairings in unstructured data (logs, notes, etc.).

Algorithms: Levenshtein (logs), Latent Dirichlet Allocation (events)
BMC domain knowledge added? Yes
Type of Machine Learning: Unsupervised
- Data from logs or events used without specific outcome known in advance
Products: IT Data Analytics
Value: Reduce time to identify root cause up to 60%

Some concluding thoughts on machine learning in AIOps

All AIOps platforms use machine learning in some capacity to solve specific IT domain problems on specific data sets. Whether it is clustering on events, pattern matching on logs, modeling and forecasting on metrics or something else – someone has done the hard work of looking at what algorithms are best suited to the data and what approach to machine learning using those algorithms fits the desired outcome. If needed, they have also put in the hours to train the system. The value proposition of an AIOps platform is that IT operators are buying that expertise and research in addition to the monitoring or aggregating functions of the solution.

The ultimate benefit to the customer is removing the need for a user to have the appropriate analytic skill set, build and configure analytics and machine learning technology, execute analysis, modeling and system training and then implement it against their domain data. Ideally, the user can focus on operational tasks leveraging their IT and specific ecosystem domain knowledge, trusting the system to provide desired outcomes for decision or automation.

Organizations implementing AIOps platforms should do due diligence to understand as thoroughly as possible the data sets they need to analyze and the outcomes they want to achieve. They can then use those specific use cases to vet potential vendors through a proof of value. For a broader roadmap to implementing AIOps, please see my other post.

A Roadmap To AIOps

Seth Paskin — Mon, 18 Jun 2018 00:00:47 +0000

In my conversations with customers about AIOps, I frequently hear concerns about maturity. Customers may believe, for example, they aren’t mature enough to implement analytics, or that there is a linear progression for AIOps capabilities and they must start from a certain point corresponding to their own maturity self-assessment. Oftentimes they say something like ‘I have to get X in place first before I can even think about Y’. Usually the “X” they are talking about is getting a handle on exploding amounts of events and alerts or unifying dispersed monitoring.

I understand and empathize with their concerns. At the same time, I think that decades of ITIL training, with its rigid and regimented processes – reinforced by analysts and vendors – has made it difficult for all of us to see the possible or envision alternative solutions to our long-standing problems. AIOps holds the promise of step-function improvement without the strictures of ITIL but there is very little practical guidance about what that might look like.

In this post I want to propose some concrete steps that I believe are required or highly desirable to build an AIOps practice. I will then offer a ‘roadmap’ for taking these steps in an AIOps implementation, indicating which are prerequisites for others, which can be pursued simultaneously and which have dependencies.

A Quick AIOps refresher

Gartner has identified an emerging IT market trend: traditional IT processes and tools are not suited to dealing with the challenges of modern digital business. (More information here) This has to do with the velocity, variety and volume of digital data; the distribution of responsibility and budget in the broader organization outside of IT; and the need to move from offline, historical analysis to real-time analytics.

Gartner’s response to this trend is AIOps: the merging of IT Service Management (ITSM), IT Operations Management (ITOM) and IT Automation at the data layer. That data must reside in a big data platform that supports the application of real-time analytics as well as deep historical queries. The analytics must be managed by machine learning that supports both supervised and unsupervised processing as the data streams in.

The idea is that tools in the IT silos remain sovereign, e.g. Service Management still handles requests, incidents, etc. and Performance Management still monitors metrics, events and logs, but that their data is joined and subjected to machine-driven analysis for the purposes of enabling a) better, faster decisions and b) process as well as task automation.

Keep the End State in Mind

Remember that the end state is a system where data streams freely from multiple IT data sources into a big data platform; that data is analyzed upon ingestion and post-processed with data from other sources and types; machine learning is used to manage and modify the analytics and algorithms; and automated workflows are triggered, whose output also becomes a data feed into the system. The system adapts and responds as data volumes, types and sources change, automatically adjusting response and informing administrators as needed.

Early stage: Identify your current use cases

In a situation of change, transformation and fluidity, the best place to start is with what you know. Most customers have initiatives around solving for use cases that they can’t currently accommodate, or adapting how they are currently solving for a use case to be more responsive, scalable, accommodate new technologies, etc.

I always encourage customers to enumerate the list of use cases that they currently address or want to address. Having disclosure and transparency around current ‘desired’ state opens the dialogue to:

Questioning the ‘why’ of those desired outcomes
Assessing the priority of specific use cases
Highlighting gaps in capability, tools, skills or process

This is a terrific starting point for developing an AIOps strategy that will be successful. Emphasis on “starting”. We don’t know what we don’t know – new use cases will come up, new desired outcomes will emerge and priorities will shift as your business and technologies change. New AIOps approaches will open new possibilities and pose new challenges.

The important thing is to start down a path with a purpose that bridges where you are to where you want to be. If where you want to be changes, no problem, you can course correct. But if you don’t know where you are and have a realistic understanding of what is needed to get to desired state, you will end up unfocused and likely unsuccessful.

Early stage: Assess your data freedom

The foundational element for AIOps is the free flow of data from disparate tools into the big data repository. Accordingly, you must assess the ease and frequency with which you can get data out of your IT systems. The optimal model is streaming – being able to send data continuously in real-time.

Few IT monitoring and service desk tools support streaming of outbound data. They may support programmatic interaction via REST API in more current versions or iterations. However, if they are based on traditional relational data bases like Oracle or SQL, even having a programmatic interface doesn’t mean that they will be able to support streaming. The performance impact to production systems using relational data bases may be too great as they are not designed to support the continuous outflow of data.

Getting clear on your data streaming capabilities is an early and high-priority activity in developing an AIOps strategy. Answer these questions for each data source:

How do I get data out of my current IT tools?
What data can I get?
Can I do it programmatically?
How frequently can I do it?

The constraints you discover may cause you to change your data consolidation strategy (e.g. start with batch uploads vs streaming) or consider replacing your IT tools with ones that will support real-time data streaming.

Early stage: Agree on a system of record

A second foundational element for AIOps is organizational alignment and communication. Suggesting that IT Operations and IT Service Management come together to review joint data requires that the teams agree on a ‘source of truth’ and establish a regular cadence of interaction with clear roles and responsibilities. The latter is a larger topic that requires a longer conversation I will pursue at a later date. Here I want to focus on making joint decisions based on shared data.

The data I’m speaking of here is not all the data that might flow into the AIOps big data store for analysis. It is the data required for IT leaders and practitioners to understand what is happening in their environment, understand what actions have been or can be taken, make decisions, and ultimately track their effectiveness. With respect to agreement on data, teams must determine:

A minimum set of data that is required to overcome the limitations of the status quo
Where the data is to reside
The joint view/access that teams will share

In many mature IT organizations, that system is the Service Desk because in the traditional ITIL model, the Service Desk is where request, incident and change data was expected to co-exist. This model gets challenged, however, when DevOps teams use Jira to log defects and enhancements, use APM tools whose events and telemetry aren’t captured by IT Operations or Security teams are working independently to identify threats.

Preparing to implement AIOps means identifying all of the effective causes and resultant indicators in your application, service or business value chain and putting a plan in place to bring that data together. You may leverage the big data platform if you can build meaningful dashboards on top of it that filter the mass aggregate of data for the specific uses of different IT audiences. Single data source – multiple views. However, it may make more sense in your environment to select a subset of data and feed it into (e.g. Jira tickets, APM events, etc.) your established system of record.

Early stage: Determine success criteria and begin tracking them

Successful management of any business and certainly IT, begins with an understanding what key performance indicators (KPIs) or metrics best indicate success or failure. It seems facile to say but is worth repeating that:

Understanding what to measure
Implementing consistent and robust measurement
Regularly reporting out or providing visibility to the performance measures and
Holding responsible parties accountable

is required for actionable understanding of your business.

Most organizations measure lots of things. Most IT tools come with lots of measurement tools and templates. But frequently, an understanding of the business needed to identify which of the things is important is missing. I have been in many situations where teams report out to me on ‘performance’, but when I ask why such a measure is important or what is driving it, the response is a blank stare or ‘I’ll get back to you’.

Quantity doesn’t trump quality in measurement. It may be that there is one thing that needs to be measured – assuming you know what drives that measure up or down. Those things too may need to be measured but without understanding causal relationships, simply throwing graphs on a chart is unhelpful and more often detrimental. Understanding your KPIs is understanding your business.

Also often neglected is a comprehensive process for sharing information, engaging stakeholders, determining actions and holding people accountable. Visibility is primary, but visibility without action or response is empty. When action is required, people and teams need make commitments with timelines and execute against them. These need to be documented and measured as well to ensure that the business, and hence the KPIs, move in the right direction.

Mid stage: Assess current and future state data models

This is one that is critical, but which few customers understand or feel comfortable addressing. Essentially, you must take stock of the data model for each of the data sources you want to use for your AIOps solution and the data model that is required to realize the AIOps use cases and determine how the data from different sources will interact to deliver the desired results.

The reason this is challenging is that the data model in most IT tools is hidden from the user, few organizations have an idea about how big data platforms (NoSQL) differ from traditional data bases (SQL) and fewer still have data analyst/science expertise. I have written a separate blog post here on big data for AIOps that gives some background and context. Here I want to address the idea of data ‘relationships’ for the purposes of analytics.

The AIOps approach is to join data from different IT (and non-IT sources) in a single big data repository. The idea is then to make that data ‘talk to each other’; to find relationships in the data that will yield insights unattainable when the data sits separately in different silos. But what are those relationships? How can diverse data from different sources with different structures be brought together for analysis? And who can do it?

There are a number of shared data structures that can be processed by an AIOps system without additional modification from AIOps practitioners:

Timestamps – events, logs and metrics all have time signatures that can be used to bring them together around a point in time or a time window. Timestamps can be used to correlate events with each other and with time-series data for causal analysis.
Properties – using the term loosely for key pairs (key : value) of information associated with an event, log or metric such as ‘status’, ‘source’, ‘submitter’, etc. Properties can be used to create relationship models between different data sets.
Historicity – the past performance of time-series or event activity data. This can be used to forecast future performance or predict future threshold achievement (e.g. saturation, degradation, etc.)
Seasonality – the shape or regularity of time-series data over a day, week, month, etc. Seasonality can be used to correlate multiple data sets or anticipate resource requirements for scalability, e.g.
Application, service and business models – if you have a robust and regular discovery and configuration management practice, you can leverage these to inform an AIOps platform with asset relationship information for grouping, correlation, suppression, de-duplication, etc.

In general, IT time-series data is well formed and structured. Correlating, analyzing and forecasting time-series data is a fairly well-established practice in IT Operations monitoring and management tools. What changes for AIOps implementation is the need to bring together IT and non-IT data (e.g. user counts + performance, latency + conversions, etc.); increase the granularity of data e.g. from five minutes to sub-one minute; and the application of analytics on streaming data – in ‘real-time’ or on ingestion – vs. ad-hoc historical queries.

For IT events that have structured, semi-structured or unstructured properties, AIOps represents a paradigm shift. To begin with, most IT event data is not well formed. Human generated events are inconsistent, with large amounts of missing or unstructured data. Machine generated events have more consistency, but are often incomplete and have large amounts of repetitive, semi-structured data. They also come in at an order of magnitude in volume larger than human generated events. Machine logs, seen as events, are essentially blobs of semi-structured data. For AIOPs analysis of events to be effective, AIOps systems must overcome the challenges of poor, missing, incomplete, incorrect and unstructured data.

This is why much of the current activity in the AIOps space is centered on event management, analysis and correlation. Once data begins to flow into an AIOps platform, customers must consider how they will approach data structure and integrity to support machine analysis. One strategy is to perform ‘ETL’ (Extract, Transform, Load) on incoming data. Specifically, normalizing and transforming data as it flows in, to adhere to centralized standards so the data can be correlated and analyzed.

This approach suffers from limitations that will likely make it untenable for many enterprises. First, the amount of processing required to transform the data on ingestion but before analysis will likely either render the system not real-time or be cost prohibitive. Second, any centralized standard that is manually managed will require constant maintenance that will not be able to keep up with changes and will only comprehend the known, not the unknown or new.

A more promising strategy is “tagging”, which is what is employed as a best-practice in most cloud services. Tagging allows the hashing of variable attributes of different types of objects, which can then be referenced, sorted, correlated and analyzed using the tags – regardless of what the object is or how it is tagged. Instead of requiring mapping of pre-defined properties with common values, tags are fluid and can change with the data. Tagging is how NoSQL databases handle attributes and how hyper-scale analytics tools like Elasticsearch are enabled. Additionally, tagging can be done in real-time by machines as data flows in, which overcomes blindness to the unknown and human-scale limitations.

For customers looking to adopt an AIOps strategy, understanding current and desired data structures is a critical but secondary consideration. First you need to get the data flowing together. Any big data platform that supports an AIOps practice will have the capability to support the ETL or tagging approach. After data is flowing, you can determine which one works best for your business needs and budget.

Mid stage: Implement Existing Analytics Workflows

It is likely that when you begin your AIOps journey, you will already have certain analytics in place. I do not mean here the analytics that are embedded in your IT tools. I mean offline, mostly manual analytics that you do regularly, irregularly or periodically to identify areas for process improvement, reduce costs, improve performance, etc.

These manual efforts are precisely what your AIOps solution should address and automate in its first iteration. Once the data you use to do these investigations is flowing into your data platform, you should seek to recreate and automate the analyses. The initial value you will generate is reduction of manual effort spent on analysis, but you should also immediately be able to increase the frequency and perhaps the scope (data points, systems, permutations, etc.) of the analysis.

Remember that AIOps is intended to put you into a position of doing real-time analysis on data sets beyond human scale. The easiest way to move in this direction while simultaneously realizing immediate value is to reduce the time/effort and increase the speed/frequency with which you do analyses that are already part of your operational process.

Mid stage: Begin Implementation of Automation

Ah, automation. Everyone knows its value. Everyone knows they need it (or at least could use it). Few organizations put it into practice. Fewer still approach it as a practice with discipline. There used to be a mantra in performance management – ‘Monitor all the things!’ The mantra in the digital era is ‘Automate all the things!’

It should be sufficient to say that in a digital enterprise, data grows and moves at speeds beyond human scale. To address this you need to turn to machines to perform analysis and execute automation. There are, however, other process factors that impact the desperate need for IT operations to automate. Prominent among them is the rise of the developer and DevOps, more specifically “continuous” integration and delivery (CI/CD).

Let’s clarify something first: you automate tasks; you orchestrate processes. Task automation in IT Operations typically has been and remains segregated by tools. Your service desk has some automation, you have automated patching for your servers, you may automate some remediations from your monitoring tools. Orchestration across these tools is more difficult to achieve and rarely fully accomplished.

DevOps is essentially the automation of development tasks and their orchestration – to eliminate the bottlenecks caused by phased review processes in waterfall developments, segregated test and compliance activities and operational, pre-production interlocks. What this means for IT is that DevOps application teams creating the innovative cloud services impacting the business are now moving at lightning speed compared to the traditional application teams of the past.

For IT Operations to keep up, they must not only ‘automate all the things’, they must orchestrate them and also plug into the CI/CD tool chain. If you don’t know when things move from test to staging to production; if you don’t know who owns the code or what impact it has on production; if you can’t measure and identify developer backlog/productivity on business services, you can’t effectively manage your environment.

That is the situation that modern IT Ops finds itself in. They need to match the speed and agility of the DevOps teams spread throughout their organization while simultaneously adding visibility to those teams’ activities into their value chain. This begins by automating and orchestrating the things they already do – across siloed tools – and finding ways to connect, share information and communicate with the DevOps teams in their enterprises.

Late stage: Develop New Analytics Workflows

Above I talked about implementing existing, manual analytics workflows into your AIOps solution to automate and scale them. Once this is accomplished, you should have the bandwidth to:

Assess the value of those workflows
Modify and improve those workflows
Develop new workflows based on the existing or to address gaps

Part of the problem with the ‘brute-force spreadsheet’ approach to doing analysis with disparate data sets is that the energy and focus it requires oftentimes exhausts the capacity for the practitioner to assess the value of what is being delivered. Reports have been promised, meetings are scheduled and expectations have been set. Unless a leader calls for a re-evaluation of the approach, rarely is the process questioned.

Once the existing process has been automated in the AIOps platform, the practitioner can step back and evaluate whether the necessary information is being analyzed, insights are being gained and results are actionable. Having done so, s/he can make improvements using the AIOps platform – which should be an order of magnitude easier than doing so in the spreadsheet(s) – and evaluate the impact of those changes.

Simultaneously, s/he can determine where information/insight gaps exist and envision higher-levels of analysis that leverage the outcomes of existing workflows. Again, the promise of AIOps is the ability not only to execute what heretofore wasn’t practically feasible; it’s doing it at a scale and speed that makes previously unrealized analytics opportunities possible.

Late stage: Adapt Organization to New Skill Sets

It should be obvious by now that if the AIOps platform is taking the analysis and response activities off of the plate of the IT Ops practitioner, the role of the IT practitioner will evolve. You will transition out of the need to have someone who has domain knowledge for the purposes of tactically addressing issues to one who can put that knowledge to use training the system.

This is not a simple sematic distinction. The ability to know when something is wrong, determine how to tell as system to alert about that fact and then fix it is fundamentally different from the ability to understand how systems are operating well or poorly, how the system is reading and reacting and then adjust the system accordingly (or give appropriate guidance thereto).

IT Ops will move from a ‘practitioner’ to an ‘auditor’ role. This doesn’t require in-depth, data-science level understanding of machine analytics. It does require understanding how systems are processing data and whether the desired business outcomes are being achieved. Of all of the changes AIOps will bring to IT Operations, I believe this will be the most disruptive.

IT Operations has long had a bunker, hero mentality, particularly with monitoring teams. Giving up control to a machine will be one of the most difficult transitions those who have been steeped in the practice for decades will experience. Many will not be succeed. This is an inevitable result of market trends as they exist now. The move to business beyond human scale will have significant consequences for the humans who have been used to managing it.

Organizations will have to cultivate this new skill in their existing – reduced – workforce or bring in talent that either has the skill or can adapt to the change. This will be challenging in two ways: the scarcity of such skills and the fact that the market may take a while to respond with the education, certification and practical opportunities necessary to build a robust AIOps labor force. It will take time for these changes to have noticeable impact and it may be that only the highest-performing organizations understand and realize it. But it will happen and will be a tectonic shift in the discipline of IT Operations.

Late stage: Customize Analytic Techniques

The last activity I will discuss is both the most speculative and the most contentious. It is the question of whether IT Operations organizations will need to develop a mature data science practice or not. Some analysts believe you do. I disagree. I believe in the segregation between domain and data science knowledge.

I have two preceding paradigms in mind: the scientist-analyst and the developer-analyst. Scientists have long been executing complex, data-intensive analyses. With the rise of machine computation, scientists had to develop, at least, the ability to craft the mathematical algorithms that they wanted to run on their data sets. At first, when computational resources were shared, scientists built their own analyses to be run on systems maintained by computer experts. The languages, parameters and constraints were dictated by the systems and scientists had to work within them.

In that paradigm, scientist developed specialized knowledge that allowed them to leverage the computational systems. Once computational resources and analytic languages became less expensive, more powerful and more accessible, scientists had to develop not only the domain knowledge in their fields, but also data science and computational knowledge sufficient to execute their desired analyses on contemporary computing platforms.

They were able to do this because:

Their programs were research, not commerce and hence weren’t subject to market or business pressures (at least not immediately like IT)
They were self-selected for the education, drive and acumen to learn and master both types of knowledge (Ph.D.)
They were afforded the time in an academic setting to acquire the skills and knowledge necessary
Failure to do so would be fatal to their careers – labor competition

Let us contrast this with the programmer-analyst. Currently the market stands in critical need of data science practitioners who can also implement their data science knowledge in code. In spite of the ubiquity of data science jobs and data science education (both formal and informal), the market is bereft of people who have M.A. or Ph.D. level knowledge of statistical modeling (e.g.) and are at least adequate Python or R programmers.

This may change but I do not foresee that happening soon, if ever. It is simply the case that it is too hard for most people to learn the math required and too easy to make very good money with just the coding to incent them take on more than that. And even if they did, they still would need the domain knowledge required for a particular industry or problem area.

Asking IT Operations practitioners to know math, IT and coding to manage infrastructure, applications and services is, I think, too much. In my vision of the future, IT Operations would be the stewards of semi-intelligent, semi-autonomous systems with deep knowledge of the domain, sufficient knowledge of the math to understand what the system is doing and no knowledge (or no need for knowledge of) coding.

In this paradigm, AIOps vendors provide systems that offer multiple analytics options from which practitioners select combinations that best fit their environments. Ideally this would require less knowledge of the math than of the business outcomes. Also ideally, the AIOps platforms would provide regression analysis that would suggest ‘best-fit’ options from which practitioners could make informed decisions.

This is how I see new and customized analytics coming out of AIOps. Some organizations may have the wherewithal and will to have domain, data science and programmatic implementation expertise. Teams of people. For revenue generating activities, this may make sense. I don’t see a future where such an approach will be feasible for IT Operations.

Concluding Thoughts

I have offered 9 steps for an AIOps roadmap.

Identify current use cases
Agree on a system of record
Determine success criteria and begin tracking them
Assess current and future state data models
Implement existing analytics workflows
Begin implementation of automation
Develop new analytics workflows
Adapt organization to new skill sets
Customize analytic techniques

#1 and #3 are table stakes for IT operations in its current state and so certainly for AIOps. If you don’t know what you are currently trying to accomplish and/or you can’t measure it, you can’t hope to manage it even with existing tools. #2 can be done with existing tools or it may be an assessment that current tools are unsatisfactory. If the latter, building out requirements for how different organizations will share a view of what is happening is the logical response. These are all early stage activities on your AIOps roadmap.

#4 is a requirement for any activities that follow. As I mentioned in that section, understanding your current and future data needs is paramount to a successful AIOps implementation. It can be done piecemeal, but it must be done. #5 depends on #4. #6 depends on #5 for the analytics portion of the AIOps process but automation of tasks and orchestration between tools can and should be pursued at whatever stage of maturity an IT organization finds itself.

#s 7, 8 and 9 are more intertwined and likely to evolve organically in tandem, taking different courses in different organizations. It may be impossible to forecast or plan at early- or even mid-stages for their eventualities but the highest performing organizations will comprehend them in their strategic horizons.

To paraphrase Peter Drucker, the future has “already happened”. The only IT organizations that aren’t thinking about how AIOps leveraging machine learning, big data and analytics will radically alter the way they function are those that haven’t realized it yet. And they are the ones likely to miss the almost limitless opportunities that digital transformation presents.

To learn more about BMC’s approach to AIOps and how we can help you on your journey, read more here or contact us.

Machine Learning, Data Science, Artificial Intelligence, Deep Learning, and Statistics

Seth Paskin — Fri, 16 Feb 2018 00:00:34 +0000

Machine learning. Data science. Artificial intelligence. Deep learning. Statistics. Most organizations, companies and individuals today are using these technologies – whether they know it or not. If your work involves computers, you’re likely familiar with at least some of them – but the terms can be confusing, and their use sometimes conflicting.

The 21^st century is the era of big data. Big data refers to data sets that are so large and complex that previous applications of data processing aren’t adequate. Researchers and companies are harnessing and experimenting with various methods of extracting value from big data. The global connected world offers infinite ways to generate, collect, and store data for analysis. Never before have we had access to this much data, and we are only now beginning to find ways to unleash the immense amount of meaning and information contained within.

The relatively recent concepts of data science, machine learning, and deep learning offer a new set of techniques and methods, but also find their way into hype and branding. Companies may adopt these terms without necessarily using their processes for a “cutting-edge” appeal to customers. In this article, we’ll explore the differences between these terms, whether they’re new or a return of the old, and whether they’re just different names for the same thing.

Statistics and artificial intelligence

Let’s begin with statistics, as these fields have been around for decades, even centuries, before computers were invented. The study of statistics and the application of statistical modeling are a subfield of mathematics. Both the theories and applications are aimed at identifying and formalizing relationships in data variables, based on mathematical equations. Statistical modeling relies on tools like samples, populations, and hypotheses.

In the latter part of 20^th century, as access to computers became more widely available and computational power commoditized, people began to do statistics in computational applications. This allowed for treatment of larger and different data sets as well as the application of statistical methods that were untenable without computing power.

Artificial Intelligence is ultimately an evolution of this first encounter between math and computer science. [For a fun romp through the history of AI, check this out] Statistical modeling started as a purely mathematical or scientific exercise, but when it became computational, the door opened to using statistics to solve ‘human’ problems. In the post-war, due to enthusiastic optimism around the promise of computing as well as the belief that human thought processes were essentially computational, the idea that we could build an ‘artificial’ human intelligence gained currency.

In the 1960s, the field of artificial intelligence was formalized into a subset of computer science. New technology and a more expansive understanding of how humans’ minds work changed artificial intelligence, from the original computational statistics paradigm to the modern idea that machines could mimic actual human capabilities, such as decision making and performing more “human” tasks.

Modern artificial intelligence is often broken into two areas: general artificial intelligence and applied artificial intelligence. Applied artificial intelligence is at play when we consider systems like driverless cars or machines that can smartly trade stocks. Much less common in practice is general artificial intelligence, the concept that a system could, in theory, handle any task, such as:

Planning
Getting around
Recognizing objects and sounds
Speaking and translating
Performing social or business transactions
Working creatively

The concept of artificial intelligence grows and shifts as technology advances, and likely will do so for the foreseeable future. Currently the only solid criterion for success or failure is how it can accomplish applied tasks.

Machine learning

By 1959, the idea of artificial intelligence had gained solid traction in computer science. Arthur Samuel, a leader and expert in the field, imagined that instead of engineers “teaching” or programming computers to have what they need to carry out tasks, that perhaps computers could teach themselves – learn something without being explicitly programmed to do so. Samuel called this “machine learning”.

Machine learning is a form of applied artificial intelligence, based on the theory that systems that can change actions and responses as they are exposed to more data will be more efficient, scalable and adaptable for certain applications compared to those explicitly programmed (by humans). There are certainly many current applications proving this point: navigation apps and recommendation engines (shopping, shows, etc.) being two of the obvious examples.

Machine learning is typically categorized as either ‘supervised’ or ‘unsupervised’. Supervised learning involves the machine to infer functions from known inputs to known outputs. Unsupervised MACHINE LEARNING works with the inputs only, transforming or finding patterns in the data itself without a known or expected output. For a more detailed discussion, see my blog about the differences between supervised and unsupervised machine learning.

Machine learning is a task-oriented application of statistical transformations. Accomplishing the task will require a process or set of steps, rules, etc. The process or set of rules to be followed in calculations or problem-solving operations is called an algorithm. When designing a learning machine, the engineer programs a set of algorithms through which the machine will process data.

As the machine learns – gets feedback – it typically will not change the employed statistical transformations but rather alter the algorithm. For example, if the machine is trained to factor two criteria in evaluating data and it learns that a third criteria has high correlation to the other two and refines the accuracy of calculation, it could add that third criteria to the analysis. This would be a change to the steps (algorithm), but not the underlying math.

Ultimately, machine learning is a way to “teach” computers to be adaptable to changes in data. We now have essentially infinite amounts of digital data being created constantly. The volume and diversity of that data increases rapidly and exponentially. Machines analysis has the advantages of speed, accuracy and lack of bias over human analysis, which is why machine learning is critical and has hit a tipping point.

Deep learning

Deep learning goes even further than machine learning as applied ARTIFICIAL INTELLIGENCE – it could be considered the cutting edge, says industry expert Bernard Marr. Machine learning trains and works on large sets of finite data, e.g. all the cars made in the 2000s. Machine learning does a good job of learning from the ‘known but new’ but does not do well with the ‘unknown and new’.

Where machine learning learns from input data to produce a desired output, deep learning is designed to learn from input data and apply to other data. A paradigmatic case of deep learning is image identification. Suppose you want a machine to look at an image and determine what it represents to the human eye. A face, flower, landscape, truck, building, etc. To do this, the machine would have to learn from thousands or millions of images and then apply that knowledge to each specific new image you want it to identify.

Machine learning is not sufficient for this task because machine learning can only produce an output from a data set – whether according to a known algorithm or based on the inherent structure of the data. You might be able to use machine learning to determine whether an image was of an “X” – a flower, say – and it would learn and get more accurate. But that output is binary (yes/no) and is dependent on the algorithm, not the data. In the image recognition case, the outcome is not binary and not dependent on the algorithm.

This is because deep learning uses neural networks. Neural networks require their own deeper dive in another post but for our purposes here, we just need to understand that neural networks don’t calculate like typical machines. Rather than following an algorithm, neural networks are designed to make many ‘micro’ calculations about data. Which calculations and in what order is determined by the data, not an algorithm. Neural networks also support weighting data for ‘confidence’. This results in a system that is probabilistic, vs. deterministic, and can handle tasks that we think of as requiring more ‘human-like’ judgement.

Deep learning neural networks are large and complex, requiring many layers and distributions of micro calculations. The machine still trains on data, but it can perform more nuanced actions than machine learning. Deep learning is appropriate for machine classification tasks like facial, image, or handwriting recognition.

Here are interesting examples of current, real-world technology using machine learning and deep learning:

Driver-less cars use sensors and onboard analytics that better recognize obstacles, so they can quickly and more accurately react appropriate.
Software applications are able to recolor black and white images by recognizing objects and predicting the colors that humans see.
Machines are able to predict the outcome of legal proceedings when basic case facts are input into the computer.

Data science

Statistics is a field of mathematics. Artificial intelligence, deep learning and machine learning all fit within the realm of computer science. Data science is a separate thing altogether.

Formally defined, data science is an interdisciplinary approach to data mining, which combines statistics, many fields of computer science, and scientific methods and processes in order to mine data in automated ways, without human interaction. Modern data science is increasingly concerned with big data.

Data science has many tools, techniques, and algorithms culled from these fields, plus others – in order to handle big data. The goal of data science, somewhat similar to machine learning, is to make accurate predictions and to automate and perform transactions in real time, such as purchasing internet traffic or automatically generating content.

Data science relies less on math and coding and more on data and building new systems to process the data. Relying on the fields of data integration, distributed architecture, automated machine learning, data visualization, data engineering, and automated data-driven decisions, data science can cover an entire spectrum of data processing, not only the algorithms or statistics related to data.

Terminology branding

These terms are sometimes used interchangeably, and sometimes even incorrectly. A company with a new technology to sell may talk about their innovative data science techniques, when really, they may be using nothing close to it. In this way, companies are simply aligning themselves with what the concepts stand for: innovation, forward-thinking, and newfound uses for technology and our data. This isn’t inherently bad, it’s simply a caution that because a company claims use of these tools in its product design doesn’t mean it does. Caveat emptor.

Concerns and Challenges of IT Leaders Considering AIOps Platforms

Seth Paskin — Thu, 11 Jan 2018 18:52:16 +0000

In my role as Principal Solutions Marketing Manager for BMC’s TrueSight SaaS analytics and monitoring solutions, I’ve had the opportunity to talk with hundreds of customers this past year about Artificial Intelligence for IT Operations (AIOps). If you haven’t heard about AIOps before, learn more here.

In both 1×1 meetings and at events, I’ve spoken to ITOM and ITSM professionals as well as I&O leaders about what AIOps means to them. In this blog post I’d like to share what I’ve heard (and what I’ve inferred) from IT professionals about what they think about AIOps platforms, vendors, artificial intelligence and machine learning.

There is tremendous excitement about the promise of AIOps

IT leaders are optimistic. They have known for some time that the challenges of digital transformation can’t be met by the traditional IT approach. AIOps shows real promise as a path to success. The idea of a real step-function evolution of IT is energizing and empowering. AIOps also uplevels the conversation from IT silos, bringing practitioners across disciplines together with executive decision makers. It’s a rallying point for the whole IT organization.

But…they feel burned by past unfulfilled promises

IT executives I speak with complain frequently that software vendors have sold them on the promise of a new solution and then not helped them to realize the value of that solution. On their side, they acknowledge that an aversion to professional services, which can accelerate time-to-value, and cultural inertia against adopting new technology, workflows and processes all contributes to failed value realization as well.

This often adds up to significant investment in IT that hasn’t delivered as promised. I&O leaders I speak with are understandably skeptical hearing about new solutions from vendors from whom they purchased solutions that went un- or under-utilized. ITOM professionals who haven’t leveraged all capabilities of a solution are right to question why they need additional software.

And…they are skeptical of the analytics and machine learning ‘hype’

IT pros who have been on the front lines for a long time also express (directly and indirectly) skepticism about the efficacy of analytics and machine learning – all while knowing they need them to address digital transformation. Many have already piloted or tried analytics initiatives in-house or with other vendors prior to our speaking. Results vary from ‘failed’ to ‘mixed’.

This failure to see concrete results from their own, or other’s, analytics and machine learning initiatives drives their skepticism. IT is not a research project – we want to see demonstrable outcomes, and also see that the analytics process is understandable and that it connects to real, measurable problems. Part of this is unclear expectations and perhaps an unwillingness to ‘start the longest journey with a single step’, but vendors must take a strong lead while the market for AIOps is immature and guide customers to a crawl-walk-run approach. Not sexy, but necessary.

They think their data quality is too poor

The first step in building analytics is getting data together. For AIOps, this means immense quantities of varied IT data – like events, tickets, metrics, logs, etc. At this point in its evolution, solutions in the AIOps market are pretty good at doing this. See my other blog post on AIOps and big data for more information.

Of the organizations I’ve spoken with who have succeeded in this first step, they have almost universally discovered (or more properly, ‘validated’) that their data quality is poor. Data can be missing, incomplete, unhelpful, garbled, full of noise, inconsistent, etc. During a Proof of Concept with one customer, we found that 70% of their service tickets had no categorization. Another had free text ‘description’, ‘notes’ and ‘resolution’ fields for each event that contained all the relevant details.

What we generally find is: ‘structured’ data is poor and critical information is in ‘unstructured’ data. In many cases, data quality issues exposed in data aggregation have meant that analytics simply fail or return results that are not trusted.

They are constrained by traditional approaches to IT

This is not a criticism, it’s simply an observation. Think of it as Henry Ford’s ‘faster horse’ syndrome. It manifests in one of two ways: they want to do exactly what they are doing but do it faster and less expensively or they have vague, unrealistic expectations of what AI, ML and big data are going to do for them.

Implementing AIOps successfully requires an understanding of where you are and where you intend to go. Too many AIOps initiatives revolve around existing problems and do not think strategically about reshaping their approach, processes and organizations to account for the new realities of digital business.

So what does AIOps mean to IT leaders?

We are most certainly in the “peak of inflated expectations” for artificial intelligence, machine learning and advanced analytics for IT Operations. (see fig 1) As vendors and partners, we need to acknowledge where our customers are culturally, as well as the reality of the market, and give them a path to AIOps adoption that addresses their expectations and our shared past – even if we see greater opportunities within reach.

[Fig 1]
(source: https://www.gartner.com/smarterwithgartner/top-trends-in-the-gartner-hype-cycle-for-emerging-technologies-2017/)

The outcomes IT professionals expect from AIOps can be categorized generally as automation and prediction. They see the world in terms of what they are currently doing and experiencing. First and foremost, they need to be doing it faster with fewer resources. Their first expectation from AIOps is that it will allow them to automate what they are currently doing manually and thus increase the speed at which those tasks are performed, and therefore also increase the number of those tasks that can be performed in a given time or with a given set of resources.

Some specific examples I’ve heard include: correlate customer profile information with financial processing applications and infrastructure data to identify transaction duration outliers and highlight performance impacting factors; evaluate unstructured data in service tickets to identify problem automation candidates; categorize workloads for optimal infrastructure placement; and correlate incidents with changes, work logs and app dev activities to measure production impact of infrastructure and application changes.

What all these tasks have in common is that they require reason and domain knowledge – determining root cause, identifying outliers, identifying causal relationships, surfacing hidden problems, etc. Customers want AIOps to take over manual investigation tasks and at the same time, they typically aren’t able to share all of the domain knowledge necessary to ‘coach’ an AIOps solution to do what they expect. Practically, this means:

The desire is for ‘unsupervised’ analysis of their massive digital data sprawl.
Faith is lacking that AIOps tools can generate meaningful outcomes without their specific domain knowledge.
Ability to share that domain knowledge is wildly inconsistent.
Tools rarely address this through usability.

“Prediction” is still very much tied up in troubleshooting culture. When IT managers say they are looking for a system to “predict”, what they are saying is that they want to know in advance when something is going to happen, so they can avoid it rather than having to respond after the fact. At least, this is how it is usually articulated.

What I think they really have in mind about prediction is something more like this: learn what’s normal, learn what abnormal looks like, and when normal is starting to look abnormal, let me know. Pattern matching and trend analysis, broadly construed, extended beyond individual metrics and events defined by system, application or service. Don’t make me define KPIs; make sure that structured data is clean; decide what data to look at or not; decide what analytics to run, etc. Just dump it in a bucket, look for patterns, map it against domain knowledge and let me know what I should focus on.

Customers are overwhelmed with data, they see that traditional management techniques are not up to the task, and they don’t know where to start. They expect AIOps systems to figure it all out. At the same time, the concerns about the validity of the technology and fears of the unknown are still in play. It will not be enough for AIOps vendors, even if they create a solution like this, to simply start predicting without explanation. Customers will need to understand the ‘why’, if not the ‘how’, in order to buy into the outcomes.

Concluding thoughts

I know we say this about every 3 years, but there’s never been a more exciting time to be an IT solution provider. The pace of technological and global change is accelerating, and seismic shifts seem to happen regularly. Established institutions disappear almost overnight and disrupters are themselves disrupted in ever shortening cycles.

The thing that doesn’t change is people. Ultimately everything is measured against the human yardstick. Human factors continue to be the defining criteria for IT solutions – even where the solutions are meant to let machines automate human tasks. Until data science and statistics are taught to everyone from primary school on, AIOps providers can’t expect customers to be able to implement complex data and analytics solutions on their own. And even if those skills were ubiquitous, it wouldn’t mean that their application could always be connected to real-world business problems.

AIOps solution providers need to acknowledge the very real challenges customers face with tools, knowledge and culture; map that against maturity and expectations; and build solutions that engage with customers in the right way for where they are while providing a path to their desired end state.

AIOps Leverages Natural Language Processing for Service Tickets

Seth Paskin — Thu, 30 Nov 2017 00:00:41 +0000

As IT service delivery becomes more complex and dynamic, forward-thinking IT leaders are turning to Artificial Intelligence for IT Operations (AIOps). By leveraging machine learning and analytics on a big data platform, AIOps can help to resolve IT issues more efficiently and help to identify opportunities for automation. An AIOps platform that ingests large volumes of streaming and batch-monitoring data from multi-cloud environments provides visibility and insights for individual teams within an IT organization and for teams across interdependent IT functions.

In a 2017 BMC survey of 1000 CIOs & senior IT professionals, 80% said they need to rethink their IT management approaches, and 78% said they would be looking to Artificial Intelligence to address complexity.

One of the many areas where AIOps can add value is IT service ticket analytics.

AIOps natural language processing for service tickets

Digital enterprises face an ever-increasing volume and velocity of IT service tickets. An AIOps platform uses machine analysis to keep up with the pace and volume of data while enabling correlation between the tickets, incidents, changes, and/or performance data. Using natural language processing (NLP) on the unstructured data in tickets (e.g., “Description” or “Notes”) yields insights into the problems driving the tickets that might otherwise go unnoticed and unaddressed.

Service ticket templates typically have a finite list of choices or categorizations. Users who open tickets can (or sometimes, “must”) choose from a list of options to self-classify the ticket issue or subject. Often, they also have the option to specify profile information such as the department where they work or a user ID.

When the finite list of categorizations doesn’t describe a user’s issue, or the user can’t find a choice with which they are happy, they will typically choose the default bucket of “Other”. When a user selects “Other,” the details of their issue or concern are going to be in the unstructured descriptions that they provide.

NLP algorithms can go through this free-form text and “cluster” issues based on the free-form text specified anywhere in the service ticket. This is particularly useful if the categories have not been kept up to date by the service desk team or by the ticketing system itself. A lot of these tickets can end up uncategorized, or the information provided can be difficult to extract without a big data platform that quickly handles the large volume data set and applies the NLP algorithm to the data set.

NLP algorithms for specific languages include ways to remove “stop words” in free-form text fields such as “the”, “a”, “an” and so on. They can cluster details about service tickets by two-word pairs that are included somewhere in the large volume of uncategorized tickets – not necessarily in sequence but anywhere in any one of the text fields. One of the pairs could be something like, “Account lockout” or “Access denied.” If the system includes incident data from log files, the two-word pair could be something like “Incorrect format,” or “Command failed.”

An event clustering visual like the one shown in the example below can quickly help you pinpoint the key word pairs that occur far more frequently compared to others.

This visualization and drill-down to the actual incidents associated with each cluster can help you quickly determine where to prioritize problem resolutions related to service tickets.

Beyond NLP for AIOps

Being able to leverage natural language processing (NLP) against the volume of service ticket information provides a great example of leveraging an AIOps platform.

With a TrueSight AIOps platform in place, you can also do the following:

Predict future performance issues with dynamic baselining
Forecast resource utilization, including public cloud cost, with capacity analytics
Focus on the most likely source of a problem with probable cause analytics
Discover issues captured in logs of a problem with log analytics
Identify problems driving incidents with anomaly detection

For more information, see the AIOPs eBook:

https://www.bmc.com/forms/elevate-it-operations-aiops-ebook.html

Why AIOps Needs Big Data and What That Means for You

Seth Paskin — Thu, 16 Nov 2017 06:36:26 +0000

In a previous blog post about AIOps, we presented the following graphic:

In this blog post, we’ll go deeper into the Big Data layer, explain why it is essential for AIOps, provide a brief overview of big data evolution and discuss what this means for your AIOps initiative.

Relational Databases and the Pre-Digital Age

Prior to about 20 years ago, most business, IT, scientific, health care and other systems didn’t produce enough data to be what we now consider ‘big data’.ⁱ What data they did produce could be stored in the standard database technology of the time – relational databases. Analytics could be built into tools with a relational database back-end or in standalone analytics platforms that leveraged relational databases.

In a relational database, data is organized into tables that consist of columns and rows. Tables typically define entity types (customer info, sales info, product info, etc.) and columns define attributes or properties. Rows contain data about specific instances of the entity. Entities are ‘related’ by virtue of being in the same table; tables are ‘related’ to each other using keys or indexes in order to associate data across different tables for the purpose of calculation, reporting and analysis. Queries that involve large numbers of tables and/or lookups across tables can severely stress the processing capability of the system. The challenge with relational databases is tuning performance to support the specific uses required.

Tables in relational databases must be defined in advance based on the data you are going to store in them. Understanding data type, size and expected use in retrieval, reporting, calculations and analytics is critical to relational database design. You must understand the data structure, relationships between data entities and what you intend to do with it to get the expected benefit from querying in a relational database. New data or changes to incoming data structure, relationships or uses necessitate changes to the database design.

Relational databases have a significant cost vs. performance tradeoff as data volumes and uses grow. While relational database reliability has generally been very good, performance is an ogoing challenge as tables proliferate, grow in size and support simultaneous query requests. Commodification of database technology in conjunction with drastic reductions in the cost of storage and processing helped organizations deal with scalability issues but structurally, the technology was never going to be able to support what we have come to call ‘big data’.

Big Data

Beginning around the year 2000, we began to see an explosion in data creation thanks in part to a transition from proprietary, expensive, analog storage to commodified digital storage. Digital tape meant more data could be archived. CD/DVD (and then Blu-ray) meant more data could be distributed and shared. Increasing PC hard drive, processor and memory capacity meant more data was being captured and stored by individuals.

Once these technologies permeated institutions like hospitals, companies and universities, digital data generation and collection exploded. People started to ask how to unlock the business, scientific and knowledge value of these massive data sets. Two technical challenges needed to be overcome: the rigidity of relational data structures and scaling issues for processing queries on relational databases.

The first problem was addressed with the development of ‘data lakes’. A data lake is a single store for structured data such as that in relational databases (tables, rows, columns) but also semi-structured data (logs, JSON), unstructured data (emails, documents) and other data (images, audio, video).ⁱⁱ Data lakes collect all data, regardless of format and make it available for analysis. Instead of doing extract, transform, load (ETL) tasks or applying a data schema at ingestion, the schema is applied when the data is called upon for analysis.

The second issue was addressed with massively parallel processing (MPP). Relational databases rely on single or shared storage, which can be accessed by many processing nodes. The storage becomes a bottleneck due to either performance or query queuing. Simultaneous queries on the same data may have to be queued – to wait – on other queries to make sure they are using the most updated data.

MPP attempts to segment data across processing nodes, eliminating the single storage bottleneck. Segmentation is done by data type, expected use or even as sub-segments of larger data sets. This permits simultaneous or “parallel” processing that enables significantly increased query performance over traditional relational databases.

Of course, MPP segmentation presents its own challenges as well as the need for segmented data reconciliation. However, for relatively static data of the sort typical for early big data analysis, this approach worked well. Queries could be batched and executed in parallel vs. serially and doing complex analysis on massive data sets became achievable for most organizations.

The implementation of data lakes and MPP is best exemplified in Apache Hadoop – a technology with which most technologists are familiar – and specifically in Hadoop 1.0. Hadoop introduced the ‘Hadoop Distributed File System’ (HDFS) and MapReduce to address the limitations of traditional relational databases for big data analytics.

HDFS is an open-source data lake (accepting almost any data type as-is), supports data distribution across commodity hardware and is optimized for MPP queries that segment data. It made storing and utilizing massive data sets for analytics a technical and economic reality. MapReduce is an MPP engine designed to structure queries for parallel execution on segmented data in HDFS.

Apache Hadoop commoditized big data for organizations in every vertical. Scientists, business analysts, health science researchers and others began doing deep analysis on massive data sets to try and identify cures, weather patterns, insights, competitive advantage and more. The big data boom was born. But Hadoop 1.0 had some limitations that limited its utility for certain applications:

MapReduce was the only application that could be used with HDFS
MapReduce only supported batch processing of structured queries. You couldn’t work with streaming (real-time) data or do interactive analysis.
While relatively easy to setup and administer, optimization of data and queries was difficult. Organizations required data scientists to manage investigations to get useful results.

Cue Hadoop 2.0 – the democratization of big data and the enablement of AIOps.

Big Data for AIOps

With Hadoop 2.0, Apache released YARN (“Yet Another Resource Negotiator”). YARN sits alongside MapReduce and compliments its scheduling and batch capability with support for streaming data and interactive query support. YARN also opened the door for using HDFS with compatible, non-Apache solutions.

Streaming, interactive big data analytics were now possible. Integrating 3^rd party applications with Hadoop meant that vertical applications could incorporate a new level of analytics – if they were re-architected. Hadoop, however, was still difficult to optimize and use for organizations who needed an analytics practice but didn’t have data science resources.

Enter market influence. Seeing the need for easier to use and more purpose-built solutions, companies like Elastic, Logstash and Kibana emerged (“ELK” or the ‘Elastic Stack’) offering batch, streaming and interactive big data analytics and ultimately becoming an alternative to Hadoop for some use cases

Why does this matter for core IT Operations and Service Management? Because both IT disciplines rely on streaming data and interactivity. In ITOM and ITSM applications, analytics had been limited by the database technology used and application architecture. And IT as a cost center couldn’t justify hiring data scientists to find ways to use analytics for monitoring, remediation and service delivery use cases.

On the other side, the digital transformation of enterprises has been simultaneously revolutionizing the role of IT while applying unprecedented pressure on IT to deal with scale, complexity and speed. To support business innovation and keep up with digital transformation, IT needs systems that:

Bring together diverse IT data
Use machines to analyze massive amounts of streaming data in real-time
Generate meaningful information for IT specific use cases (event management, alerting, workload placement, root cause analysis, cloud cost optimization, etc.)
Identify additional automation opportunities
Integrate with IT workflows and support interactive and historical analysis.

The challenge in transitioning to an analytics-based approach has been the limitations of purpose-built applications and their data silos. IT tools are not easy to replace or upgrade and even if they are re-architected to support big data analytics, their data remains siloed. Enter AIOps.

Fundamental premises of AIOps are:

For IT to respond to digital transformation, machines must take over manual analysis
Analytics must be real-time on streaming data as well as historical data
Datasets must include diverse IT data from different silos
Systems should be interactive, both from a technical and usability perspective

AIOps is only possible now with the commodification and evolution of big data technologies. It needs to support diverse data (data lakes); it needs to support simultaneous analytics on massive amounts of data; it must do analytics in real-time on streaming data; and must allow for interaction by human agents.

AIOps does not replace domain-specific ITOM and ITSM tools. It changes the approach to IT management by taking data from domain tools, housing it in a big data backend, applying analytics and machine learning, and democratically distributing insights across the organization.

With this background in mind, here are the key implications for your AIOps initiative:

You must choose either to build a big data backend yourself on Hadoop, ELK or some other technology – or rely on a partner delivered solution. Partner solutions may be big data as a service or a big data backend in an AIOps platform. You should not build an AIOps initiative around a traditional relational database.
If you build the platform yourself, recognize that you will take on the technical debt associated with ensuring performance of the solution; maintenance of an elastic (public or private cloud) infrastructure; as well as the creation of a robust data science practice. That practice must deal not only with the analytics theory, but also the implementation of that theory in your AIOps platform (e.g. development in Python or R)
Ensure that your domain IT tools have APIs that support streaming data to the AIOps big data backend or that you can provide near real-time ETL of critical data (e.g. tickets, events, logs, etc.) to the platform. AIOps analytics, correlation and pattern matching algorithms need synchronized feeds of diverse data.
Prepare organizationally for the shift. Not only will different, typically siloed, IT groups be required to share data and collaborate, they will also need to agree on review and response processes. Data will need to be visualized in common and what counts as ‘normal’ and ‘abnormal’ may need to be redefined in the context of the new joint approach.

AIOps is the path forward for enterprise IT but it will not come overnight nor without serious investment of thought, time and resources from I&O leaders. This article addresses just one aspect of a successful AIOps initiative. Look for future articles on other key elements such as analytics and algorithms.

To help you with your AIOps initiative, BMC offers the TrueSight AIOps platform which leverages machine learning, analytics, and big data technologies to reduce MTTR and drive the digital enterprise. TrueSight is designed for enterprise IT organizations in digitally transforming businesses. To learn more about TrueSight AIOps, click here.

ⁱ Financial systems did (and do) but most relied (and continue to rely) on a different type of technology.↩
ⁱⁱ Campbell, Chris. “Top Five Differences between DataWarehouses and Data Lakes”. Blue-Granite.com. Retrieved May 19, 2017.↩