In an earlier post, Seth Paskin and I answered the question “What is AIOps?” in the following way:
“AIOps refers to multi-layered technology platforms that automate and enhance IT operations by using analytics and machine learning to analyze big data collected from various IT operations tools and devices, in order to automatically spot and react to issues in real time.”
This post is about how AIOps will change the way IT Operations personnel (IT Ops) work and the new skill sets they have to adopt in an AIOps world.
How does AIOps work, again?
Gartner explains that an AIOps platform (figure 1) uses machine learning and big data to aggregate observational data (from monitoring systems output, job logs, syslogs, etc.) and engagement data (from ticketing, incident, and event recording system data) to produce a virtuous circle of continuous insights yielding continuous improvements and fixes.
Automation is both an input and output of AIOps. The results or statuses of automated workloads and jobs can be used like operational data and engagement data for analytic purposes. Manual improvements can take the form of automating tasks, responses, remediations, etc. Machine learning that handles analytics at scale and adjusts algorithms accordingly is a form of automated improvement, e.g. Amazon and eBay online shopping, machine systems stock trading, or Netflix recommendations.
If it’s automated, what does IT Ops do?
The implications of implementing AIOps are significant not only in terms of technology, but also in terms of process, culture and skills. AIOps will produce a big change in IT Ops’ role in both the Data Center and the business, leading IT organizations to ask this question:
What happens to the traditional IT Ops role when you turn IT Operations tasks over to an AIOps system that can respond to issues, manage applications and infrastructure, and adjust for cost and business value faster than the human beings that oversee it?
The answer is that just as Data Centers evolve using new technologies, IT Ops must also evolve by learning and using new skills to manage these new technologies.
Traditional IT Ops skills versus AIOps skills
Traditional IT Ops work focuses on producing and maintaining consistent, stable environments for service and application delivery. It also is concerned with meeting customer/user expectations and planning for growth and change. Traditional IT Ops tools try and provide useful information for the execution of these tasks. Generally these tools use human domain knowledge or analytic techniques or are modeled on them.
AIOps uses big data, algorithms, and machine learning to examine the profile of IT and business data, determine what “normal” looks like, find what factors are causal and correlative when things aren’t normal, and automatically recommend or implement a response. Machines execute these steps at incredibly fast rates on exponentially increasing amounts of data.
With AIOps, IT Ops job skills expand to include auditing AIOps results. IT Ops will need to understand how and why the AIOps platform is producing the outcomes it’s recommending or implementing. In an AIOps environment, IT Ops personnel need an enhanced skill set that helps them oversee the machine’s work, rather than just performing the work themselves.
Here are three skills IT Ops personnel will need as the world transitions into AIOps and application-centric infrastructures.
Skill #1: Auditing and Adjusting Machine Outcomes
In machine learning, there is a concept of ‘supervised’ and ‘unsupervised’ learning. Supervised learning is where one trains a system using sample (historical) data. When the system outputs expected results, it is considered ‘trained’ and can be applied to new data. Unsupervised learning is where no training data is provided and the system must organize and analyze data with no outside guidance.
AIOps will almost always involve supervised learning. IT Ops personnel will need a good understanding of the algorithms behind AIOps processing in order to train and validate the system. They won’t need to be data scientists or understand complex math to do this, but they’ll need a better understanding of how the machine learning algorithms apply analytics to the data. The goal is to understand the “why” of the machine-produced outcomes so that they can be accepted, rejected or adjusted.
As a simple example, in the traditional IT Ops world, you might set a specific metric such as processor utilization at 70%. When CPU utilization hits 70%, you would specify your monitoring software to send you an alert so that you can investigate. You do this because you know from experience that 70% is when something problematic happens or indicates an undesired state of affairs. 70% may or may not be the exact right number but it works for you to get the job done.
In an AIOps world, the machine examining your data will create a baseline of what normal looks like for CPU utilization. Told what the metric is for the problem or undesired state, the machine can more closely look at the relationship between CPU and that metric. It will then determine the right threshold for when to send alerts or to make an automatic adjustment (such as assigning more capacity or adjusting runaway job resources). The machine may discover a different threshold is more accurate or gives more lead time to you, that the issue correlates with another metric you should be monitoring instead, or that it only happens when a series of conditions apply, not just CPU activity.
IT Ops personnel will need a deep enough understanding of how machine learning analytics work so that, when they turn control over to the machine, they can audit to see how that automated control is evolving and doing its job. With AIOps, IT Ops moves from a totally manual process to an auditing and adjustment process, where you’re fine-tuning the system according to changes in your environment that the machine learning algorithms need to learn. Seasonal historical events (e.g. Black Friday, Amazon Prime Day) as well as one-off events (marketing campaigns, launches) will introduce new data into the system to which it will need to adjust and be validated by IT Operators.
AIOps auditing and management is a key skill that ITOps will need to develop. It will be informed by the specific working environment (tribal knowledge) and the industry. Some skill training will come from vendors. Some of it will be obtained through self-education, and some will be obtained through certification. AIOps education will be similar to the type of education staffs had to obtain when they learned network skills, and you should expect a similar education process for AIOps management.
Skill #2: Understand APIs and other modern-stack application technologies
As I’ve noted before, with application-centric infrastructures, DevOps, and Agile software development, IT Ops are increasingly taking responsibility for resolving application issues that software developers previously handled. Regardless of where your organization is with application delivery, it is undeniable that the application has become king and developers are getting consistently more influence and budget.
IT Ops must now speak the language of developers (APIs, continuous delivery), understand application technologies (microservices, containers) and determine the correct way to measure their impact on the IT ecosystem (and respond when things go wrong). For example, IT Ops needs to be able to answer:
- Is an application processing data correctly and do we need to correct any data issues?
- What portions of the code are causing issues?
- Is code execution or a database call causing slow response time?
- Is a 3rd party service or API impacting application performance?
- Is auto-scaling in cloud services (AWS, Azure) delivering performance at the right price?
- Is engaging multiple APIs or external services introducing latency?
And many other questions besides. In addition to understanding, IT Ops must also open channels of communication with developers to alert and collaborate on application-related issues.
Perhaps the key application technology for today’s enterprise is the on-demand cloud. Application developers have essentially been given carte blanche to use cloud resources as they see fit while the organizational budget for cloud sits with IT Ops. Developers may not care individually about a $30 -$50 a month bill but over 1000s of developers across the organization, costs add up. IT Ops must gain visibility into what is happening with cloud resources and an understanding of workload profiles in order to determine where they should be placed for cost/performance optimization.
Duties that used to be handled by applications programmers are now shifting to IT Ops. Applications are becoming more function and service specific and are being built as services that talk to each other through APIs. Cloud resources used by developers are still owned by IT Ops. A working familiarity with APIs and other applications technologies (what they do, how to test, how to address, etc.) is becoming a requirement for IT Ops. It will also be needed for AIOps management.
Skill #3: Security, Security, Security
If your IT Ops organization isn’t already responsible for security, understanding what a security event is in an operational context and how to react to it is critical. In many organizations, security functions are siloed away from IT Operations. As AIOps becomes more prevalent, a security event storm such as a denial of service attack or some of the recent ransomware attacks will likely be quickly detected by AIOps machine learning. Knowing how to recognize them as a security failure rather than an operational failure and responding to them as such will again, be critical. In the AIOps environment, a greater awareness of security issues and how IT Ops personnel should react to them will be more critical than ever.
It takes a generalist
Digital business innovation happens at the edge of an organization’s IT eco-system. Once innovation matures into production, be it an infrastructure, application, or security improvement, ownership will pass over to IT Operations.
With the advent of AIOps and other new technologies such as application-centric infrastructures, microservices, and DevOps/Agile, IT Operations personnel will no longer be permitted to remain specialists in the area of IT performance management. They must become generalists in a number of different areas. IT Ops skill sets must evolve to include practical working knowledge of such things as machine learning/algorithm management, applications programming, and security. As our organizations become increasingly digitized, possessing these three skills will become the new normal for IT Ops.
Special thanks to Seth Paskin and Stephen Watts for help and input in writing this post.
- Announcing TrueSight 11
- Concerns and Challenges of IT Leaders Considering AIOps Platforms
- Reduce MTTR: Machine Learning to the Rescue
- IT Alerts: From Operational Trenches to AIOps
- Why AIOps Needs Big Data and What That Means for You