Part 2 – 6 Capabilities in the New Generation of Cloud Operations Management
In part 1 of this topic, I talked about a paradigm shift in IT from boxes, applications, and ownership in the classic data center to a new cloud model with pools, services, and sharing. Cloud will drive changes to operations and deliver powerful benefits. In this part, I want to examine six new capabilities that operations management will have to evolve in order to deliver on the promise of the cloud.
1. Operate on the “pools” of compute, storage, memory
Traditionally, operations management solutions have had good coverage for individual servers, storage arrays, or network devices. However, in the cloud, these are all abstracted away, and it becomes imperative to operate at the “pool” level. You have to look beyond what you are monitoring in the individual device level. You need to make sure that you have immediate access to the operational status of the pool. That status could be aggregated workload (current usage) and capacity (past usage and future projection). Perhaps more importantly, it needs to accurately reflect the underlying health of the pool, even though individual component availability is not the same as the pool availability.
For example, when one ESX host in the pool has a problem, the VMs in that host can be migrated to other hosts in the same pool (through vMotion or Live Migration, etc.) as long as the pool still has the capacity. So despite one host being unavailable, the overall pool still is still functioning properly. The operations management solution you are using should understand the behavior of the pool and report the health status based on it.
2. Monitor elastic service
The Cloud is all about elasticity. That means several things. First of all, you will have services that dynamically expand and retract based on demand. Your operations management solution should adapt to this dynamic nature. For example, if you are monitoring the performance of a service, you need to make sure your monitoring coverage expands or retracts with the service. It should do so automatically. This means that you can’t expect a manual process to figure out and deploy your monitoring capabilities to the target. Your operations management solution needs to know the configuration of that service and automatically deploy or remove necessary agents.
Another important consideration is coverage for both cloud and non-cloud resources. This is critical for enterprises that are building a private cloud. Why? Chances are that not every tier of a multi-tier application can be moved to the cloud. There may be static, legacy pieces, such as a database or persistent layer that are still deployed in the physical boxes. If you are in such a situation, you will want to monitor your service no matter where its resources are located, cloud or not. In addition, a management solution should natively understand different behavior in each environment.
Furthermore, when resources are located in both private and public cloud, your operations solution should allow you to monitor your services in each seamlessly. It should also support inter-cloud service migration. At the end of day, you want your service to be monitored no matter where its resources are located. Your operations management solution must know their location and understand their behavior accordingly.
3. Detect the issue before it happens
Compared to workloads in the traditional data center, workloads in the cloud exhibit a wider variety of behavioral issues due to their elastic nature. When service agility is important, relying on reactive alerts or events will not be an option, particularly for service providers, to support a high SLA. You need to detect and resolve issues before they happen.
How do you do that? With a monitoring solution that knows how to learn the behavior of your cloud infrastructure and your cloud services. This is not new technology in the traditional data center. But in the Cloud, the device level behavior evolves more rapidly and in less conformity. Your solution should have the ability to learn the behavior of abstracted resource such as pools, as well as service levels that are based on business KPIs. Based on those metrics, it should give you predictive warning to allow you isolate the problem before it impacts your customer.
Speaking of problems, when you try to pinpoint them, make sure you have done the proper root cause analysis. This becomes an even more critical in the cloud when large numbers of scattered resources are involved. Amazon’s outage that happened earlier this year is a good example. According to the Amazon service dashboard, a network event triggered many EBS nodes in that region to think they lost mirror and the central management policy kicked in to reconfigure them.
As a service provider, if you sit in your monitoring dashboard, you probably see a sea of red alerts suddenly appears. Even though among of them is that network alert, chances are you are not going to notice it. Your operations management solution should intelligently detect the root cause of this and highlight that network event in your dashboard or to the remediation process.
4. Make holistic operations decision
In the cloud, you have to manage more types of constructs in your environment. In addition to servers, OSs, and applications, you will have compute pools, storage pools, network containers, services, and, for service providers, tenants. These new constructs are closely related. You can’t view their performance and capacity data in silos; they have to be managed holistically.
Let’s take the Amazon’s recent outage as an example again. The root cause was from its network. But it affected the storage pool immediately. And that caused the huge impact to its EC2 instances, CloudWatch, and RDS services, as well as many of its customers. If you treat those symptoms separately, you won’t have a solid plan to quickly recover from a similar outage.
You had better know who are your critical customers and their services so you can focus on recovering them in order of priority. You may want to send out alerts to affected customers to proactively let them know there is an issue. Your operations management solution should give you a panoramic view of all these aspects and their relationships. Not only will it let you quickly isolate the problem, but it will also save you money if you know which SLAs cost more to breach.
5. Enable self-service for operations
To give your customer better experience and save on support costs, it’s important to give your customers constant feedback. Traditionally, performance data, in general, is not available to the end user. In the cloud, you have a larger number of users or service requests with a relatively lower ratio of administrators. You want to minimize the “false alarms” or manual routine requests. The best way is to let your end users see the performance and capacity data surrounding to their services.
You can also let your users define KPIs to monitor, the threshold levels they want to set, and some routine remediation processes they want to trigger (such as auto-scaling). Your operations management solution should allow you to easily plug this data into your end user portal.
6. Make cloud services resilient
Resiliency is the ultimate goal of your cloud operations management. If you have a solution which can understand the behavior of your services and proactively pinpoint potential issues, naturally the next step is that you want that solution to automatically isolate and eliminate the problem. While it sounds simple it is more complex than it may appear. First, you need to make sure your solution has accurate behavior learning and analytics capabilities. Second, you still need to put human in control through well-defined policies whether by an automated policy engine or a human interactive process. Lastly, your solution should be able to seamlessly plug into other life cycle management solutions, such as provisioning, change management, service request, etc. Operations management alone can’t make your cloud resilient. You should plan the right architectural design (e.g. designing for failure) to start with and implement a good management process that reflects the paradigm shift to ensure your success.
By no means will these 6 capabilities cover all the aspects of the new generation of cloud operations management. But they are a good start based on what we have heard from our customers and other leading cloud providers. What is the most important shift in your operations management strategy? What are critical capabilities that you think a cloud operations management solution should have? I welcome your thoughts as well.