Have you ever made a decision with only a portion of the information you really needed? Sometimes you’ll get lucky and your hunch will be on target – you’ll take the right action and everything will work out just fine. Other times, things won’t go quite so well, and you’ll regret not taking the extra time to check a few additional facts before responding.
If the situation is as trivial as getting on the wrong train, for example, the consequences will be relatively short lived and the impact mostly restricted to you! If however, we’re talking about the failure to understand the true nature of a live or predicted critical system outage, the stakes can be higher. Much higher.
In IT operations, we can be a little quick to react to red warning signals and make conclusions about the likely cause. We can be equally guilty of assuming the red lights are simply noise that we can ignore, or that the green lights we’ve chosen to focus on means that everything is ok…
In this blog, I’ll look at the pitfalls of not integrating and correlating various sources of IT operations data into a consolidated view.
Example #1: Happy systems, angry users.
Have you ever witnessed the following scenario? Everything looks fine in the operations center – all status indicators are twinkling with a pleasing shade of green on your monitoring console. Suddenly, the relative calm is interrupted by a call from your IT service desk supervisor. All is not well.
She tells you her team is facing a barrage of calls from less than satisfied users, the ERP system’s forms are taking two minutes to load and this problem couldn’t have come at a worse time for the finance and accounting teams.
Failure to correlate system performance management metrics with data from an end user experience management system can, and frequently does, result in IT operations teams failing to spot and correct critical issues.
In this particular scenario, there are likely many other monitoring capabilities that could have detected the potential for large latencies too. Hopefully the general idea is obvious: assuming all is well simply because the key server and application parameters suggest they are, may not be the best strategy.
The irony is of course, that very often all the data sources needed to detect and stay ahead of the issues like this are available – they’re just not integrated to form a holistic service management view.
Example #2: Picking the wrong fight
Picture the scene: the monitoring console lights up like a Christmas tree, although this time it’s a menacing shade of red. “Email is down” come the shouts from various operators, and sure enough the MS Exchange server can be seen protesting loudly and issuing all kinds of alerts.
The problem investigations team jumps on the case and prepares to spend a good few hours elbow deep in the server and application. Meanwhile, in a building nearby, a storage management system is quietly detecting various problems with a SAN. The storage team responds (eventually) and starts an investigation at their own pace.
And yes, you guessed it, in reality the alert presented to the majority of the team as an application failure, was in fact caused by a problem in the SAN that supported the application. Trivial, easy to correct, and a depressingly common occurrence – all caused by not modeling (and monitoring) email as a service.
Making sense of noise
Ok, so the two examples I gave above were pretty simple, largely for the sake of clarity and illustration. But the point, I hope, is clear and valid: there is tremendous value in building a service view of your key IT business services.
This is especially true in today’s complex and dynamic technology environments, where services are deployed across many different platforms. It’s not uncommon for organizations to invest in separate systems-management tools for:
§ Physical servers
§ Virtual servers
§ Physical storage
§ Virtual storage
§ Private cloud systems
§ Public cloud systems
Some of these tools play together better than others and will readily integrate into other systems, but watch out! Some do not, especially those tied to a specific infrastructure vendor.
The need to make sense of the noise these systems can generate is self-evident. You can’t possibly optimize your approach to systems performance without aggregating and making sense of these data sources.
Organizing information by Business Service
Grouping technology infrastructure in terms of the business services it supports is a very powerful organizing principle. Instead of monitoring multiple systems management tools, each with its own categorization and naming conventions, you can build a service model called, for example: “Email service: Houston”.
You can then integrate data from various systems to give you a complete view of the health of the service – from end user latency, right through to the health of the associated storage.
It’s pretty evident that adopting this approach is going to give you huge advantages in terms of isolating failures quickly and accurately. You’re also going to make much better decisions about which problems to prioritize when you can instantly correlate a failure to the business service and the users of the service it impacts.
Integrating the right data sources
The first step is to select an effective systems management platform that will allow you to integrate multiple monitoring tools without restriction. It should also have sufficient analytic power to allow you to quickly understand the complex relationships between detected issues or system abnormalities.
Ideally, the technology should be proactive, and be looking for and suggesting abnormalities and correlations itself. The technology in this domain has advanced considerably in recent times and is very capable of unearthing hidden relationships and patterns of failure.
You must then decide which tools consistently supply accurate and valuable data. Don’t waste time and effort integrating data from tools that have a reputation for having poor reliability or misleading data.
Finally, consider which systems will combine to give you a complete picture of the service. We saw the risks of not normalizing one kind of performance data with another in the two scenarios above!
If you’d like to understand more about how BMC can help you build an integrated and intelligent IT operations management platform: start exploring here.
Until next time,