In over 20 years of working on IT infrastructure, I’ve been involved in any number of troubleshooting exercises. Some have been minor—the sort of thing where the network link in the conference room won’t come up, or perhaps someone’s printer is being fussy. Other issues have been more serious, such as a virus raging through the e-mail system (ILOVEYOU did not love me that week) or a broadcast storm that completely knocks out a datacenter. Yet others have been made of subtle evil, coming and going at will, and hiding whenever someone like me is trying to track them down. Problem? What problem?
A fairly consistent thread in the troubleshooting process of all these environments has been the silo point of view—what I call “silo myopia.” Most IT shops I’ve been part of have been organized along technology lines. The storage folks are a team. The voice folks are a team. The server folks are a team. The network folks are a team (kept in the basement and avoided at all costs). This creates problems. IT teams organized this way tend not to work together. Rather, they work at avoiding, shifting, and otherwise escaping blame during times of crisis. If a specific team can point at a collection of green lights and normal metrics and say, “Look—see? It’s not our issue. Go away,” then they will often do exactly that.
Why This Becomes Everyone’s Problem
The problem with this type of blame-shifting culture is intuitively obvious. IT teams working independently of one another can’t possibly operate efficiently. Fair enough, but the problem is far deeper than that. The way I see it, silo myopia is a business problem. An IT professional who only sees the business from the point of view of the infrastructure he or she manages misses the bigger picture. Yet, all too often, that’s exactly the case. Taken as a whole, if IT as a collection of infrastructure silos isn’t monitoring the right sorts of things, then they are blind to business-impacting problems.
Let’s consider the case of monitoring network infrastructure, a topic near and dear to my heart. For years, it was acceptable to monitor “red light, green light” status. Is the router, switch, firewall, or interface up (green light)? Excellent! All must be well. Today, we laugh and shake our heads at this simplistic view of network infrastructure.
Nearly all Network Management Stations (NMSs) have evolved well beyond “red light, green light” to share many more useful details with network operators, such as historical interface bandwidth utilization, basic device inventory, configuration history, logged events, and environmental conditions. Furthermore, it’s not uncommon for NMSs to discover information about endpoints connected to the network, track routing table updates, and otherwise pay attention to topological information.
I’ll go so far as to say that a modern NMS has the ability to overwhelm network operators with data. Drilling into an event is like adding progressively higher magnification to a microscope. After a while, the detail being examined is so minute, that it becomes easy to forget the big picture.
If you extrapolate this problem across all IT silos, a more serious challenge is exposed. Loads of data is coming in from management consoles all over the department. The question is, what is being done with all the data? All too often, it seems to be that—despite all of the information found in logs, events flagged by alerts, and colorful charts graphing fascinating metrics—there are no ready explanations when applications fail. The network team has all green lights. The storage team doesn’t see any failed drives or offline interfaces. The DBAs see a SQL engine that’s happily servicing user requests.
What is the root cause of this dilemma all too many IT departments face? I believe the biggest issue is simply that IT silos are monitoring their infrastructures in a vacuum—isolated from one another, and outside of the context of business application delivery.
In part two of this series, we’ll dive into the notion of what a datacenter really does. In part three, we’ll take the big ideas from parts one and two and use them to contemplate a better IT monitoring philosophy.