IT Operations (IT Ops) pros play three critical roles in an organization. They’re architects, builders, and heroes who save the day when things go wrong. They envision and help plan digital environments, build the infrastructure those environments run on, and fix aberrations before and after problems turn into crises.
As they say in the Geico commercials, it’s what they do.
Today, I’d like to focus on the break/fix nature of an IT Ops job, specifically the messy business of preventing IT network crises and dealing with them when they occur. Based on dealing with IT Ops changes over the last 15 years, here are (IMHO) some of the most important things IT pros can do to avoid network crises before they happen and solve crises when they occur.
- What changed? Many (most?) crises happen because of a change in the environment. When diagnosing an issue, it’s helpful to know of other recent environment changes. If you’re unable to find an obvious direct cause for an issue, take a minute and ask: what recently changed that might have caused my problem? This is particularly helpful when resolving an issue that’s occurring in a remote location where you might not have visibility to everything that happens.
If a server stops communicating for example, the first steps are always going to be to check the server to make sure it’s not hung or down, that the hard drives aren’t filled, that it’s connected to the network, etc. If you don’t find your solution on the server itself, it’s time to broaden your search and look at any other things that have recently been changed.
Connections reveal themselves during a failure. Check your project management system or change logs to see what changes have recently occurred on the network. It could be you can’t reach the server because it’s behind a router, switch, or firewall that’s been incorrectly configured. Someone may have accidentally deleted the DNS record for the server or changed a routing path. The problem may have occurred somewhere else and you’re seeing the symptoms, not the cause.
- Avoid collateral damage through planning – There’s nothing like the sinking feeling caused by an unexpected problem that occurs while you’re making a change in another area. An example of collateral damage might be swapping out a server only to discover it knocks out a nightly transfer, because the transfer’s security is keyed to the machine’s hardware identity and changing the hardware changed the hardware key. The key to battling collateral damage is to do your homework and identify as many related functions as you can, before the change occurs. Go deep and identify any related functions, and add any necessary adjustment steps to your change plan.
- Use a checklist for changes – In his book The Checklist Manifesto: How to Get Things Done Right, Atul Gawande talks about how to use checklists to increase our ability to deliver information correctly, safely, and reliably. Too often, IT Ops pros walk in to a situation and perform critical work using only memory, training, and instinct. Problems occur when they perform steps out of sequence or skip steps. I’m a big proponent of using checklists during network changes as an aid to insure success and avoid crises. A good checklist helps you plan and correctly implement these steps in the change process.
- Preparatory steps – What needs to be done before the change? What servers or equipment need to be downed or adjusted? Who needs to be notified?
- In-process steps – What steps must be performed during the change? What configurations need to be modified?
- Verifying the change worked – How do you determine whether the change worked. What items should you check? What data should be used for verification?
- Emergency procedures – What mitigation strategies should you use if things go bad? What’s your game plan for a crisis?
- Restoration steps – How do you reverse the preparatory steps you made to implement the change? Paying attention to this step can avoid triggering a crisis in another area.
Checklists don’t have to be long. They just have to be thorough, accurate, and used. IMHO, using a checklist is crucial for a successful network change. For more information, check out my post on eight reasons you need a checklist for implementing IT projects.
- “The One Thing at a Time” rule – My personal rule is: only perform one major network change at a time. It’s one thing if a single change goes wrong and you have one crisis. It’s another if two or more changes go bad at the same time, creating multiple crises. It’s tempting to perform multiple changes as long as you have part of the network down but don’t do it. It’s not worth the risk.
- Know where you’re at: location awareness – The most horrifying self-inflicted wounds occur when an IT pro wipes out a production system when he thinks he’s working on a test system. The perfect example is the IT Manager who while refreshing a QA database, accidentally wipes out the production database because he’s on the wrong machine. These mistakes often occur when using Remote Desktop programs, where you accidentally connect to the wrong machine. Make sure to put in steps to insure you’re on the right machine before you start working, even if it’s something as simple as performing a hostname command. You’ll thank yourself the first time this stops you from performing work on the wrong machine.
These items are practical steps that are either not covered or only touched upon in change management guides. Performing simple steps like these can help you deal with an unexpected IT Ops crisis or prevent a crisis from occurring.