SLA Number Five is Alive: A strange and cautionary tale in ITSM

In a very ordinary industrial park, in an unremarkable town in the south of England, for a few short months there “lived” a strange and malevolent force. It inflicted misery and delight on bemused humans in equal amounts, and was eventually cornered and later extinguished.

 

One day, while it was still “alive”, I was invited to meet the culprit, who by this time had been isolated and disarmed to a degree. Although, it was still managing to cause a few problems, thanks to a widespread legacy of destruction.

 

“You need to check this out! You’ll probably never see anything as freaky as this again in your entire career” was an invitation I couldn’t refuse from the service desk manager.

 

What I was introduced to that afternoon a few years ago, was a bunch of homespun service level agreement (SLA) logic gone rogue. “Freaky” it turned out, had been an understatement. This thing was literally out of control and exhibiting a strange kind of intelligence. 

 

It had come to be known affectionately as “The Count” to the team, because of its appetite for sending emails simply containing “Count:” to seemingly random recipients in the organization. But that was just the tip of the weirdness iceberg…

 

Other tricks up the The Count’s digital sleeve included:

 

  • Texting people with empty SMS alerts whenever he “felt” like it, and I later came to suspect that The Count might have indeed “felt” – just a tiny bit.

 

  • He would frequently open new tickets and close existing tickets (often important tickets) on a whim. The new tickets were sometimes empty and at other times filled with jumbled text fragments.

 

  • Thanks to an integrated alarm escalation system, The Count was able to initiate phone calls to the home landlines of key supervisors and specify a pre-recorded message to be played.

 

The recording he chose most often was message 0, which simply said: “This is a test”.  I tried to imagine what is must be like to hear that, bleary eyed, at 3am…

 

  • The Count also remained reasonably good at his day job. Most of the time the right actions were applied at the right time to the right tickets… but sometimes The Count would wreak havoc and suddenly begin all manner of escalations on a very low priority incident ticket. On other occasions, he would flatly refuse to do anything at all!

 

 

 

Please don’t switch me off etc.…

 

 

As a computer science grad, I experienced a great deal of inner conflict at meeting The Count. One the one hand, it was obvious that underlying this strange behavior there was likely to be some horrific acts of configuration. Like a petulant child, The Count did pretty much what it “liked” (and yes I do like to think it took pleasure in its work).

 

But my overwhelming reaction was “This is AWESOME”. Here, live, in front of me, was the beguiling property of emergence. Through virtue of sheer complexity, The Count was exhibiting strong emergent behavior that nobody understood.

 

The Count, it turns out, was actually the sum of the collective intelligence of more than 3,000 pieces of SLA logic – much of it tightly entwined and perilously complex.

 

Disabling the SLA module they’d built did help somewhat, but didn’t totally alleviate the problem. The Count had created so much workflow itself (as some of his components were designed to do) that whole chunks of undocumented logic were lurking in the depths.

 

Very fortunately, the organization in question was about to move to a brand-new software solution, and was very lucky to be able to start again. But the fault wasn’t really with their homebuilt SLA system at all, which was actually very well designed and extremely easy to use. In fact, it was this ease of use that ultimately led to the system’s downfall

 

 

Unpicking the morass

 

The real problem was a relatively recent and deeply over ambitious SLA program. The following facts should give you a feel for the extent of the problem:

 

  • There were more than 3,000 SLA components. Enough said.

 

  • They had SLAs about SLAs, and then they had SLAs about those SLAs…and then…well, you get the picture. In places the nesting was 7 layers deep. Given the complexity of each individual SLA, this put the system way beyond any human capacity to understand all the potential states it could get into.

 

  • There was complex interdependence and tight coupling between SLAs. Very often they would swap and set triggers for each other, getting into weird and unpredictable looping conditions. Then, along would come another SLA that was also watching that shared trigger and…boom. Welcome to emergence.

 

  • SLAs could also be used to trigger workflow, which in turn created new SLAs, which could then set triggers for their parent SLAs. Most of this machine-generated logic was automatically cleaned up by design, but not all of it. I know, I know…

 

  • Nothing was documented and there was no control on adding new SLAs. Yikes.

 

 

Result? Carnage.

 

 

Lessons learned

 

I’ll readily admit this is an extreme example, but think about this simple fact for a second: In reality there were only 10 valid and useful business SLAs that mattered.

 

That’s right, just 10. When they took the time to really understand what was needed and to formally agree with the business a sensible scheme, they could only identify 10 separate agreements.

 

This, by the way, is a very common problem. It is easy to build SLAs, and often with the best of intentions, a handful of useful automated agreements can quickly grow into a complex mess. This can be distracting and degrade the service you provide, the very opposite of what SLAs are trying to achieve.

 

Here are a few useful pointers:

 

     1. Think carefully about what you’ve actually agreed with the business and don’t over engineer it. Did they really want SLAs that measure SLAs that      measure SLAs that also measure SLAs? I suspect not.

 

Take note: Just because something is possible does not represent an open invitation for you to build it!

 

     2. Create an SLA review process and board and get every addition endorsed. Give these things the respect they deserve or repent at leisure!

 

     3. If you’re not technical but you are responsible for building and managing SLAs, you might feel a little wary about the unintended consequences of what      you do. If so, find a friendly developer to give your logic a once over…

 

     4. Avoid building aggregate SLAs deeper than 2-3 layers if you can. SLAs about SLAs are fine, and often very useful from a business perspective, but      proceed with caution thereafter…

 

 

     5. Minimize the interdependence between SLAs: i.e. sharing triggers and initiating each other. It can be a requirement, but usually has very little      business value and is just showing off.

 

 

 

Summing up.

 

SLAs are an important and immensely powerful tool for cementing the relationship between ITSM teams and the business. But restraint is the watchword for using them effectively.

 

Man, I miss the The Count.

 

Cheers

 

Chris

These postings are my own and do not necessarily represent BMC's position, strategies, or opinion.

Share This Post


Chris Rixon

Chris Rixon

Chris has worked in IT Operations Management technology since 1990, in roles spanning: IT helpdesk, software engineering, consulting, architecture, sales engineering and marketing. Chris joined the Remedy Corporation in 2000 and came to BMC during the acquisition in late 2002.