Most enterprises are undergoing rapid digital transformation to keep up with their competition by embracing innovation with new ways to deliver value at greater speeds and lower cost. For example, retail stores and banks are losing their brick and mortar for online digital experiences that delight customers with convenience and services that are customized and delivered faster.
Behind the scenes, IT operations supports infrastructure and apps in these fast-paced digital environments. IT at its core is more relevant now than ever—likely defining the success of business in today’s digital economy.
Being able to deliver performance and availability for the latest digital experience initiatives requires IT operations to have a strategic approach that helps find and fix issues and prevent costly down or slow time better than ever.
Operational trenches for IT
Day-to-day IT deals with hundreds and thousands of incidents tickets often associated with alerts that come from the performance of apps and infrastructure within the environment. If a ticket was generated for each alert and if IT operations has to address everything that occurred related to each alert, the volume of alerting information would easily overwhelm both IT operations and IT service management.
Here are some common issues that occur in the operational trenches:
False positive alerts from static thresholds—Static thresholds are set using a best practice approach for key metrics, however in some cases they are difficult to set accurately to fit each scenario. For example, thresholds for memory consumption across a set of servers could trigger an alert when memory spikes temporarily or for a server that can handle a higher consumption of memory without issue.
Unprioritized and duplicate events from the same source—Many enterprises have multiple monitoring solutions or multiple related alerts for the same issue that could also create an event storm that seizes up resources when only one notification and incident ticket would be more practical.
Alerting information not routed to the right teams who can fix issues—Once an issue is determined as critical, getting it to the right team based on responsibility of the infrastructure, app, or service could be another nightmare if the alert and incident ticket only includes the issue but no indication of where to route the information to a team who can address most effectively.
Top three best practices for operational success
Make sure that the amount of tickets from alerts and event management systems does not get out of hand by reducing the amount of irrelevant information that can tie up resources unnecessarily.
Here are some ways to achieve this goal:
- Aggregate the storm of same events so that when an alarm is triggered, create one event and then update a counter instead of having each alarm trigger another event in the system
- Use identifiers to correlate related events into one single event that provides all the relevant information to address the issue
- Plan a “blackout” period for maintenance windows as well as suppress events that no longer have relevance
Most enterprises have a range of vendor tools to manage their IT from legacy monitoring, through acquisitions, as well as their own innovation. As a result, IT operations might have a slew of options for event management that address issues differently and integrate differently with the environment.
Here are some ways to simplify workflows in this environment:
- Have visibility of events and data from multiple sources in a single console instead of having to log in to multiple consoles from different vendor tools
- Associate service models from configuration management data base (CMDB) information with your event management system to add service impact to events associated with the configuration items (Cis) to help prioritize and have more intelligence built in to the events
- Use dynamic grouping for events for specific roles so that teams responsible for addressing the events in the system have access to relevant views instead of having to sort through a bunch of events not related to their responsibility
Integrate with ticketing
Being able to integrate IT operations with IT service management in the generation and assignment of tickets can greatly reduce the mean time to repair (MTTR) issues for any digital environment.
Here are some examples of integration:
- Convert known actionable events for device availability and performance to service desk incidents
- Route specific event and incident data to responsible teams who can address issues quickly and identify patterns over time to configure automated remediation to avoid having to address manually
- Automatically update both related events and service desk incidents with remediation status so that when these events occur the end-to-end flow from event to closure is more efficient
Going beyond operational event management to machine-assisted learning with AIOps
Whenever possible, you’ll want to correlate and prioritize events from all areas of your on-premises and public cloud infrastructure, automatically generate incident tickets, notifying the service desk before users become aware of the problem, as well as integrate and analyze events from third-party monitoring solutions.
Operational alerting and event management must now be coupled with the ability to do advanced machine learning and analytics from an Artificial Intelligence IT Operations (AIOps) platform. Enterprises need advanced ways to triage and automate IT to deliver business value competitively.
With TrueSight at BMC, you can go beyond the basics for event management to make intelligence decisions based on the volume and velocity of service ticket and event management data using AIOps approaches. You can leverage machine learning on a big data platform to find root cause issues and business impact issues that can be addressed to further reduce the noise of event management.
For example, with this approach, you can correlate the time to resolution for specific events with business value to determine if you might need more training for some technology issues in your organization or to justify converting some manual activity to automation.