Where do incidents come from? Every service has errors, flaws, or vulnerabilities that may cause incidents, which can originate from any of the four dimensions of service management. For example, a piece of software may have a bug, some equipment may have corrupted memory, or a vendor may not have capacity to address service issues in line with agreed targets. Some errors remain unidentified or unresolved during service design, development and deployment, and may be a risk to live services. In ITIL, we define a problem as a cause, or potential cause, of one or more incidents. A known error is defined as a problem that has been analysed but has not been resolved.
The purpose of problem management is to reduce the likelihood and impact of incidents by identifying actual and potential causes of incidents, and managing workarounds and known errors.
Problems are related to incidents, but it is important to differentiate them in the way they are managed:
- Incidents have an impact on users or business processes, and must be resolved so that normal business activity can take place.
- Problems are the causes of incidents therefore they require investigation and analysis to identify the causes, develop workarounds, and recommend longer-term resolution. This reduces the number and impact of future incidents.
Problem management involves three distinct phases:
1. Problem Identification
Problem identification activities identify and log problems by:
- Performing trend analysis of incident records;
- Detecting duplicate and recurring issues;
- During major incident management, identifying a risk that an incident could recur;
- Analyzing information received from suppliers and partners;
- Analyzing information received from internal software developers, test teams, and project teams.
2. Problem Control
Problem control activities include problem analysis, and documenting workarounds and known errors. Just like incidents, problems will be prioritized based on the risk they pose in terms of probability and impact to services. Focus should be given to problems that have highest risk to services and service management.
When analysing incidents, it is important to remember that they may have interrelated causes, which may have complex relationships. Therefore problem analysis should have a holistic approach considering all contributory causes such as those that caused the incident to happen, made the incident worse, or even prolonged the incident.
When a problem cannot be resolved quickly, it is often useful to find and document a workaround for future incidents, based on an understanding of the problem. A workaround is defined as a solution that reduces or eliminates the impact or probability of an incident or problem for which a full resolution is not yet available. An example of a workaround could be restarting services in an application, or failover to secondary equipment. Workarounds are documented in problem records, and this can be done at any stage without necessarily having to wait for analysis to be complete. However, if a workaround has been documented early in problem control, then this should be reviewed and improved after problem analysis has been completed.
An effective incident workaround can become a permanent way of dealing with some problems, where resolution of the problem is not viable or cost-effective. If this is the case, then the problem remains in the known error status, and the documented workaround is applied when related incidents occur. Every documented workaround should include a clear definition of the symptoms and context to which it applies. Workarounds may be automated for greater efficiency and faster application.
3. Error Control
Error control activities manage known errors, and may enable the identification of potential permanent solutions. Where a permanent solution requires change control, this has to be analysed from the perspective of cost, risk and benefits.
Error control also regularly re-assesses the status of known errors that have not been resolved, taking account of the overall impact on customers and/or service availability, and the cost of permanent resolutions, and effectiveness of workarounds. The effectiveness of workarounds should be evaluated each time a workaround is used, as the workaround may be improved based on the assessment.
Interfaces of Problem Management with Other Practices
|Incident Management||Activities from these two practices are closely related and may complement each other (e.g. identifying the causes of an incident is a problem management activity that may lead to incident resolution), but they may also conflict (e.g. investigating the cause of an incident may delay actions needed to restore service).|
|Risk Management||Problem management activities aim to identify, assess, and control risks in any of the four dimensions of service management. Therefore, it may be useful to adopt risk management tools and techniques.|
|Change Control||Problem management typically initiates resolution via change control and participates in the post-implementation review. However, approval and implementation is outside the scope of problem management.|
|Knowledge Management||Output from the problem management includes information and documentation concerning workarounds and known errors. Also, problem management may utilize information in a knowledge management system to investigate, diagnose, and resolve problems.|
|Continual Improvement||Problem management activities can identify improvement opportunities in all four dimensions of service management. Solutions to problems may be documented in a continual improvement register or added to a product backlog.|
People Aspects of Problem Management
Many problem management activities rely on the knowledge and experience of staff, rather than on detailed, documented procedures. Skills and capabilities in problem management include the ability to understand complex systems, and to think about how different failures might have occurred. Developing this combination of analytical and creative ability requires mentoring and time, as well as suitable training of techniques such as Cynefin, Kepner and Tregoe, 5-Whys, Ishikawa diagrams and Pareto analysis among others.
Contribution of Problem Management to the Service Value Chain
As problem management deals with errors in the operational environment, it is involved mainly in the improve and deliver and support value chain activities of the service value chain as shown below:
|Engage||Customers may wish to be involved in problem prioritization, and the status and plans for managing problems should be communicated.|
|Design and Transition||Problem management provides information that helps to improve testing and knowledge transfer.|
|Obtain/Build||Product defects may be identified by problem management and be managed during this activity.|
|Deliver and Support||Problem management makes a significant contribution by preventing incident repetition and supporting timely incident resolution.|
|Improve||Effective problem management provides the understanding needed to reduce the number of incidents and the impact of incidents that can’t be prevented.|
These postings are my own and do not necessarily represent BMC's position, strategies, or opinion.
See an error or have a suggestion? Please let us know by emailing email@example.com.