What is problem management?
Problem management is one aspect of ITIL implementation that gives many organizations headaches. The difficulty lies in the similarity between incident management and problem management. The two processes are so closely aligned that differentiating the activities can become difficult for ITIL novices. At what point does one turn into the other? In some organizations, the two processes aret so closely related they are combined altogether. The differences are important, however, since they are not the same and have different objectives.
BMC Helix: Next Generation ITSM
The term “problem” refers to the unknown cause of one or more incidents. A useful metaphor for understanding the relationship between problems and incidents is to think of the relationship between a disease and its symptoms. In this metaphor, the disease is the problem and the symptoms are the incidents. Just as a doctor uses the symptoms to diagnose the disease, so problem management uses the incidents to diagnose the problem.
When incidents occur, the role of incident management is to restore service as rapidly as possible, without necessarily identifying or resolving the underlying cause of the incidents. If incidents occur rarely or have little impact, assigning resources to perform root-cause analysis can’t be justified. However, if an individual incident or a series of repeated incidents causes significant impact, problem management is tasked with diagnosing the underlying cause of the incidents and, ultimately, to identify a means to remove that cause.
Problem management’s first activity is to diagnose the problem and validate any workarounds. Problem management uses a problem database to track problems and to associate any identified workarounds with them. Once the problem has been diagnosed and a workaround identified, the problem is referred to as a “known error.” These are documented in the known error database (KEDB), which may be the same physical database as the problem database. The KEDB is a significant tool for incident management in resolving incidents caused by known errors.
After the known error has been identified, the next step is to determine how to fix it. This will typically involve a change to one or more CIs, so the output of the problem management process would be a request for change, which would then be evaluated by the change management process, or included in the CSI register.
Problem management is thought of as a reactive process in that it is invoked after incidents have occurred, but it is actually proactive, since its goal is to ensure that incidents do not recur in the future, or if they do, to minimize their impact.
Problem Management 101
Problem management is a step beyond incident management in the ITIL service operation lifecycle. Incident management handles any unplanned interruption to or quality reduction of an IT service, whereas problem management handles the root causes of incidents. Or in clearer terms, incident management restores service whereas problem management eliminates the cause of failed services.
A problem is defined by ITIL as the cause of one or more incidents. Some incidents, such as a malfunctioning mouse at a user’s workstation, are not indicative of a problem. Other incidents, such as repeated network outages, create a problem investigation due to their frequency. In this case, problem management is reactive. Proactive problem management involves addressing the state of hardware, software, and processes, and preemptively addressing issues before they cause excessive incidents. Neither incident management nor request management has the ability to be proactive like problem management.
The purpose of problem management
When users continue to face the same incidents without resolution, they lose trust in the service desk’s ability to resolve any problem. Hence the primary objective of problem management is to identify, troubleshoot, document, and resolve the root causes of repeated incidents. Incident information filters up to problem management and problem management, in turn, provides the service desk with the known error and workaround information necessary to mitigate problems in the short term.
Problems include issues such as failing hardware or an inadequately configured database query. Problem management reduces incidents over the long term. Incident reduction decreases the load on the service desk, improves end-user satisfaction, and decreases the long-term costs associated with user and service downtime. When problems cannot be resolved, problem management works with the service desk to mitigate the impact of the related incidents. The end goal of problem management should always be to reduce the overall quantity of preventable incidents and thereby increase the quality of service provided.
The scope of problem management
Problem management has a very limited scope and includes the following activities:
- Problem detection
- Problem logging
- Problem categorization
- Problem prioritization
- Problem investigation and diagnosis
- Creating a known error record
- Problem resolution and closure
- Major problem review
The main function of problem management
While problem management involves several functions, the most important is the service desk. While it is also known as a help desk, this is not the ITIL-preferred term and should be avoided. In ITIL, this function acts as the single point of contact for service customers to report incidents and submit service requests. Without a single point of contact, users may contact staff and expect immediate service without prioritization limitations. Unfortunately, this means that urgent incidents could be ignored while incidents that don’t impact the business get handled first. Another common scenario is that important but low-priority incidents are not handled for weeks while the IT support staff take care of the most pressing issues on their desks, leaving no time for smaller issues. The service desk allows the service provider to address everyone’s issues promptly and sequentially. It also encourages knowledge transfer between departments, collects data on IT trends, and feeds problem management.
This function can be divided into separate support levels called tiers. The first tier is for basic issues. This includes low-priority issues such as basic computer troubleshooting. Tier one incidents are the most likely to be turned into incident models, since these are easy to solve and recur often. Tier-one incidents do not impact the business or other users. They can always be worked around until the service desk resolves them. For example, a Microsoft® Outlook® error can be worked around by using the web-based email application instead.
Then there’s tier two. The second-tier support level handles issues that have some impact on the user but not on the business as a whole. Usually these incidents require more skill or access to resolve. Tier-two incidents are medium priority, and require a more immediate response and higher level of access or training than tier-one incidents.
Tier-three incidents affect the entire organization and many users. Sometimes, a VIP may fall into a tier-two or tier-three categorization to provide a faster response time for these users. Often, these incidents fall into the Major Incident Response (MIR) process. These incidents are defined by ITIL as those that cause significant disruption to the business. These are always high priority. Incidents that require MIR are good candidates as potential problems, since they affect the business and likely have a different root cause than regular incidents.
You’ll know that you’ve accurately assessed tiers and priorities when most incidents fall into tier one/low priority, fewer fall into tier two, and only a few require escalation to tier three.
The service desk interfaces with the problem management team in several ways. The first interaction is when a potential problem is raised. This often happens when an incident is deemed unresolvable at the service desk and must be escalated. This also happens when an incident occurs repeatedly despite normal troubleshooting and resolution steps. Finally, when the problem management or continual service improvement team identifies problems proactively, they may contact the service desk for more information or incident statistics.
The problem management process
The ITIL problem management process has many steps, and each is vitally important to the success of the process and the quality of service delivered.
The first step is to detect the problem. A problem is raised either through escalation from the service desk, or through proactive evaluation of incident patterns and alerts from event management or continual service improvement processes. Signs of a problem include incidents that occur across the organization with similar conditions, incidents that repeat despite otherwise successful troubleshooting, and incidents that are unresolvable at the service desk.
The second step is to log the problem. In an ITIL framework, problems are logged in a problem record. A problem record is a compilation of every problem in an organization. This can be accomplished via a ticketing system that allows for problem ticket types. Pertinent problem data, such as the time and date of occurrence, the related incident(s), the symptoms, previous troubleshooting steps, and the problem category all help the problem management team research the root cause.
The third step is to categorize the problem. Problem categorization should match incident categorization. Incident [and problem] categorization involves assigning a main and secondary category to the issue. This step is beneficial in several ways. One benefit is that it allows the service desk to sort and model incidents that occur regularly. The modeling allows for automatic assignment of prioritization. The third and most important benefit is the ability to gather and report on service desk data. This data allows the organization to not only track problem trends, but also to assess its effect on service demand and service provider capacity.
The fourth step is to prioritize the problem. A problem’s priority is determined by its impact on users and on the business and its urgency. Urgency is how quickly the organization requires a resolution to the problem. The impact is a measure of the extent of potential damage the problem can cause the organization. Prioritizing the problem allows an organization to utilize investigative resources most effectively. It also allows organizations to mitigate damage to the service level agreement (SLA) by reallocating resources as soon as the issue is known.
The fifth step is a two-part process, which involves investigating and diagnosing the problem. The speed at which a problem is investigated and diagnosed depends on its assigned priority. High-priority issues should always be addressed first, as their impact on services is the greatest. Correct categorization helps here, since identifying trends is easier when problem categories correlate to incident categories. Diagnosis usually involves analyzing the incidents that lead to the problem report as well as further testing that may not be possible at the service desk level, such as advanced log analysis.
The sixth step is to identify a workaround for the problem. A workaround should always be indicated, because problems are not resolved at the incident level. A workaround enables the service desk to restore services to users while the problem is being resolved. A problem can take anywhere from an hour to months to resolve, therefore a workaround is vital. A problem is considered open until resolved, so a workaround should only be considered a temporary measure.
Step seven is to raise a known error record. Once the workaround has been identified, it should be communicated to staff within the organization as a known error. It’s good practice to record a known error in both an incident knowledge base and what ITIL calls a known error database (KEDB). Documenting the workaround allows the service desk to resolve incidents quickly and avoid further problems being raised on the same issue.
Step eight is to resolve the problem. Problems should be resolved whenever possible. Resolution resolves the underlying cause of a set of incidents and prevents those incidents from recurring. Some resolutions may require the change management board, as they may affect service levels. For example, a database switchover may cause slowness during the switchover period. All risks should be evaluated and accounted for before implementing the resolution. Document the steps taken to resolve the problem in the organization’s knowledge base.
The ninth step is to close the problem. This step should only occur after the problem has been raised, categorized, prioritized, identified, diagnosed, and resolved. While many organizations stop at this step, it isn’t the last according to ITIL.
The final step is to review the problem. This is also known as a major problem review. The major problem review is an organizational activity that prevents future problems. During the review, the problem management team evaluates the problem documentation and identifies what happened and why. Lessons learned, such as process bottlenecks, what went wrong, and what helped should be discussed. This is where having a complete problem log will help. A completed log will work much better than trying to pull the details from memory. This problem review should result in improved processes, staff training, or more complete documentation.
Problem management process flow diagram
How problem management fits into ITIL
Problem management is only one component of the ITIL service management lifecycle. Within ITIL, it exists in the service operation main process. As a process, it interfaces with many other parts of ITIL. Due to its relationship with the service desk, it is directly affected by and affects incident management. It also interfaces with financial management, since the financial impact of a problem is considered during the prioritization and resolution stages. It interfaces with service design when past and potential problems are considered during the IT design process. It interfaces with knowledge management when known issues are recorded. Finally, it interfaces with continual service improvement when problem management is proactive, since both have the goal of improving the quality of service delivered to internal and external customers.
This process is one that is integral to long-term service delivery success and therefore should not be ignored when designing a robust IT service, whether it’s internally or externally facing. Read on to discover more about the ITIL lifecycle.
BMC Helix: Next Generation ITSM
BMC Helix ITSM combines the latest in digital and cognitive automation technologies to enable best-practice ITSM principles, helping you to provide intelligent and predictive service management across any environment. Learn more about BMC Helix ITSM
- Optimized for ITIL® 4
- Predictive service management through auto-classification, assignment, and routing of incidents
- Integrations with leading agile DevOps tools such as Jira
- Delivered in containers to enable operational and cloud deployment efficiencies
These postings are my own and do not necessarily represent BMC's position, strategies, or opinion.
See an error or have a suggestion? Please let us know by emailing email@example.com.