Incidents, defined by ITIL (IT Infrastructure Library) as any unplanned interruption or reduction of quality in an IT service, can cause companies to lose millions of dollars. In fact, according to research by Gartner, major incidents (emergency-level service outages) cost businesses up to $300,000 per hour.
So, managing such incidents and restoring services quickly is essential for any business. Having an incident management process ready to implement is a huge help here. Incident management focuses on the handling and escalation of incidents to restore services to the levels defined by your service level agreement (SLA). It does not deal with root cause analysis or the resolution of deeper issues — instead, its objective is to bring normal service back as quickly as possible after an incident.
Incident management for organizations must be largely proactive. It deals with having processes and systems in place to restore functions rapidly when incidents occur. These processes take care of functions such as optimization of facility management, automation of emergency response, and more — all of which relate to incident management.
Any incident management process includes a set of defined steps that help resolve incidents quickly. The ITIL (a framework of best practices for IT service management) lays out the following five steps for resolving a major incident quickly and effectively.
The first step in the incident management lifecycle is to identify the incident. Incidents can be reported by employees or customers through different channels, including walk-ups, self-service, phone calls, email, SMS, live chat, network monitoring software, or automated system scanning. The service desk team usually receives the report and decides whether the issue is an incident or a request. This is important as requests and incidents are categorized and handled in different ways.
A service request is a request from a user for something to be provided. These requests include creating a new account, changing a password, making hardware or software upgrades, or even requesting information. They are typically minor and less urgent than incidents (which include system-wide service outages, server breakdowns, etc.) and are handled with a predefined request fulfillment process, instead of with your incident management process.
Once the incident is identified, it is logged in the service desk. The incident log (called a ticket) should include information such as:
The more detailed this log is, the better, as this would improve your knowledge base and help your problem management team analyze the root cause and streamline incident resolution for similar incidents with templates and guidelines.
This is a crucial step in every effective incident management process. It involves assigning a logical category and subcategory (as needed) to the incident. This allows the service desk to analyze the incident and look for patterns, which could be instrumental in preventing future incidents.
Done well, categorization can streamline incident logging, reduce redundancy, and speed up resolution by making it quick to identify whether an incident is easily resolvable or requires escalation. Sometimes, categorization can even allow you to automatically prioritize incidents. For example, an incident in the "system outage" subcategory would automatically be of high priority.
Categorization uses multiple levels of classification and can get quite complicated, especially in large organizations. The actual categories are unique to the specific business, but ITIL has provided some guidelines to help simplify category assignments.
The next step is to assign a priority to the incident. Start by assessing how much impact the incident has on your business and how quickly it needs to be resolved. To do this, you need to consider the financial impact the incident will have on your business, the number of people who would be affected, and the security and compliance implications. Define your priority levels before the incident happens so that your service desk teams don't have to waste time on prioritization.
The priorities are typically as follows:
As your help desk would have limited resources, all open incidents must be addressed in order of priority. This ensures that your IT team is focusing on what's crucial rather than spending time on low-level problems while major incidents wreak havoc on your customers or employees. Set clear service agreements around each level of priority and communicate them to customers so that they know how quickly they can expect a resolution to their problem.
The final step in the incident management process is incident response. This is broken down further into five parts:
The service desk employee tries to quickly diagnose the problem on a surface level so that it can be redirected to the relevant team. They ask some troubleshooting questions to the customer or employee who reported the incident to get a general idea of the problem. Based on this, they come up with a quick hypothesis as to what's likely causing the problem so that they can either fix it themselves or escalate it to the relevant team.
Predefined troubleshooting templates, knowledge basis, diagnostic manuals, and flowcharts can help streamline this process for the service desk team.
While most incidents should be resolved by service desk employees and should not make it to this step at all, sometimes incidents are more difficult to resolve. In such situations, service desk agents would escalate the incident so it can be resolved by advanced technicians or certified support staff. Depending on the situation, the incident will either be forwarded directly to the technical team or to upper management.
The aim here is to make the process smooth for the technical support staff by gathering and logging the right information in detail. This helps them get up to speed quickly and makes resolving the incident faster and more efficient.
Both investigation into and diagnosis of the incident happen throughout the incident's lifecycle. But the focus of this step is the investigation that takes place after it's escalated. The support staff first try to confirm that the initial diagnosis is correct, then begin looking into the deeper causes (where necessary) and possible solutions to the incident. After the problem is diagnosed, your team can determine the appropriate steps to resolve the issue.
This involves not only working on the solution (e.g., patching up software, replacing hardware, changing software settings), but also notifying the end users (employees or customers) and authorities (management, the security team, or in some cases, law enforcement) about the incident, disruption of services (if applicable), and when to expect a resolution.
This involves actually taking the steps to fix the issue and restoring the systems to normal functioning. Depending on the severity of the incident, the resolution may go deeper, investigating root causes and taking steps to ensure that it doesn't occur again. For example, if the incident was caused by malware, deleting the malicious files may not be enough — you may need to fully replace systems to ensure that the malware doesn't spread.
Recovery implies the amount of time it would take for the full restoration of normal services. Some fixes like bug patches may require further testing even after resolution, to ensure that the issue has been resolved, while others may be quick. So, the recovery time must be communicated to users and authorities so they know when they can start using the system again.
At the end of this stage, the service desk confirms that the service has been restored and documents all the details related to the incident as part of their incident reporting.
Once resolved, the ticket is passed back to the service desk to be closed. The service desk employees should ideally check with the person who reported the incident to confirm that the resolution is satisfactory before actually closing the incident.
Incident closure usually involves finalizing documentation and evaluating all the steps taken to respond to the incident. This evaluation helps teams come up with proactive measures to prevent recurrences, and it identifies areas of improvement to streamline future incident response. Sometimes, an incident report will be provided to administrative teams and board members.
Depending on the severity of the incident, managing it can sometimes become complicated. It involves coordination between the service desk, technical support, and management, and regular communication with the user.
The various systems used to manage the steps — including a logging system, communication platform, and more — can also be overwhelming. A single platform solution to manage your whole incident management process can help.
Pulpstream's incident management platform automates incident management, allowing you to store data, create incident reports, and automate communications with any number of stakeholders.
Make your service desk employees' work easier with Pulpstream's incident management solution. Book a free demo today!