Subscribe

Logging is not just for lumberjacks


Johannesburg, 01 Sep 2015

Arguably the most important function in the ITIL incident process and thus also in the major incident process is information gathering, which is primarily used to feed the three process steps highlighted below:

ITIL Incident Management

* Identification and logging;
* Classification and prioritisation;
* Investigation and diagnosis;
* Resolution and recovery; and
* Incident closure.

Information gathering is the most important function because information is by default the primary input and no matter how well an enterprise may have perfected IT crisis management and in particular the major incident process, or any process for that matter, if the inputs are wrong, the output is more than likely to be the same. As the saying goes, 'Garbage in. Garbage out'.

So what type of information is needed at each of the steps? Let's take a look:

Identification and logging

As with typical Incidents, the primary sources of information for identification and logging should come from event monitoring systems; if these systems have been properly deployed, as well as customers who are experiencing the symptoms of the failure or outage related to the incident.

It is critical to understand exactly how important information gathering is at this first point of entry - the IT service desk, since this information will be used in classification and prioritisation, the routing of the incident, as well as investigation and diagnosis. Typically, IT organisations have scripts defined for their service desk agents to follow if a customer calls with a specific issue or request. The problem with a script is that it assumes that the interaction is not flawed. Blindly relying on a person's word as fact at this point can often lead to incorrectly classifying an incident. In the case of an incident not being classified as major when it should be, resource time is unnecessarily spent on addressing a symptom rather than investigating a cause. This of course, increases the total downtime as well as incurring an unnecessary resource time cost.

Best practice at this point would be to use open questions to gather as much information as possible. This information should clarified in such a way that it can feed classification and prioritisation as well as investigation and diagnosis. Kepner - Tregoe (KT) teaches an interesting questioning technique called Questioning to the Void, which involves asking open questions until such time as you can no longer acquire a deeper or broader understanding from a particular information source.

Classification and Prioritisation

An IT organisation needs to have agreed with business on exactly what constitutes a major incident which is usually an incident with severe negative business consequences. An enterprise should consider its critical business functions before making this determination. This will therefore, be different from one enterprise to the next.

Once this has been determined, the information required to classify a Major Incident is the following:

* Is the affected service a critical business function?
* To what extent is the service affected? i.e. have we lost total ability to perform this function or is the service degraded?
* To what extent is this the case within the rest of the enterprise? i.e. how many customers are affected?

With the right tools, these questions can be answered but this is not always the case. And although these are simple questions to ask on a service desk, the big problem is that most often people's understanding is limited to their perception of what is important to them.

It must be understood however that service desk agents deal with large amounts of calls and that most of the time, they are there to do that and that alone. A very important part of the classification of major incidents, and one which is often missed, is done by an individual or a team of people responsible for watching call types. It is their function to monitor tickets for excessive incidents or those of the same type. When they spot this, for example, a larger than normal volume of tickets for complaints about e-mail, it is their job to notify the Major Incident Manager that there may a major incident. They are responsible for classifying major incidents where the regular mechanisms have missed it.

Investigation and Diagnosis

At this point, one should have information around the scale of the issue, i.e. how many people it has affected and how severely the service is impacted. This understanding can be further increased by finding out where they are located geographically and when the issue was first seen.

A person's perception of when an issue first appeared is often misleading, especially when the service is degraded rather than totally unavailable. The best way of getting clear information is through the use of tools. Please understand however, that if you tools are not setup to correlate the actual perspective of customers in mind, they are essentially unproductive. In future articles we will discuss these aspects of tool usage.

What normally happens at this point is every IT engineer runs off and checks his equipment. They might run a test to verify that it up and running or even health checks to see if everything is ok. It is at this point that information gathering stops and most often, the major incident process breaks down.

Why is this so often the case? Well checking devices is not troubleshooting. The fact that engineers are checking all of their devices means that they have not pinpointed the issue. One of the best models to use for troubleshooting is KT's Problem Analysis and its primary purpose it to pinpoint where the issue resides. It requires the gathering of information around the location of the issue, including timing and extent. It also requires the determination of the inverse as well - those devices, users, locations etc. that are not affected by the issue, compare them, find their differences and thus pinpoint the investigation area.

The importance of logging

When it comes to major incidents the automated logging of events and the resultant logging of process and activities by the service desk and IT resources is of crucial importance. If this is overlooked the consequences are that recovery and resolution of the incident take an undesirable length of time which increases the negative business impact. Thus IT resources need to embrace logging, as it is an activity that is not just limited to lumberjacks!

Share

Editorial contacts

Jonathan Hill
Dee Smith and Associates
info@deesmith.co.za