The first one, obviously, was walking into my office at eight o’clock in the morning on Wednesday, and being told there was a telephone call saying that there was an incident at Three Mile Island, and that it had shut down and that beyond that we didn’t know.
Definition
Incident Management is the process that aims to restore normal service operation as quickly as possible so as to minimise the impacts on business services. An Incident manager manages the process, mainatians records of incidents and events, and provides management inforation. Incident has a close realtionship to problems management and not uncommonly in small organisation one person fills both roles. However the probelm management process can conflkict with incident, for example where the problem manager wants logs that may be lost if the service is restored immediately.
An incident is "Unplanned interruption to an IT service, or a reduction in quality of an IT Service. The focus therefore overall is maintaining the quality of service inclduing amintaining a high availabilty of those services.
It can also be the failure of a configuration item that has not yet impacted on the service but will. Therefore the definition is in effect any event that has disrupted service or could disrupt a service. So incidents can be reported by users of the service, or could be detected and monitoring tools or technical staff. The service desk is the
Response to incidents will be dependent to some extent on pre-determined priority, a disruption to t tier one system impacting many users is likely to have a service level requirement that mean significant resources will be put towards restoring services, other incidents may be less critical and the response will be scaled accordingly. Service Level Management therefore has an impact on the nature of the response,, and organisations without service level management tend to overreact making all incidents a high priority.
Incident Management is the process aimed at restoring service, getting the right people to work out what has gone wrong and restore service as quickly as possible. But it is also largely to do with communication, what you tell to who and when.
Whilst you can't plan for every specific circumstance, there is a lot that can be organised about incident response and communications before the incident happens.
Some of the key activities are
- Incident Detection and recording
- Classification (Assess Impact/Urgency, Match to known problems/errors)
- Initial Diagnosis and Escalation
- Investigation and diagnosis
- Resolution and recovery
- Closure (work around or solution)
- Communication (customers, managers, technical teams)
Terminology
The Business perception
I think it is fair to say that no buisness anywhere want to have an incident ever, and if you do have one then you want the service restored in under five minutes. Whilst a well run business will have activites that people can do while a service is down, in a computer age it is increasingly likely if your main system isn't working that you'll have a massive loss of productivity.
Now everyone who ever went to fix a quick problem for a friend on a PC only to find hours later it wasn't as easy at it first looked, knows that it can be a bit difficult to know what you're dealing with at the beginning. So given you can't always predict the speed of the restore and occasionally you will have your own version of three mile island, a really improtant part of major incident management is good communication. Now often messages are given to senior management or the business, but not to the wider IT technical group who may just have the answer to the current technical issue if they only knew about it.
No matter what the SLA or agreement on service, you often see a business manager jumping up and down during an incident because that shows they are serious about service. This doesn't help, and is even worse if an escalation culture exists as every senior manager ends up responding to the incident, ironically making the communication channel confusing, and leading to instructions coming to the techncial people from all over the place. You know you don't handle incidents well when you have a number if managers standing next to the poor technical guys asking questions about what he is doing. As on technicla guys said to me after one incident, I'd have sorted the incident ages ago if I hadn't had all the top brass asking for ten second updates.
The way to avoid this is to have an Incident Manager in control of commncations, well understood processes for communicating to senior manager, technical staff and the business, and some planning around when technical folk get to work on the incident, and when they have to report on what they are doing. War rooms are a popular way of dealing with finding out what is happening, but even they can tie up technical teams in talking about it rather than doing it.
Good planning on the process helps immenesly, and a god experienced Incident Manager can navigate the difficulties of the the parts you couldn't plan for. The Service Desk has an important roe in the communications, both as the sourceof information form users, but also as the point of communication to the various aprtioes invovled.
Being the problem manager in a large organisation can presentit’s err . . . own problems.
The IT perception
The Process
No matter what the SLA or agreement on service, you often see a business manager jumping up and down during an incident because that shows they are serious about service. This doesn't help, and is even worse if an escalation culture exists as every senior manager ends up responding to the incident, ironically making the communication channel confusing, and leading to instructions coming to the techncial people from all over the place. You know you don't handle incidents well when you have a number if managers standing next to the poor technical guys asking questions about what he is doing. As on technicla guys said to me after one incident, I'd have sorted the incident ages ago if I hadn't had all the top brass asking for ten second updates.
The way to avoid this is to have an Incident Manager in control of commncations, well understood processes for communicating to senior manager, technical staff and the business, and some planning around when technicla folk get to work on the incident, and when they have to report on what they are doing. War rooms are a popular way of dealing with finding out what is happening, but even they can tie up technical teams in talking about it rather than doing it.
Good planning on the process helps imenesly, and a god experienced Incident Manager can navigate the difficulties of the the parts you couldn't plan for. The Service Desk has an important roe in the communications, both as the sourceof information form users, but also as the point of communication to the various aprtioes invovled.
Other Information
I would rate the Keprner Tregoe problem solving technique as one of the best I have come across. The beauty of KT is that the techniques are useful for dealing with technical issues, management issues of basically any problem you want to resolve. I’ve been on a number of courses over the years and many you pick up a few good points here and there, others change your perception of what you thought before you went, Kepner-Tregoe is one of those courses.
