IT at the speed of business, part three

By Andrea Lodolo, CTO at CA Southern Africa.

Johannesburg, 23 Jan 2013

In my first two Industry Insights, I outlined how portfolio management is key to keeping the most important projects in focus and under control - maximising the business ROI; and service virtualisation. This is a relatively new concept, but one that will become more entrenched in the minds of IT and business alike, as it is used extensively to deliver robust enterprise applications more quickly and for lower costs.

In this third Industry Insight, I introduce the concept of service operations management. How does one align the services provided to the business by IT while measuring the availability with respect to meeting expected results? In other words, meeting service level agreements.

Let's take this one step at a time: understanding the service levels expected or desired is probably a good place to start. Business, or at least the consumers of a business service that is delivered by IT, will have certain expectations. This service could be an application or a combination of applications, and is often referred to by business as a service, because this is what is being delivered to internal or external customers and users.

Examples of this could be a banking application; an ERP application; a help desk; or call centre system. What would be the expected response times for this application? What would be an acceptable minimum? This is what needs to be determined and accepted by business as the measurement. Another way is to measure the response times, and if they are deemed acceptable, then this can be used as a baseline. Thus it becomes a matter of ensuring the application (or service) performs within the accepted parameters.

The five nines

So now it is understood what has to be achieved for each service on a consistent basis. But, traditional methods IT has used to define service levels and how they have determined whether systems are available or not, have been problematic. Let me try to explain this: IT coined the phrase of five nines, being 99.999% uptime. Achieving this almost impossible figure would mean this particular component was down for no more than five-and-a-quarter minutes in a year. There are variants of the five nines, possibly four nines (almost an hour), or even three nines (almost nine hours). All of which, if taken in isolation, do not in themselves appear too bad. Having a database down for an hour in a whole year doesn't sound so bad at all, if seen in isolation from the rest of the environment.

It becomes a matter of ensuring the application performs within the accepted parameters.

Due to the scale and size of enterprise IT environments, the different components of IT are generally segmented into things like networks, databases, operating systems, middleware, etc. Some cases are even more granular than this, and each one of these is trying to achieve its predefined service level agreement (SLA).

But, what if each can show it meets its SLA over the course of the measurement period? Wouldn't that mean the business service is always available or meeting its SLA? In almost all cases, the answer is probably not. The reason for this is simply that each service comprises a combination of all of the above, and most probably spans more than just one company's IT environment. So, to determine the real SLA of a service, all the downtime or failed SLAs of every component that affects the service must be accumulated. The consequence of this has a much higher impact on the service than would be initially apparent, and usually with hugely detrimental impact.

Wasn't me

So now, not only are there poor or non-performing services, but determining where the problem really is can be a frustrating and time-consuming exercise. To explain this is quite simple: each silo of responsibility, such as networks, databases, systems, etc, all believe they are achieving their respective SLAs. So when a service has a problem, the likelihood of each one of these reporting that their area is clean and performing as expected - is the norm. Thus, finger-pointing starts, each one blaming the other because each one can prove they are in the clear. Yet the service is still having issues.

To fix this, the entire supporting structure of a service must be measured, measuring all the components together. Traditionally, all the components have been measured and monitored at a low level, ie, system components, database health, network health and performance, etc. Now all these must be put together to define which pieces make up a service or could affect a service.

Once this is clear, it can be understood how each of these will impact the service in any given situation. It can also be understood what effect a failed or merely ailing component will have on the service. In this way, the risk factors to the service can be determined, as well as how drastic the remediation needs to be.

This form of IT management is referred to as service operations management, and of course, one needs appropriate tools that can bring all of these individual metrics and measurements together in real-time, so they can be rolled up to determine and indicate the health of each service. This method eliminates finger-pointing and shows the true SLA of a service and the actual impact on business.

In the next part in this Industry Insight series, I will go into more detail on this concept and drill down a little more into what IT must put in place to ensure service operations management is effective through the implementation of a CMDB and service dependencies.

IT at the speed of business, part three

Service operations management shows the true SLA of a service and the actual impact on business.

The five nines

Wasn't me