How to avoid costly IT outages

Johannesburg, 29 Mar 2022

Most businesses these days rely heavily on an IT infrastructure. Business, and related income, comes to a halt when a key application goes down or access to the internet is disrupted.

These outages lead to millions of rands in losses and a decline in customer satisfaction every time.

Let’s look at how we can avoid these outages becoming expensive and costing us clients and our reputation.

First prize, of course, would be to avoid the downtime in its totality and the only way to do this is to be proactive. Some options are listed below:

This can be done by proactively looking at performance metrics in the IT environment and seeing when things are acting abnormally (for instance, CPU usage is unusually high on one of the virtual machines). You can then have a look at the cause of the abnormalities and resolve them before a crash happens. The best option is to have a machine learning (ML) system that learns what is normal for what time of day and day of the week, and alerts you if something needs investigation.
Another way to avoid downtime is to look at any events that occur regularly. If something happens every Tuesday at 10am, it is worth investigating and fixing it once and for all, so that it never happens again. So, a system that allows you to see seasonal (regularly occurring) events is useful for this.
Another proactive task is to look at which devices, servers or apps give the most problems. Often there are less than 10 that cause a lot of problems. Taking time to properly fix or even replace the faulty component avoids downtime in the long run. Having access to historical data and seeing the list of top 10 items giving problems can help identify where to focus your attention.

All this is something IBM Cloud Pak for Watson AIOps can do for you. AI ops solutions will not only be able to help resolve issues in a reactive mode, but help avoid issues from happening in the first place. If the outage can’t be avoided, then we need to limit the duration of the downtime. Again, there are a few ways to do this:

Sometimes there are solutions that your subject matter experts (SMEs) know about, that they apply every time a specific fault occurs. This could be something like, for example, restarting a process, interface or server. This institutional knowledge can be automated. Automated solutions run very quickly and reliably. Once the error is detected, the solution is kicked off immediately without human intervention.
Imagine being able to look at all the chats and information in your service desk related to a problem that happens, to see what you did last time. This can be done automatically using artificial intelligence (AI) and then a recommended solution can be suggested to you. This dramatically increases the time to resolution and also uses institutional knowledge.
If a whole lot of events are caused by one root problem, then it would save a lot of time in resolving the issue if the events could be grouped and the root cause identified. The best solution would be if this could be automated using artificial intelligence (AI) and data analytics. The AI and data analytics would use historical data to create the groups and recommend the root cause. Having the root (or probable) cause identified helps you to quickly look at the right place to solve the problem and the other grouped events give you context of the problem that can aid in creating the solution.
Being notified, in something like Microsoft Teams, if there is a problem aids with getting the right people looking into the problem as soon as possible. In these days of working remotely, people spend a lot of time on Teams and will notice a message there before noticing an event on a scrolling event list or an error appearing on a dashboard.

All of the above is available to you in IBM’s Cloud Pak for Watson AIOps. Please contact me to find out more about this solution and how it can help you and your company avoid downtime.

https://envisage.solutions

Editorial contacts