Subscribe
  • Home
  • /
  • TechForum
  • /
  • Effective planning, simplifying the environment can do much to decrease cost of downtime

Effective planning, simplifying the environment can do much to decrease cost of downtime

Companies should ensure their operating systems are up to date and that all applications are simplified, says Pieter van der Merwe, availability solutions architect, Africa & Middle East at Stratus Technologies.


Johannesburg, 08 Jul 2014
Pieter van der Merwe
Pieter van der Merwe

While IT managers dread system downtime, the harsh reality is that even the best plans and preparation cannot prepare for every circumstance. Pieter van der Merwe, Availability Solutions Architect, Africa & Middle East at Stratus Technologies, says it is often the simplest oversights that can quickly escalate into serious events that are difficult and often costly to remedy.

"In many instances, downtime is the result of some type of human error. Here it is often the environment that is overlooked as opposed to under-investment in software or hardware availability solutions," he says.

Van der Merwe cites an example of a situation where having a backup for something as simple as the air-conditioning unit in the server room often does not feature in business continuity plans.

"While protecting your organisation's hardware and software in every way possible is of utmost importance and cannot be over-emphasised, it is often the unrelated or seemingly insignificant environmental factors that come back to bite you.

"In the case of an air-conditioning unit in a server room, should it malfunction or shut down, this will inevitably cause your server to overheat and shut down, resulting in critical downtime that could have a material or reputational impact on an organisation," he comments.

In addition to environmental factors, Van der Merwe says complexity and a lack of understanding of the set-up as a whole can also contribute to downtime. "For instance, substantial investment may have been made in a fault-tolerant server, but lack of planning or oversight within the illogical layout of the data centre sees even the best trained people making mistakes," he adds.

"Almost all servers today are 'dual corded' where they are connected to two different power supplies. In this scenario, it is possible for an electrician to mix up the power supply feeds resulting in a situation where the power supplies connected to a server and communication devices are not synchronised, making your applications vulnerable to downtime," he explains.

In addition to complexity, Van der Merwe says bad planning is another contributing factor to downtime. "A good example here is the case where an organisation in Nigeria was conducting routine maintenance and in error the diesel generator was switched to manual mode as opposed to automatic.

"The problem was compounded by the fact that the next day was voting day and there was restriction of movement. As a result, staff were unable to get to the site to rectify the situation, with the battery of the UPS eventually running flat and causing a power failure and outage," he comments.

In terms of implementing measures to prevent downtime, Van der Merwe stresses the importance of simplifying the environment. "The more complex the environment, the longer it will take to rectify or recover from downtime. Here it is advisable to troubleshoot and determine core functionality and where implementing fault-tolerant solutions may be necessary. Ideally, organisations should look to implement active/active functionality, ensuring a high availability cluster is put in place.

"In addition, organisations should ensure that its operating systems are up to date and that all applications are simplified and multiple versions are not being run. Patches should also be applied as required to fix errors or vulnerabilities in software. Being consistent in your labelling and use of colour codes may seem obvious, but very often this is lacking and confusion and mistakes start to creep in," he adds.

For Van der Merwe, also important to remember is that people and process work hand-in-hand. "It is critical to test any new system in a test environment prior to implementing into production. This ensures that any instability which may impact performance is picked up and can be rectified. Here, not investing in rectifying the weakest link will inevitably cause the system to fail, resulting in downtime. In today's always-on world, this is a risk most organisations will not be prepared to take," he concludes.

Share

Editorial contacts

Craig Atherfold
Change the Conversation
(+27) 11 100 2250
craig@changetc.co.za