Operational resilience is essential for business continuity

Johannesburg, 20 Nov 2007

Christelle Larkins, Area Manager for South Africa, East Africa & Indian Ocean at MGE Office Protection Systems, says business continuity plans must increasingly rely on secure power systems with adequate operational resilience.

As the relentless advance of technology continues, it has become vital for us to create an arena in which we can protect these technological advancements. Our reliance on the 24/7/365 availability of IT means the success of any disaster recovery centre is measured in milliseconds. The key to disaster recovery and business continuity today is operational resilience.

In many companies, disaster recovery centres rely on backup centres for their IT business continuity. The key to disaster recovery is, however, not just redundancy; the resilience of the system or the networks is critical.

Resilience must be incorporated into the technology at the design stage to ensure continuity, even during extended power surges, which may be caused either by power supply inconsistencies or by latent design defects in the electrical and mechanical support systems of the data centre.

We are all aware of our country`s vulnerability in terms of blackouts and energy supply shortages. There can be different reasons for a blackout - human error, a shortage of electricity, and so on. Since resilience in uninterruptible power supplies (UPS) is at the heart of any data centre, design of these has become critical.

A few steps are essential in ensuring the resilience of any IT system.

Improving MTBF, MTTR

In principle, electricity utilities can meet 99.9% of availability. Percentage availability is measured as a ratio of meantime to repair (MTTR) and meantime between failure (MTBF), ie, availability = (1 - MTTR/MTBF) x 100. MTBF can be increased by means of enhanced equipment reliability and using products that are independently certified by recognised bodies eg, TUV, KEMA and Veritas.

It is important to make sure that both the equipment and the installation as a whole are fault tolerant. In view of this, it is imperative to carry out system integrity (SI) testing that includes all the key components of the data centre prior to handover to client.

During the SI testing, one needs to simulate faults at various levels within the electrical circuits and the mechanical items within the data centre. This type of simulated testing can prove the strength of the system design, equipment and the installation. SI testing additionally brings any issues related to non-compatibility between different packages or equipment to the fore.

MTTR can be reduced by using remote online diagnostics and getting skilled experts to carry out repairs within a short space of time. A comprehensive range of spares needs to be available on 24/7 basis. Normal utility availability of 99.9% can result in nine hours of blackout or several short blackouts and some brown-outs. As the utility companies are not required to guarantee continuity, organisations have to protect themselves against these power losses.

In time-critical industries such as finance, institutions cannot accept this poor level of availability, so it is essential, therefore, to have data centres backed up with UPS systems and standby generators to protect their core IT infrastructure. It is has been shown that the cost of a one-hour financial trading system failure in Europe can reach up to six million euros (Source: IBID).

Protecting the system with UPS

In order to achieve a high level of resilience with a UPS system, units with dual conversion design must be used. In this way, the critical IT load is protected against any power quality issues at the input of the UPS, whether it is voltage or frequency related. It is important to note that some of the rotary UPSes, and a small percentage of static UPSes, may not be of dual conversion design.

Static type UPSes utilise battery banks to ensure adequate backup time to provide cover during a mains loss or poor quality from the utility supply. This allows enough time for the standby generators to fire up and support the UPSes.

For large data centres it is worth using 10-year design life batteries. It is also necessary to install a battery monitoring system that is based on impedance-check technology. However, it is very likely that in few years time, battery banks will be replaced with fuel cell technology.

Battery autonomy can be based on requirements set by the client, but 10 to 15 minutes is the common figure found in this type of industry. Standby generation is also essential for long outages and also to support non-essential loads: air conditioning, lighting, and so on.

It is good practice to have n+1 redundancy even at the standby generation level, whereas it is critical for UPSs to have n+n redundant design. (If n is the minimum requirement to support the critical load, for example, 2 x 500 kva UPSes, then n+1 redundancy would mean 3 x 500kva UPSes and n+n would mean 4 x 500kva UPS.) It is important to utilise an external centralised static bypass (CSB) for each parallel redundant UPS system.

CSB provides a very high degree of resilience when compared to simple modular parallel UPS systems. The reason being that CSB static switch is rated for the system load while the modular type UPS system depends on static switch that is rated only for the individual module rating ie, typically only 20% of the load.

Since the growth of blade servers, it is good to have generator sets that have excellent compatibility with leading power factors imposed by these servers. This will provide added resilience in the event of the entire UPS system going into bypass mode during a main`s loss situation. Since most of the IT loads generate harmonics, it is good practice to limit propagation of such pollution by using active harmonic filters.

This helps to reduce the size of the neutral conductor, ie, saves copper costs and eliminates fire risk and nuisance tripping of circuit breakers. UPSes need to be provided with suitable active harmonic filters to ensure the current distortion (THDI) level is held at 5% or lower, regardless of loading on the UPS system. This would help to meet the anti-pollution recommendation G5/4.

In order to limit the damage caused due to a faulty source, the static load transfer switches (STS) need to be used at power distribution unit (PDU) level so that any fault is limited to that part of the circuit and system resilience is not affected. These STS units act extremely fast (2 - 10 millisecond switching time) and can hence switch the critical load from one source to the other without jeopardising functionality of the servers.

Checking redundancy, resilience

Once the design is checked for any dormant or hidden points of failure, it is good practice to carry out factory witness tests for each individual item, ie, UPS systems, switch gear, standby generator sets and so on.

Even during factory witness testing it is good practice to simulate short circuits on the load side to help measure the fault tolerance level of the UPS system and its components, also to evaluate 100% load step performance of the UPS system and standby generators.

It is recommended that suitable critical component monitoring systems, such as UPS battery banks monitoring, are deployed, as this type of monitoring will help the facilities management team to take the proactive steps necessary to avoid an internal disaster in a data recovery centre.

In closing, there is no substitute for planned and regular maintenance, which should also include thermal imaging of critical components, for example, UPS, batteries during discharge, PDUs and switch gear.

Editorial contacts