Towards 100% uptime

By Raul Garbini, Sales director at Edgetec.

Johannesburg, 29 Oct 2008

In the previous Industry Insight in this series, I touched on the cost of system downtime, specifically on the IBM System i server, specifically because it has one of the highest levels of uptime in the industry.

However, even uptime of 99.95% is not high enough for some companies, and they need to move as close to 100% availability as is possible for commercial customers. If we accept that all systems will, on occasion, go down, there has to be a cascading approach to business availability and disaster recovery. At the most basic level, many companies believe tape backups are sufficient to protect them in the event of a system failure.

This approach embraces periodic saves of the entire system, daily incremental tape saves of altered or critical data, and then ensuring the tapes are stored safely off-site, in line with disaster recovery best practice.

However, in the event of a failure, entire applications may need to be reloaded from tape, which can take up to two days. It may be necessary to manually recreate all transactions from the time the last good tape was saved, and finally, it is not at all unusual for tape to return media errors.

Clearly, such an approach is not sufficient for a company that cannot tolerate downtime, so those companies that have calculated the cost of downtime have added layers of protection to slash data recovery time to the absolute minimum. There are a number of options to reduce the recovery window, among them:

* Disk protection: Installing disk drives (DASD) that perform parity protection or disk mirroring to help prevent the chance of data loss in the event of a disk drive failure.
* Journaling: This is a process that efficiently monitors any change made to data and objects. In the event of a system failure, data can often be recreated without the need to manually re-key too much lost data.
* Recovery services: Protected third-party recovery sites, available on a subscription basis, where data changes made between tape saves are transmitted (data vaulting), and/or backup tapes are restored on a comparably configured system, which acts as a backup system after the loss or failure of a production system.
* High availability: True high availability consists of designating a second machine as a backup system, enabling communication between this second machine and the production machine, then implementing programs that replicate all changes to critical objects on the production system to a backup system. If a failure or system maintenance event occurs, users are moved to this second mirror image machine where they can resume business without the loss of data. In general, high availability provides the most efficient way to mitigate most planned and unplanned downtime events.

High-availability software

If a company buys a second machine just for the purpose of high availability, there is a real advantage in locating it in another building across town or across the country.
Raul Garbini is director of Edgetec.

The four standard components of any high-availability solution are:

* System-to-system communications;
* Data replication processes;
* System monitoring functions; and
* Role-swapping capabilities (moving from the production system to the backup system).

The first step of the high-availability process is to establish communications between production and backup machines. Typically, TCP/IP is the best way for two machines to communicate with each other, especially when moving large amounts of data between the machines.

Setting up TCP/IP communications is simple, but the challenge exists in determining the bandwidth required between the two systems to handle the volume of data to be regularly replicated.

A significant factor in deciding how much to spend on bandwidth is the volume of transactions that need to be sent to the backup site and how far a company intends to locate the second machine from the production machine; the further the machines are from each other, the more bandwidth is needed and the more it costs.

Certainly, a company can have the second machine in the same computer room as the first and then directly connect the two machines; however, it may be decided to use a second machine located on another floor or in another building, which adds a significant disaster recovery advantage.

If a company buys a second machine just for the purpose of high availability, there is a real advantage in locating it in another building across town or across the country. By doing this, if a site disaster occurs in the building where the primary machine is located, the second machine will not be vulnerable (hopefully) to the same disaster.

* In the next Industry Insight in this series, I'll look in detail at data replication, system monitoring and role swapping, and how these are addressed by software.

* Raul Garbini is director of Edgetec.

Towards 100% uptime

There are many options available that will reduce the recovery window.

High-availability software