Tarsus explains cluster vs fault tolerance

By Louis Helmbold for Tarsus Technologies

Johannesburg, 17 Aug 2001

In today`s world of globalisation, electronic commerce, mobile computing, company performance, and competitiveness are increasingly dependent on the availability of information systems.

"Companies implement clusters without considering all options, and end up using these servers as standalone servers, while similar uptime can be maintained by managing their own risk," says Louis Helmbold, HP network and storage consultant at Tarsus Technologies.

A cluster is defined as a group of independent network server nodes that present themselves to a network as a single system.

"The goal of high available systems is to keep the amount of unplanned downtime to a minimal time frame," Helmbold points out.

He explains that Single Points of Failure (SPOF) are hardware and software failure of that component that would take the entire application or system down.

"Fault-tolerant systems typically have redundant hardware and/or software components that allow for the failure of any one component and the ability to automatically recover from that failure," Helmbold explains.

"Microsoft`s cluster offers 99.99 % availability, but a break of up to 20 seconds during a fail-over can cause information storage corruption in Exchange version 5. This could lead to a restore and prolonged downtime," Helmbold continues.

Helmbold says Linux`s cluster offers software freeware and 99.99 % availability though it is a Beta release. Third-party hardware and duplicate machines are required.

"NT 4 and Marathon present 99.999% availability, which is ideal for Exchange 5.x, no break in service, and every transaction is duplicated with no loss of RAM contents. On the other hand, four duplicated machines and two duplicated data stores are needed. Additionally, it is third-party hardware and limited to NT4.

"Fault tolerance offers lower costs, but no guaranteed uptime, which means companies have to manage their own risk," Helmbold points out.

"When planning a system uptime and downtime has to be defined," he continues. "The maximum amount of downtime and the impact also have to be defined, and in addition, data restore time must be taken in consideration. Ultimately, companies have to decide on system fault tolerant or cluster solutions."

Helmbold briefly explains how to minimise downtime in a fault tolerant system:

Hardware:

a. Raid/Mirror hard disks

b. Dual power supplies

c. ECC Memory

d. Dual network cards

e. UPS (Network must also be on UPS)

f. Plan network uptime

g. Have standby components, for example, motherboards on site

h. Load balancing via hardware (investigate)

Software:

a. Implement SNMP warnings (SMS/Pager)

b. HP Toptools

c. Load balancing via software (investigate)

Environment:

a. Cooling

b. Access Control

c. Stable power

Support:

a. Ensure reliable qualified support (on site and off site)

b. On site service contract (HP support packs)

"If high-availability is required, do a proper investigation and remember that the more uptime that is required, the higher the budget," Helmbold concludes.

Editorial contacts