Keeping IT alive when disaster strikes

Johannesburg, 20 Jun 2011

From the earthquakes and tsunamis that pounded Japan to the floods in Australia, the headlines this year have provided many examples of just how fragile IT infrastructures are in the face of disaster. And it's not just natural catastrophes that companies need to be aware of, but human-caused calamities as well.

In Egypt, the government pulled the plug on the country's entire Internet infrastructure in an attempt to shut down protests organised through social media. One of the most significant and prolonged infrastructure outages this year - the total collapse of Sony's PlayStation Network online gaming service in April - was caused by hackers.

Even though the natural disasters and social unrest that have hit many other countries seem like remote threats to local businesses, board-level execs and IT managers in SA cannot afford to be complacent, warn business continuity experts. Yet many local companies lag international best practice in their business continuity planning.

Although a few large South African companies, notably the banks and mining houses, take business continuity seriously, many others regard it as a grudge purchase, says Dean Horner, CEO at 42 Consulting.

Many companies see business continuity as an insurance against things they believe will never happen to them. But the big lesson from world events this year is that disaster can strike any company at any time, no matter where it is located, Horner says.

South African companies tend to be relatively strong in the field of DR planning - the discipline of getting the technical infrastructure up and running again after some sort of disaster, says Glen Khan, an executive at the infrastructure division of UCS Solutions.

They are not as good at business continuity planning, which is all about addressing the risks that would prevent business processes and services from running smoothly throughout an outage. The approach is reactive rather than proactive, he adds.

One possible reason local organisations lag international best practice when it comes to business continuity is they are not as heavily regulated as companies in many other countries. The London Stock Exchange, for example, will not allow a company to list without showing that it has a rigorously tested continuity plan, says Jorgen Nielsen, director at ContinuitySA. The same is not true of the Johannesburg Stock Exchange.

Frameworks such as King III are helping to drive awareness of ICT governance issues such as business continuity in the boardroom, but they don't have the force of law, Horner says.

“Frameworks such as King III are changing industry, but slowly,” Nielsen agrees. “Some industries such as banking are more heavily regulated than others.”

Fires, floods and killer bees

Another reason South African companies may be complacent is that they simply don't face as high a risk of earthquakes or extreme weather as do certain parts of North and South America or some Pacific Rim countries.

But the list of risks to business continuity local companies need to manage is long, warns Nielsen.

There are so many trucks on SA's roads transporting hazardous materials that it is just a matter of time before a major business centre needs to be evacuated because of a toxic spill, he adds.

The rising water table in Johannesburg is another concern - it's not inconceivable that underground power and telecom cables could be damaged by flooding. Gauteng even suffers from occasional tornadoes that could cause serious damage to critical infrastructure at some point in the future, Nielsen comments.

Companies should ensure that their DR site is not located on the same utilities grid as their production data centre.
Dean Horner, CEO at 42 Consulting

The reasons clients need to invoke ContinuitySA's services are usually more mundane, however. Failures of IT equipment, theft of electrical and telecom cables, and infrastructure failures are among the most common reason for outages in SA, says Nielsen. Fires also cost South African companies billions of rand a year.

Building a truly resilient infrastructure geared up for business continuity in the face of these threats is not cheap. Tasima, the IT company responsible for running the National Traffic Information System (eNatis) on behalf of the Department of Transport, is one example of a local organisation that has sunk a substantial investment into keeping a mission-critical system running at all times.

eNatis processes hundreds of thousands of transactions a day, most of which impact on government service delivery. If eNatis were to fall over, traffic departments countrywide would come to a standstill, unable to issue drivers' and vehicle licences. External parties that link into the system would also be affected.

Banks would not be able to process vehicle financing applications while the police would be unable to check vehicle registrations at roadblocks to detect stolen cars. Unplanned downtime is not an option and planned maintenance or upgrades can only take place during quiet hours of the day, says Tebogo Mphuti, CEO of Tasima.

The high availability system is configured to offer availability of more than 99.99% during business hours and more than 99.5 percent during all hours. Tasima runs a data centre in Midrand as well as a DR centre nearly 50km away from its main site. The DR site runs independently of the primary site, ready to take over if a disaster takes the data centre completely offline.

No point of failure

Every element of the system at the primary and DR site is designed for redundancy and resilience. The system architecture at each site comprises clustered application and data servers as well as a storage area network with a RAID configuration. If one server fails, another will simply shoulder its load.

Tasima runs dual network switches, network cables, and power distribution boards at each site. Diesel generators stand ready to kick in if there's a power outage, and the primary Telkom data link is backed up by Broadlink radio links and a Neotel connection.

Redundancy is built into each piece of hardware in the infrastructure as well. Servers, for example, all have dual network cards and power supplies as well as multiple memory cards and processors.

Storage arrays are configured with multiple disc drives. If a component in a server or storage system fails, it should keep going until it is fixed or replaced.

“There is no single point of failure in the data centre or the DR centre,” says Gert van Eeden, eNatis project director at Tasima.

The total cost of the offsite DR centre is probably less than 3% of the cost of the eNatis system - a small price to pay for the peace of mind it delivers, Van Eeden says.

The costs that companies sink into their DR infrastructure needn't be wasted, even if they never face a disaster. Organisations can also use these sites for quality assurance and testing to improve their return on investment, says George Sithole, pre-sales manager for Dimension Data's data centre solutions business unit.

Another piece of good news is that much of the technology companies need to drive their business continuity strategies is more affordable than ever before, according to Sithole.

Bandwidth is cheaper and more plentiful, making it more viable to mirror data between a production and a DR site. Tools such as virtualisation software also ease DR planning by automatically balancing and shifting computing workloads when a disc or server processor fails, Sithole adds.

Testing, testing, testing

The process of rolling out a business continuity plan starts out with a business impact assessment. Organisations must identify the critical outputs in every department in the business in a logical manner, says Nielsen. These will be areas of the business where a failure will have a major impact on the organisation's finances, operations and reputation.

From there, a company can determine what it needs to do to keep those processes running through an outage and how long it has to get those processes back up after a failure or a disaster. They need to look not only at their own processes, but also at their dependencies on other companies in their supply chains.

The continuity plan needs to be formally documented and signed off at board level, Nielsen says. The plan should be a simple set of instructions that guide managers and employees about what they should do if there is a disaster. This plan should be updated at least once a year.

The business continuity plan should not only be about keeping the lights burning in the data centre, says Fred Mitchell, manager of the Symantec Storage and Security Division at Drive Control Corporation.

A cookie-cutter approach won't work for continuity and DR planning - your partner needs to understand your business.
Dean Horner, CEO at 42 Consulting

Companies need to consider issues such as where their employees will meet up and work if they can't access the office. Communications plans must be put in place for employees, customers and other stakeholders, Mitchell says.

Business continuity professionals also emphasise the importance of testing the continuity plan on a regular basis to make sure it will work when a genuine crisis takes place. Nielsen says that most continuity plans fail on their first test because it is so easy to overlook a vital process or system during the business impact assessment.

“Your business continuity plan is like a parachute,” he adds. “You don't need it often. But when you do, you really need it and it must work.”

The test exercises need to be realistic enough to inspire some of the same panic that a genuine crisis would, says Mitchell. One could, for example, switch a key server off on a quiet day to see how the business reacts. The performance of the continuity plan should be evaluated so it can be improved, he adds.

The high cost of business continuity in a world of shrinking IT budgets is one of the major challenges the industry faces. It is a major reason why companies turn to continuity service providers and data centre outsourcing firms for help with their business continuity planning.

“You get access to a range of specialised skills when you work with a continuity service provider,” says Nielsen. “Cost is a major driver as well. You share the costs of infrastructure such as data centre facilities and workspace with the service providers' other clients, bringing your total cost of ownership down.”

When choosing a service provider, companies should look for a partner with a depth of specialist technical skills in the continuity field as well as risk management consulting skills, says Kahn of UCS.

According to 42 Consulting's Horner, organisations should also look for partners that understand the nuances of their industries.

“A cookie-cutter approach won't work for continuity and DR planning - your partner needs to understand your business,” he says.

Companies should ensure that their DR site - whether they run it themselves or outsource it to a service provider - is not located on the same utilities grid as their production data centre, says Khan. One of the major lessons from Japan's disasters is that a back-up site in the same building or around the block is not good enough when a real catastrophe strikes, says Nielsen.

Clouding continuity planning

When clouds of volcanic ash grounded flights across Europe in 2010, thousands of executives stranded in airports and hotels flung across the world could keep working thanks to online services such as Dropbox, Gmail and corporate virtualised desktop infrastructures (VDIs).

More recently, Amazon's Elastic Cloud Computing data centre crashed and caused data losses and outages of several hours for Web sites that depended on the service like Evite, Quora, Reddit and Foursquare. These two incidents illustrate the potential opportunities and risks cloud computing brings to the world of business continuity.

On the one hand, it gives organisations access to highly resilient infrastructures that have been built at great cost by service providers with massive economies of scale. SMEs could use cloud services for cheap offsite back-ups. But on the other, it puts control of vital business infrastructure and applications in the hands of an external service provider.

“When there's a tsunami or a political uprising or a power outage or even just a traffic jam, people still need to work,” says Sean Wainer, country manager for Citrix Systems SA. “Solutions such as VDIs allow employees to remotely access their applications and data anywhere where there is a cellular or WiFi connection and keep working the same way as they would in their offices.”

One of the primary benefits of cloud computing is that it can make high-end DR and business continuity solutions affordable to companies that could never afford to install the software and hardware themselves, says Richard Vester, executive head of Hosted Services & Solutions, Vodacom Business Services.

Running a real-time data replication solution across two data centre sites is prohibitively expensive for all but that the largest companies, for example. Public cloud solutions can make this level of redundancy and data protection available to smaller organisations for their mission-critical apps, Vester says.

He adds that the recent Amazon cloud outage is unlikely to dent uptake of cloud computing. Larger organisations are likely to run private cloud infrastructures for their most mission-critical applications rather than entrusting them to an external service provider, he adds.

Smaller companies will understand that an external service provider will probably be able to offer higher levels of uptime and resilience than they could achieve by building their own infrastructures.

Companies that host applications in the cloud must ensure that the cloud provider can prove it offers the same or better DR capabilities they would expect from their internal IT departments, says Vishal Mothie, technical specialist at Novell SA.

Grant Hodgkinson, business development director at Mimecast SA, says companies should not assume that all cloud providers are thorough about backing up data and hosting it in a redundant way. They should quiz their service providers carefully about their disaster contingency plans and their service level agreements, he adds.

* Article first published on brainstorm.itweb.co.za