Expecting the unexpected

Johannesburg, 12 May 2020

The increasingly complex and sophisticated threat landscape is challenging businesses in an unprecedented fashion, and seeing them battle to prepare for an ever-widening range of scenarios.

Sure, traditional threats such as natural disasters and even the recent coronavirus scourge remain top of mind, but business continuity (BC) and disaster recovery (DR) practitioners must now also plan for sophisticated cyber attacks, misinformation campaigns, data manipulation, as well as many other events.

They need to think ahead, and get a grip on how they can rapidly notify their workforces in the event of an incident to lower operational and business risk as much as possible. Luckily, there are a slew of tools available today – some cloud-based, some on-premise – that businesses can look at to help streamline their BC and DR efforts, however, before an organisation can select a suitable solution, they need a DR plan that prioritises their mission-critical data and applications.

Craigh Stuart, systems engineering manager at Nutanix SADC, says deciding on the best DR solution for a business, we need to understand what it is required. “Firstly, DR needs to look after your data, namely provide a view of where data is destined for, and how much each line of business has. Knowing the nature of your data will help determine where it needs to be stored, as types of data vary, which means the costs to store them is also different.”

The possibilities are endless when it comes to AI and ML.
Marius Burger

Secondly, he says a business needs to understand the ‘demand’ placed on data, with DR you need to recover data so robust SLAs must be in place no matter where the data resides. “Lastly, time to data recovery is critical and in order to make an effective decision of where you want your data to be you must have an understanding of how long it will take to recover it with minimal service disruption.”

The road to recovery

According to Andrew Cruise, MD of Routed, the first step in DR is identifying the recovery infrastructure. “Without a recovery environment there is no recovery. The second step is to record the workloads and applications, with dependencies, if necessary. This should then allow the organisation to group data and applications in terms of priority. Thirdly, the organisation must decide, with input from application owners, on the level of criticality of applications and assess the risk and impact of downtime for each group of applications: this will define the DR schedules. Finally, software needs to be chosen to execute the replication of workloads to the recovery environment.”There are also certain steps needed to pin down recovery point objectives (RPO) and recovery time objectives (RTO) when the business has many different data sets of varying importance, all of which may have very different protection and recovery needs.

Stuart says IT teams must ensure that when taking decisions on RPO and RTO requirements for their DR the following criteria needs to be clearly understood. “If DR is to be invoked, what workloads need to be running to sustain the business and enable it to respond to customers? Each application needs to have a clearly defined RTO to ensure governance SLAs are met, especially if the business is governed by regulators. Notably, RPOs are dependent on what SLAs are being adhered to. When this is understood IT can determine whether block-based replication is required, or in the event that the application itself needs to replicate the type and nature of the data is taken into consideration. If a client is using the public cloud domain for DR, RPO considerations around RTOs need to be investigated versus on-premise solutions as latency and bandwidth play a major role in the time to recovery.

“Applications need to be grouped and an assessment of risk and impact of downtime is required in order to ascertain the RTO and RPO for DR. Generally, organisations can fairly easily identify which applications need DR (rather than just backup – which, in essence, is an archive copy of data to be restored in production, not to recovery infrastructure), and often this is sufficient as it’s easier to apply one set of RPOs and RTOs across the board. If an organisation is going to separate groups of workloads for DR it’s wise not to overcomplicate by going too granular – two or maybe three alternatives for replication schedules should be sufficient,” adds Cruise.

Firstly, DR needs to look after your data, namely provide a view of where data is destined for, and how much each line of business has.
Craigh Stuart

For Marius Burger, CIO at Seacom, the biggest challenge lies in understanding your data landscape, and building governance to maintain such – establishing and keeping visibility and transparency of data across the organisation. “Frameworks should then be established for data classification / calculation of risk per system / data source within the organisation. For determining RTOs, the business needs to understand what an outage costs the organisation over time; understand importance of each system in the landscape; understand mitigation and recovery requirements, and understand recovery cost and benefit. When it comes to determining RPOs, the business needs to understand the maximum amount of data loss the organisation can tolerate, and understand the recovery cost benefit, including the cost of data lost.”

Home or away?

Speaking of whether cloud or on-premise solutions are more suitable for BC and DR efforts, Stuart says an on-premise DR solution takes care of the data without the need for additional governance as you know where your data is. It just requires a physical infrastructure to host your data and that is usually at a cost you are aware of and can control.”

Conversely, Stuart says cloud or public DR, requires no additional physical infrastructure and is built on a consumption model where you buy what you need to instantiate workloads when required. “The costs here also vary drastically based on where the data is going, how much of it needs to be transported, snapshotted, or where it needs to be stored. Leaping into a cloud DR solution isn’t always as easy as the brochure says it is, as you need to know whether where it’s stored adheres to associated governance requirements. Technically your teams also need to be cognisant of the latency and performance that goes into “getting your data back up” in the event of a failover event.

Using on-premise infrastructure for DR allows the organisation to retain a high level of ownership and control, says Cruise. “This could be important because of administrative reasons such as compliance or security, or for technical reasons, including granular access to the back-end of the infrastructure, which may be required. With cloud there is always a suspicion, sometimes baseless, that infrastructure is less secure and less compliant than on-premise; and cloud DR products aren’t as flexible as on-premise as back-end access is usually not permitted.

Without a recovery environment there is no recovery.
Andrew Cruise

On the other hand, Cruise says there is a greater risk in on-premise DR, as many organisations trickle down hardware to be used in recovery environments, which doesn’t perform as well, and isn’t as reliable as newer production hardware. DR simply has to work, there is no point in having a recovery environment, which is not fit for purpose. “The flipside of this is if the recovery environment is similar in performance and age to production, it will be very expensive in terms of both CAPEX (paying double) and OPEX (keeping it powered on, updated and in good working order) – and inefficiently utilised. There is also the usual risk of disaster (fire, flood, theft and others) to the recovery environment. Cloud DR is typically procured on a usage basis and therefore is more cost effective, and less risky.

The benefits of on-premise DR are many, adds Burger. It can function independently of breakout connectivity, and in terms of performance, DR host proximity to primary sources and LAN generally pose throughput and latency benefits. If the appropriate compliance metrics have been met, regulatory matters such as data sovereignty are generally less of a concern. “With cloud, businesses reap the benefits of pay-as-you-grow or pay-as-you-use cost savings. Moreover, it isn’t necessary to host and maintain additional infrastructure further reducing cost and effort. Some cloud providers offering DRaaS / BCaaS as (one of) their primary functions is likely better versed on the matter, and can assist in informing data and DR/BC strategies.”

The next questions are: where does data discovery fit in? And, how can it be used to help an organisation understand and regain control over its data? Cruise says that data discovery tools built into replication software can be used to identify all workloads in the environment, which may need to be replicated and failed over elsewhere. This can be useful as a spot check in case anything has been missed.

Secondly, backup and replication software providers have been moving towards treating all backed up or replicated data as a broad ‘secondary data’ pool (together with old, stale, or archived data), which can be accessed and interrogated. There is a strong element of discovery of data on the network in order to identify all secondary data for this purpose.

Getting a clear picture

Data discovery for BI enablement, while not necessarily pertaining directly to DR matters, also strives to achieve visibility and transparency of data across the organisation, says Burger. “Many of the factors that drive DR RPO metrics, such as loss, retention and suchlike, are considered when defining data discovery strategies. Thus, data discovery strategies should be considered in close relationship with those of DR, and defined data discovery metrics can often inform the DR planning process.”

Artificial intelligence and machine learning (ML) are also playing a role in today’s DR and BC. “These tools are becoming increasingly relevant when determining how a business executes on a DR strategy, explains Stuart. AI has the potential to examine specific types of data with the ability to run risk and business impact analysis to help businesses determine the right strategy to execute DR plans. ML is then fed this information to help construct recovery strategies on a consistent basis. This new approach allows the business to continuously evolve their DR plans and outcomes in a more predictable fashion. Ultimately, the majority of data that exists in business today is unstructured and is dotted across various platforms. To perform effective DR you need a tool which files, automatically analyses, and categorises data, so that business can access patterns, anomalies, and any sudden changes that could put the data at risk.The possibilities are endless when it comes to AI and ML, says Burger. “In the case of their place when it comes to DR and BC planning and management, things are no different, and they can be used to tackle various facets of the problem landscape.

“They can be used for predicting potential outcomes to ensure potential threats are not missed from a planning perspective, and ML with AI through predictive analytics could be used to determine in advance when systems are likely to fail, allowing preventative measures to kick in to completely evade downtime. In addition, AI can monitor causes of downtime, which fed into ML can provide insight into typical causes of downtime for better preventative planning. Finally, AI can potentially be used for automation of recovery itself improving incident response timing and reducing recovery turnaround.

Cruise believes there may be a place for AI and ML in automatically backing up or replicating applications as they are provisioned, but that’s not where we are yet. “However, these tools are being used in conjunction with the ‘secondary data’ concept – this data can be interrogated and used to extract pertinent information, rather than accessing and subjecting live production data to same, which may affect performance.”

* Article first published on brainstorm.itweb.co.za