The great de-duplication debate

By John Hope-Bailie, Technical director of Demand Data

Johannesburg, 02 Sept 2009

De-duplication is a method of reducing data storage needs by eliminating redundant data from a device. The technology is also known as 'intelligent compression' or 'single-instance storage'. Under the 'de-duplication' umbrella is a range of data reducing options, allowing users to pick from 'best-of-breed' solutions, depending on their applications.

Two key technology options are 'source-based de-duplication' and 'target-based de-duplication'. Essentially, the debate surrounding which option is best is linked to arguments about where the de-duplication process should occur. Should it be before the data is sent to the backup target device, or at the device itself?

Those in the 'source de-duplication' camp maintain that this option reduces network traffic and decreases the backup 'window' - the time needed for backup - because less data is transmitted.

Source-side de-duplication supporters say this option is able to aggregate the processing power of multiple client systems. They also maintain that, as the technology is the newer of the two, it has more benefits and is more compatible with modern backup and storage infrastructures.

Target-side protagonists, on the other hand, say de-duplication should be performed at the backup target, because all backups can then be more easily aggregated into a single storage system.

Target-side advocates point to the fact that client systems are unaffected by this process, which is easier able to integrate into existing data storage environments, effectively hiding the de-duplicating process behind industry standard protocols.

Third option

From this background has emerged increasing support for a third opinion - that source de-duplication and target de-duplication technologies are complementary and that both are required in today's business environment.

Supporters of this position say it's no longer necessary to choose between the two technologies - it's possible to have the best both worlds, from both business and technology perspectives.

In this light, it is becoming obvious that source-side de-duplication is able to offer some advantages that target-side de-duplication cannot. For example, it reduces the amount of bandwidth required at the source. In some applications, such as remote office backup, this is vitally important. In this situation, no amount of target de-duplication is able to save the time or the bandwidth that source de-duplication can.

At the other end of the scale, target-side de-duplication is able to offer better performance than the source-side equivalent in terms of throughput to a backup device. This is because bandwidth requirements from the source are of little concern.

As a result, in scenarios where backups and restores are required at the high end of the Gigabit Ethernet performance curve, or at fibre channel speeds - which are expected to reach 16Gbps by 2011 - target-side de-duplication is an attractive option. Many industry commentators say it is the most flexible, highest performing and cost-effective way to get the benefits of de-duplication in the backup, storage and recovery process.

Role of de-duplication

With the increasing entrenchment of de-duplication in the business environment has come a refinement of its role within the organisation.

For instance, within target-side de-duplication solutions, there are those that excel in inline de-duplication. This is the 'de-duping' of data on the fly as it enters the backup or storage device.

With the increasing entrenchment of de-duplication in the business environment has come a refinement of its role within the organisation
John Hope-Bailie is technical director of Demand Data.

There are other solutions best suited to post-process de-duplication - de-duping data in the background as an ongoing process after it has entered the backup or storage device.

Expanding the concept further are virtual tape libraries (VTLs) that perform de-duping processes on the backup data once it is stored within the device. In some cases, data is replicated offsite to another de-duping VTL, while in others it is written out to physical tape for an offsite copy (in non-de-duplicated format).

It's important to note that de-duplicating on the fly will slow down the backup data rate, but once the data has arrived, the other processes can start immediately. Post-processing will provide a maximum backup data rate, but the subsequent processes may need to be delayed until the background de-duplication process has been completed.

In conclusion, today's end-users have an increasing range of options to choose from for a de-duplication solution. They include inline de-duplication for those users who want their replication to be done as fast as possible, and post-process de-duplication for those who require their backups to be completed as fast as possible.

From both business and technology points of view, the technologies are complementary and entirely consistent with the most time-proven, long-term backup, recovery and archive strategies.

* John Hope-Bailie is technical director of Demand Data.

The great de-duplication debate

Should de-duplication occur before the data is sent to the backup target device, or at the device itself?

Third option

Role of de-duplication