SnapMirror: The director's cut

Johannesburg, 04 Jun 2007

The famous story about Abe Lincoln writing the Gettysburg Address on the back of an envelope while riding a train is just that - a story. He had actually painstakingly written the speech days before, and just used the envelope to help him focus on the key points, says Stephen Manley, technical director of data protection at NetApp.

NetApp SnapMirror demonstrates how years of hard work can change an industry.

When you think of SnapMirror, three things should come to mind: low cost, simple, and reliable. SnapMirror enables disaster recovery for lower tiers of data with options for simple, low-cost mirrors.

It offers unique configuration choices to minimise new hardware purchases and optimise existing primary storage. SnapMirror copies can become part of daily administrator and business activity, while improving the reliability of disaster recovery (DR).

SnapMirror's low-cost, simple, reliable mirroring delivers access to your data, anytime, anywhere, for any use.

SnapMirror for disaster recovery

Lower the cost of DR and protect more data. SnapMirror is designed to be inexpensive, simple, and reliable, so that customers can deploy mirroring for all of their business critical data.

Let's face it. Disasters don't strike only the most critical, most protected data, and they often reveal the importance of (now unavailable) lower-tier applications and data. Just as IT organisations tier storage, some now offer mirroring tiers to meet varying DR SLAs.

The choice should not be either "mirror over an expensive network to an expensive system" or "use tape". With asynchronous mirroring to ATA storage, companies can abandon their "load the tape and pray" DR plans.

Over five years ago, SnapMirror pioneered mirroring from FC-based primary storage systems to ATA NearStore systems. Today, our largest customers are rapidly expanding their ATA mirror infrastructure. Furthermore, periodic (for example, hourly or nightly) asynchronous mirrors significantly reduce network traffic because most new data is either repeatedly modified or rapidly removed.

By designing SnapMirror to transfer only "net-new" data, most customers see a reduction in network load of 25% to 90%. Lower cost DR means that you can now afford to protect more of your data. Can you afford not to?

For DR mirrors to expand and protect new tiers of data, mirroring must be simplified. Sadly, managing mirrors today is such a high-stress job that many admins "draw straws" to pick who will suffer. Like snapshots, though, mirroring should be painless; there are no moving parts, no impossible backup windows, and no processes to manage.

To ensure its simplicity, NetApp built SnapMirror on the foundation of Snapshot copies. To set up SnapMirror, an admin initialises the mirror (through network or tape), sets a cron-like schedule, and... that's it. With two steps, he or she can protect applications and files, SAN and NAS, remote offices and the data centre.

DR should always be so easy that one administrator can manage petabytes of mirrors. To further simplify SnapMirror management, NetApp recently introduced protection manager. You can learn more about it in Three Common Backup and Replication Challenges Solved, which includes a demo of the product in action.

Of course, for the new, simple tiered DR solutions to have value, mirrors must reliably protect against both physical and logical disasters. Unfortunately, historically, "Mirrors mirror everything, including corruption". If a rogue process deletes half your files, that data will also be destroyed on your mirror. Because application and user disasters occur more frequently than physical outages, traditional mirrors leave companies exposed. SnapMirror builds on NetApp Snapshot technology to close this gap.

SnapMirror protects the current data while also preserving the historical Snapshot copies, so administrators can recover from both physical and logical disasters. Your mirror should protect against all disasters, otherwise it's doing only half its job.

One of my favourite customers once said: "I'm glad you like talking about SnapMirror, but I already bought it. Just help me set it up." In his honour, some practical Q&A is included at the end of this article.

Questions for this section include:

* How do I decide whether to mirror to ATA or FC?
* I think my network is my SnapMirror bottleneck. How can I be sure?
* SnapMirror can't meet its schedule because of the network. Can I change my SnapMirror configuration rather than buying a faster pipe or a compression device?
* How can I optimise my SnapMirror performance on ATA?

SnapMirror for end-to-end data protection

If you can't trust your mirror, what can you trust? By focusing on cost, simplicity and reliability, SnapMirror improves the performance of primary storage systems while simplifying end-to-end data protection.

Not only can mirrors be cheaper to deploy and manage, but they should also reduce the overall cost of tape backups. With explosive data growth and 24x7 operations, administrations now run backups during business hours.

To compensate for the constant backup load, companies overspend to overprovision their servers and storage. Meanwhile, they maintain an idle mirror as an insurance policy. Why shouldn't the mirror handle the backup, so the primary storage can serve applications and users? Because SnapMirror replicates the active data and Snapshot copies, administrators can back up any application-consistent Snapshot copy while continuing to update the mirror. To reduce cost, tape backups should be run on the mirrors, leaving primary servers and storage for applications and users.

Because power and cooling challenges seem to be growing as quickly as data protection struggles, mirrors should not require extra, dedicated hardware. IT departments consolidate to reduce complexity and power consumption. Therefore, mirrors should not double (or worse) the hardware footprint.

SnapMirror uses NetApp Flex[*] technology to minimise its footprint. Firstly, a customer can use SnapMirror to replicate multiple source systems to one destination system. Secondly, SnapMirror can replicate System A's data to System B, while also replicating System B's data to System A.

User operations can be prioritised over SnapMirror and other system activities using NetApp FlexShare software - a standard feature of the NetApp operating system starting with Data ONTAP 7.2 - to minimise the impact of SnapMirror.

To improve performance, administrators frequently mix primary FlexVol volumes and mirrored FlexVol volumes on the same aggregate. Without a dedicated "mirror" aggregate, primary data spreads across all disk drives. More I/O resources are available to primary applications, while SnapMirror runs in the background protecting the data. Mirrors should simplify and optimise your environment, not sell more hardware.

Of course, performance means nothing if, when a disaster strikes, your mirrors are not instantly ready to serve data. Even if a disaster occurs while a mirror is in mid-transfer, the mirror must be available for recovery. Mirrors must also remain consistent while recovering. We've all heard campfire horror stories about an admin incorrectly executing step 23 on page four of the DR plan, "And... thwack! The mirror became worthless and it was time to look for a new job".

A mirror must always reliably handle a disaster. SnapMirror uses Snapshot copies to always maintain consistency. Regardless of network status, SnapMirror can fall back to the last Snapshot copy for instant DR.

SnapMirror DR itself consists of a handful of commands that cannot compromise data integrity. Mirrors must be reliable - if you can't trust your mirror, what can you trust?

Again, I've answered some common questions in the Q&A including:

* Speaking of backup... where does SnapVault fit into your DR strategy?
* Sometimes my CPU hits 100% when I run SnapMirror. Is that bad?
* You talk about disk I/O a lot. Why are you so obsessed?

SnapMirror - beyond DR

Put your mirrors to work. Mirrors are evolving from expensive DR insurance policies to cost-effective core business tools that deliver DR and more. A single SnapMirror copy can improve DR reliability, optimise test and development, and simplify day-to-day administrator activities.

To reduce the cost of DR, don't let your mirrors idly await a disaster. Put them to work. Offloading tape backup to the mirror is just the beginning; mirrors can meet other challenges.

Administrations cannot easily test application upgrades, evaluate new database schemas, or run data mining because they fear extra load or data corruption on their primary storage. "Don't touch it, you'll break it." Because mirrors are separate data copies with cycles to spare, they can help solve these problems.

SnapMirror uses Snapshot copies and can be used with NetApp FlexClone technology to safely unify test and development with data protection. To mine data, clients can access the read-only SnapMirror and its Snapshot copies. To run tests, administrations can instantly create a FlexClone of their SnapMirror to create a separate, space-efficient, read-write copy. Regardless, the mirror maintains complete DR readiness. With SnapMirror, secondary storage delivers primary business value.

Not only should mirrors deliver business value every day, they should simplify administrators' everyday lives. Data migration has become common in today's dynamic storage environments: applications need more space or performance, data must move to newer arrays, and users or projects change sites. If a mirror can truly handle DR, data migration should be easy.

SnapMirror simplifies data migration because it reduces storage downtime to minutes. Firstly, the IT team sets up a SnapMirror relationship to the migration target. Then they disable client access to the old array, update the mirror a final time, split the mirror, and direct clients to the new array. Simple. Efficient. Fast.

Customers initially worry about using their mirrors for more than DR, but constant use actually improves DR success. Data recovery fails because either the data was protected incorrectly or the recovery process was run incorrectly.

Using your mirrors for test and development or data migration helps eliminate those failures. Firstly, there is no better validation of data correctness than having clients or applications use it. It's like running a fire drill every time. Secondly, there is no better way to make the IT team comfortable with using the mirrors than to simply use them. SnapMirror customers recover from disasters quickly and reliably because they trust the data and themselves.

NetApp didn't build SnapMirror four score and seven years ago, but over the past decade we have protected more tiers of data than any other product. SnapMirror was not created in a moment of genius, but emerged from years of dedication to three design principles: low cost, simplicity, and reliability. SnapMirror was built to ensure that business-critical data shall always be available anytime, anywhere, for any use.

Questions addressed in the Q&A include:

* I'm moving to VMware - how do you work with them?
* After I migrate my data, how do I reconfigure my mirrors?
* Why is NetApp so excited about FlexClone? Doesn't everybody support clones?

View Q&A related to this section.

SnapMirror Q&A

1. SnapMirror for disaster recovery

Lower the cost of DR and protect more data. Because SnapMirror is designed to transfer only "net-new" data, most customers see a reduction in network load of 25% to 90%. Lower-cost DR means that you can now protect more of your data. Additionally, SnapMirror reliably protects against both physical and logical disasters. Read the full details.

Common questions include:
Q. How do I decide whether to mirror to ATA or FC?
A. Ask yourself three questions. Firstly, during a disaster, what will be the bottleneck for your performance? ATA is slower than FC, but if you connect to your DR system with a high-latency network, disk type won't matter. Secondly, what RPO do you want? Don't run a synchronous mirror from FC to ATA; I recommend at least 15-minute intervals. Thirdly, what's more flexible - the answers to number one and number two, or your budget?

Q. I think my network is my SnapMirror bottleneck. How can I be sure?
A. We encounter three SnapMirror bottlenecks: CPU, disk I/O, and network. Although the network throughput or latency can slow you down, don't discount the impact of system load. A system running at 90% CPU and/or disk utilisation leaves few resources for SnapMirror, especially because SnapMirror gives priority to user load. If the numbers don't point to the culprit, you can run SnapMirror to tape (SnapMirror store). By eliminating the network, you may be able to isolate the bottleneck.

Q. SnapMirror can't meet its schedule because of the network. Can I change my SnapMirror configuration rather than buying a faster pipe or a compression device?
A. Maybe. There are two opposite solutions to this problem. Customers with frequently overwritten data run SnapMirror less often, which eliminates redundant network traffic. Customers with infrequently overwritten data, run SnapMirror more often. By constantly sending the new data to the mirror, they better utilise their network bandwidth (no bursts). There are two easy ways to figure out your workload. Firstly, use the snap delta command to monitor the amount of change between Snapshot copies. Secondly, modify the SnapMirror schedule and examine the SnapMirror logs. Either way, determine how much more data you would transfer with four-hourly transfers than with one transfer every four hours. In the end, though, sometimes you do just need to buy a faster network.

Q. How can I optimise my SnapMirror performance on ATA?
A. Configure your mirrors to utilise all the ATA drives at once using FlexVol volumes. Because the ATA drives are slower than the FC drives on the primary storage, you need to compensate by using them more consistently. Fortunately, FlexVol volumes make it possible to spread your mirrors across the maximum number of spindles, while the SnapMirror simple schedule makes it easy to always run one or more transfers to each destination aggregate.

2. SnapMirror for end-to-end data protection
If You Can't Trust Your Mirror, What Can You Trust? Because SnapMirror replicates the active data and Snapshot copies, administrators can back up any application-consistent snapshot copy while continuing to update the mirror. To reduce cost, tape backups should be run on the mirrors, offloading the primary servers and storage. To improve performance, mix primary and mirrored FlexVol volumes on the same aggregate. Read the full details.

Common questions include:
Q. Speaking of backup... where does SnapVault fit into your DR strategy?
A. SnapVault, the NetApp D2D backup solution for the past five years, adds two DR tiers. SnapVault alone offers a tier between periodic mirroring and traditional D2D and tape backup. SnapVault can run CDP-style, with hourly backups and 20:1 to 50:1 deduplication, from which users can recover their own data. However, you cannot make SnapVault writable. Therefore, although recovery will be faster than any other D2D or tape backup, it will be much slower than a DR mirror.

SnapVault plus SnapMirror delivers a unified DR and backup tier. SnapMirror can make any of the SnapVault backups writable for near-instant recovery. One caveat: SnapVault plus SnapMirror can provide only up to one hour RPO, while SnapMirror alone can deliver from 0 RPO and up, depending on your schedule. Backup management and deduplication are not computationally free, after all.

Q. Sometimes my CPU hits 100% when I run SnapMirror. Is that bad?
A. Not necessarily. If a CPU runs at 90%, and you have work to do, you're not getting the most out of your system. If it runs at 100%, you worry that the load will slow high-priority operations. Ideally, a system should run as hard as possible, but let high-priority operations take precedence. That's how we designed SnapMirror. It drives the CPU as aggressively as possible, but constantly pre-empts itself to minimise its impact on user load. If you find that SnapMirror is hurting your application performance, it's more likely due to disk I/O load than to CPU overload. Reduce the frequency of SnapMirror updates, look into FlexShare, and monitor your overall disk I/O loads.

Q. You talk about disk I/O a lot. Why are you so obsessed?
A. Processor performance continues to rapidly increase. Memory steadily becomes cheaper. Bus bandwidth keeps growing. Disk drive capacity explodes-and disk drive performance doesn't. With the vast amount of data and compute power available, disk I/O becomes a precious commodity. NetApp solutions minimise, optimise, and offload I/O for data protection, so your primary storage resources are focused on serving your applications.

3 SnapMirror - beyond DR
Putting your mirrors to work. Mirrors are evolving from expensive DR insurance policies to cost-effective core business tools that deliver DR and more. A single SnapMirror copy can improve DR reliability, optimise test and development, and simplify day-to-day administrator activities. Read the full details.

Common questions include:
Q. I'm moving to VMware - how do you work with them?
A. We love VMware and all virtual server technology. I'll discuss three great opportunities with VMware and NetApp. Firstly, VMware completes our tiered DR story because you can deploy DR tiers for both servers and storage. With virtual machines, the DR site can have fewer servers. We reduce the DR storage cost; they reduce the server cost. Secondly, VMware works with SnapMirror to further simplify test and development. FlexClone your SnapMirror, create a virtual server with your application, and start running. That's it. No load on the primary storage and servers; no extra data copies; no complex system/application configuration. Thirdly, VMware highlights the value of SnapMirror for offloading backup. As your server utilisation increases, you cannot afford to run heavyweight backups - there's no more excess CPU available. In VMware environments, SnapMirror offloads the backup from both the server and the storage. It's the only way to scale.

(ed note: You can find out more on this topic in Five Ways to Use NetApp Snapshot Copies with VMware VI3.)
Q. After I migrate my data, how do I reconfigure my mirrors?
A. If you migrate a volume, simply change snapmirror.conf to match the source system/volume/qtree name and configure the new source to allow transfers to the destination (use the snapmirror.allow or /etc/snapmirror.allow option). That's it. There's no need to break and/or resynchronise the mirrors. SnapMirror picks up where it left off.

Q. Why is NetApp so excited about FlexClone? Doesn't everybody support clones?
A. NetApp has revolutionised cloning the way it did snapshots, because the two are built on the same principles. NetApp can store hundreds of snapshot copies per volume without incurring a performance or storage penalty. Our snapshot copies don't generate extra I/O when data is modified, and they consume space only for changed blocks.

NetApp can now create hundreds of FlexClone copies without incurring a performance or storage penalty. A FlexClone copy can be instantly created from any Snapshot copy. At that moment, the clone shares all of its blocks with the Snapshot copy; it uses no extra space and generates no extra load. As the application modifies the clone, our clones consume space and I/O only for the changed blocks.

FlexClone copies transform our industry-leading read-only snapshot copies into industry-leading read-write snapshot copies. It's easy to get excited about a technology that - combined with Snapshot and SnapMirror - makes data available anytime, anywhere for any use.

Stephen Manley is the Technical Director of Data Protection, NetApp. He joined NetApp Engineering in 1997 after graduating from Harvard University with a BA in computer science. He has helped define and develop NetApp tape and disk-to-disk backup, mirroring, and compliance technologies, including SnapMirror. Stephen has travelled the world over, working with customers and backup partners to provide open, reliable, and simple data protection solutions, one data centre at a time.