2009-09-30

The Worst Case Scenario

What do you do when the unthinkable happens, and you lose your entire primary site of operations, including your local cloud storage? It could be a flood, a hurricane, or something as commonplace as a building fire, but it's happened — and now all of your infrastructure and storage media associated with your archive is gone.

But life continues on, and so does your business. Assuming your archive is a critical part of your workflow and corporate assets, how can you protect against these major disruptions?

To the right is a simplified flowchart to help determine when you should protect against site-loss scenarios, and what options are available to prevent loss of archived data and ensure continued access.
Click on the image to enlarge.

Single-Site Archives

If all of the equipment and storage associated with your archive is located at a single site, and that site is lost, unless you have some form of off-site storage, all data stored in the archive will be lost.

While a multi-site archive is the best solution to avoid data loss in this scenario, for cost-sensitive data where rapid restoration of data access and operations is not required, a multi-site archive may be more expensive than is warranted. In this case, two common options are to create two tape copies and have them vaulted off-site, or to store a copy of the data into a public cloud provider, such as Amazon, Iron Mountain, or Diomede Storage.

Vaulting to Tape

Storing to tape, and vaulting the tapes off-site is the least expensive option for protecting archived data. In this scenario, a node would be added to the single-site archive that creates two identical tapes containing recently archived data. These tapes are then sent off-site.

The ability for an archive to be restored from the "bare metal", from only the data objects being archived, is a very important feature of an archival system. This ensures that even if the control databases are lost, the archived data can still be accessed, and the archive can be rebuilt.

When planning a tape vaulting approach, the frequency that these tapes are created determines how much data is at risk of loss in the event of the loss of the site. For example, if tapes are generated every week, and take two business days worst case to be taken off-site, then the business can have an exposure window of up to twelve calendar days.

In the event of the catastrophic loss of the primary site, these tapes would have to be recalled from the vaulting provider, which can take some time, and hardware would have to be re-acquired to rebuild the archive. Don't underestimate the amount of time required to re-order hardware. Often the original equipment is no longer available, so a new archive will need to be specified and ordered, and can take weeks for the servers to be assembled and shipped.

Once the tapes have arrived and the hardware has been set up, the archive is rebuilt from the data stored on the tapes, and once the last tape is processed, the archive is ready to be used. This is known as a "bare-metal restore".

Of course, depending on the size of the archive, this could take a very long time. An 1 PB media archive would take 115 days to restore when running at a restore load of 800 Mbits/s, and a 10 billion object e-mail archive would take 115 days to restore when running at a restore load of 1000 objects per second. Rebuild times must be taken into account when planing for archive restoration, and often the cost of downtime associated such a restore is high enough that cloud or multi-site options are considered instead.

Storing to the Cloud (Hybrid Cloud)

Another option for single-site archives is to store a copy of the archived data to a cloud storage provider. This eliminates the headaches associated with tape management, but introduces the requirement for network connectivity to the provider. In this scenario, for each archived object to be protected off-site, the object is also stored to a cloud provider, which retains the data in the event that a restore is needed.

Unlike with tape archiving, data is stored immediately to the cloud, limited only by WAN bandwidth. However, this limitation can be substantial, and when bandwidth is insufficient, data will be at risk until the backlog clears. If data is being stored to the archive at 100 Mbits/s, an OC-3 class Internet connection would be required, which can be far more expensive than sending twenty tapes out each week.

In the event of the catastrophic loss of the primary site, hardware would be re-acquired to rebuild the archive, and network connectivity would need to be acquired to allow connectivity to the cloud. When both of these are operational, the archive would be reconnected to the cloud. This would restore access to the archived data, albeit limited by the network bandwidth. Over time, the on-site archived data can be then restored back over the WAN.

The primary disadvantages of this approach are the time required to get the hardware and network access for restoring the on-site component of the archive, with the second disadvantage being cost. Fears about unauthorized disclosure of data and loss of control over data are also common, though they can be mitigated with the appropriate use of encryption.

And often, for less than the price charged by most public cloud providers, one can afford to create a multi-site archive, either across multiple premises owned by the business, or into a second premise hosted by a third party.

Why not just use the Cloud?

Some public cloud providers encourage an architecture that has a minimal on-site presence, and stores all data off-site in the cloud. For some scenarios, this approach works very well, as it minimizes capital costs and minimizes the time required restore hardware and access in the event of a disaster. However, one must have sufficient WAN bandwidth for the expected store and retrieve loads (as opposed to just store-only traffic loads when using the cloud as a storage target), and in the event of a network connectivity failure, access to most or all of the archive can be disrupted.

This is contrasted with the hybrid cloud model, where the private cloud on-site allows continued access to the data even during WAN failures, and the public cloud is used as a low-cost data storage target.

Multi-Site Archives

When continuance of business operations are important, or archival data must be accessible across multiple sites, the archive can be extended to span multiple sites. This involves several considerations, including:
  1. What data is created in one site and accessed in another?
  2. What data should be replicated to other sites for protection?
  3. What data should be replicated to other sites for performance?
  4. In the event of a site loss scenario, what will be the additional load placed on other sites?
Such multi-site archives are very flexible, and allow seamless continuance of operations even in the event of major and catastrophic failures that affect one or more sites. While this is obviously the best solution from a business continunace standpoint, it is also the most expensive, as you must duplicate your entire archive infrastructure in multiple sites, and provide sufficient WAN bandwidth for cross-site data replication.

Of course, one can also deploy systems like Bycast's StorageGRID to provide a mixture of the above described approaches, using policies to determine which archived content is stored locally, vaulted to tape, stored in a public cloud, and replicated across multiple sites. This flexibility allows the value to the data to be mapped to the cost of the storage, and leverages a common infrastructure for all levels of protection required.

No comments: