2009-09-30

The Worst Case Scenario

What do you do when the unthinkable happens, and you lose your entire primary site of operations, including your local cloud storage? It could be a flood, a hurricane, or something as commonplace as a building fire, but it's happened — and now all of your infrastructure and storage media associated with your archive is gone.

But life continues on, and so does your business. Assuming your archive is a critical part of your workflow and corporate assets, how can you protect against these major disruptions?

To the right is a simplified flowchart to help determine when you should protect against site-loss scenarios, and what options are available to prevent loss of archived data and ensure continued access.
Click on the image to enlarge.

Single-Site Archives

If all of the equipment and storage associated with your archive is located at a single site, and that site is lost, unless you have some form of off-site storage, all data stored in the archive will be lost.

While a multi-site archive is the best solution to avoid data loss in this scenario, for cost-sensitive data where rapid restoration of data access and operations is not required, a multi-site archive may be more expensive than is warranted. In this case, two common options are to create two tape copies and have them vaulted off-site, or to store a copy of the data into a public cloud provider, such as Amazon, Iron Mountain, or Diomede Storage.

Vaulting to Tape

Storing to tape, and vaulting the tapes off-site is the least expensive option for protecting archived data. In this scenario, a node would be added to the single-site archive that creates two identical tapes containing recently archived data. These tapes are then sent off-site.

The ability for an archive to be restored from the "bare metal", from only the data objects being archived, is a very important feature of an archival system. This ensures that even if the control databases are lost, the archived data can still be accessed, and the archive can be rebuilt.

When planning a tape vaulting approach, the frequency that these tapes are created determines how much data is at risk of loss in the event of the loss of the site. For example, if tapes are generated every week, and take two business days worst case to be taken off-site, then the business can have an exposure window of up to twelve calendar days.

In the event of the catastrophic loss of the primary site, these tapes would have to be recalled from the vaulting provider, which can take some time, and hardware would have to be re-acquired to rebuild the archive. Don't underestimate the amount of time required to re-order hardware. Often the original equipment is no longer available, so a new archive will need to be specified and ordered, and can take weeks for the servers to be assembled and shipped.

Once the tapes have arrived and the hardware has been set up, the archive is rebuilt from the data stored on the tapes, and once the last tape is processed, the archive is ready to be used. This is known as a "bare-metal restore".

Of course, depending on the size of the archive, this could take a very long time. An 1 PB media archive would take 115 days to restore when running at a restore load of 800 Mbits/s, and a 10 billion object e-mail archive would take 115 days to restore when running at a restore load of 1000 objects per second. Rebuild times must be taken into account when planing for archive restoration, and often the cost of downtime associated such a restore is high enough that cloud or multi-site options are considered instead.

Storing to the Cloud (Hybrid Cloud)

Another option for single-site archives is to store a copy of the archived data to a cloud storage provider. This eliminates the headaches associated with tape management, but introduces the requirement for network connectivity to the provider. In this scenario, for each archived object to be protected off-site, the object is also stored to a cloud provider, which retains the data in the event that a restore is needed.

Unlike with tape archiving, data is stored immediately to the cloud, limited only by WAN bandwidth. However, this limitation can be substantial, and when bandwidth is insufficient, data will be at risk until the backlog clears. If data is being stored to the archive at 100 Mbits/s, an OC-3 class Internet connection would be required, which can be far more expensive than sending twenty tapes out each week.

In the event of the catastrophic loss of the primary site, hardware would be re-acquired to rebuild the archive, and network connectivity would need to be acquired to allow connectivity to the cloud. When both of these are operational, the archive would be reconnected to the cloud. This would restore access to the archived data, albeit limited by the network bandwidth. Over time, the on-site archived data can be then restored back over the WAN.

The primary disadvantages of this approach are the time required to get the hardware and network access for restoring the on-site component of the archive, with the second disadvantage being cost. Fears about unauthorized disclosure of data and loss of control over data are also common, though they can be mitigated with the appropriate use of encryption.

And often, for less than the price charged by most public cloud providers, one can afford to create a multi-site archive, either across multiple premises owned by the business, or into a second premise hosted by a third party.

Why not just use the Cloud?

Some public cloud providers encourage an architecture that has a minimal on-site presence, and stores all data off-site in the cloud. For some scenarios, this approach works very well, as it minimizes capital costs and minimizes the time required restore hardware and access in the event of a disaster. However, one must have sufficient WAN bandwidth for the expected store and retrieve loads (as opposed to just store-only traffic loads when using the cloud as a storage target), and in the event of a network connectivity failure, access to most or all of the archive can be disrupted.

This is contrasted with the hybrid cloud model, where the private cloud on-site allows continued access to the data even during WAN failures, and the public cloud is used as a low-cost data storage target.

Multi-Site Archives

When continuance of business operations are important, or archival data must be accessible across multiple sites, the archive can be extended to span multiple sites. This involves several considerations, including:
  1. What data is created in one site and accessed in another?
  2. What data should be replicated to other sites for protection?
  3. What data should be replicated to other sites for performance?
  4. In the event of a site loss scenario, what will be the additional load placed on other sites?
Such multi-site archives are very flexible, and allow seamless continuance of operations even in the event of major and catastrophic failures that affect one or more sites. While this is obviously the best solution from a business continunace standpoint, it is also the most expensive, as you must duplicate your entire archive infrastructure in multiple sites, and provide sufficient WAN bandwidth for cross-site data replication.

Of course, one can also deploy systems like Bycast's StorageGRID to provide a mixture of the above described approaches, using policies to determine which archived content is stored locally, vaulted to tape, stored in a public cloud, and replicated across multiple sites. This flexibility allows the value to the data to be mapped to the cost of the storage, and leverages a common infrastructure for all levels of protection required.

2009-09-14

Introducing CDMI

Today, the Storage Networking Industry Association (SNIA) publicly released the first draft of the Cloud Data Management Interface (CDMI). The draft standard can be downloaded at the below address:

http://www.snia.org/tech_activities/publicreview

I'm very pleased to have been a significant contributor to this standard since the inception of the working group earlier this year. Over the last nine months, we've been able to come a long way towards defining a working standard for cloud storage management, and Bycast is proud to have contributed many best-of-breed capabilities first pioneered in Bycast's StorageGRID HTTP API, used by hundreds of customers worldwide to store and access many dozens of petabytes of data in cloud environments, both public and private.

Why CDMI?

CDMI provides a standardized method by which data objects and metadata can be stored, accessed and managed within a cloud environment. It is intended to provide a consistent method for access by applications and end-users' systems, and provide a consistent interface for providers of cloud storage.

Currently, almost all of the cloud storage providers and vendors use significantly different APIs, which forces cloud application and gateway software vendors to code and test against different APIs, and having to architect their application around the lowest common denominator. CDMI significantly reduces the complexity of development, test and integration for the application vendor, and is specifically designed to be easy to adopt for both cloud providers and application vendors. CDMI can run along side existing cloud protocols, and, as an example, a customer could run a CDMI gateway in an EC2 instance to gain access to their existing Amazon S3 bucket without Amazon having to do any work — a great example of the power of cloud!

Much like SCSI, FiberChannel and TCP/IP, such industry-wide standards provide many advantages. These range from simple but essential efficiencies, such as standardized interface documentation, conformance and performance testing tools, the creation of a market for value-added tools such as protocol analyzers and developer awareness, libraries and code examples.

Industry standards also jump-start the network effect, where more applications encourage providers to support the standard, and more providers supporting the standard encourage application vendors to support the standard. Finally, and most excitingly, CDMI increases inter-cloud interoperability, and is a fundamental enabler for advanced emerging cloud models such as federation, peering and delegation, and the emergence of specialized clouds for content delivery, processing and preservation.

A Whirlwind Tour of CDMI

CDMI stores objects (data and metadata) in named containers that group the objects together. All stored objects are accessed by a web addresses that either contain a path (eg: http://cloud.example.com/myfiles/cdmi.txt) or an object identifier (eg: http://cloud.example.com/objectid/AABwbQAQvmAJSJWUHU3awAAA==).

CDMI provides a series of RESTful HTTP operations that can be used to access and manipulate a cloud storage system. PUT is used to create and update objects, GET is used to retrieve objects, HEAD is used to retrieve metadata about objects, and DELETE is used to remove objects.

Data stored in CDMI can be referenced between clouds (where one cloud points to another), copied and moved between clouds, and can be serialized into an export format that can be used to facilitate cloud-to-cloud transfers and customer bulk data transfers. All data-metadata relationships are preserved, and standard metadata is defined to allow a client to specify how the cloud storage system should manage the data. Examples of this "Data System Metadata" include the acceptable levels of latency and the degree of protection through replication.

In addition to basic objects and containers (similar to file and folder from a file system), CDMI also supports the concept of capabilities, which allow a client to discover what a cloud storage system is capable of doing. CDMI also supports accounts, which provide control and statistics over account security, usage and billing. Finally, CDMI supports queue data storage objects, which enable many exciting new possibilities for cloud storage.

In fact, queues are important and significant enough that I'll be writing more about them and what they enable in a subsequent blog entry.

The Next Steps

With CDMI now "out in the wild", this is the point where the standards effort starts to get really interesting. Up to this point, it has been a relatively small group that has been working on the standard, and we've had to make some controversial decisions (such as eliminating locking and versioning from the first release). There's still a lot of work to be done, and as CDMI gets more visibility, we look forward to increased involvement from other players in the industry. Together, we can make this standard even better, and help shape the future of cloud storage.

So, if you are interested in cloud storage and cloud storage APIs, I would strongly encourage you to take the time to read the CDMI draft documentation, and contribute your thoughts and suggestions.

We're proud of what we've achieved, and together, we can make it even better.

2009-09-10

Cloud Computing and Cloud Storage Standards

All of our work at the Cloud Storage technical working group in the Storage Networking Industry Association (SNIA) has been coming together, and we are nearing a public release of the Cloud Data Management Interface (CDMI).

We've also been working with the Open Grid Focum (OGF) on making it such that CDMI can be used in conjunction with the OCCI standard to manage data storage in cloud computing environments, and there are some exciting possibilities of combining CDMI with the VMWare vCloud standard that was recently donated as a potential industry standard to the DMTF.

You can read our joint whitepaper on CDMI and OCCI at the below URI:

http://ogf.org/Resources/documents/CloudStorageForCloudComputing.pdf

If any readers have any questions about CDMI and how it facilitates cloud computing, please don't hesitate to comment!