2009-02-24

Object Storage, Part 2 - Metadata

A key aspect of object-based storage is the storage of metadata as an intrinsic part of each stored object. Allowing applications to define and include arbitrary metadata as part of a stored object provides the foundation for enabling many enhanced capabilities for both the application and the storage system.

As part two of the object storage series of posts, this entry covers the importance of metadata to object storage and builds on top of last week's entry, On Object-Based Storage, which introduces Object-based Storage.

What is Metadata, Anyways?

Metadata is one of those unfortunate terms that the computer industry has abused to the point where it now has a wide range of meanings. From a strict definitions standpoint, metadata is "data about data", but in more general use, it refers to any descriptive data.

For example, a file name is considered to be metadata about the file, as is file creation date, access lists of who is allowed to access the file, and a thumbnail view of how the file would look when printed.

Figure 1 - Objects Including Data and Metadata

Metadata is often blurred together with data, and can be considered to be data depending on the context. This can be illustrated in the below example of a compound object document. For example, an indication of which typeface to use for a paragraph of text is often referred to as metadata associated with the text, as is the location offset and scaling factor describing how an image should be rendered. While this information is indeed metadata to the paragraph and image, respectively, from the context of the document, this information is an intrinsic part of the document that must be present for it to be rendered as the user intended.

Figure 2 - Composite Objects Including Metadata as Data

Thus, while in principle, data can stand alone without the associated metadata, in reality, the metadata often provides the context that makes the data usable. (After all, think about how hard it would be to find a file without file names or directories)

Metadata for Applications

For most applications, being able to attach metadata to stored data is a fundamental requirement for structured storage. By storing metadata along side, or intermixed with the data, applications are able to ensure that the data is sufficiently described for manipulation or display to the end user.

Since there is no universal file format, each application vendor has had to choose between living with the limits of standard file formats, or creating their own proprietary file format. While standardized file formats have emerged over the years, and often include the ability to be extended to include application-defined tags or properties, there is no "universal file format", and more often than not, application developers resort to creating their own format. And even with general formats such as XML, without additional descriptive information, such as a schema (more metadata), the files are not self-describing.

While object storage does not create a universal file format, it does provide a consistent and standard way for applications to store metadata along with data, in a format that is independent from any specific application. Providing applications with a consistent way package up all of the data and metadata into a single storage object, then commit it atomically to storage provides many advantages over ad-hoc solutions.

For example, let us use the example of someone who is writing a web-based blogging system. Each blog entry has the body text in HTML (the data), and a series of metadata items associated with the post, such as a title, creation date, posting date, posting status (draft, posted), and an author. A typical design would be to use a database to store the metadata, and store the HTML posts as files. In this implementation, even if the data is stored in the database (always a temptation, but rarely a good idea), databases are intrinsically loosely coupled data stores, and much complexity and fragility ensues.

Contrast this to an application designed around an object store. For each post, a "Blog Post Object" is created, which includes the post data, and the metadata. The object is committed to storage as an atomic element, where it persists. Each committed object is self-describing, no schema is needed, and the amount of complexity to the application writer is vastly reduced.

Now one might question how an object store is different from a database, and that is a good question. In fact, one could consider a database a specialized form of an object store, and an object store to be a specialized form of a database. And ultimately, both of these perspectives are correct. The key aspect to keep in mind is where services are being provided — In a database, services are provided to the application by a middle layer that runs on top of a non-intelligent storage system, where with object-based storage, services are provided by the storage system itself. This is a key difference, which will be the basis of much of the remainder of this series of articles.

Metadata for Storage Systems

Today's storage systems is like having an illiterate librarians managing a building full of pages of paper. People given a page, they put them in a location, people ask for a page at a given location, and they give them back the page. While this works, it's not very intelligent.

But what if storage could be more? What if you could ask for a book? What if you could ask for all the books about a given topic? Or by a given author? Object storage is our literate librarian, who understands the metadata associated with stored objects.

When the storage system understands what is being stored, this enables all sorts of capabilities and optimizations that would otherwise not be possible. Now the storage system can provide the ability to search content. Now the storage system can intelligently optimize storage and retrieval performance and latency. And most importantly, now there is a way for the application and the storage system to communicate with each other.

There are many exciting capabilities that emerge from having richer communication between the application and the storage system, and these are worth describing in more depth. The subsequent entries in this series will discuss these emergent capabilities, including query, placement, protection, permissions, representation, policies, compression and versioning. These will be the subject of the entries to follow.

2009-02-17

Watch for Goats in the Cloud

George Crump, of Storage Switzerland posted an article titled, Cloud Storage Reality, where he talked about the emerging class of "Cloud Storage" solutions. His conclusion: That Cloud Storage is a reality and ready for prime time.

But is it? Or more specifically, is all that is called "Cloud Storage" ready for prime time.

George's listing of the key advantages of cloud storage when compared with traditional enterprise storage systems, in dispersion, nodes, scale, granular, ease and self-upgrading are dead-on.
"Say, What's that mountain goat doing clear up here in
this cloud bank?"

Similarly, we agree with his taxonomy of the three different deployment models, Service Only, Software Only and Pre-packaged Cloud.

But there is a key distinction between different ways that clouds can be deployed that can make the difference between a high-risk failure and a low-risk success:

Storage Just in the Cloud?

While cloud storage is a proven architecture, pure Internet-based storage remains risky. Before enterprises will be willing to trust their data and their business to a provider, they first look for industry maturity, stability and reliability. After all, the pure internet-based storage industry is still in early stages of adoption, and one can argue that it already failed once, during the "Storage Utility Provider" craze at the beginning of the decade. Heck, even Enron was getting into that business.

And enterprise uptime is only half the QoS battle — Even if the remote storage service provider has 100% uptime, access to the provider is limited by the reliability of the Internet networks, and access is restricted by the bandwidth to the Internet.

After all, despite all the talk of bandwidth being free, the costs of an OC-3 to the Internet still makes most CFO's reach for their chests.

Then, if you really want to kill pure Internet-based storage, get the lawyers involved...

What Really Works

Cloud Storage is production ready and widely deployed, but only in configurations that extend into the customer's data centre. I would wager that virtually all enterprise-class cloud storage deployments include data being stored in the customer's data centre. You see this with profiles of Amazon's S3 customers, and we see this with our customers. This is to be expected, of course, since all private cloud deployments exist primarily within the customer's data centre.

So, to summarize, where does internet-resident cloud storage work?

Cloud Storage providing off-site protection copies for data that is also held on-site.

Cloud Storage providing lower-cost storage for data where high levels of QoS are not required.

Cloud Storage facilitating data sharing across sites.

2009-02-15

Object Storage, Part 1 - Introduction

Object-Based Storage is an alternate approach to specifying an interface between higher level application programs and storage devices for the purposes of storing digital data. While not commonly used, the many advantages of object-based storage are resulting in increasing adoption, and over the next decade it is expected to become widespread.

As the first part of a series of blog posts talking about object-based storage, this post introduces object-based storage in the context of other widely used storage interface technologies, and briefly covers the advantages inherent in object-based storage.

Stream-Based Storage

In early computing systems, data was stored as series of bits or bytes that could be written or read over time. Examples included ticker tape for bit-streamed data, and punch cards for byte-streamed data. Within a computing system, some of the first persistent storage systems were based around writing data as sequences of magnetic signals on tape or on a rotating drum. These stored values could then be read back in the same order they were written.

While this approach is still used in storage devices such as tape, for low-latency storage purposes, the main disadvantage of stream-based storage was the time required to access a given piece of information. In order to access a given piece of information, a program had to specify the location within the stream that the information was contained, which required keeping track of many bits of addressing information. Thus, in order to save bits, the locations in the stream were divided into equally sized "blocks" that could be used to refer to locations within the stream.

Block-Based Storage

From an interface standpoint, today's hard disks are still conceptually modelled as a long stream of bytes divided into equally sized blocks. All accesses to the storage devices are performed by reading or writing blocks over industry standard protocols such as SCSI and Fibre Channel, which typically specify that each block contains 512 bytes of user-accessible information.

Under the covers, the hard disk controllers understand that the data isn't actually stored in one long sequential stream of bits, and is stored across multiple platters of spinning discs, and the physical location of a given block is specified by translating the block address into a distance and angle on the surface of the disc. For solid-state storage devices, block addresses are mapped to physical groups of semiconductor devices arranged in two or three-dimensional structures.

File-based Storage

Given that application data, be it documents or images or sound data, are often larger than a block and may not fill up blocks completely, a higher-level logical structure maps these documents, typically called files, onto the blocks on disk. This software is typically called a file system, and provides a directory of files, metadata about the files, and information known as "extents", which points to the list of blocks that contain the contents of the file.

When a file system is layered on top of a block storage device, the user is already being presented with a form of object storage. In this case, there are two types of objects, files and directories. However, this is where the similarities end.

Object-Based Storage

In an object-based storage interface, instead of writing a stream of data, or writing blocks of data, or writing a file, an application program writes an object. In interfaces such as XAM (eXtensible Access Method) and OSD (Object Storage Device), instead of manipulating files, applications manipulate objects, and can perform a much wider variety of operations on these objects.

There are several aspects of objects that differentiate them from files:
  • Objects are compound. While files can also contain multiple different types of data, as is commonly found in an .zip or .tar archive file, objects intrinsically contain multiple different pieces of information together into one package. Metadata and data are all stored with an internal directory that allows each sub-component of the object to be accessed.
  • Objects are self-describing. Instead of just having the metadata that is associated with a file, such as creation time and file name, objects have arbitrary metadata specified by an application that is specific to the application and data's nature. For example, a sound object can have sampling frequency, gain levels and other sound-specific metadata, while a document may have metadata describing the application that created it, and may even contain a low-resolution preview image of the document.
  • Objects are self-contained. Unlike a file, the metadata associated with an object is an intrinsic part of the object. When a file is stored on disk or even copied from one location to another, the metadata is not stored as part of the file contents, and may not be transferred along with the file data. With objects, the metadata is an intrinsic part of the object, and cannot be separated without creating a different object.
How is This Different?

This isn't new, and these three concepts have been widely used for organizing the contents of files stored on block devices in most software systems. Every Microsoft Word document, Macintosh Resource Fork and Windows Executable is already an object based on this definition. So, why is object storage needed when these concepts are already widely implemented?

The key to answering this question is looking at the entire storage stack, instead of looking at each part in isolation. In a file-based storage system, if an application wants to load a preview image from a file, the application must first open the file, know how to understand the file (which is typically application-specific), and read the data. At the next layer down, the file system must translate these read requests (which view the file as a stream of data) into block addresses using the file system extents metadata, issue block requests to the storage device, which in turn need to look up the location of the data in the blocks to return it back up to the application.

With object-based storage, the application tells the storage system direction that it wants data from a specific object, and the request can be sent all the way down to the storage device, which because it knows about the object structure itself, is able to better respond to the request, and process the request more efficiently.

And because the storage device can understand the structure of objects, it is able to more intelligently and more efficiently handle these application requests.

This may seem like a subtle distinction, but there are numerous advantages resulting from standardizing object structures and pushing down awareness of object contents to the storage layer, and this additional awareness enables many new capabilities and increased efficiencies. These will be the subject of the entries to follow.