2009-02-24

Object Storage, Part 2 - Metadata

A key aspect of object-based storage is the storage of metadata as an intrinsic part of each stored object. Allowing applications to define and include arbitrary metadata as part of a stored object provides the foundation for enabling many enhanced capabilities for both the application and the storage system.

As part two of the object storage series of posts, this entry covers the importance of metadata to object storage and builds on top of last week's entry, On Object-Based Storage, which introduces Object-based Storage.

What is Metadata, Anyways?

Metadata is one of those unfortunate terms that the computer industry has abused to the point where it now has a wide range of meanings. From a strict definitions standpoint, metadata is "data about data", but in more general use, it refers to any descriptive data.

For example, a file name is considered to be metadata about the file, as is file creation date, access lists of who is allowed to access the file, and a thumbnail view of how the file would look when printed.

Figure 1 - Objects Including Data and Metadata

Metadata is often blurred together with data, and can be considered to be data depending on the context. This can be illustrated in the below example of a compound object document. For example, an indication of which typeface to use for a paragraph of text is often referred to as metadata associated with the text, as is the location offset and scaling factor describing how an image should be rendered. While this information is indeed metadata to the paragraph and image, respectively, from the context of the document, this information is an intrinsic part of the document that must be present for it to be rendered as the user intended.

Figure 2 - Composite Objects Including Metadata as Data

Thus, while in principle, data can stand alone without the associated metadata, in reality, the metadata often provides the context that makes the data usable. (After all, think about how hard it would be to find a file without file names or directories)

Metadata for Applications

For most applications, being able to attach metadata to stored data is a fundamental requirement for structured storage. By storing metadata along side, or intermixed with the data, applications are able to ensure that the data is sufficiently described for manipulation or display to the end user.

Since there is no universal file format, each application vendor has had to choose between living with the limits of standard file formats, or creating their own proprietary file format. While standardized file formats have emerged over the years, and often include the ability to be extended to include application-defined tags or properties, there is no "universal file format", and more often than not, application developers resort to creating their own format. And even with general formats such as XML, without additional descriptive information, such as a schema (more metadata), the files are not self-describing.

While object storage does not create a universal file format, it does provide a consistent and standard way for applications to store metadata along with data, in a format that is independent from any specific application. Providing applications with a consistent way package up all of the data and metadata into a single storage object, then commit it atomically to storage provides many advantages over ad-hoc solutions.

For example, let us use the example of someone who is writing a web-based blogging system. Each blog entry has the body text in HTML (the data), and a series of metadata items associated with the post, such as a title, creation date, posting date, posting status (draft, posted), and an author. A typical design would be to use a database to store the metadata, and store the HTML posts as files. In this implementation, even if the data is stored in the database (always a temptation, but rarely a good idea), databases are intrinsically loosely coupled data stores, and much complexity and fragility ensues.

Contrast this to an application designed around an object store. For each post, a "Blog Post Object" is created, which includes the post data, and the metadata. The object is committed to storage as an atomic element, where it persists. Each committed object is self-describing, no schema is needed, and the amount of complexity to the application writer is vastly reduced.

Now one might question how an object store is different from a database, and that is a good question. In fact, one could consider a database a specialized form of an object store, and an object store to be a specialized form of a database. And ultimately, both of these perspectives are correct. The key aspect to keep in mind is where services are being provided — In a database, services are provided to the application by a middle layer that runs on top of a non-intelligent storage system, where with object-based storage, services are provided by the storage system itself. This is a key difference, which will be the basis of much of the remainder of this series of articles.

Metadata for Storage Systems

Today's storage systems is like having an illiterate librarians managing a building full of pages of paper. People given a page, they put them in a location, people ask for a page at a given location, and they give them back the page. While this works, it's not very intelligent.

But what if storage could be more? What if you could ask for a book? What if you could ask for all the books about a given topic? Or by a given author? Object storage is our literate librarian, who understands the metadata associated with stored objects.

When the storage system understands what is being stored, this enables all sorts of capabilities and optimizations that would otherwise not be possible. Now the storage system can provide the ability to search content. Now the storage system can intelligently optimize storage and retrieval performance and latency. And most importantly, now there is a way for the application and the storage system to communicate with each other.

There are many exciting capabilities that emerge from having richer communication between the application and the storage system, and these are worth describing in more depth. The subsequent entries in this series will discuss these emergent capabilities, including query, placement, protection, permissions, representation, policies, compression and versioning. These will be the subject of the entries to follow.

No comments: