2009-03-25

Object Storage, Part 4 - Query

Being able to store data is of limited use unless there are efficient mechanisms by which data can be located and retrieved. In fact, one can argue that file systems are just special purpose data allocation and query systems. Most users and applications locate and access files through a directory of one sort or another, be it a file system, relational database or a custom index within a proprietary file structure, highlighting the importance of query in data storage.

As part four of the object storage series of posts, this entry covers the use of metadata to provide rich query capabilities, and how these capabilities enable implicit policies as discussed in the last entry, Object Storage - Explicit and Implicit Policies.

Where Has That File Gone?

Most of us have experienced the frustration of searching for a file and not being able to find it. But we have it easy compared to how data was stored before the widespread adoption of the file system.

In block-based storage systems, data is accessed by an address that defines where the data starts, and a length, which determines how much data needs to be read (or written). While this approach is simple and efficient, it is difficult to manage, as you need to keep an external catalogue of where each item is stored.

File systems simplified the problem for the user and the developer by creating a standard directory that could be used to organize files into hierarchical structures and associate metadata, such as file names and creation dates with each file. With the introduction of the file system, search systems were able to look at the directory and build an index of information such as file names, dates and other metadata.

So, we progressed from walking the disk to walking the file system directories. We then skipped to reading an index, then reduced the index results by filtering out all entries except the ones that matched our query. WIth full-text indexing of the contents of the files, in addition to the file metadata, now integrated into the operating system, a user can even filter their results to just files that contain specific words and phrases.

Of course, these impressive improvements in technology have been largely offset by an explosion of files, as hundreds of thousands of files to millions of files are now quite common in home and small office settings, and in large enterprises, tens to hundreds of billions of files can reside on enterprise storage systems.

But What About the Developer?

Despite these impressive achievements, for software developers, the facilities offered by a file system have not changed significantly since early file systems were created in the 60's and 70's. While search has improved the life for users, developers often are forced to create their own application-specific index of files for searching purposes.

A good example is to look at two popular applications offered by Apple on the Macintosh platform: iTunes and iPhoto. Both of these applications store each song and photo, respectively, as a file. But as a user, you never see these files — you see a custom user interface that is designed for tasks associated with managing and playing music, and managing and organizing photos.

When you open these applications, they do not access every single photo or song. They access indexes, which allow queries to be performed quickly to get results to the user interface. Thus, if you want to hear those hits of the 1600's, or view a slideshow of photos tagged "Tafoni", iTunes and iPhoto is actually doing a query, much like the following SQL:

SELECT * WHERE CENTURY == "1600"

SELECT * WHERE TAG == "tafoni"

Specifically, the applications are performing a query for metadata, and the century that a song was composed in, and the tags of a photo are all examples of metadata.

Unfortunately, a file system only has limited fixed metadata, and doesn't understand or have a way to be extended to include application-specific metadata. But an object storage system does understand metadata, and thus offers developers powerful query features that can dramatically reduce the complexity of development while also increasing the value of metadata interoperability across applications.

What can Object Query Do?

Simply put, an object storage system can do everything that a basic relational database can do, but without needing a schema. Every piece of metadata associated with objects can be queried, and the metadata is arbitrary, defined by applications and end users.
  • Storing an object with metadata is analogous to a INSERT
  • Changing metadata in an object is analogous to an UPDATE
  • Deleting metadata or an object is analogous to a DELETE
  • And object storage query is analogous to a SELECT
Because the metadata is an intrinsic part of each stored object, you never have to worry about transactional consistency, or inconsistencies between an index and the actual metadata of the object.

And, one can visualize implicit policies as just policies performed against a query result. Specify the metadata constraints, perform a query, and apply the policy to the results. (In reality, it is a little more complex, but we'll get to notifications later in the series)

If iTunes used Object Storage...

To illustrate how query in the storage system is of significant value to developers, let's imagine that Apple included an object storage system as part of the Mac OS, and had written iTunes to use object storage instead of using SQLite.

When you first start up iTunes, it remembers the last view you were looking at. A view is the results of a query, so it would issue that query to the storage system for the metadata of all song objects that match the query parameters. This would return the list of metadata that is used to display the list of songs. When a user double-clicked to play a song, iTunes would open the song data from the corresponding object, and start playing.

There is no need to incorporate a database, no need to create a fixed schema (just add metadata and go!), no need to worry about consistency or corruption, and way less code to manage.

Most importantly, because the query is done by object storage system, your query performance scales as your storage infrastructure scales. So a query from a million objects on a home PC runs just as fast as a query from a billion objects within an enterprise. And as the object storage system is improved over time, all the applications get faster, for free.

But easing the load on the software developer isn't the only value. Because the objects are stored in a common storage system, if iTunes desires, it can allow any application to query for the music it manages. So if another application developer wants to each for music, it can construct a query that will return results from iTunes' repository. Controlled access across applications opens up all sorts of opportunities to create systems built around loosely coupling agents accessing and manipulating a shared repository. For example, a format conversion tool could convert MP3 files into AAC files in the background, transparently, and fully interoperate with iTunes.

Of course, if an application wanted to keep its objects private, that's just a permissions setting. Which is an excellent segue into the next entry — Security in a object storage system.

No comments: