2009-02-15

Object Storage, Part 1 - Introduction

Object-Based Storage is an alternate approach to specifying an interface between higher level application programs and storage devices for the purposes of storing digital data. While not commonly used, the many advantages of object-based storage are resulting in increasing adoption, and over the next decade it is expected to become widespread.

As the first part of a series of blog posts talking about object-based storage, this post introduces object-based storage in the context of other widely used storage interface technologies, and briefly covers the advantages inherent in object-based storage.

Stream-Based Storage

In early computing systems, data was stored as series of bits or bytes that could be written or read over time. Examples included ticker tape for bit-streamed data, and punch cards for byte-streamed data. Within a computing system, some of the first persistent storage systems were based around writing data as sequences of magnetic signals on tape or on a rotating drum. These stored values could then be read back in the same order they were written.

While this approach is still used in storage devices such as tape, for low-latency storage purposes, the main disadvantage of stream-based storage was the time required to access a given piece of information. In order to access a given piece of information, a program had to specify the location within the stream that the information was contained, which required keeping track of many bits of addressing information. Thus, in order to save bits, the locations in the stream were divided into equally sized "blocks" that could be used to refer to locations within the stream.

Block-Based Storage

From an interface standpoint, today's hard disks are still conceptually modelled as a long stream of bytes divided into equally sized blocks. All accesses to the storage devices are performed by reading or writing blocks over industry standard protocols such as SCSI and Fibre Channel, which typically specify that each block contains 512 bytes of user-accessible information.

Under the covers, the hard disk controllers understand that the data isn't actually stored in one long sequential stream of bits, and is stored across multiple platters of spinning discs, and the physical location of a given block is specified by translating the block address into a distance and angle on the surface of the disc. For solid-state storage devices, block addresses are mapped to physical groups of semiconductor devices arranged in two or three-dimensional structures.

File-based Storage

Given that application data, be it documents or images or sound data, are often larger than a block and may not fill up blocks completely, a higher-level logical structure maps these documents, typically called files, onto the blocks on disk. This software is typically called a file system, and provides a directory of files, metadata about the files, and information known as "extents", which points to the list of blocks that contain the contents of the file.

When a file system is layered on top of a block storage device, the user is already being presented with a form of object storage. In this case, there are two types of objects, files and directories. However, this is where the similarities end.

Object-Based Storage

In an object-based storage interface, instead of writing a stream of data, or writing blocks of data, or writing a file, an application program writes an object. In interfaces such as XAM (eXtensible Access Method) and OSD (Object Storage Device), instead of manipulating files, applications manipulate objects, and can perform a much wider variety of operations on these objects.

There are several aspects of objects that differentiate them from files:
  • Objects are compound. While files can also contain multiple different types of data, as is commonly found in an .zip or .tar archive file, objects intrinsically contain multiple different pieces of information together into one package. Metadata and data are all stored with an internal directory that allows each sub-component of the object to be accessed.
  • Objects are self-describing. Instead of just having the metadata that is associated with a file, such as creation time and file name, objects have arbitrary metadata specified by an application that is specific to the application and data's nature. For example, a sound object can have sampling frequency, gain levels and other sound-specific metadata, while a document may have metadata describing the application that created it, and may even contain a low-resolution preview image of the document.
  • Objects are self-contained. Unlike a file, the metadata associated with an object is an intrinsic part of the object. When a file is stored on disk or even copied from one location to another, the metadata is not stored as part of the file contents, and may not be transferred along with the file data. With objects, the metadata is an intrinsic part of the object, and cannot be separated without creating a different object.
How is This Different?

This isn't new, and these three concepts have been widely used for organizing the contents of files stored on block devices in most software systems. Every Microsoft Word document, Macintosh Resource Fork and Windows Executable is already an object based on this definition. So, why is object storage needed when these concepts are already widely implemented?

The key to answering this question is looking at the entire storage stack, instead of looking at each part in isolation. In a file-based storage system, if an application wants to load a preview image from a file, the application must first open the file, know how to understand the file (which is typically application-specific), and read the data. At the next layer down, the file system must translate these read requests (which view the file as a stream of data) into block addresses using the file system extents metadata, issue block requests to the storage device, which in turn need to look up the location of the data in the blocks to return it back up to the application.

With object-based storage, the application tells the storage system direction that it wants data from a specific object, and the request can be sent all the way down to the storage device, which because it knows about the object structure itself, is able to better respond to the request, and process the request more efficiently.

And because the storage device can understand the structure of objects, it is able to more intelligently and more efficiently handle these application requests.

This may seem like a subtle distinction, but there are numerous advantages resulting from standardizing object structures and pushing down awareness of object contents to the storage layer, and this additional awareness enables many new capabilities and increased efficiencies. These will be the subject of the entries to follow.

2 comments:

Larry Calihan said...

David,

Found your blog very intriguing. Do you have other white papers or articles you can refer me to?

Larry

David Slik said...

Larry,

This is the first blog entry in a series of entries that I will be posting about object-based storage and the advantages and disadvantages of this approach, so stay tuned!

Also, here are a couple of excellent articles from the Internet that cover this area. There's quite a few interesting academic papers covering the subject going back over the last three decades.

http://developers.sun.com/solaris/articles/osd.html
http://weblog.infoworld.com/storageadviser/archives/001288.html