Vanishing Into the Infrastructure: March 2009

2009-03-27

The Fight over Cloud

As far as buzzwords goes, "Cloud" is a pretty good one. I've complained about this one in the past, and cloud is a very annoying term to me precisely because of its lack of firm definition, vapidness, and that it means different things to different people.

Now that the much debated "Cloud Manifesto" has been leaked to the web, there's a little more to chew on. The war over the definition of what cloud means has begun.

Below is my analysis of what annoys me about this document, ignoring for now the politics associated with its creation and distribution.

The author(s?) of this "manifesto" say that the document "does not intend to define a final taxonomy of cloud computing", yet that is exactly what they have ended up doing. And given that their definition reads more like a advertisement for outsourcing VMs, this is at odds of my personal view of cloud as a general architecture for building distributed systems.

To me, none of their listed criteria, either independently or in combination, make something "cloud", nor does being "cloud" imply the existence of any of these proposed criteria.

So, with that said, let's take a more detailed look at these "key characteristics of the cloud":

Scalability On Demand

While this is value that can be offered by a cloud, there are lots of non-cloud systems that provide exactly this, and you can have a cloud that does not provide scalability on demand.

For example, IBM has been offering mainframe systems with extra processors that you can pay to use, "on demand". I wouldn't classify a zSeries a cloud.

Streamlining the Data Centre

As "streamlining" is an ambiguous word, we'll assume that the authors mean outsourcing or cost reductions. However, not all uses of cloud will result in the reduction of cost (capital or infrastructure) or moving work outside the data centre.

Improving Business Processes

Technological systems are more often than not orthogonal to business process improvement. One can deploy cloud systems and end up with a worse business process, and one can improve business processes without deploying cloud technology.

Minimizing Startup Costs

Fractional allocation to reduce the minimal quanta that must be purchased to do useful work is the closest to a acceptable criteria for cloud, but is a VM server a cloud, then? Also, one can deploy a cloud system that does not support fractional allocation.

All Together Now?

Depending on your definition of what cloud is, you could create a cloud system that provides fixed capacity processing, increases data centre costs and brings additional work into the data centre, makes no changes to business processes, and requires large startup costs.

Conversely, you could create a system that has variable on-demand capacity, reduces data centre costs, outsources data centre work, improves business processes, and minimizes startup costs, all without it being a cloud.

Of course, the linchpin of this entire argument is just what a cloud is, and this is why this manifesto matters. It is the first major attempt to put a stick in the sand and say that Cloud is X, Y and Z.

And that is why is has generated such a storm of controversy. At stake is who gets the first mover advantage in the struggle to define what exactly a cloud is.

2009-03-25

Object Storage, Part 4 - Query

Being able to store data is of limited use unless there are efficient mechanisms by which data can be located and retrieved. In fact, one can argue that file systems are just special purpose data allocation and query systems. Most users and applications locate and access files through a directory of one sort or another, be it a file system, relational database or a custom index within a proprietary file structure, highlighting the importance of query in data storage.

As part four of the object storage series of posts, this entry covers the use of metadata to provide rich query capabilities, and how these capabilities enable implicit policies as discussed in the last entry, Object Storage - Explicit and Implicit Policies.

Where Has That File Gone?

Most of us have experienced the frustration of searching for a file and not being able to find it. But we have it easy compared to how data was stored before the widespread adoption of the file system.

In block-based storage systems, data is accessed by an address that defines where the data starts, and a length, which determines how much data needs to be read (or written). While this approach is simple and efficient, it is difficult to manage, as you need to keep an external catalogue of where each item is stored.

File systems simplified the problem for the user and the developer by creating a standard directory that could be used to organize files into hierarchical structures and associate metadata, such as file names and creation dates with each file. With the introduction of the file system, search systems were able to look at the directory and build an index of information such as file names, dates and other metadata.

So, we progressed from walking the disk to walking the file system directories. We then skipped to reading an index, then reduced the index results by filtering out all entries except the ones that matched our query. WIth full-text indexing of the contents of the files, in addition to the file metadata, now integrated into the operating system, a user can even filter their results to just files that contain specific words and phrases.

Of course, these impressive improvements in technology have been largely offset by an explosion of files, as hundreds of thousands of files to millions of files are now quite common in home and small office settings, and in large enterprises, tens to hundreds of billions of files can reside on enterprise storage systems.

But What About the Developer?

Despite these impressive achievements, for software developers, the facilities offered by a file system have not changed significantly since early file systems were created in the 60's and 70's. While search has improved the life for users, developers often are forced to create their own application-specific index of files for searching purposes.

A good example is to look at two popular applications offered by Apple on the Macintosh platform: iTunes and iPhoto. Both of these applications store each song and photo, respectively, as a file. But as a user, you never see these files — you see a custom user interface that is designed for tasks associated with managing and playing music, and managing and organizing photos.

When you open these applications, they do not access every single photo or song. They access indexes, which allow queries to be performed quickly to get results to the user interface. Thus, if you want to hear those hits of the 1600's, or view a slideshow of photos tagged "Tafoni", iTunes and iPhoto is actually doing a query, much like the following SQL:

SELECT * WHERE CENTURY == "1600"

SELECT * WHERE TAG == "tafoni"

Specifically, the applications are performing a query for metadata, and the century that a song was composed in, and the tags of a photo are all examples of metadata.

Unfortunately, a file system only has limited fixed metadata, and doesn't understand or have a way to be extended to include application-specific metadata. But an object storage system does understand metadata, and thus offers developers powerful query features that can dramatically reduce the complexity of development while also increasing the value of metadata interoperability across applications.

What can Object Query Do?

Simply put, an object storage system can do everything that a basic relational database can do, but without needing a schema. Every piece of metadata associated with objects can be queried, and the metadata is arbitrary, defined by applications and end users.

Storing an object with metadata is analogous to a INSERT
Changing metadata in an object is analogous to an UPDATE
Deleting metadata or an object is analogous to a DELETE
And object storage query is analogous to a SELECT

Because the metadata is an intrinsic part of each stored object, you never have to worry about transactional consistency, or inconsistencies between an index and the actual metadata of the object.

And, one can visualize implicit policies as just policies performed against a query result. Specify the metadata constraints, perform a query, and apply the policy to the results. (In reality, it is a little more complex, but we'll get to notifications later in the series)

If iTunes used Object Storage...

To illustrate how query in the storage system is of significant value to developers, let's imagine that Apple included an object storage system as part of the Mac OS, and had written iTunes to use object storage instead of using SQLite.

When you first start up iTunes, it remembers the last view you were looking at. A view is the results of a query, so it would issue that query to the storage system for the metadata of all song objects that match the query parameters. This would return the list of metadata that is used to display the list of songs. When a user double-clicked to play a song, iTunes would open the song data from the corresponding object, and start playing.

There is no need to incorporate a database, no need to create a fixed schema (just add metadata and go!), no need to worry about consistency or corruption, and way less code to manage.

Most importantly, because the query is done by object storage system, your query performance scales as your storage infrastructure scales. So a query from a million objects on a home PC runs just as fast as a query from a billion objects within an enterprise. And as the object storage system is improved over time, all the applications get faster, for free.

But easing the load on the software developer isn't the only value. Because the objects are stored in a common storage system, if iTunes desires, it can allow any application to query for the music it manages. So if another application developer wants to each for music, it can construct a query that will return results from iTunes' repository. Controlled access across applications opens up all sorts of opportunities to create systems built around loosely coupling agents accessing and manipulating a shared repository. For example, a format conversion tool could convert MP3 files into AAC files in the background, transparently, and fully interoperate with iTunes.

Of course, if an application wanted to keep its objects private, that's just a permissions setting. Which is an excellent segue into the next entry — Security in a object storage system.

2009-03-12

On Ten Year Trends - Parallel Goes Mainstream

Several bloggers have been discussing the current "ten year trends" that are transforming the computing environment, including Cloud Computing, Virtualization, Flash, and Mobility.

However, there is a deeper trend that while well underway, is so significant that it is often overlooked. That is the forced transition to loosely coupled distributed computing models caused by the knee in increasing compute performance by making individual processors faster.

Over the next five years, I believe we are going to see a full-fledged transition in hardware systems from tightly coupled single and multi-processor shared memory systems to loosely coupled many-processor systems that communicate via message passing. This trend is already evident in the high-performance computing space, where having hit the limits of individual processor performance, then having hit the limits of shared memory, virtually all new architectures have adopted this approach.

One of the fundamental drivers of this transition is a new economic model that measures computing power per dollar, or computing power per watt. This model favours many smaller less powerful processors over fewer high-power processors. After all, with Atom and ARM processors costing only a few dollars per core in volume, and performing around the same as a 1 GHz Pentium did just a few years ago, with a fraction of the power and heat dissipation, I predict we will soon see a transition to inexpensive servers where one or two rack units are crammed full of hundreds to thousands of these class of processors, all connected together by what looks like a Ethernet fabric.

The timing is right — Microsoft's talking about it, IBM's already done it, and the mobile market has brought the component volumes up to the point where the economics are right.

But five years does not make a ten-year trend: The larger change is that to programming models, languages and mindsets. Very few people in the industry can work comfortably and productively in massively parallel environments, as we have seen in the difficulty in building scalable web-based systems like Twitter. Computer Science still teaches programming and computer architecture around the old model of the "one true core", with parallel and distributed programming taught as a speciality, often at the graduate studies level.

It is going to take five or more years for these new massively parallel hardware and ways of thinking to filter down to the next generation of programmers who can think parallel, create new tools, libraries and stacks, and then play tens to hundreds of thousands of nodes like an instrument.

Imagine having a hundred thousand cores on your desktop. Imagine having ten million of them in your data centre. The problems are significant, and exciting.

2009-03-09

Object Storage, Part 3 - Explicit and Implicit Policies

Once metadata becomes an intrinsic part of each stored object, application-specified metadata provides a rich vocabulary by which to enable applications to communicate with the underlying storage system and for administrators to manage them.

As part three of the object storage series of posts, this entry covers the ability to specify explicit and implicit policies that gives both the application and the administrator control over how data is managed, and drives additional value in the storage subsystem. This entry builds on top of the last entry, Object Storage - Metadata, which introduced the importance of metadata and why it applies to storage and to object storage in particular.

More Than Just a Bit Bucket

When people first think about storage, they think about bits. And that's fundamentally what storage systems do. They take your bits, keep them, and give them back to you. But if that's all a storage system does, it's pretty dumb, as there is a lot more to what applications need then just storing bits.

Storage is also more than the storage system and applications — The storage administrator is also an important player in enterprise storage, and is often charged with goals that may or may not agree with the desires of the application.

So, if storing bits is "Dumb Storage", what is "Intelligent Storage"? Well, an easy example is the below list of many of the things that applications and administrators want to have their storage system do for them:

Index
Protect
Share
Compress
Replicate
Distribute
Archive
Cache
Tier
Version

And this list is just the beginning.

Explicit Policies

In order for these higher-level behaviours that are desired by administrators and applications to be fulfilled, they first need to be communicated to a storage system. And metadata fulfils this role perfectly.

If an application wishes for a given stored data to be protected such that only that application can access it, it needs only to attach metadata to the object that indicates this intent, and trust that the storage system will honour its request. This agreement between the application and the storage system is the contact of functionality.

Want multiple copies? Add metadata. Want it shredded on delete? Add metadata. Want index keywords? Add metadata. Etc.

Thus, through a vocabulary of well-defined metadata that it will honour, the capabilities of a storage system can be advertised to an application. And the sum of this metadata forms an explicit policy, specified by the application to the storage system, as an atomic part of the stored object.

Implicit Policies

But this isn't the only way that policies can work. While the application knows best from its perspective, it is only one small and limited part of an enterprise. There are larger forces at work — desires to ensure that data is not lost in a disaster, desires to reduce costs, desires to meet legal obligations, and to manage information over time and space.

Enter the storage administrator.

Like with explicit policies, implicit policies are also built around metadata. But instead of having an intent being explicitly stated as metadata directives, implicit policies map an intent to a collection of objects with common characteristics.

Let's imagine that the storage administrator wishes to ensure that critical financial documents are protected against a site disaster, and are retained for a minimum of ten years. The administrator can create an implicit policy that says:

For all objects with metadata that indicates it is a financial document, make two copies, one remote, and keep them for ten years.

Once again, the metadata is key. The metadata might be a path in a file system, or a document type, or the division within an organization. Regardless, the administrator now has a tool to take subsets of stored data, limited only by their imagination and the avaialble metadata, and make things happen.

And unlike explicit policies, implicit policies can be changed without having to change the metadata of the stored objects. By combining both types of policies, an application can specify metadata (an explicit policy) that selects which implicit policy is to be used. And all of these approaches can be combined: An application may say that this object "Must be high performance", that the retention is "governed by implicit policy named Finance", and say nothing about replication.

As can be imagined, this approach to storage management is very powerful, and allows the value of storage to be expressed in terms of business values to an organization. In our next entry, we will look at some concrete examples of explicit and implicit policies, and see how these are often implemented in object storage systems.

Vanishing Into the Infrastructure