Microsoft Azure is Cloud Storage

Microsoft today announced their Cloud Computing initiative, the Azure Services Platform. While many of the commentators have been comparing it with Amazon's EC2 offerings, and with VMWare based hosting, Azure is as much about cloud storage as it is about computing.

Quoting from Microsoft's Port 25 Blog:

Both Windows Azure applications and on-premises applications can access the Windows Azure storage service, and both do it in the same way: using a RESTful approach. The underlying data store is not Microsoft SQL Server, however. In fact, Windows Azure storage isn't a relational system, and its query language isn't SQL. Because it's primarily designed to support applications built on Windows Azure, it provides simpler, more scalable kinds of storage. Accordingly, it allows storing binary large objects (blobs), provides queues for communication between components of Windows Azure applications, and even offers a form of tables with a straightforward query language, Chappell says.

This is worth emphasizing. Azure provides global cloud-based object storage. And it takes it one step further, by providing active objects such as queues and tables. The presence of queues, message busses and other persistent data structures are a real game-changer, as they form the location-independent "glue" by which to hold together large-scale loosely-coupled applications that are best suited for cloud-based hosting. This directly competes with Amazon's S3, and by creating a platform that runs .Net and other interpreted languages directly without the weight of a full OS running in a VM, it should be able to scale much more elegantly.

For storage object access, the API Microsoft has adopted is very similar to the HTTP API used by Bycast for the StorageGRID platform, so this continues the trend of standardization around HTTP for object storage and access.

Their API documentation can be found at the below link:


Also of interest from the announcement are Microsoft's .Net Services for Ruby, which provide first class access to the Azure services for ruby applications.

Now all we need is a XAM VIM that talks to Azure. Their containers map closely to XSets, blobs to XStreams, and stype data can be placed in tables and thus be queried. They've even provided methods to handle cache coherency for multiple simultaneous writers. Very interesting...


Discussions on the Web

XAM and content-based search

A post about the advantages of including support for full-text search in XAM implementations, and my response regarding the pros and cons of providing this functionality at the storage layer instead of the application layer.

The Universal Design Pattern

A discussion of a commonly used design pattern where objects have properties that are both inherited and overridden, and my response regarding the similarities of the XAM objects storage model, and it's use for persistent storage of serialized objects created when using this design pattern.


Web 2.0 and Archiving

A few days ago, Facebook announced that they had reached the milestone of having 10 billion images stored in their servers. This translates to over 40 billion files under management, as each image is stored at four different resolutions. In total, all of these files requires 1 PB of storage space, and most likely consumes 2 PB of physical storage, assuming their data is mirrored at a minimum of two sites for availability purposes.

If we take those 40 billion files and 1 PB of storage consumed, we can figure out that the average file size is around 25 Kbytes. All of these files need to be stored, managed, distributed and replicated — exactly the solution space for an archive. And this is why I believe that the Web 2.0 problem space has been often viewed as an archiving problem.

But the Web 2.0 problem is larger. It is not the number of objects stored that makes this an engineering challenge, but the number of objects accessed: Facebook serves 15 billion images per day, thus, serving more files each day then the number of unique images they have stored! At peak load, they are serving over 300,000 images per second. Even though a majority of the images accessed each day fall in a small (<0.01%) subset of the total image set, the images that are popular change daily, and the storage system must handle these "hot files" efficiently in order for the system not to collapse under the load.

Thus, while it is true to say that Web 2.0 sites require an archive (for all those infrequently or never again accessed objects), due to the retrieval requirements, an archive is just one part of the infrastructure required to serve a Web 2.0 workload.


Extraordinary Concurrency as the Only Game in Town

DARPA released a new report today providing a detailed analysis of the challenges involved in scaling high-performance computing up to the exascale levels, three orders of magnitude higher then today's petascale supercomputers.

The report is available for download at the below URL:

TR2008-13: Exascale Computing Study: Technology Challenges in Achieving Exascale Systems

This report covers many fascinating issues, ranging from advancements in semiconductor technologies to interconnects to packaging and all the way up the stack to the software that would run on such a computing system.

What jumped out to me is the degree to which concurrent programming models have become so critical to scaling computing, and the degree to which this is still an open problem. Single threaded programming has reached the end of the performance road, and the only way to improve is to parallelize. This is directly in agreement with the research in concurrent programming and runtime systems that I have been running for the past decade, and I'll be sharing more thoughts related to my findings and future research over the coming months.

Resilient Exascale Systems

A computing system with millions of loosely coupled processing elements running billions of concurrent processes will be continuously encountering failures at the hardware, interconnect and software levels. A successful software environment must automatically and transparently recover from such faults without passing the burden of complexity on to the programmer.

Ultimately, the only successful approach will be a dispersal-driven replicated checkpoint-and-vote architecture which allows a trade-off to be made between reliability and efficiency. This approach offloads the complexity of error detection and recovery from the programmer, and allows the system to automatically optimize system efficiency (a higher voting degree trades compute efficiency for time efficiency, and a lower voting degree trades time efficiency for compute efficiency).

This presence of such a resiliency layer also allows the use of fabrication techniques that have far higher defect densities and error rates then traditional system fabrication, as near-perfection is no longer required. This is a key to enable feature size and voltage to continue to be reduced, to provide high-density arrays of compute elements, and to support larger dies and stacked wafers.

Storage for Exascale Systems

One of the most fundamental changes we need to make is to stop separating storage resources from computing resources. CPU power is continuing to increase, and storage density is continuing to increase, but the bandwidth and latency between the two is not.

Instead of moving the data to the compute element, we need to move the computation to the data. The first step is to replace the traditional demand-paged memory hierarchy with generic compute elements that can act as programmable data movers, create vastly larger register stores, and consider such data movement as a subset of a more general message passing approach. Instead of just sending data to fixed computational processes, we need to turn processes into messages themselves, thus allowing processes to also be sent to the location(s) where data is stored.

Ultimately, persistent storage (for checkpoints/state) and archival storage needs to be automated and under the control of dedicated system software, not specified as part of the software. Checkpoint information needs to be streamed to higher-density storage, as does parallel I/O used to read and write datasets being processed. This allows us to move to a model where storage is just process state (and thus, ultimately, just messages), and the files as we know today are just views into the computing system, much like reports are views into a database.


Blades for Storage

Over the last decade, I've had a significant interest in the benefits of blade server architectures. In early 2000, I worked on a development project that used CompactPCI cards from ZiaTech (later acquired by Intel) where we used the CompactPCI bus as a high-performance low-latency system-to-system interconnect. Several years later, we also evaluated blade systems from IBM and HP to see if we could leverage some of the benefits of blade systems for StorageGRID deployments. But, at every point, there were major downsides, namely, cost, bandwidth and storage interconnectivity.

Thus, until now, I've never found a product that provides the storage connectivity that allows a blade system to complete economically with DAS attached shelves linked to 1U servers. That changed this week, with IBM announcing the new BladeCenter S Chassis.

Enabling Features of the BladeCenter S

There are three features in this product that allow it to be a perfect foundation for software-based storage systems:

Feature #1 - SAS/SATA RAID modules, which allow you to hang up to eight SAS/SATA shelves off each BladeCenter chassis. With 12 TB of SATA storage per shelf, that allows you to have 96 TB of raw addressable storage per BladeCenter. This feature is unique to the IBM offering at this time, and is the key enabling feature of the product for storage software solutions.

Feature #2 - Internal SAS/SATA storage, with two bays, each holding 6 3.5" drives (up to 6 TB SATA, 3 TB SAS). This is perfect for system and database disks, and allows all of the control storage to be co-located with the compute blades.

Feature #3 - 10GigE switch modules and blade NICs give you the bandwidth you need to provide high-performance storage services to external clients without bottlenecks.

Scalable Hardware for Storage Software

With the BladeCenter chassis starting at less than $2,500, this allows a common architecture and hardware platform that extends from entry-level systems consisting of two HA blades with internal storage all the way up to multi-chassis systems managing hundreds of terabytes of storage. This also reduces support costs, as you can leverage a common pool of replacement parts.

The blade architecture allows the compute hardware to be field replaceable, which reduces MTTR. And as the attached storage is assignable across blades, there is the possibility to design a system that has a hot standby blade that dynamically reassigns the storage associated with the failed blade, and resumes providing services.

If I were to design a new software storage product, this would be the foundation hardware I would choose. The software would run in VMs, granted, but this hardware finally provides the cost/performance/supportability tradeoffs that I view are critical for such a product to succeed, and provides sufficient built-in storage to allow a common hardware platform to extend down to entry-level systems.