2008-12-23

Concurrency and Complexity

HP's Russ Daniels understands a critical issue facing the next generation of software development:

"If you think about whatever those skills are that are necessary to be able to program a computer, we don't understand what they are very well, but we know that they're not very widely distributed. So you end up with a relatively small portion of the population who can write a program successfully. Then you start introducing certain kinds of abstractions: if you introduce recursion, you lose some people; if you introduce pointers, you lose some people; if you introduce event-driven models, you lose some people; when you introduce concurrency, you lose virtually everybody.

So we create these programming languages that support concurrency and then you find that there's a lock and an unlock around every method, because otherwise people hurt themselves. It presents an interesting dilemma—we need to be able to get more stuff done, we need to figure out how to do stuff in parallel, but we're pretty bad at doing that."

One of the keys to solving this complexity problem is to make the mechanics of parallelism safe, automatic, and transparent to the programmer. Ultimately, this approach needs to extend far beyond just parallelism, as we see the same failings in the areas of security, fault tolerance, fault recovery, dynamic scaling, and resource economics.

An ADE Example

In order to see how this could work, let's take a quick look at an example of how ADE transparently provides concurrency and protection of resources that must be accessed serially:

Let's imagine that we're tasked with building a simple LAN switch, and we want to provide an IP address resolution for connected Ethernet hosts. If we want to send an IP packet to a given computer, we first need to find what the MAC address of the computer with a given IP address is. This is performed using the Address Resolution Protocol (ARP), which was originally documented in IETF RFC 826.

At a simplified level, the protocol works as follows:
  1. A host which wants to talk with a specific IP address sends an ARP broadcast request asking for the MAC address that traffic should be addressed.
  2. The switch intercepts the broadcast, and looks up the IP address in its local cache. If the IP to MAC address mapping is found, it responds to the originally requesting host with the ARP response.
  3. If the mapping is not found in the local cache, the switch broadcasts the ARP request to all other ports.
  4. If the switch receives a ARP response, it adds the IP to MAC mapping to its local cache, and forwards the response to the originally requesting host.
This is equivalent to a simple database insert/lookup problem, and in traditional concurrent programming models, errors can easily be introduced if care is not taken to lock the local cache while performing modifications.

In contrast, because ADE handles concurrency by forcing all component interactions to be atomic transactions, no locks are required and the system can be proven to be correct and deadlock free.

In this example, we are going to consider three actors: The ARP Lookup Process, the ARP Cache Process and the ARP Discovery Process. These actors have the following roles:
  • The ARP Lookup Process receives ARP lookup requests from the network hosts, asks the ARP Cache Process if the mapping is known, asks the ARP Discovery Process for the mapping if not known, and responds to the requesting host.
  • The ARP Cache Process looks up mappings in the local cache, and adds mappings when asked.
  • The ARP Discovery Process sends ARP lookup requests to the network, and when a mapping is discovered, it asks the ARP Cache Process to remember them.
In this model, there would be one instance of the ARP Lookup Process per ARP lookup packet received on the network, one instance of the ARP Cache Process, and one instance ARP Discovery Process per cache miss.

In a busy network, there could be dozens of ARP requests being processed, and many concurrent requests that may require changes to be made to the ARP cache. Even if ARP discovery is adding a new mappings at the same time that a new request is looking up a mapping, there is no possibility that the cache can become corrupt, because all access to the cache is automatically serialized by virtue of the entities and their messages that are exchanged.

This is the key aspect of ADE that simplifies the creation of distributed systems. Because concurrency is automatic, when problems are decomposed into collections of interacting actors, any parallelism inherent in the problem will be automatically utilized.

2008-12-15

XAM Active Objects

The SNIA XAM standard provides a comprehensive object model and interface for the storage of dynamic data objects. The standard provides methods by which collections of data (known as XSets) can be created and deleted, data can be stored, retrieved and manipulated inside these XSets, and queries can be performed to locate XSets that match specified charactertistics.

While the first version of the standard was under development, I worked with Jared Floyd of Permabit to ensure that the XAM Job model was architected in such a way that it would include all of the components required to support active objects, like those supported in ADE.

What are XAM Active Objects?

This is best illustrated by an example:

Let us create a hypothetical XSet that includes an XStream (binary data) that contains the bytecodes defining a Java program. When this XSet is committed, it can automatically start executing. The entity that executes these active objects can either be performed by a sidecar system that attaches to the XAM Storage System (XSS), and becomes aware of new active objects via the standard XAM query facilities, or the execution of these active objects can be an intrinsic part of the XSS.

The Java program would be executed in the security context of the XSET it is contained within, and thus it can access local data stored within it's own XSET, and optionally perform XAM operations based on its security credentials. This allows it to read other XSets, create new XSets and perform queries to discover XSets.

For example, an active XAM object could remain resident within the storage system, performing queries for specific types of XSETs (for example, PDF objects), and convert them into a newer version of the PDF format. Such an model would allow dynamic format conversion of archived content, just by loading in a new XSET. This model is also be useful for analysis, where a large number of XSETs need to be analysed or datamined in the background.

Because XAM Storage Systems will often be distributed, multiple active objects can easily be parallelized across the compute infrastructure the makes up the system, and parallel computing patterns are easy to implement, as one active object can create XSETs that act as child active objects (the equivalent of performing a fork in UNIX). As code can be bundled with data, this model will also enable the creation of data-driven MIMD and SIMD parallel data processing systems.

XSETs become process contexts, complete with inputs, code, state and outputs, with the full ability to discover and access data within the storage system, create new data, spawn child processes and report status information to external systems. Because XSETs belong to user contexts, adding processing quotas, limits on computing resources, and other policies can use the same methods as used for other XAM policies.

Part of the Standard?

All this could be implemented today as a vendor-specific extension to the standard. However, adding this capability to the XAM standard would require work to be done in the following areas:
  • Active object language profiles would have to be defined, since multiple languages including Java, .net and interpreted languages such as ruby could all be supported.
  • Language bindings would need to be standardized to allow these programs embedded in the active objects to be perform XAM operations on the XSS they are resident in.
  • Standard job-style status reporting would be beneficial to allow standardized active object execution status monitoring.
  • Policies for computing-related aspects, such as resource usage and quotas would need to be defined.
  • Requirements for security isolation would need to be defined.

Given that none of these would require changes to the core XAM standard, this could easily be added as an optional part of the standard, much like Level 2 query, which provides full-text search within XStreams.

In summary, the XAM standard provides a foundation upon which a rich data-driven distributed computing system can easily be created. This opens up many intriguing possibilities, and would be relatively easy to formally add to the standard.

2008-11-23

The ADE Process Model

To illustrate how ADE-based systems operate, it is best to use an example. In this entry, we will examine at a high level how a simple ADE program is run, from bootstrapping to shutdown.

ADE Versions

Over the last ten years, there have been three major revisions of ADE:
  • ADE V1 was developed in 1998, and was a minimal system designed to facilitate experiments with message-based computing. In this version, ADE was hosted as a library that ran on Mac OS.
  • ADE V2 was a result of continued improvements over the next two years, and included a pure message-based execution model and the ability to run directly on a processor without requiring a host operating system.
  • ADE V3 was created specifically for the Bycast StorageGRID platform during 2000 through 2004, and has been tuned to provide high levels of performance while retaining most of the advantages of the ADE approach to distributed computing. In this version, ADE was updated to use an Operating System abstraction layer, allowing it to run on a wide variety of UNIX-like and embedded operating systems.
In order to demonstrate the pure message-based execution model, this example is based on ADE V2.

Bootstrapping

When ADE first starts, there are no processes present and no messages being sent. When in this state, there are no processes to send a message to, and there are no processes that could send a message. Thus, in order to transition the system into an executing state, we must "bootstrap" the system to the point where there is at least one process executing.

Given that ADE processes are just messages, when I refer to a message as a process, I'm referring to a message that contains executable code as part of the message contents. And since messages are just self-describing data structures (known as "containers"), a message can be serialized to disk. As messages are created, manipulated and exchanged in serialized form, one of the key design criteria of the container data format was highly efficient serialization.

We use this property to bootstrap the system. Once the ADE kernel is running, we load an (initially hand-created) message into memory. This message contains executable code for our bootstrap process, as well as the state for the process.

Figure 1: An executable message (Capability ID 100)

Loading this message into memory involves the following steps:
  1. An entry is added to an in-memory database, keyed by the message capability. The message capability is a randomly assigned 128 bit identifier assigned by the system that uniquely identifies the message.
  2. The kernel creates a protected memory space for the message. This memory space provides the execution context.
  3. The serialized message is loaded into this memory space.
At this point, we have a process in memory, ready to do work, but no messages to trigger any work to be done. In order to kick off execution, we need to load a second message into the system. This message is far simpler than the first message, as it just triggers the executable code in the first message. This message consists of a message capability, a source CID (set to zero for bootstrapping), a destination CID, which is set to the capability identifier for the first message, and a message type, which is a string that identifies which executable code in the destination that should be invoked.

Figure 2: A message destined to CID 100

This is read as, "Message CID 200, from CID 0, to CID 100, invokes /create".

Message Processing

When this second message is loaded into memory, the presence of the destination capability identifier field triggers the message to be added to a scheduler queue for delivery to the specified destination capability. The ADE kernel sees this scheduler entry, and moves the triggering message into the memory space of the destination message, where it becomes part of the destination message. The scheduler then switches to the context of the destination message, and executes the code indicated by the triggering message. As this executable code has access to the message contents, it can now inspect the contents of the triggering message, and manipulate its own message state.

Figure 3: Message CID 100, after receiving message CID 200.

In our example, message CID 100 then executes the "create" method, and the executable code in the create method creates and sends a new message to itself (which is the only possible destination, and the only destination known to the process), invoking the "destroy" method. The final step in the "create" method is to clean up the contents of triggering message from the process message state.

Figure 4: Message CID 100, with message CID 300 about to be sent.

This newly sent message in turn gets enqueued by the scheduler, then processed, causing the "destroy" method to be invoked by the destination process. This executable code for the "destroy" method removes the entire state of the message, which then results in the message itself being removed (and the entry being removed from the in-memory database), returning us to our original pre-bootstrap state.

In Summary

This process model, of sending messages to other messages, executing code in destination messages, manipulating the state of the destination messages, and creating new messages, allows full computational capabilities, and provides a rich foundation for the creation of many advanced distributed computing models.

In our next ADE post, we'll examine some of the design patterns and capabilities that emerge as a result of this process model.

2008-11-19

SAS for Blades

Blade-based computing form factors offer many potential advantages for constructing storage subsystems using standard hardware, but until recently, the lack of cost-effective storage connectivity has hampered blades when compared to 1 and 2U servers with directly connected SCSI or SATA/SAS shelves.

As I mentioned in an earlier post about IBM's BladeCenter S storage attachment capabilities, this is starting to change as low-cost SAS storage connectivity switching is integrated into blade solutions.

And now I am pleased to see that HP has introduced a new product, the HP StorageWorks 3Gb SAS BL Switch. This product provides eight external facing SAS 3GB ports, and if you install two of these switches in a c3000 chassis, you can attach up to 16 external SAS/SATA shelves. Assuming 12 TB of SATA storage per shelf, that allows you to have 192 TB of raw addressable storage per c3000.

This is perfect for high density storage systems, as you end up with 6U for the c3000, 32U for the storage shelves, and 4U remaining for a KVM and networking equipment. As the c3000 series blades also allows you to install "Storage Blades", you can host databases locally to the blade. This makes an excellent platform for a scalable high-density StorageGRID deployment, as each site can start with two blades providing basic services, and as storage capacity is expanded, additional shelves and blades can be added as needed. And building out to 200TB per rack is a pretty respectable density. One could easily use this hardware to build a 1 PB storage system based around five racks running StorageGRID, connected together using 10 GB network uplinks.

When comparing this to the IBM offering, there are both advantages and disadvantages. In order to allow an HP blade to access the SAS switch, you have to install a Smart Array P700m controller card, which represents an additional cost. With the IBM offering, the RAID controller in part of the SAS switch module itself. But given that the HP switch allows twice the SAS attachment density, and that having the controller card as part of the blade provides a greater degree of failure isolation, I'm inclined to prefer the HP solution.

But regardless of the minor differences, the bottom line is that both systems are now far more viable for use as lower-cost storage infrastructure due to the elimination of the need for fibre-channel based connectivity to the storage shelves. This is a huge improvement in terms of costs, and one that we won't see again until FCOE (or SAS over Ethernet) becomes widely deployed.

2008-11-18

ADE: A Decade of Distributed Computing

The Asynchronous Distributed Environment (ADE) is a message-based framework for the creation of distributed grid computing systems, including the Bycast StorageGRID. With over 40,000 node-years of production runtime, this framework represents one of the more mature distributed computing environments.

I originally developed ADE as a testbed for distributed computing concepts, and published a paper describing the system, titled "The ADE Environment: A Macroscopic Object Architecture for Software Componentization and Distributed Computing" at the MacHack conference back in 1998. As an environment for experimentation, it was very successful, allowing the rapidly exploration of many patterns for creating massively concurrent distributed systems, and for refining the environment to improve the programming model and associated infrastructure.

The ADE Process Model

ADE takes a rather unique approach to the traditional CSP process model: processes are messages. Thus, instantiating a process is as easy as sending a message that includes executable code. Each message has a unique "capability" identifier that allows messages to be sent to itself. When a message is received by a process, the executable code corresponding to the received message is run, which allows the process to manipulate its state (the process message) and to optionally generate additional messages.



An example of a more complex process trace can be found as part of this Graphviz .dot file.

This approach has several major advantages:
  1. Parallelism is implicit and automatic, as dependencies are expressed as message exchange relationships. All forms of serial and parallel computing can thus expressed as directed acyclic graphs.
  2. Processes can easily be migrated from one system to another, as they are just messages. Lightweight and automated migration permits experimentation with alternate approaches to distributed processing, such as migrating processes to a data source or resource.
  3. As processes are just messages, the entire execution state of a system can be captured by checkpointing all message across a given cut line, and by retaining historical message states, execution can be run backwards, and "anti-messages" can undo processing.
  4. Execution state can easily be logged and inspected for debugging and visualization. By inspecting the real-time message graph, deadlock and livelock can be automatically detected, as can the critical path for performance optimization.
And as one can imagine, this is just the tip of the iceberg.

A Maturing Environment

Much has changed over the last decade, and ADE has transformed from a research system into an industry proven technology. As ADE originally was based on ATM networking, we re-wrote the networking and node-to-node messaging to use TCP/IP, and the process model was changed to allow statically linked code to be associated with processes, instead of including it with the message. As storage systems require extremely high execution performance, we optimized for message processing speed and efficiency, and some of the less-used features, such as process migration, were retired from our production builds.

The result has been a high-performance, yet flexible system that facilitates the rapid development and testing of distributed systems. By focusing on allowing distributed software to be easily developed, visualized and tested, we've been able to rapidly innovate and add functionality to our system. And ultimately, this has been the most significant advantage of ADE.

2008-11-10

Welcome, EMC

Today, EMC announced their Atmos product, perviously code-named "Maui". It is an interesting offering, and one that we are quite familiar with... given that object-based storage is what Bycast has been developing since 2000, and has had deployed and operational in customer sites since 2002.

Let's do a quick rundown of the core features that Bycast StorageGRID offers that are also offered by Atmos:
  • Object based storage? Check
  • Metadata with objects? Check
  • Multi-site? Check
  • Multi-tenancy? Check
  • File-system interface? Check
  • Web-services API? Check
  • Policy-driven replication? Check
  • Compression? Check
  • Object-based De-dupe? Check
  • Object versioning? Check
  • MAID-style spin-down? Check
  • Web-based admin? Check
There are also a couple of significant features found in StorageGRID that are not present in Atmos:
  • Encryption? Unknown
  • Object integrity protection? Unknown
  • Clustered gateways? Unknown
  • Storage hardware independent? Nope
  • Storage vendor independent? Nope
  • Support for tape and other high-latency media? Nope
  • Six years of operation in the field? Nope
  • Mature, customer-proven product? Nope
Over the last five years of deploying large-scale distributed object-based storage systems, we've learned a lot about what things work well, and where things go wrong. Based on the initial feature set, we know that Atmos will be a successful offering for EMC. There will, of course, be the usual teething pains, not unlike those that happened with Centera, but that is the nature of all 1.0 software.

So, to our friends at EMC, I say, Welcome. We're glad to see object-based storage becoming mainstream.

2008-11-03

On Graphing Data

Over the years, I've had to do a fair bit of data analysis for network monitoring and software instrumentation. Computers, especially when monitoring themselves, can easily generate vast amounts of data, and when you are looking for the one anomaly or correlation, the only way to quickly find them is to represent the information visually. And for data that changes over time, that means generating graphs of the data.

A Little History

Back in 2000, when we starting building our web-based management and monitoring tool at Bycast, we spent a lot of time looking at different third-party graphing toolkits. This was long before AJAX and mature client-side JavaScript engines, and simply weren't able to find any that even met the following basic requirements:
  1. Low-latency chart generation
  2. Anti-aliased rendering
  3. Data binning
  4. Display of minimum, maximum and average values
Having looked at many poorly rendered charts, I set very high visual standards for our chart rendering classes. And since many of our graphs would be displaying data collected over weeks, months and years, a single graph could often require the processing of hundreds of thousands to millions of data points. As one might imagine, this was a non-trivial problem.

Min-Max-Average

Attempting to plot these values directly would never be efficient or responsive for the end user, so we grouped data points into time-based bins, calculated the minimum, maximum and average value of each bin, and rendered the bin values. Since each chart had the same number of bins, this allowed us to optimize our data processing and storage, while still allowing high resolution display of data, and allowing users to easily zoom into areas of interest.

This is best illustrated with an example:


This graph, which shows system memory usage over time, has a dark green line, which indicates the average memory usage, and the light shaded region shows the minimum and maximum range within each bin.

This allows deviations from the average to be easily identified visually. For example, one can see that memory usage dropped significantly on the evening of September 28th, and at one point, fell to almost 210 MBytes. This would not be visible with a chart that only displayed the average value, as most charts tend to do.

We've found that displaying this information is very important. Often it is the deviations from the average that are the most important to focus on, and this method of displaying them allows one to quickly identify and zoom in on the areas of interest.

Graphing Guidelines

I have long been looking for a well-documented set of guidelines for how to render graphs, but until now, I had not found anything that met the bill.

In August 2008, Microsoft's Microsoft Health Common User Interface Group released a new guidance document, titled Displaying Graphs and Tables. This document is excellent, and I would encourage everyone involved with information visualization related to graphing to read this document. It fits exactly what I had been looking for, and is very well written.

Some Additional Graphing References

Here are some references that I have read and found useful when designing graph-based information visualizations:

Beautiful Evidence, by Edward R. Tufte
The Elements of Graphing Data, by William S. Cleveland
The Grammar of Graphics by Leland Wilkinson

2008-10-27

Microsoft Azure is Cloud Storage

Microsoft today announced their Cloud Computing initiative, the Azure Services Platform. While many of the commentators have been comparing it with Amazon's EC2 offerings, and with VMWare based hosting, Azure is as much about cloud storage as it is about computing.

Quoting from Microsoft's Port 25 Blog:

Both Windows Azure applications and on-premises applications can access the Windows Azure storage service, and both do it in the same way: using a RESTful approach. The underlying data store is not Microsoft SQL Server, however. In fact, Windows Azure storage isn't a relational system, and its query language isn't SQL. Because it's primarily designed to support applications built on Windows Azure, it provides simpler, more scalable kinds of storage. Accordingly, it allows storing binary large objects (blobs), provides queues for communication between components of Windows Azure applications, and even offers a form of tables with a straightforward query language, Chappell says.

This is worth emphasizing. Azure provides global cloud-based object storage. And it takes it one step further, by providing active objects such as queues and tables. The presence of queues, message busses and other persistent data structures are a real game-changer, as they form the location-independent "glue" by which to hold together large-scale loosely-coupled applications that are best suited for cloud-based hosting. This directly competes with Amazon's S3, and by creating a platform that runs .Net and other interpreted languages directly without the weight of a full OS running in a VM, it should be able to scale much more elegantly.

For storage object access, the API Microsoft has adopted is very similar to the HTTP API used by Bycast for the StorageGRID platform, so this continues the trend of standardization around HTTP for object storage and access.

Their API documentation can be found at the below link:

http://go.microsoft.com/fwlink/?LinkId=131258

Also of interest from the announcement are Microsoft's .Net Services for Ruby, which provide first class access to the Azure services for ruby applications.

Now all we need is a XAM VIM that talks to Azure. Their containers map closely to XSets, blobs to XStreams, and stype data can be placed in tables and thus be queried. They've even provided methods to handle cache coherency for multiple simultaneous writers. Very interesting...

2008-10-23

Discussions on the Web

XAM and content-based search

A post about the advantages of including support for full-text search in XAM implementations, and my response regarding the pros and cons of providing this functionality at the storage layer instead of the application layer.

The Universal Design Pattern

A discussion of a commonly used design pattern where objects have properties that are both inherited and overridden, and my response regarding the similarities of the XAM objects storage model, and it's use for persistent storage of serialized objects created when using this design pattern.

2008-10-19

Web 2.0 and Archiving

A few days ago, Facebook announced that they had reached the milestone of having 10 billion images stored in their servers. This translates to over 40 billion files under management, as each image is stored at four different resolutions. In total, all of these files requires 1 PB of storage space, and most likely consumes 2 PB of physical storage, assuming their data is mirrored at a minimum of two sites for availability purposes.

If we take those 40 billion files and 1 PB of storage consumed, we can figure out that the average file size is around 25 Kbytes. All of these files need to be stored, managed, distributed and replicated — exactly the solution space for an archive. And this is why I believe that the Web 2.0 problem space has been often viewed as an archiving problem.

But the Web 2.0 problem is larger. It is not the number of objects stored that makes this an engineering challenge, but the number of objects accessed: Facebook serves 15 billion images per day, thus, serving more files each day then the number of unique images they have stored! At peak load, they are serving over 300,000 images per second. Even though a majority of the images accessed each day fall in a small (<0.01%) subset of the total image set, the images that are popular change daily, and the storage system must handle these "hot files" efficiently in order for the system not to collapse under the load.

Thus, while it is true to say that Web 2.0 sites require an archive (for all those infrequently or never again accessed objects), due to the retrieval requirements, an archive is just one part of the infrastructure required to serve a Web 2.0 workload.

2008-10-07

Extraordinary Concurrency as the Only Game in Town

DARPA released a new report today providing a detailed analysis of the challenges involved in scaling high-performance computing up to the exascale levels, three orders of magnitude higher then today's petascale supercomputers.

The report is available for download at the below URL:

TR2008-13: Exascale Computing Study: Technology Challenges in Achieving Exascale Systems

This report covers many fascinating issues, ranging from advancements in semiconductor technologies to interconnects to packaging and all the way up the stack to the software that would run on such a computing system.

What jumped out to me is the degree to which concurrent programming models have become so critical to scaling computing, and the degree to which this is still an open problem. Single threaded programming has reached the end of the performance road, and the only way to improve is to parallelize. This is directly in agreement with the research in concurrent programming and runtime systems that I have been running for the past decade, and I'll be sharing more thoughts related to my findings and future research over the coming months.

Resilient Exascale Systems

A computing system with millions of loosely coupled processing elements running billions of concurrent processes will be continuously encountering failures at the hardware, interconnect and software levels. A successful software environment must automatically and transparently recover from such faults without passing the burden of complexity on to the programmer.

Ultimately, the only successful approach will be a dispersal-driven replicated checkpoint-and-vote architecture which allows a trade-off to be made between reliability and efficiency. This approach offloads the complexity of error detection and recovery from the programmer, and allows the system to automatically optimize system efficiency (a higher voting degree trades compute efficiency for time efficiency, and a lower voting degree trades time efficiency for compute efficiency).

This presence of such a resiliency layer also allows the use of fabrication techniques that have far higher defect densities and error rates then traditional system fabrication, as near-perfection is no longer required. This is a key to enable feature size and voltage to continue to be reduced, to provide high-density arrays of compute elements, and to support larger dies and stacked wafers.

Storage for Exascale Systems

One of the most fundamental changes we need to make is to stop separating storage resources from computing resources. CPU power is continuing to increase, and storage density is continuing to increase, but the bandwidth and latency between the two is not.

Instead of moving the data to the compute element, we need to move the computation to the data. The first step is to replace the traditional demand-paged memory hierarchy with generic compute elements that can act as programmable data movers, create vastly larger register stores, and consider such data movement as a subset of a more general message passing approach. Instead of just sending data to fixed computational processes, we need to turn processes into messages themselves, thus allowing processes to also be sent to the location(s) where data is stored.

Ultimately, persistent storage (for checkpoints/state) and archival storage needs to be automated and under the control of dedicated system software, not specified as part of the software. Checkpoint information needs to be streamed to higher-density storage, as does parallel I/O used to read and write datasets being processed. This allows us to move to a model where storage is just process state (and thus, ultimately, just messages), and the files as we know today are just views into the computing system, much like reports are views into a database.

2008-10-02

Blades for Storage

Over the last decade, I've had a significant interest in the benefits of blade server architectures. In early 2000, I worked on a development project that used CompactPCI cards from ZiaTech (later acquired by Intel) where we used the CompactPCI bus as a high-performance low-latency system-to-system interconnect. Several years later, we also evaluated blade systems from IBM and HP to see if we could leverage some of the benefits of blade systems for StorageGRID deployments. But, at every point, there were major downsides, namely, cost, bandwidth and storage interconnectivity.

Thus, until now, I've never found a product that provides the storage connectivity that allows a blade system to complete economically with DAS attached shelves linked to 1U servers. That changed this week, with IBM announcing the new BladeCenter S Chassis.

Enabling Features of the BladeCenter S

There are three features in this product that allow it to be a perfect foundation for software-based storage systems:

Feature #1 - SAS/SATA RAID modules, which allow you to hang up to eight SAS/SATA shelves off each BladeCenter chassis. With 12 TB of SATA storage per shelf, that allows you to have 96 TB of raw addressable storage per BladeCenter. This feature is unique to the IBM offering at this time, and is the key enabling feature of the product for storage software solutions.

Feature #2 - Internal SAS/SATA storage, with two bays, each holding 6 3.5" drives (up to 6 TB SATA, 3 TB SAS). This is perfect for system and database disks, and allows all of the control storage to be co-located with the compute blades.

Feature #3 - 10GigE switch modules and blade NICs give you the bandwidth you need to provide high-performance storage services to external clients without bottlenecks.

Scalable Hardware for Storage Software

With the BladeCenter chassis starting at less than $2,500, this allows a common architecture and hardware platform that extends from entry-level systems consisting of two HA blades with internal storage all the way up to multi-chassis systems managing hundreds of terabytes of storage. This also reduces support costs, as you can leverage a common pool of replacement parts.

The blade architecture allows the compute hardware to be field replaceable, which reduces MTTR. And as the attached storage is assignable across blades, there is the possibility to design a system that has a hot standby blade that dynamically reassigns the storage associated with the failed blade, and resumes providing services.

If I were to design a new software storage product, this would be the foundation hardware I would choose. The software would run in VMs, granted, but this hardware finally provides the cost/performance/supportability tradeoffs that I view are critical for such a product to succeed, and provides sufficient built-in storage to allow a common hardware platform to extend down to entry-level systems.

2008-09-23

Google Trust

Search providers such as Google, Yahoo and Microsoft are in a unique position to provide indicators of trustworthiness of a given web site, as they act as a trusted intermediary between end-users and their desired destinations. Currently, there are currently no clear methods by which an end-user can assess if a given site in a search result is indeed controlled by the organization that it claims to represent. If there were a way that a user could identify which sites had verified identities, this would result in vast improvements in the trust relationships between end-users and the organizations behind any given web site.

The User Experience

Let us consider a hypothetical user expierence: A user navigates to Google and enters a company name. They are presented with a listing of search results, often including the official company web site, other web sites for companies with a similar name, sites reviewing products by the company and sometimes sites maskarading as the official site.

In the search results listing, each site where the identity of the organization that controls the site has been verified includes a special icon known as a "trust mark". This icon indicates that Google has established a chain of trust that allows the identity of the organization responsible for the content on that site to be verified.

Figure 1: An example UI from Safari indicating the validity of a certificate.
The green check icon is a good example of a visual representation of a trust mark.

The presence of the trust mark may be sufficient for the user to navigate to the site in confidence, or they may click on the trust mark, showing a page containing legal information about the entity, including their location (using Google Maps, of course). This information would especially be useful for disambiguating different companies with similar names.

The Technology

Standard web certificates are already used for secure transactions and providing information about the authenticity of a secured web site. But these are limited to the secure sections of web sites, such as pages for authentication and payment processing. Most web sites do not use SSL/TLS for the bulk of their web site due to the computational cost of processing HTTPS transactions when compared to standard HTTP.

However, the same certificates used to provide HTTPS could also be used for indicating a degree of trust. By placing the certificate as a file in the root path of the web site, the Google crawler could retrieve a "certificates.txt" file, much like the current "robots.txt" file. As most certificates contain the top level domain name, Google would be able to verify the chain of trust of the certificate, check to make sure that the URL it was crawling matched the URL in the certificate, and then display the trust mark and associated information.

As this approach leverages existing infrastructure, does not require any new protocols, and allows web sites operators with existing certificates to immediately use them for this purpose, this would facilitate rapid adoption of this technique.

2008-09-15

Lightning Between the Clouds

Today's VMWare announcements about vCloud are the first concrete announcement of a product that enables migrating VM sessions across corporate boundaries. With the success of VMotion at a technical level, the concept of outsourcing your DR centre was an obvious next steps, and vCloud formalizes the ability to define these relationships in code.

However, there are many significant technical and non-technical risks associated with such a service offering, including:
  • Storage Synchronization - In order to migrate VM sessions, the storage backing the VM session must also be synchronized across the two organizational entities. This plays well into EMC's hands to bundle location-independent storage along with VM functionality.
  • Security - VM sessions contain sensitive in-memory data, such as encryption keys, that (in a well-designed system) never makes it to disk. This is on top of the security issues associated with the storage associated with the sessions.
  • Multi-tenancy - An outsourcing provider will most likely be running VM sessions for multiple customers on a common infrastructure. Thus, network isolation, in additional to storage isolation and partitioning, become major issues that are not present when all resources are within a single enterprise.
  • Management - Resources need to be billed, QoS monitored, SLA's tracked and enforced, and loads predicted and managed. All this will have to work seamlessly across both the customer's and outsourcer's infrastructure. The back-office part of such a service is always under-estimated, and is critical to get right in order for a service to succeed.
Solving these problems is a pretty tall order, and I have to commend VMWare for their vision. This is version 2.0 of the VM revolution, and things are starting to get really interesting.

2008-09-12

Time for Compliance

Many aspects of compliance storage rely on trusted time. These include timestamps indicating when an object or file was stored, retention durations that indicate when files must not be deleted or modified, and audit records indicating when operations were performed against the storage system. All of these timestamps must be accurate, and, more importantly, must be resistant against attack in order to satisfy the multitude of compliance regulations, such as Sarbanes-Oxley and HIPAA.

A Question of Time

When evaluating such a storage system, here are ten good questions to ask your vendor:
  1. How and when is the clock set?
  2. Who can set or adjust the clock?
  3. Are changes to the clock audited?
  4. How much can the clock drift over time?
  5. If the clock is synchronized, is the synchronization chain trustworthy?
  6. Is clock synchronization traceable to the NIST?
  7. If clock synchronization is no longer possible, how does the system react?
  8. When clock synchronization is regained, how does the system react?
  9. What protections are present to prevent tampering with the clock at the system level?
  10. What protections are present to prevent tampering with the clock at the network level?

Two Architectures

Generally, two architectures have emerged, one that involves a completely sealed system that is capable of maintaining accurate time with drift less than one minute per year for the life of the system, and one that involves network-based transactions that cryptographically prove that a given event happened at a given time.

The advantages of the first architecture include strong resistance against tampering, and low maintenance requirements. However, the downside to such an architecture is the requirement for custom hardware, both to keep accurate time (the clocks in typical servers range from largely inaccurate to downright embarrassing), and to provides the means to physically secure the hardware from prying eyes and screwdrivers. Because this requires custom enclosures and maintenance contracts (who do you trust to have the keys to the rack?) this typically lends itself to solutions from larger storage hardware vendors. And, after all, if you are spending hundreds of thousands to millions of dollars on something, it better well be able to keep accurate time.

The second architecture is unfortunately far more complex and difficult to design and implement correctly. In a software-only solution, very little can be relied upon to be trusted. After all, a standard x86 server is only one boot-disk away from unfettered tampering, and it's difficult to detect if you are running under a hypervisor. Thus, such systems must rely on complex network transactions to determine accurate times of events, often resulting in increased transactional latency. Unless these time transactions are designed and tested to ensure that a malicious time source or compromised node is unable to alter the timestamps and compliance durations, this can be a significant point of weakness.

Beware NTP

One protocol to keep a watch out for is NTP. A malicious NTP server, combined with a poisoned DNS cache and the quick throw of a circuit breaker might result in all your compliance data being unprotected ahead of schedule, or even worse, automatically erased from the system. And given that NTP security is rarely used and not well regarded, it is almost a certainty that it forms a weak link in the chain of trust.

Many systems that use NTP just use it to set the server and operating system clock, which they then trust blindly. For a given server, this clock can be easily altered, and in order to obtain trusted timestamps, information from multiple sources that can not all be easily compromised must be used.

Time is of the Essence

Time is often overlooked when evaluating compliance storage, but is a fundamental aspect of the compliance process. After all, in a court of law, if the timestamps of events cannot be proven to be accurate, and retention durations cannot be shown to be enforced, that expensive compliance system may end up being even more expensive.

2008-09-09

Raining on the Cloud

A thorn in my side, as of late, has been the Wikipedia article on Cloud Computing. Describing yet another newly coined buzzword for distributed computing, this article contains many examples of of the worst of Wikipedia, and reminds me of some of the articles I have been subjected to by SoA fundamentalists (and and before them, that of the CORBA-cultists, etc).

Cloud, as in Network Cloud

The concept of Cloud Computing originated as a analogy to the network cloud, a mainstay of whiteboard and Visio diagrams everywhere. Thus, in order to understand what it means, one must consider what a network cloud means. Fortunately, this is simple to answer and relatively un-contentious: all a network cloud means is "the stuff we don't have to worry about". It's infrastructure. It's the stuff that we can let the network and/or the networking people figure out how to make work, and by ignoring those details, allows us focus on the problem at hand.

If we take this concept of the network cloud and apply it to computing, we end up with "The practice of using known resources to provide computational services as a component of solving a larger problem." Just like that network cloud in the diagram indicates that we don't care how the packets get from site A to B, cloud computing allows us to not worry about how and where computation is performed.

Distributed Computing, Renamed, Yet Again

When viewed from this perspective, cloud computing is just yet another flavour of distributed computing, one where computational services are provided over a network, typically the Internet. The fact that the users of these services does not have to own, control, manage or even be aware of how the service is provided, is important, but not ground-breaking. The only key difference is that the service contract and information hiding resulting from a well-defined and managed service allows application complexity to be built on top of the services without having to worry about their implementation or operation.

When it all works, that is...

2008-09-03

Vanishing into the Infrastructure

Throughout our history, it is our infrastructure that we build our civilization upon. From roads to electrical power grids, from the telephone network to the soon to be ubiquitous Internet, it is the technologies that we no longer think about that enable our way of life. For it is when technology fades into the infrastructure that things become interesting:
  • This is when the next generation of technology emerges.
  • This is where the majority of the dollars under the adoption curve are found.
  • And most importantly, this is when the benefits to human existence become widespread and far-reaching.
Fading into the infrastructure is not easy, nor predictable. After all, it involves technical and business challenges that far exceed the complexities of any original innovation. But for those who thrive on these challenges, it offers opportunities to make a far-reaching difference.

Welcome.