Vanishing Into the Infrastructure: November 2008

2008-11-23

The ADE Process Model

To illustrate how ADE-based systems operate, it is best to use an example. In this entry, we will examine at a high level how a simple ADE program is run, from bootstrapping to shutdown.

ADE Versions

Over the last ten years, there have been three major revisions of ADE:

ADE V1 was developed in 1998, and was a minimal system designed to facilitate experiments with message-based computing. In this version, ADE was hosted as a library that ran on Mac OS.
ADE V2 was a result of continued improvements over the next two years, and included a pure message-based execution model and the ability to run directly on a processor without requiring a host operating system.
ADE V3 was created specifically for the Bycast StorageGRID platform during 2000 through 2004, and has been tuned to provide high levels of performance while retaining most of the advantages of the ADE approach to distributed computing. In this version, ADE was updated to use an Operating System abstraction layer, allowing it to run on a wide variety of UNIX-like and embedded operating systems.

In order to demonstrate the pure message-based execution model, this example is based on ADE V2.

Bootstrapping

When ADE first starts, there are no processes present and no messages being sent. When in this state, there are no processes to send a message to, and there are no processes that could send a message. Thus, in order to transition the system into an executing state, we must "bootstrap" the system to the point where there is at least one process executing.

Given that ADE processes are just messages, when I refer to a message as a process, I'm referring to a message that contains executable code as part of the message contents. And since messages are just self-describing data structures (known as "containers"), a message can be serialized to disk. As messages are created, manipulated and exchanged in serialized form, one of the key design criteria of the container data format was highly efficient serialization.

We use this property to bootstrap the system. Once the ADE kernel is running, we load an (initially hand-created) message into memory. This message contains executable code for our bootstrap process, as well as the state for the process.

Figure 1: An executable message (Capability ID 100)

Loading this message into memory involves the following steps:

An entry is added to an in-memory database, keyed by the message capability. The message capability is a randomly assigned 128 bit identifier assigned by the system that uniquely identifies the message.
The kernel creates a protected memory space for the message. This memory space provides the execution context.
The serialized message is loaded into this memory space.

At this point, we have a process in memory, ready to do work, but no messages to trigger any work to be done. In order to kick off execution, we need to load a second message into the system. This message is far simpler than the first message, as it just triggers the executable code in the first message. This message consists of a message capability, a source CID (set to zero for bootstrapping), a destination CID, which is set to the capability identifier for the first message, and a message type, which is a string that identifies which executable code in the destination that should be invoked.

Figure 2: A message destined to CID 100

This is read as, "Message CID 200, from CID 0, to CID 100, invokes /create".

Message Processing

When this second message is loaded into memory, the presence of the destination capability identifier field triggers the message to be added to a scheduler queue for delivery to the specified destination capability. The ADE kernel sees this scheduler entry, and moves the triggering message into the memory space of the destination message, where it becomes part of the destination message. The scheduler then switches to the context of the destination message, and executes the code indicated by the triggering message. As this executable code has access to the message contents, it can now inspect the contents of the triggering message, and manipulate its own message state.

Figure 3: Message CID 100, after receiving message CID 200.

In our example, message CID 100 then executes the "create" method, and the executable code in the create method creates and sends a new message to itself (which is the only possible destination, and the only destination known to the process), invoking the "destroy" method. The final step in the "create" method is to clean up the contents of triggering message from the process message state.

Figure 4: Message CID 100, with message CID 300 about to be sent.

This newly sent message in turn gets enqueued by the scheduler, then processed, causing the "destroy" method to be invoked by the destination process. This executable code for the "destroy" method removes the entire state of the message, which then results in the message itself being removed (and the entry being removed from the in-memory database), returning us to our original pre-bootstrap state.

In Summary

This process model, of sending messages to other messages, executing code in destination messages, manipulating the state of the destination messages, and creating new messages, allows full computational capabilities, and provides a rich foundation for the creation of many advanced distributed computing models.

In our next ADE post, we'll examine some of the design patterns and capabilities that emerge as a result of this process model.

2008-11-19

SAS for Blades

Blade-based computing form factors offer many potential advantages for constructing storage subsystems using standard hardware, but until recently, the lack of cost-effective storage connectivity has hampered blades when compared to 1 and 2U servers with directly connected SCSI or SATA/SAS shelves.

As I mentioned in an earlier post about IBM's BladeCenter S storage attachment capabilities, this is starting to change as low-cost SAS storage connectivity switching is integrated into blade solutions.

And now I am pleased to see that HP has introduced a new product, the HP StorageWorks 3Gb SAS BL Switch. This product provides eight external facing SAS 3GB ports, and if you install two of these switches in a c3000 chassis, you can attach up to 16 external SAS/SATA shelves. Assuming 12 TB of SATA storage per shelf, that allows you to have 192 TB of raw addressable storage per c3000.

This is perfect for high density storage systems, as you end up with 6U for the c3000, 32U for the storage shelves, and 4U remaining for a KVM and networking equipment. As the c3000 series blades also allows you to install "Storage Blades", you can host databases locally to the blade. This makes an excellent platform for a scalable high-density StorageGRID deployment, as each site can start with two blades providing basic services, and as storage capacity is expanded, additional shelves and blades can be added as needed. And building out to 200TB per rack is a pretty respectable density. One could easily use this hardware to build a 1 PB storage system based around five racks running StorageGRID, connected together using 10 GB network uplinks.

When comparing this to the IBM offering, there are both advantages and disadvantages. In order to allow an HP blade to access the SAS switch, you have to install a Smart Array P700m controller card, which represents an additional cost. With the IBM offering, the RAID controller in part of the SAS switch module itself. But given that the HP switch allows twice the SAS attachment density, and that having the controller card as part of the blade provides a greater degree of failure isolation, I'm inclined to prefer the HP solution.

But regardless of the minor differences, the bottom line is that both systems are now far more viable for use as lower-cost storage infrastructure due to the elimination of the need for fibre-channel based connectivity to the storage shelves. This is a huge improvement in terms of costs, and one that we won't see again until FCOE (or SAS over Ethernet) becomes widely deployed.

2008-11-18

ADE: A Decade of Distributed Computing

The Asynchronous Distributed Environment (ADE) is a message-based framework for the creation of distributed grid computing systems, including the Bycast StorageGRID. With over 40,000 node-years of production runtime, this framework represents one of the more mature distributed computing environments.

I originally developed ADE as a testbed for distributed computing concepts, and published a paper describing the system, titled "The ADE Environment: A Macroscopic Object Architecture for Software Componentization and Distributed Computing" at the MacHack conference back in 1998. As an environment for experimentation, it was very successful, allowing the rapidly exploration of many patterns for creating massively concurrent distributed systems, and for refining the environment to improve the programming model and associated infrastructure.

The ADE Process Model

ADE takes a rather unique approach to the traditional CSP process model: processes are messages. Thus, instantiating a process is as easy as sending a message that includes executable code. Each message has a unique "capability" identifier that allows messages to be sent to itself. When a message is received by a process, the executable code corresponding to the received message is run, which allows the process to manipulate its state (the process message) and to optionally generate additional messages.

An example of a more complex process trace can be found as part of this Graphviz .dot file.

This approach has several major advantages:

Parallelism is implicit and automatic, as dependencies are expressed as message exchange relationships. All forms of serial and parallel computing can thus expressed as directed acyclic graphs.
Processes can easily be migrated from one system to another, as they are just messages. Lightweight and automated migration permits experimentation with alternate approaches to distributed processing, such as migrating processes to a data source or resource.
As processes are just messages, the entire execution state of a system can be captured by checkpointing all message across a given cut line, and by retaining historical message states, execution can be run backwards, and "anti-messages" can undo processing.
Execution state can easily be logged and inspected for debugging and visualization. By inspecting the real-time message graph, deadlock and livelock can be automatically detected, as can the critical path for performance optimization.

And as one can imagine, this is just the tip of the iceberg.

A Maturing Environment

Much has changed over the last decade, and ADE has transformed from a research system into an industry proven technology. As ADE originally was based on ATM networking, we re-wrote the networking and node-to-node messaging to use TCP/IP, and the process model was changed to allow statically linked code to be associated with processes, instead of including it with the message. As storage systems require extremely high execution performance, we optimized for message processing speed and efficiency, and some of the less-used features, such as process migration, were retired from our production builds.

The result has been a high-performance, yet flexible system that facilitates the rapid development and testing of distributed systems. By focusing on allowing distributed software to be easily developed, visualized and tested, we've been able to rapidly innovate and add functionality to our system. And ultimately, this has been the most significant advantage of ADE.

2008-11-10

Welcome, EMC

Today, EMC announced their Atmos product, perviously code-named "Maui". It is an interesting offering, and one that we are quite familiar with... given that object-based storage is what Bycast has been developing since 2000, and has had deployed and operational in customer sites since 2002.

Let's do a quick rundown of the core features that Bycast StorageGRID offers that are also offered by Atmos:

Object based storage? Check
Metadata with objects? Check
Multi-site? Check
Multi-tenancy? Check
File-system interface? Check
Web-services API? Check
Policy-driven replication? Check
Compression? Check
Object-based De-dupe? Check
Object versioning? Check
MAID-style spin-down? Check
Web-based admin? Check

There are also a couple of significant features found in StorageGRID that are not present in Atmos:

Encryption? Unknown
Object integrity protection? Unknown
Clustered gateways? Unknown
Storage hardware independent? Nope
Storage vendor independent? Nope
Support for tape and other high-latency media? Nope
Six years of operation in the field? Nope
Mature, customer-proven product? Nope

Over the last five years of deploying large-scale distributed object-based storage systems, we've learned a lot about what things work well, and where things go wrong. Based on the initial feature set, we know that Atmos will be a successful offering for EMC. There will, of course, be the usual teething pains, not unlike those that happened with Centera, but that is the nature of all 1.0 software.

So, to our friends at EMC, I say, Welcome. We're glad to see object-based storage becoming mainstream.

2008-11-03

On Graphing Data

Over the years, I've had to do a fair bit of data analysis for network monitoring and software instrumentation. Computers, especially when monitoring themselves, can easily generate vast amounts of data, and when you are looking for the one anomaly or correlation, the only way to quickly find them is to represent the information visually. And for data that changes over time, that means generating graphs of the data.

A Little History

Back in 2000, when we starting building our web-based management and monitoring tool at Bycast, we spent a lot of time looking at different third-party graphing toolkits. This was long before AJAX and mature client-side JavaScript engines, and simply weren't able to find any that even met the following basic requirements:

Low-latency chart generation
Anti-aliased rendering
Data binning
Display of minimum, maximum and average values

Having looked at many poorly rendered charts, I set very high visual standards for our chart rendering classes. And since many of our graphs would be displaying data collected over weeks, months and years, a single graph could often require the processing of hundreds of thousands to millions of data points. As one might imagine, this was a non-trivial problem.

Min-Max-Average

Attempting to plot these values directly would never be efficient or responsive for the end user, so we grouped data points into time-based bins, calculated the minimum, maximum and average value of each bin, and rendered the bin values. Since each chart had the same number of bins, this allowed us to optimize our data processing and storage, while still allowing high resolution display of data, and allowing users to easily zoom into areas of interest.

This is best illustrated with an example:

This graph, which shows system memory usage over time, has a dark green line, which indicates the average memory usage, and the light shaded region shows the minimum and maximum range within each bin.

This allows deviations from the average to be easily identified visually. For example, one can see that memory usage dropped significantly on the evening of September 28th, and at one point, fell to almost 210 MBytes. This would not be visible with a chart that only displayed the average value, as most charts tend to do.

We've found that displaying this information is very important. Often it is the deviations from the average that are the most important to focus on, and this method of displaying them allows one to quickly identify and zoom in on the areas of interest.

Graphing Guidelines

I have long been looking for a well-documented set of guidelines for how to render graphs, but until now, I had not found anything that met the bill.

In August 2008, Microsoft's Microsoft Health Common User Interface Group released a new guidance document, titled Displaying Graphs and Tables. This document is excellent, and I would encourage everyone involved with information visualization related to graphing to read this document. It fits exactly what I had been looking for, and is very well written.

Some Additional Graphing References

Here are some references that I have read and found useful when designing graph-based information visualizations:

Beautiful Evidence, by Edward R. Tufte
The Elements of Graphing Data, by William S. Cleveland
The Grammar of Graphics by Leland Wilkinson

Vanishing Into the Infrastructure