Extraordinary Concurrency as the Only Game in Town

DARPA released a new report today providing a detailed analysis of the challenges involved in scaling high-performance computing up to the exascale levels, three orders of magnitude higher then today's petascale supercomputers.

The report is available for download at the below URL:

TR2008-13: Exascale Computing Study: Technology Challenges in Achieving Exascale Systems

This report covers many fascinating issues, ranging from advancements in semiconductor technologies to interconnects to packaging and all the way up the stack to the software that would run on such a computing system.

What jumped out to me is the degree to which concurrent programming models have become so critical to scaling computing, and the degree to which this is still an open problem. Single threaded programming has reached the end of the performance road, and the only way to improve is to parallelize. This is directly in agreement with the research in concurrent programming and runtime systems that I have been running for the past decade, and I'll be sharing more thoughts related to my findings and future research over the coming months.

Resilient Exascale Systems

A computing system with millions of loosely coupled processing elements running billions of concurrent processes will be continuously encountering failures at the hardware, interconnect and software levels. A successful software environment must automatically and transparently recover from such faults without passing the burden of complexity on to the programmer.

Ultimately, the only successful approach will be a dispersal-driven replicated checkpoint-and-vote architecture which allows a trade-off to be made between reliability and efficiency. This approach offloads the complexity of error detection and recovery from the programmer, and allows the system to automatically optimize system efficiency (a higher voting degree trades compute efficiency for time efficiency, and a lower voting degree trades time efficiency for compute efficiency).

This presence of such a resiliency layer also allows the use of fabrication techniques that have far higher defect densities and error rates then traditional system fabrication, as near-perfection is no longer required. This is a key to enable feature size and voltage to continue to be reduced, to provide high-density arrays of compute elements, and to support larger dies and stacked wafers.

Storage for Exascale Systems

One of the most fundamental changes we need to make is to stop separating storage resources from computing resources. CPU power is continuing to increase, and storage density is continuing to increase, but the bandwidth and latency between the two is not.

Instead of moving the data to the compute element, we need to move the computation to the data. The first step is to replace the traditional demand-paged memory hierarchy with generic compute elements that can act as programmable data movers, create vastly larger register stores, and consider such data movement as a subset of a more general message passing approach. Instead of just sending data to fixed computational processes, we need to turn processes into messages themselves, thus allowing processes to also be sent to the location(s) where data is stored.

Ultimately, persistent storage (for checkpoints/state) and archival storage needs to be automated and under the control of dedicated system software, not specified as part of the software. Checkpoint information needs to be streamed to higher-density storage, as does parallel I/O used to read and write datasets being processed. This allows us to move to a model where storage is just process state (and thus, ultimately, just messages), and the files as we know today are just views into the computing system, much like reports are views into a database.

No comments: