2008-11-03

On Graphing Data

Over the years, I've had to do a fair bit of data analysis for network monitoring and software instrumentation. Computers, especially when monitoring themselves, can easily generate vast amounts of data, and when you are looking for the one anomaly or correlation, the only way to quickly find them is to represent the information visually. And for data that changes over time, that means generating graphs of the data.

A Little History

Back in 2000, when we starting building our web-based management and monitoring tool at Bycast, we spent a lot of time looking at different third-party graphing toolkits. This was long before AJAX and mature client-side JavaScript engines, and simply weren't able to find any that even met the following basic requirements:
  1. Low-latency chart generation
  2. Anti-aliased rendering
  3. Data binning
  4. Display of minimum, maximum and average values
Having looked at many poorly rendered charts, I set very high visual standards for our chart rendering classes. And since many of our graphs would be displaying data collected over weeks, months and years, a single graph could often require the processing of hundreds of thousands to millions of data points. As one might imagine, this was a non-trivial problem.

Min-Max-Average

Attempting to plot these values directly would never be efficient or responsive for the end user, so we grouped data points into time-based bins, calculated the minimum, maximum and average value of each bin, and rendered the bin values. Since each chart had the same number of bins, this allowed us to optimize our data processing and storage, while still allowing high resolution display of data, and allowing users to easily zoom into areas of interest.

This is best illustrated with an example:


This graph, which shows system memory usage over time, has a dark green line, which indicates the average memory usage, and the light shaded region shows the minimum and maximum range within each bin.

This allows deviations from the average to be easily identified visually. For example, one can see that memory usage dropped significantly on the evening of September 28th, and at one point, fell to almost 210 MBytes. This would not be visible with a chart that only displayed the average value, as most charts tend to do.

We've found that displaying this information is very important. Often it is the deviations from the average that are the most important to focus on, and this method of displaying them allows one to quickly identify and zoom in on the areas of interest.

Graphing Guidelines

I have long been looking for a well-documented set of guidelines for how to render graphs, but until now, I had not found anything that met the bill.

In August 2008, Microsoft's Microsoft Health Common User Interface Group released a new guidance document, titled Displaying Graphs and Tables. This document is excellent, and I would encourage everyone involved with information visualization related to graphing to read this document. It fits exactly what I had been looking for, and is very well written.

Some Additional Graphing References

Here are some references that I have read and found useful when designing graph-based information visualizations:

Beautiful Evidence, by Edward R. Tufte
The Elements of Graphing Data, by William S. Cleveland
The Grammar of Graphics by Leland Wilkinson

No comments: