Web 2.0 and Archiving

A few days ago, Facebook announced that they had reached the milestone of having 10 billion images stored in their servers. This translates to over 40 billion files under management, as each image is stored at four different resolutions. In total, all of these files requires 1 PB of storage space, and most likely consumes 2 PB of physical storage, assuming their data is mirrored at a minimum of two sites for availability purposes.

If we take those 40 billion files and 1 PB of storage consumed, we can figure out that the average file size is around 25 Kbytes. All of these files need to be stored, managed, distributed and replicated — exactly the solution space for an archive. And this is why I believe that the Web 2.0 problem space has been often viewed as an archiving problem.

But the Web 2.0 problem is larger. It is not the number of objects stored that makes this an engineering challenge, but the number of objects accessed: Facebook serves 15 billion images per day, thus, serving more files each day then the number of unique images they have stored! At peak load, they are serving over 300,000 images per second. Even though a majority of the images accessed each day fall in a small (<0.01%) subset of the total image set, the images that are popular change daily, and the storage system must handle these "hot files" efficiently in order for the system not to collapse under the load.

Thus, while it is true to say that Web 2.0 sites require an archive (for all those infrequently or never again accessed objects), due to the retrieval requirements, an archive is just one part of the infrastructure required to serve a Web 2.0 workload.


damien said...

Hi David,

I actually had some very similar thoughts when I read that Facebook blog entry last month. Do you have any idea what they're using for archiving? S3 like some other web sites, or a homegrown solution? Just curious. I know they're not using Evercore. ;)

-- Damien

David Slik said...

I suspect that they have not yet reached the point where they are managing their "cooler" data in a separate system.

Given how new their photo hosting service is, they most likely have been focusing on scalability and performance, rather then cost optimization.

As Web 2.0 companies come out of the "crazy growth" phase, intelligent tiering and archival technologies will become increasingly important for them. At that point, they will make the build vs. buy decision.

Having said that, given the magnitude of many of these companies storage problems and their availability of internal technical resources, I suspect that most of the larger players (Google, etc) will opt to build their own systems, at least until the market matures.

After all, it is always easier to build a purpose-specific system tightly coupled to your business, as opposed to creating a generic storage and archiving product that will meet the needs of many different types of customers.