A few days ago, Facebook announced that they had reached the milestone of having 10 billion images stored in their servers. This translates to over 40 billion files under management, as each image is stored at four different resolutions. In total, all of these files requires 1 PB of storage space, and most likely consumes 2 PB of physical storage, assuming their data is mirrored at a minimum of two sites for availability purposes.
If we take those 40 billion files and 1 PB of storage consumed, we can figure out that the average file size is around 25 Kbytes. All of these files need to be stored, managed, distributed and replicated — exactly the solution space for an archive. And this is why I believe that the Web 2.0 problem space has been often viewed as an archiving problem.
But the Web 2.0 problem is larger. It is not the number of objects stored that makes this an engineering challenge, but the number of objects accessed: Facebook serves 15 billion images per day, thus, serving more files each day then the number of unique images they have stored! At peak load, they are serving over 300,000 images per second. Even though a majority of the images accessed each day fall in a small (<0.01%) subset of the total image set, the images that are popular change daily, and the storage system must handle these "hot files" efficiently in order for the system not to collapse under the load.
Thus, while it is true to say that Web 2.0 sites require an archive (for all those infrequently or never again accessed objects), due to the retrieval requirements, an archive is just one part of the infrastructure required to serve a Web 2.0 workload.