Vanishing Into the Infrastructure: January 2009

2009-01-29

Software Development and Evolutionary Biology

I've finally figured out why, over time, open source will become the dominant form of software development:

Open source is, well, open. They can freely share knowledge, experiences, mistakes and achievements without restrictions.
Open source projects cross-pollinate, exchange code and rapidly change over time.
Open source projects tend to have significant diversity, with multiple competing projects that do the same or similar things.

From an evolutionary biology standpoint, we have more frequent sharing of genetic code, higher rates of mutations, and a broader diversity. Thus, these projects have a better chance of survival.

Contrast this to closed source development projects:

Closed source is still open, in that people take their experiences, skills and knowledge with them when they move from company to company. But instead of information being shared on a hourly basis, the sharing of information tends to be on the time scale of years, and the degree of sharing is far lower.
Closed source projects resist "contamination" of their code base, and try to prevent any leakage of their code to the outside world. They also try to do the minimal amount of changes necessary in order to maximize return on investment.
Closed source projects tend to avoid markets where there is significant diversity. Interestingly, when there is competition, the products involved tend to be far better than in areas that are dominated and controlled by one or two vendors.

As a consequence, closed source projects tend to evolve slowly, become rigid and inflexible, and are unable to adapt to rapidly changing environments.

When you mix open source and closed source together in an ecosystem, interesting things happen. The first is that the competitive pressure resulting from open source pushes closed source projects to act more like open source projects. Closed source projects also tend to initially be parasitic, using open source without contributing much back.

There are exceptions to all of these generalizations, but this feels about right to me. It would be fascinating to see the results of research that looks at software development from the viewpoint of evolutionary biology.

Jump Starting Off-Site Storage

During a discussion about Cloud Storage on the StorageMojo blog, Pete Steege asked about the challenges of the initial load of data into a cloud storage provider.

How are storage cloud companies handling the “first backup” issue? Multiple terabytes or petabytes that need to be migrated to the cloud initially?

The incremental part of the process is a no-brainer.

One solution that we use at Bycast is to deploy two or more edge servers with attached storage at the customer's premise, and allow them to perform bulk ingest over the REST API (or via CIFS/NFS). When the ingest is complete or nearly complete, the object storage repository on disk can be physically shipped to be integrated into the "cloud", and subsequent transactions can be performed over the network.

Another approach taken by EMC's Mozy backup service is what they call "data seeding". Here, you purchase a 2 TB USB drive from EMC, store your data onto the drive, then ship it back to EMC in order to get the process going. I couldn't find any current references of this capability on their web site, so this may no longer be a supported feature.

With such hybrid models, you need the software intelligence to ensure that the data is always accessible via the cloud API, always protected from loss, corruption and unauthorized disclosure, and to ensure that the data is audited to ensure that you know that all data that you ingested locally made it into the cloud.

Despite the associated complexity, this is a very powerful approach, as one should never underestimate the bandwidth of a 747 full of disks or tapes.

2009-01-23

When to use S3

In response to my previous post, jeredfloyd of Permabit asked about when S3 would be useful use as storage for our customers.

Do you feel S3 has the reliability and availability for your customers today? I love the concept, but I've so far been scared off by horror stories of downtime. Also, what about security concerns?

These are good questions, and I'm going to elaborate on these concerns and where we see S3 as providing value to our customer.

The Bottom Line

I wouldn't use or recommend S3 for anything other than a low-grade secondary replica location for redundancy purposes. Having said that, the levels of reliability and accessibility that I've seen are already higher than what my experiences have been with tape libraries.

Bring Your Own Security

From a security standpoint, I wouldn't put anything on S3 that hasn't been encrypted and wrapped with an integrity verification layer, as we do in StorageGRID. And if the data is encrypted, there is less of a concern about deleting it if you can't get to it any more. Just throw away the keys.

As you can't implement secure wipe using their API, even if you overwrite the data, so you would also want to be sure that you're not storing really sensitive information there, even with today's standard encryption algorithms and key strengths.

S3 Isn't Inexpensive

One of the things that I want to emphasize is that based on our analysis of their economics, if you are storing data for long periods of time, it's far cheaper to just add storage nodes with SATA shelves.

Tape isn't cheaper until you're looking at 50+ TB libraries. For infrequently accessed data and redundancy copies (you need to make more when putting them on tape, since it's not as reliable as disk), it quickly becomes very economical for large capacity deployments.

Despite this, S3 Still Has Value

Having said this, even with these concerns, I see several situations where S3 support brings real value for our customers:

If you're really small (less than 50 TB), adding storage capacity is still pretty expensive as a percentage of your yearly budget because our customers typically add in 10TB or larger increments. Using S3 as an overflow pool (keeping one or two copies locally on disk, and using S3 as your second or third copy) lets you defer that purchase for a little while, and when you do make that purchase, you can automatically migrate all the data on S3 off onto your new storage resource.
If it takes you too long to purchase hardware, or your budgetary cycle for capital purchases takes too long, or even just an unexpected load where there just isn't time to provision more storage, you can shift second or third copies off onto S3 to free up space, and expense it to the business as a opex or project cost.
If you have a short-term storage need, and don't want to invest in hardware yet, just put it off onto S3. It will cost a little more per TB, but since wouldn't be able to amortize the storage costs of in house hardware across the typical three-year lifespan of that hardware, it works out to be cheaper in the end.
If you're almost full, you've ignored the alarms telling you that you don't have enough space on other nodes to repair your storage redundancy if you loose a node, and you don't have any storage ready to replace a failed storage, S3 would be a good "last resort" option for creating new replicas to restore your desired level of redundancy.

So, to summarize, I'd use it for a sort-term storage resource to defer capital costs, a short-term emergency storage resource to keep you going, and for storage of short-term data. And in all cases, I wouldn't have the only copy in the grid on S3.

Based on these use-cases, it would be of most value to smaller IT shops with smaller systems. As you get into larger archives and storage systems (200+ TB), many of these situations will never come up.

Regardless of your size, having S3 as a choice as a storage tier gives administrators another tool to handle different situations, and that flexibility can be quite useful. Ultimately, it's up to them to decide if the costs (and bandwidth usage) makes sense for them.

Some Notes on Amazon S3

During our recent meetings, there were a fair number of questions and discussions about the economics of public cloud storage providers, such as Amazon's S3 service.

This YCombinator discussion has lots of good information about pricing, usage and experiences of some of S3's supporters and detractors. It's well worth reading.

Interestingly, thanks to a new S3 user space file-system FUSE module, Bycast has pretty much everything we need to provide a S3 tier of storage to our StorageGRID customers. Of course, such a capability would need to be productized, which would allow an administrator to have a place to configure the tier and securely enter and store their S3 credentials through our administrative interface, but thanks to the filesystem virtualizing the S3 API, all the hard work is already done.

Cloud Storage Protocol Standardization

During one of the panel discussions at the SNIA Cloud Storage Summit, the topic of why standards for data exchange and system management would be beneficial for cloud storage. While there are many different advantages, one of the areas that I spent a few minutes talking about was the development efficiency that is realized as a result of standards.

When a protocol is standardized and adopted multiple vendors as a way to connect systems or subsystems, the following things start to emerge:

A formal protocol specification
Web pages describing the protocol
Books about or with chapters about the protocol
Example open-source implementations
Standard interface libraries
Conformance test suites
Benchmarking suites
Protocol analysers and recorders

In essence, an ecosystem starts to emerge around the protocol, and many small companies and individuals build expertise and tools that enable the rapid uptake of the protocol.

It's been my observation that the software developers and architects tend to have a major say in the selection of protocols, especially for subsystem interconnects, and that they tend to choose the protocol that makes their life the easiest. Thus, protocols that have all of these resources widely and inexpensively available quickly become the protocol of choice, resulting in a continued upward spiral of adoption, experience, tools and systems.

We're starting to see this with XAM, with #1, #4 and #5 already available, and #2, #6 in progress. And I'm sure that somewhere out there, someone's writing a book about XAM, or at least a chapter about it.

In the cloud storage protocol arena, Amazon's S3 service has such a strong lead in this area with their S3 HTTP protocol that many of these resources have already been built, despite it being a proprietary protocol. While most other cloud storage service providers have built similar HTTP protocols, with the IP ownership restrictions around Amazon's protocol still up in the air, there is a fair bit of uncertainty if their protocol will ever be able to be used with more than S3.

Which leads us back to the need for standardized protocols.

At the SNIA Winter Symposium

I've been busy attending the SNIA Winter Symposium this week. In addition to the usual XAM workgroup meetings, I've also been participating in the Cloud Storage Summit.

On Wednesday, the organizers were gracious enough to let me present a quick talk on Private Storage Clouds, where I covered the economic drivers behind cloud storage, the differences between public and private clouds, talked about how Bycast's StorageGRID software allows the creation of private clouds, and discussed some examples of customers where we are deployed and in production.

While I didn't have as much time as I would have liked, I was able to cover a majority of the points that I had planned to discuss, and the session was well received.

It's interesting to see how much surprise there was regarding how our system is being used by customers. Robin Harris of the Data Mobility Group called us the "most surprising company" based on what we have been doing. In contrast to most of the cloud storage deployments, many of our customers are placing mission critical data on our storage system, either for archive or primary storage, and they have no backups outside of the redundancy provided as an intrinsic part of StorageGRID.

During some of the other sessions, the data being stored on many other cloud deployments was described as being "data that could be lost", or "the garbage dumpster".

Thanks to an innovative use of WebEx, you can view all of the talks online at the cloud storage sessions page. I'd encourage you to have a look at these, as there is quite a lot of interesting material here, especially the views of the analysts from the Thursday sessions.

Vanishing Into the Infrastructure