Archive for March, 2011

Netflix: fail constantly

[Sorry for the sporadic posting. I’ve had more travel in the past 7 weeks than the last 2 years.  I should be back to a more regular schedule soon.]

The “cloud” is still a new and curious beast for a lot of us, especially people who grew up in a more traditional hosting model.  We have several generations of IT workers who have learned everything about hosting on our own hardware and networks.  The flexibility of the cloud is a game-changer, and I’m continually learning new places where “conventional wisdom” will lead you down a difficult path.

Netflix has been kind enough to post their five key lessons from their cloud experiences on their tech blog.  While these lessons may look simple and perhaps obvious in retrospect, there are two that really hit home with me:

1. Prepare to unlearn everything you know about running applications in your own datacenters.

3. The best way to avoid failure is to fail constantly.

First, an entire generation (or maybe two or three) of system and network administrators learned all of what we know about scale and reliability by running our own applications on our own servers in our own datacenters using our own networks.  There are thousands of person-centuries of of experience that have created best (or at least “good”) practice on how to be successful in this model, but this has done very little to prepare us to be successful using cloud resources.  In fact, it might even be working against us.

We’ve all got a lot to un-learn.

Second, in the olden days, uptime was king, and a high time between reboots (or crashes) was considered a mark of a capable system administrator.  Failure was to be avoided at all costs, and testing failover (or disaster recovery) was done infrequently, if at all, due to the high impact and high costs.  We did all get used to a more frequent reboot cycle, if only to be able to install all the needed security patches, but that was just a small change in focus, not a complete sea change.

In computing clouds, it is a given and an expectation that instances will fail at random, and the solution is to have an agile application, not to focus on high availability or increasing hardware reliability.  Just as there is continuous development, testing and deployment, there needs to be continuous failover testing.  Netflix created a tool (Chaos Money) specifically to force random failures in their production systems! That’s right, they are constantly creating failures, just to continuously test their failover methods, in the live production system.

That’s a) really hardcore, b) really scary and c) really cool.

That’s one way to put your reputation on the line.  And it points out just how you need to do some very non-intuitive things, and unlearn decades of good practice to be successful in the cloud.

, ,

1 Comment

%d bloggers like this: