Netflix: fail constantly

[Sorry for the sporadic posting. I’ve had more travel in the past 7 weeks than the last 2 years.  I should be back to a more regular schedule soon.]

The “cloud” is still a new and curious beast for a lot of us, especially people who grew up in a more traditional hosting model.  We have several generations of IT workers who have learned everything about hosting on our own hardware and networks.  The flexibility of the cloud is a game-changer, and I’m continually learning new places where “conventional wisdom” will lead you down a difficult path.

Netflix has been kind enough to post their five key lessons from their cloud experiences on their tech blog.  While these lessons may look simple and perhaps obvious in retrospect, there are two that really hit home with me:

1. Prepare to unlearn everything you know about running applications in your own datacenters.

3. The best way to avoid failure is to fail constantly.

First, an entire generation (or maybe two or three) of system and network administrators learned all of what we know about scale and reliability by running our own applications on our own servers in our own datacenters using our own networks.  There are thousands of person-centuries of of experience that have created best (or at least “good”) practice on how to be successful in this model, but this has done very little to prepare us to be successful using cloud resources.  In fact, it might even be working against us.

We’ve all got a lot to un-learn.

Second, in the olden days, uptime was king, and a high time between reboots (or crashes) was considered a mark of a capable system administrator.  Failure was to be avoided at all costs, and testing failover (or disaster recovery) was done infrequently, if at all, due to the high impact and high costs.  We did all get used to a more frequent reboot cycle, if only to be able to install all the needed security patches, but that was just a small change in focus, not a complete sea change.

In computing clouds, it is a given and an expectation that instances will fail at random, and the solution is to have an agile application, not to focus on high availability or increasing hardware reliability.  Just as there is continuous development, testing and deployment, there needs to be continuous failover testing.  Netflix created a tool (Chaos Money) specifically to force random failures in their production systems! That’s right, they are constantly creating failures, just to continuously test their failover methods, in the live production system.

That’s a) really hardcore, b) really scary and c) really cool.

That’s one way to put your reputation on the line.  And it points out just how you need to do some very non-intuitive things, and unlearn decades of good practice to be successful in the cloud.

Advertisements

, ,

  1. #1 by Devin on March 22, 2011 - 9:30 am

    I think uptime is still king, but the uptime that matters has become
    decoupled from the uptime of a single piece of hardware. This trend is
    actually independent of the rise of AWS like IaaS clouds. It stems from
    a combination of reaching the limits of single system scalability and
    the drastic cost savings enabled by using commodity hardware. If you
    run enough systems, then small failure rates becomes a whole number of
    systems that fail per day.

    In addition, I think the definition of uptime has changed, with
    companies like Google, Facebook, Amazon and such leading the way. Uptime is no
    longer viewed as “unplanned” downtime, but instead means no downtime.
    This shift also requires planning for failures, and rolling upgrades.

    Netflix is in an interesting middle ground, where their fleet is large
    enough that failure is not rare, but not so large that it’s common.
    Hence the “chaos monkey” (which is brilliant idea for any service in a
    similar place on the failure curve).

    That being said, you are correct that “cloud” forces this change on an
    IT shop. Because it takes advantage of the less reliable commodity
    hardware, and because the IT team now has less control over change (and
    more the point, timing of such) in the environment. The mere act of
    moving your service to the cloud forces you to plan for failure more.

    -devin

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: