Why Amazon's cloud Titanic went down

Why Amazon's cloud Titanic went down By David Goldman, staff writer


NEW YORK (CNNMoney) -- This was never supposed to happen.

Amazon Web Services is the Titanic of cloud hosting, designed with backups to the backups' backups that prevent hosted websites and applications from failing.

Yet, like the famous ocean liner, Amazon's cloud crashed this week, taking with it Reddit, Quora, FourSquare, Hootsuite, parts of the New York Times, ProPublica and about 70 other sites. The massive outage raised questions about the reliability of AWS and the cloud itself.

It was supposed to work like this: Thousands of companies use AWS to run their websites through a service called Elastic Compute Cloud, or EC2. Rather than hosting their sites on their own servers, these customers turn to Amazon, which essentially rents out its unused -- and highly intricate -- server capacity.

EC2 is hosted in five regions across the globe: Northern Virginia, Northern California, Ireland, Tokyo and Singapore. Within each region are multiple "availability zones," and within each availability zone are multiple "locations" or data centers.

In its AWS marketing pitch, Amazon touts the way it links together many different data centers to protect customers from isolated failures. It promises to keep customers' sites up and running 99.95% of the year, or it will shave 10% off customers' monthly bills.

That allows for downtime of just 4.4 hours. Some sites have been down for nearly 36 hours now.

So what went wrong exactly?

Amazon (AMZN, Fortune 500) has been tight-lipped about the incident, and the company said it won't be able to fully comment on the situation until it does a "post-mortem." So it's not clear yet exactly how the problem occurred.

But bits and pieces of information from Amazon, its customers and cloud experts help to explain what happened.

Thursday's crash happened at Amazon's northern Virginia data center, located in one of its East Coast availability zones. In its status log, Amazon said that a "networking event" caused a domino effect across other availability zones in that region, in which many of its storage volumes created new backups of themselves. That filled up Amazon's available storage capacity and prevented some sites from accessing their data.

Amazon didn't say what that "networking event" was.

Doug Willoughby, director of cloud strategy at Compuware, theorized that it could be a wiring problem or a connectivity issue that brought down AWS' so-called "Elastic Block Store." EBS is essentially a network-based hard drive that allows customers to store between 1 gigabyte to 1 terabyte of data per volume.

Reddit, one of the better-known sites to go down due to the error, said it has 700 EBS volumes with Amazon. Both Reddit and Amazon are working to "re-mirror" or copy the volumes to a data center in another availability zone. But both the painstaking process of moving the data and the sheer number of volumes is making the fix a very lengthy process.

"We always store data in multiple zones to avoid this problem," said Jeremy Edberg, senior product developer at Reddit. "The reason it went down is that it failed in multiple zones."

Sites like Quora and Reddit were able to come back online in "read-only" mode, but users couldn't post new content for many hours. Reddit only recently began inviting handfuls of random users to create new posts again.

Many experts blamed the sites themselves for crashing, saying they should have been spread out among multiple geographical regions to take full advantage of Amazon's backup systems.

"Amazon's products are only as good as the people putting the architecture up," said Michael Kirven, co-founder of cloud services provider Bluewolf. "If you put all of your eggs in one basket, you put yourself at risk."

EC2 is so simple to use -- a credit card and a few keystrokes literally gets your business into the cloud -- that some experts say can give a false sense of security. They see in Amazon customers a certain level of naivety that nothing could possibly go wrong.

Of course, things go wrong and systems fail. Other cloud-hosted products like Google's (GOOG, Fortune 500) Gmail have gone down from time to time.

But sites like Reddit and others that crashed were simply following the instructions Amazon laid out in its service agreement, which says hosting in one region should be sufficient. Some smaller sites simply can't afford the engineering and financial resources it takes to duplicate their infrastructure in data centers all over the world.

Some sites like FourSquare took the outage in stride. The check-in service blogged that its "usually-amazing data center hosts, Amazon EC2, are having a few hiccups."

But others weren't as forgiving. BigDoor CEO Keith Smith wrote in a blog post: "If Amazon had been more forthcoming with what they are experiencing, we would have been able to restore our systems sooner."

Reddit encountered a similar hiccup for about six hours last month, forcing the company to decide to start the process of migrating away from the particular product that went down on Thursday. Reddit's Edberg said the company is sticking with Amazon for now, but "we always have our eyes open for something that's superior." To top of page

Frontline troops push for solar energy
The U.S. Marines are testing renewable energy technologies like solar to reduce costs and casualties associated with fossil fuels. Play
25 Best Places to find rich singles
Looking for Mr. or Ms. Moneybags? Hunt down the perfect mate in these wealthy cities, which are brimming with unattached professionals. More
Fun festivals: Twins to mustard to pirates!
You'll see double in Twinsburg, Ohio, and Ketchup lovers should beware in Middleton, WI. Here's some of the best and strangest town festivals. Play
Index Last Change % Change
Dow 16,770.32 -34.39 -0.20%
Nasdaq 4,410.84 -11.25 -0.25%
S&P 500 1,940.01 -6.15 -0.32%
Treasuries 2.41 0.00 0.17%
Data as of 1:04pm ET
Company Price Change % Change
Bank of America Corp... 16.72 -0.11 -0.62%
Apple Inc 99.01 -0.17 -0.17%
Facebook Inc 76.62 0.07 0.09%
Ford Motor Co 14.48 -0.11 -0.73%
Intel Corp 33.52 -0.47 -1.38%
Data as of 12:49pm ET

Sections

The bull market hit some turbulence in September, but that didn't stop these stocks from generating lots of interest from investors. More

Linda Tirado's post about her life in poverty went viral in 2013. She is now the author of 'Hand to Mouth.' More

While big chains are telling customers to stop bringing in guns, some small restaurants are embracing them. More

Linda Tirado's post about her life in poverty went viral in 2013. She is now the author of 'Hand to Mouth.' More

Market indexes are shown in real time, except for the DJIA, which is delayed by two minutes. All times are ET. Disclaimer Morningstar: © 2014 Morningstar, Inc. All Rights Reserved. Disclaimer The Dow Jones IndexesSM are proprietary to and distributed by Dow Jones & Company, Inc. and have been licensed for use. All content of the Dow Jones IndexesSM © 2014 is proprietary to Dow Jones & Company, Inc. Chicago Mercantile Association. The market data is the property of Chicago Mercantile Exchange Inc. and its licensors. All rights reserved. FactSet Research Systems Inc. 2014. All rights reserved. Most stock quote data provided by BATS.