Why Amazon's cloud Titanic went down

Why Amazon's cloud Titanic went down By David Goldman, staff writer


NEW YORK (CNNMoney) -- This was never supposed to happen.

Amazon Web Services is the Titanic of cloud hosting, designed with backups to the backups' backups that prevent hosted websites and applications from failing.

Yet, like the famous ocean liner, Amazon's cloud crashed this week, taking with it Reddit, Quora, FourSquare, Hootsuite, parts of the New York Times, ProPublica and about 70 other sites. The massive outage raised questions about the reliability of AWS and the cloud itself.

It was supposed to work like this: Thousands of companies use AWS to run their websites through a service called Elastic Compute Cloud, or EC2. Rather than hosting their sites on their own servers, these customers turn to Amazon, which essentially rents out its unused -- and highly intricate -- server capacity.

EC2 is hosted in five regions across the globe: Northern Virginia, Northern California, Ireland, Tokyo and Singapore. Within each region are multiple "availability zones," and within each availability zone are multiple "locations" or data centers.

In its AWS marketing pitch, Amazon touts the way it links together many different data centers to protect customers from isolated failures. It promises to keep customers' sites up and running 99.95% of the year, or it will shave 10% off customers' monthly bills.

That allows for downtime of just 4.4 hours. Some sites have been down for nearly 36 hours now.

So what went wrong exactly?

Amazon (AMZN, Fortune 500) has been tight-lipped about the incident, and the company said it won't be able to fully comment on the situation until it does a "post-mortem." So it's not clear yet exactly how the problem occurred.

But bits and pieces of information from Amazon, its customers and cloud experts help to explain what happened.

Thursday's crash happened at Amazon's northern Virginia data center, located in one of its East Coast availability zones. In its status log, Amazon said that a "networking event" caused a domino effect across other availability zones in that region, in which many of its storage volumes created new backups of themselves. That filled up Amazon's available storage capacity and prevented some sites from accessing their data.

Amazon didn't say what that "networking event" was.

Doug Willoughby, director of cloud strategy at Compuware, theorized that it could be a wiring problem or a connectivity issue that brought down AWS' so-called "Elastic Block Store." EBS is essentially a network-based hard drive that allows customers to store between 1 gigabyte to 1 terabyte of data per volume.

Reddit, one of the better-known sites to go down due to the error, said it has 700 EBS volumes with Amazon. Both Reddit and Amazon are working to "re-mirror" or copy the volumes to a data center in another availability zone. But both the painstaking process of moving the data and the sheer number of volumes is making the fix a very lengthy process.

"We always store data in multiple zones to avoid this problem," said Jeremy Edberg, senior product developer at Reddit. "The reason it went down is that it failed in multiple zones."

Sites like Quora and Reddit were able to come back online in "read-only" mode, but users couldn't post new content for many hours. Reddit only recently began inviting handfuls of random users to create new posts again.

Many experts blamed the sites themselves for crashing, saying they should have been spread out among multiple geographical regions to take full advantage of Amazon's backup systems.

"Amazon's products are only as good as the people putting the architecture up," said Michael Kirven, co-founder of cloud services provider Bluewolf. "If you put all of your eggs in one basket, you put yourself at risk."

EC2 is so simple to use -- a credit card and a few keystrokes literally gets your business into the cloud -- that some experts say can give a false sense of security. They see in Amazon customers a certain level of naivety that nothing could possibly go wrong.

Of course, things go wrong and systems fail. Other cloud-hosted products like Google's (GOOG, Fortune 500) Gmail have gone down from time to time.

But sites like Reddit and others that crashed were simply following the instructions Amazon laid out in its service agreement, which says hosting in one region should be sufficient. Some smaller sites simply can't afford the engineering and financial resources it takes to duplicate their infrastructure in data centers all over the world.

Some sites like FourSquare took the outage in stride. The check-in service blogged that its "usually-amazing data center hosts, Amazon EC2, are having a few hiccups."

But others weren't as forgiving. BigDoor CEO Keith Smith wrote in a blog post: "If Amazon had been more forthcoming with what they are experiencing, we would have been able to restore our systems sooner."

Reddit encountered a similar hiccup for about six hours last month, forcing the company to decide to start the process of migrating away from the particular product that went down on Thursday. Reddit's Edberg said the company is sticking with Amazon for now, but "we always have our eyes open for something that's superior." To top of page

Frontline troops push for solar energy
The U.S. Marines are testing renewable energy technologies like solar to reduce costs and casualties associated with fossil fuels. Play
25 Best Places to find rich singles
Looking for Mr. or Ms. Moneybags? Hunt down the perfect mate in these wealthy cities, which are brimming with unattached professionals. More
Fun festivals: Twins to mustard to pirates!
You'll see double in Twinsburg, Ohio, and Ketchup lovers should beware in Middleton, WI. Here's some of the best and strangest town festivals. Play
Index Last Change % Change
Dow 18,081.81 57.64 0.32%
Nasdaq 4,786.71 21.29 0.45%
S&P 500 2,087.05 4.88 0.23%
Treasuries 2.29 0.03 1.51%
Data as of Dec 24
Company Price Change % Change
Bank of America Corp... 18.04 0.12 0.64%
Gilead Sciences Inc 92.34 2.89 3.23%
General Electric Co 25.71 -0.17 -0.66%
Cisco Systems Inc 28.57 0.32 1.13%
Apple Inc 112.22 -0.32 -0.28%
Data as of Dec 24

Sections

JetBlue is offering to fly police to New York for the funeral of slain NYPD officers and is working to have the family of one of the officers flown in from overseas. More

Many Americans are buying more gifts -- and more expensive gifts -- for Christmas. That's squeezing some in the middle class. More

Marriott's plan to block Wi-Fi hotspots in its hotels is being opposed by Google and Microsoft. More

According to data from Google Trends and GrubHub, Chinese food remains the most popular type of food order on Christmas. More

If you're looking to sell your holiday gift cards for cold hard cash... there's an app for that. More

Most stock quote data provided by BATS. Market indices are shown in real time, except for the DJIA, which is delayed by two minutes. All times are ET. Disclaimer.

Morningstar: © 2014 Morningstar, Inc. All Rights Reserved.

Factset: FactSet Research Systems Inc. 2014. All rights reserved.

Chicago Mercantile Association: Certain market data is the property of Chicago Mercantile Exchange Inc. and its licensors. All rights reserved.

Dow Jones: The Dow Jones branded indices are proprietary to and are calculated, distributed and marketed by DJI Opco, a subsidiary of S&P Dow Jones Indices LLC and have been licensed for use to S&P Opco, LLC and CNN. Standard & Poor's and S&P are registered trademarks of Standard & Poor’s Financial Services LLC and Dow Jones is a registered trademark of Dow Jones Trademark Holdings LLC. All content of the Dow Jones branded indices © S&P Dow Jones Indices LLC 2014 and/or its affiliates.