Why Amazon's cloud Titanic went down

By David Goldman, staff writerApril 22, 2011: 5:37 PM ET

NEW YORK (CNNMoney) -- This was never supposed to happen.

Amazon Web Services is the Titanic of cloud hosting, designed with backups to the backups' backups that prevent hosted websites and applications from failing.

Yet, like the famous ocean liner, Amazon's cloud crashed this week, taking with it Reddit, Quora, FourSquare, Hootsuite, parts of the New York Times, ProPublica and about 70 other sites. The massive outage raised questions about the reliability of AWS and the cloud itself.

It was supposed to work like this: Thousands of companies use AWS to run their websites through a service called Elastic Compute Cloud, or EC2. Rather than hosting their sites on their own servers, these customers turn to Amazon, which essentially rents out its unused -- and highly intricate -- server capacity.

EC2 is hosted in five regions across the globe: Northern Virginia, Northern California, Ireland, Tokyo and Singapore. Within each region are multiple "availability zones," and within each availability zone are multiple "locations" or data centers.

In its AWS marketing pitch, Amazon touts the way it links together many different data centers to protect customers from isolated failures. It promises to keep customers' sites up and running 99.95% of the year, or it will shave 10% off customers' monthly bills.

That allows for downtime of just 4.4 hours. Some sites have been down for nearly 36 hours now.

So what went wrong exactly?

Amazon (AMZN, Fortune 500) has been tight-lipped about the incident, and the company said it won't be able to fully comment on the situation until it does a "post-mortem." So it's not clear yet exactly how the problem occurred.

But bits and pieces of information from Amazon, its customers and cloud experts help to explain what happened.

Thursday's crash happened at Amazon's northern Virginia data center, located in one of its East Coast availability zones. In its status log, Amazon said that a "networking event" caused a domino effect across other availability zones in that region, in which many of its storage volumes created new backups of themselves. That filled up Amazon's available storage capacity and prevented some sites from accessing their data.

Amazon didn't say what that "networking event" was.

Doug Willoughby, director of cloud strategy at Compuware, theorized that it could be a wiring problem or a connectivity issue that brought down AWS' so-called "Elastic Block Store." EBS is essentially a network-based hard drive that allows customers to store between 1 gigabyte to 1 terabyte of data per volume.

Reddit, one of the better-known sites to go down due to the error, said it has 700 EBS volumes with Amazon. Both Reddit and Amazon are working to "re-mirror" or copy the volumes to a data center in another availability zone. But both the painstaking process of moving the data and the sheer number of volumes is making the fix a very lengthy process.

"We always store data in multiple zones to avoid this problem," said Jeremy Edberg, senior product developer at Reddit. "The reason it went down is that it failed in multiple zones."

Sites like Quora and Reddit were able to come back online in "read-only" mode, but users couldn't post new content for many hours. Reddit only recently began inviting handfuls of random users to create new posts again.

Many experts blamed the sites themselves for crashing, saying they should have been spread out among multiple geographical regions to take full advantage of Amazon's backup systems.

"Amazon's products are only as good as the people putting the architecture up," said Michael Kirven, co-founder of cloud services provider Bluewolf. "If you put all of your eggs in one basket, you put yourself at risk."

EC2 is so simple to use -- a credit card and a few keystrokes literally gets your business into the cloud -- that some experts say can give a false sense of security. They see in Amazon customers a certain level of naivety that nothing could possibly go wrong.

Of course, things go wrong and systems fail. Other cloud-hosted products like Google's (GOOG, Fortune 500) Gmail have gone down from time to time.

But sites like Reddit and others that crashed were simply following the instructions Amazon laid out in its service agreement, which says hosting in one region should be sufficient. Some smaller sites simply can't afford the engineering and financial resources it takes to duplicate their infrastructure in data centers all over the world.

Some sites like FourSquare took the outage in stride. The check-in service blogged that its "usually-amazing data center hosts, Amazon EC2, are having a few hiccups."

But others weren't as forgiving. BigDoor CEO Keith Smith wrote in a blog post: "If Amazon had been more forthcoming with what they are experiencing, we would have been able to restore our systems sooner."

Reddit encountered a similar hiccup for about six hours last month, forcing the company to decide to start the process of migrating away from the particular product that went down on Thursday. Reddit's Edberg said the company is sticking with Amazon for now, but "we always have our eyes open for something that's superior."

Amazon EC2 outage downs Reddit, Quora

Amazon's cloud nightmare

Why attackers can't take down Amazon.com

First Published: April 22, 2011: 3:12 PM ET

Right Now

Just the hot list include

Hot List

Frontline troops push for solar energy

The U.S. Marines are testing renewable energy technologies like solar to reduce costs and casualties associated with fossil fuels. Play

25 Best Places to find rich singles

Looking for Mr. or Ms. Moneybags? Hunt down the perfect mate in these wealthy cities, which are brimming with unattached professionals. More

Fun festivals: Twins to mustard to pirates!

You'll see double in Twinsburg, Ohio, and Ketchup lovers should beware in Middleton, WI. Here's some of the best and strangest town festivals. Play

Job Search See 232,273 new jobs added today

See all jobs

jobs by

Original Shows

Key to NBA's success? Embracing tech

NBA Commissioner David Stern says the basketball league is looking to expand its use of technology to improve gameplay and increase its audience. Play

Unique Homes

Selling Roy Rogers' former ranch

With 67 acres of land and room for 150 horses, the former ranch of the 'King of the Cowboys' sold at auction for $640,000. Play

Help Desk

Track testing tires to find the best

Find out how TireRack tests and reviews tires and why choosing the right ones for your car is so important. Play

All CNNMoney.com Original Shows

Markets

Index	Last	Change	% Change
Dow	32,627.97	-234.33	-0.71%
Nasdaq	13,215.24	99.07	0.76%
S&P 500	3,913.10	-2.36	-0.06%
Treasuries	1.73	0.00	0.12%

Data as of 6:29am ET

Company	Price	Change	% Change
Ford Motor Co	8.29	0.05	0.61%
Advanced Micro Devic...	54.59	0.70	1.30%
Cisco Systems Inc	47.49	-2.44	-4.89%
General Electric Co	13.00	-0.16	-1.22%
Kraft Heinz Co	27.84	-2.20	-7.32%

Data as of 2:44pm ET

Symbol Matches

Symbol Starts With

Company Matches

Why Amazon's cloud Titanic went down