Amazon explains its cloud disaster

By David Goldman, staff writer


NEW YORK (CNNMoney) -- Amazon on Friday issued a detailed analysis and apology on last week's massive crash of its cloud service, an event that brought down dozens of websites.

The disruption to Amazon (AMZN, Fortune 500) Web Service's Elastic Compute Cloud, or EC2, limited customers' access to much of the information that was stored in the company's East Coast regional data centers. About 75 sites crashed because of the outage.

Until now, Amazon had stayed relatively silent about the cause. But after completing a post-mortem assessment of the mess, the company issued a technically detailed, 5,700-word explanation of what went wrong.

The event -- the first prolonged, widespread outage EC2 has suffered since launching five years ago -- was a technical perfect storm. A mistake made by Amazon's engineers triggered a cascade of other bugs and glitches.

"As with any complicated operational issue, this one was caused by several root causes interacting with one another," Amazon wrote.

On April 21, AWS tried to upgrade capacity in one storage section of its regional network in Northern Virginia. That section is called an "availability zone." There are multiple availability zones in each region, with information spread across several zones in order to protect against data loss or downtime.

The upgrade required some traffic to be rerouted. Instead of redirecting the traffic within its primary network, Amazon accidentally sent it to a backup network. That secondary network isn't designed to handle that massive traffic flood. It got overwhelmed and clogged up, cutting a bunch of storage nodes off from the network.

When Amazon fixed the traffic flow, a failsafe triggered: The storage volumes essentially freaked out and began searching for a place to back up their data. That kicked off a "re-mirroring storm," filling up all the available storage space. When storage volumes couldn't find any way to back themselves up, they got "stuck." At the problem's peak, about 13% of the availability zone's volumes were stuck.

But why did a problem in one availability zone ripple out to affect a whole region? That's precisely the kind of glitch Amazon's infrastructure is supposed to prevent.

Turns out EC2 had a few bugs. Amazon describes them in detail in its analysis, but the gist is that the master system that coordinates all communication within the region had design flaws. It got overwhelmed, suffered a "brown out," and turned an isolated problem into a widespread one.

Interestingly, those bugs and design flaws have always been in place -- but they wouldn't have been discovered if Amazon hadn't goofed up and set off a domino chain.

Amazon says that knowing about and repairing those weaknesses will make EC2 even stronger. The company has already made several fixes and adjustments, and plans to deploy additional ones over the next few weeks. The mistake presented "many opportunities to protect the service against any similar event reoccurring," Amazon said.

Of course, Amazon's customers aren't so thrilled to have been guinea pigs in this cloud-crash learning experience. Amazon offered a mea culpa, and said it would give all customers in the affected availability zone a credit for 10 days of free service.

"We want to apologize," the company said in a prepared statement. "We know how critical our services are to our customers' businesses and we will do everything we can to learn from this event and use it to drive improvement across our services." To top of page

Frontline troops push for solar energy
The U.S. Marines are testing renewable energy technologies like solar to reduce costs and casualties associated with fossil fuels. Play
25 Best Places to find rich singles
Looking for Mr. or Ms. Moneybags? Hunt down the perfect mate in these wealthy cities, which are brimming with unattached professionals. More
Fun festivals: Twins to mustard to pirates!
You'll see double in Twinsburg, Ohio, and Ketchup lovers should beware in Middleton, WI. Here's some of the best and strangest town festivals. Play
Index Last Change % Change
Dow 17,804.80 26.65 0.15%
Nasdaq 4,765.38 16.98 0.36%
S&P 500 2,070.65 9.42 0.46%
Treasuries 2.18 -0.03 -1.27%
Data as of 8:41am ET
Company Price Change % Change
Bank of America Corp... 17.62 0.09 0.51%
Apple Inc 111.78 -0.87 -0.77%
General Electric Co 25.62 0.48 1.91%
Intel Corp 36.37 -0.65 -1.76%
Microsoft Corp 47.66 0.14 0.29%
Data as of Dec 19

Sections

New York Magazine reporter Jessica Pressler, who has been caught up in controversy this past week, will not be moving on to a new job at Bloomberg News. More

Investors beware: These 5 global crises are likely to rattle the stock market and world economy. More

Forums in dark corners of the web sell the kinds of hacks that befell Sony. More

Unilever sued Hampton Creek over its egg-free mayonnaise spread Just Mayo. But the company behind Best Foods and Hellman's mayonnaise has now dropped the lawsuit. More

The income of the top 1% jumped significantly in 2012, far outpacing inflation. Not only did this group make a larger share of the country's income, their share of total taxes also jumped from 35% to 38%. More

Most stock quote data provided by BATS. Market indices are shown in real time, except for the DJIA, which is delayed by two minutes. All times are ET. Disclaimer.

Morningstar: © 2014 Morningstar, Inc. All Rights Reserved.

Factset: FactSet Research Systems Inc. 2014. All rights reserved.

Chicago Mercantile Association: Certain market data is the property of Chicago Mercantile Exchange Inc. and its licensors. All rights reserved.

Dow Jones: The Dow Jones branded indices are proprietary to and are calculated, distributed and marketed by DJI Opco, a subsidiary of S&P Dow Jones Indices LLC and have been licensed for use to S&P Opco, LLC and CNN. Standard & Poor's and S&P are registered trademarks of Standard & Poor’s Financial Services LLC and Dow Jones is a registered trademark of Dow Jones Trademark Holdings LLC. All content of the Dow Jones branded indices © S&P Dow Jones Indices LLC 2014 and/or its affiliates.