by Ben Fox Rubin March 2, 2017 10:44 AM PST @benfoxrubin
Amazon offered up more answers Thursday about what caused a bunch of websites to fail two days ago.
According to a postmortem by the company's cloud services business, around 9:37 a.m. PT Tuesday an Amazon worker incorrectly punched in a command while trying to debug an issue. That command shut down a large set of servers at Amazon Web Services' Northern Virginia site, causing a domino effect of problems.
Other services that relied on those S3 cloud storage servers were disrupted. Also, removing so much server capacity required a full system restart, which then took longer than expected, AWS said. The sites affected included Quora, Imgur, IFTTT, Giphy and Slack.
Amazon was able to fix the issue by about 2 p.m. PT.
The problem highlighted just how much of the internet now depends on AWS, the leading cloud service provider, and the major repercussions of even a small human error at AWS.
"We want to apologize for the impact this event caused for our customers. While we are proud of our long track record of availability with Amazon S3, we know how critical this service is to our customers, their applications and end users, and their businesses," the postmortem stated. "We will do everything we can to learn from this event and use it to improve our availability even further."
Looking to avoid a similar snafu, AWS said Thursday it's adding additional safety checks and ways to improve recovery times. A tool used to remove servers from the system will be modified to prevent someone from accidentally removing too much capacity at once.
"This will prevent an incorrect input from triggering a similar event in the future," AWS said in the post.
In other words, we may be saved from another typo mishap.
Solving for XX: The industry seeks to overcome outdated ideas about "women in tech."
Life, disrupted: In Europe, millions of refugees are still searching for a safe place to settle. Tech should be part of the solution. But is it? CNET investigates.