Why One Bug Caused Widespread Internet Chaos

Why One Bug Caused Widespread Internet Chaos

Anybody familiar with the concept of the butterfly effect is likely to have some sympathy for the Amazon Web Services engineer who inadvertently caused an Internet meltdown on February 28, 2017.

The cloud-based infrastructure provider suffered an hours-long outage that crippled a significant proportion of the Internet, with a plethora of high-profile websites and apps stopped in their tracks. The outage was centered in the Northern Virginia region and ensured that AWS was unable to service requests for around five hours.

Although there was no evidence that this was linked to a malicious attack, people (unsurprisingly) wanted to know what was going on.

And it turns out that human error was to blame. To be more specific, a member of the Simple Storage Service (S3) team typed in the wrong command as part of a debugging process intended to speed up the S3 billing process.

Yes, a typo brought parts of the Internet to its knees.

According to Amazon Web Services, the debugging was only supposed to affect a limited number of servers. The incorrect command not only took out a larger set of servers than intended in the US-EAST-1 datacenter—a huge location that is also Amazon’s oldest, ZDNet reported—but also removed support for two other S3 subsystems—the index subsystem and the placement subsystem. Both of these subsystems required a full restart and safety checks that took “longer than expected.”

“At 9:37AM PST, an authorized S3 team member using an established playbook executed a command which was intended to remove a small number of servers for one of the S3 subsystems that is used by the S3 billing process,” said AWS, in a press release. “Unfortunately, one of the inputs to the command was entered incorrectly and a larger set of servers was removed than intended.”

The full AWS explanation for the outage can be read here.

Never Hurts To Plan For An Unexpected Event

The outage only lasted for a few hours but it highlights the ripples that can be created when the human factor comes into play.

AWS said that it builds its systems with the assumption that things might fail but admitted that it had not restarted either of these subsystems in its larger regions for years. S3 has experienced massive growth, which means that any outage or bug is always going to have a significant effect on the digital world.

To be fair to AWS, it has already made several changes to its operational practices as a result of this unforeseen event.

These changes include modifying the tools used to remove capacity and adding safeguards to ensure that other subsystems are not affected by an incorrect input. AWS has also begun auditing its other operational tools to speed up recovery time, splitting services into small partitions or cells and reducing the dependency that the AWS Service Health Dashboard—which was also taken out by the incorrect command—has on S3.

AWS has apologized for the impact that this event had for its thousands of customers, but it acknowledged that one small bug caused a whole heap of problems. With that in mind, this widespread outage through human error should be a wakeup call for any companies that test for bugs on an infrequent basis.

Want to see more like this?
David Bolton
David Bolton
Former ARC Writer
Reading time: 5 min

Digital Quality Matters More Than Ever: Do Your Experiences Keep Customers Coming Back?

Take a deep dive into common flaws in digital experiences and learn how to overcome them to set your business apart.

4 Ways to Get Maximum Value from Exploratory Testing

Well-planned exploratory testing can uncover critical issues and help dramatically improve the customer experience. See how to guide testers to where exploration can yield the greatest returns.

3 Keys to an Effective QA Organization

Get your internal, external and crowdsourced testers on the same page

What is the Metaverse? And What Isn’t It?

It’s not far-flung sci-fi anymore — the metaverse is here, and it requires companies to rethink their approach to UX and testing

Why Machine Learning Projects Fail

Read this article to learn the 5 key reasons why machine learning projects fail and how businesses can build successful AI experiences.

How Localization Supports New-Market Launches

Success or failure in a new market is all about how you resonate with customers — don’t skimp on prep work