On December 07 2021 at 15:18 UTC, 10:18 EST, several AWS services started malfunctioning, leading to a major cloud outage that lasted nearly 7 hours, brings down thousands of websites, including major services like Amazon, Tinder, Roku, Coinbase, Happy Games, Epic Games, Ring, Disney Plus and more.
Seven hours after our monitors started showing availability issues, most services were recovered, and availability was back at pre-outage levels. Parametrix was in close contact with our customers throughout, providing our insights, assistance and support during the event.
The Notorious US-EAST-1
The main fault was at a data center in the US-EAST-1 region, demonstrating how significant dependency on cloud has become in general, on this region in particular, and how outages are a risk that must be addressed by all businesses.
According to our analysts, the main affected services were EventBridge, Auto Scaling, EC2, DynamoDB and Elastic Load Balancer and API Gateway (but other services were impacted or down as well.) The scope of the problem and the fact that some service were up and down intermittently, indicates there was a networking problem that prevented servers from communicating.
While the damages are clear and widely reported, only Amazon’s investigation will yield the full details of the error. End-users were flooding the web with reports and complaints of interrupted service.
EventBridge is a serverless event bus (pipeline) that makes it easier to build event-driven applications. It can be used to trigger applications on schedule.
This service was down for more than 12 hours, and had a domino effect on many other services that could not run because of the EventBridge malfunction. Many companies launch applications through EventBridge. These applications could not run, because EventBridge – being down – did not trigger them.
Auto Scaling has a simple role. It monitors applications and automatically adjusts their access capacity to compute services to maintain steady, predictable performance at the lowest possible cost.
The service was down for roughly 6.5 hours and led to widespread error.
AWS’ Elastic Compute allots compute powers to customers to run their processes at scale. It appears that all the API operations were hurt, which made operating the compute functions impossible. Virtual Machines that were running could continue working, but customers probably can’t get new ones up, nor can they change configurations for running processes. Also, without API access, virtual machines cannot be restarted if there is any error.
The service was degraded for roughly 7.5 hours.
DynamoDB is AWS’ proprietary database, where customers store their information. Like EC2, for this service API operations can’t be managed. Customers can write to the database, but can’t execute critical functions like backup, restore or rollback.
The service was degraded for 6 hours and 45 minutes.
Elastic Load Balancer
ELB is the equivalent of traffic control for websites. It automatically distributes incoming application traffic across servers so that no machine is inundated and overloaded. Management operations seem to be affected, meaning that ELB is functioning, but can’t be configured.
The service was interrupted for roughly 6 hours.
This is an API management tool, that lets developers create, publish, maintain, monitor, and secure tools that interface with other applications.
The service was down for roughly 8 hours.
With the outage event over, affected companies all over the world will need to assess the damages. Some were sidelined with work interruption, others discovered their websites were inaccessible – particularly detrimental to online shops and entertainment. Most will have to contend with customer churn and a tarnished reputation.
Any company impacted and covered by a Parametrix Insurance policy will be able to collect on their policy and gain the necessary cash flow to start repairing the damages quickly.