The biggest outages of 2021

11 Feb
5
12 Feb
4
15 Mar
3
17 Mar
4.5
1 Apr
1
20 May
10
8 Jun
7.5
14 Jun
11
17 Jun
1
22 Jul
1
22 Jul
2
3 Aug
0.5
24 Aug
1.5
1 Sep
6.0
27 Sep
1.3
4 Oct
6.5
12 Oct
0.5
19 Oct
1.5
16 Nov
1.5
24 Nov
1.0
7 Dec
7.0
22 Dec
2.1
Date
x

Affected region(s): West US

Issues with Azure Cosmos DB connectivity affected multiple downstream services.

The cause was a code regression triggered during Cosmos DB deployment.

Affected region(s): Multi Region

During routine handling of the networking control plane, incoming network programming operations experienced timeouts. This resulted in networking issues and elevated packet loss.

Affected region(s): Global

Errors performing authentication operations for any Microsoft services and third-party applications that depend on Azure Active Directory (Azure AD) for authentication.

Affected region(s): Multi Region

Increased latency and packet loss and service unavailable errors for traffic between regions and from Google to external endpoints.

Affected region(s): Global.

Azure DNS experienced a service availability issue. Azure DNS servers experienced an anomalous surge in DNS queries from across the globe targeting a set of domains hosted on Azure.

Affected region(s): europe-west2, us-west2

Google Cloud products experienced elevated latency and/or errors due to an issue with Access Control Lists (ACLs). The affected ACLs in this instance were internal that are used to define permissions for Google’s internal production resources. This prevented some internal service accounts from accessing various production jobs, which led to the downstream service impact.

The incident was triggered by a latent concurrency issue in a component of the production ACL system combined with a missing safety check.

On May 12, they deployed a software version that introduced a bug that could be triggered by a specific customer configuration under specific circumstances.

Early June 8, a customer pushed a valid configuration change that included the specific circumstances that triggered the bug, which caused 85% of our network to return errors.

Specifically, sites like Amazon, Twitter, Reddit, Google, CNN, the Guardian, and The New York Times all went at once in early June. Fastly restored "95%" of its services within 49 minutes, making this a broad but relatively short-lived outage compared to the rest.

Affected region(s): West US 2

Azure experienced issues with compute (VM) managemen, a severe management API issue.

This led to a failure in scale up of autoscaling groups (scale sets).

Akamai saw service disruptions for its hosting platform, which helps defend against Distributed Denial-of-Service (DDoS) attacks.

The disruption affected several large companies around the globe, including Southwest Airlines, United Airlines, Commonwealth Bank of Australia, Westpac Bank, and Australia and New Zealand Banking Group, as well as the Hong Kong Stock Exchange’s website. (APAC)

A bug in Akamai's domain name system (DNS) service, which allows web addresses to take users to their destinations, was triggered during a software update.

Websites of Delta Air Lines (DAL.N), Costco Wholesale Corp (COST.O), American Express (AXP.N) and Home Depot (HD.N) were initially down, displaying domain name system (DNS) service errors.

Affected region(s): eu-south-1 (Milan)

Errors with EC2 MGMT and autoscaling.

Affected region(s): eu-south-1 (Milan)

Issues with Lambda increasing invoke error rates.

Affected region(s): australia-southeast2

Google Cloud infrastructure components faced issues. Google Cloud Networking experienced intermittent connectivity issues.

Any service that uses Cloud Networking may have been impacted.

Affected region(s): ap-northeast-1 (Tokyo)

Customers connecting to AWS services within the region through Direct Connect experienced elevated packet loss.

The cause was a loss of several core networking devices used to connect Direct Connect network traffic to all Availability Zones in the region.

Affected region(s): us-east-1

Kinesis experienced downtime causing other services error functionalities. Customers experienced degraded performance for some EBS volumes in a single Availability Zone in the region.

Facebook accidentally shut down its own cloud services due to "configuration changes on the backbone routers that coordinate network traffic between our data centers," which cascaded and brought down all of their online services. No one could access any Meta services worldwide, including Facebook’s own employees.

Affected region(s): Global

Customers experienced increased error rates and latency for the AWS Management Console.

This issue did not affect any AWS service APIs, they were operating correctly during this time.

Affected region(s): Us-central1

Cloud Monitoring elevated errors while requesting underlying monitoring data.

Affected region(s): us-central1, europe-west1 (Belgium)

A latency bug in a network configuration service was triggered during a leader election change.

Customers impacted by the issue encountered 404 errors when accessing web pages served by the Google External Proxy Load Balancer.

Affected region(s): us-east-2

Customers experienced increased invoke error rates for Lambda in the region.

Other AWS services, including AWS Management Console, API Gateway, and Batch, were experiencing elevated error rates as a result of this issue.

Affected region(s): us-east-1

Increased API error rates and latencies for multiple AWS services due to overwhelming networking devices between the internal AWS network and the main AWS network in the region.

Affected region(s): us-east-1 Availability Zone 4 Increased EC2 launch failures and networking connectivity / instance availability issues.