By Yonatan Hatzor
This year has certainly been unpredictable, putting us all on a rollercoaster ride of uncertainty. The coronavirus has tested us in many ways. We’ve seen people come together to support one another. Medical companies have rushed to provide a vaccine in record time. And, of course, technology companies have stepped up and strengthened their solutions that keep teams connected and companies running in the face of stay-at-home orders.
To their credit, cloud technology providers have delivered in spades, expanding service capabilities to meet the new demands and increased surges as more companies adopted digital technologies. However, there have been a few bumps in the road, some pandemic-related, some not.
We can breathe a sigh of relief that there have been no massive catastrophes this year as the worst outages were still resolved in a matter of hours, but that doesn’t mean that businesses weren’t negatively impacted due to unexpected downtime.
Let’s take a look at some of the most significant downtime events of this year – a year that brought unprecedented dependence on cloud technologies and changed the way businesses work.
March 3rd: Microsoft Azure
Microsoft Azure’s U.S. East data center suffered from a six-hour outage, limiting the availability of its cloud services to customers relying on that data center. Microsoft reported that a cooling system failure caused the issue. As temperatures rose, network devices couldn’t perform properly, causing compute and storage instances to be inaccessible. When temperatures returned to normal, the engineers power-cycled hardware and services were restored.
March 26th: Google Cloud
Elevated error rates with Google Cloud IAM disrupted many services across multiple regions for three and a half hours. Google attributed the downtime to a “bulk update of group memberships that expanded to an unexpectedly high number of modified permissions, which generated a large backlog of queued mutations to be applied in real-time.” Cache servers ran out of memory, causing IAM requests to time out. Engineers restarted the cache servers with additional memory to mitigate the impact as they tried to fix the stale data and load the offline backfill of IAM data into the servers.
GitHub suffered two outages over April. On April 2nd, a misconfiguration of software load balancers led to an outage lasting almost two hours, causing a problem when developers deployed load balancers to sites. On April 21st, all GitHub users and services were affected when a misconfiguration of database connections unexpectedly went into production.
June 9th: IBM Cloud
IBM suffered a three-hour-plus outage that impacted 80 data centers across the world. With IBM’s status page also down, customers started to speculate about what could have caused such a global outage, with some assuming that it might be a BGP hijacking. Two days later, IBM released a statement saying an “external network provider flooded the IBM Cloud network with incorrect routing, resulting in severe congestion of traffic and impacting IBM Cloud services and our data centers. Mitigation steps have been taken to prevent a recurrence. Root cause analysis has not identified any data loss or cybersecurity issues.”
July 17th: Cloudflare
A short 27-minute outage from Cloudflare caused a 50% drop in traffic across its network. To alleviate congestion on a router in Atlanta, the engineering team updated the configuration, but an error sent all traffic across Cloudflare’s backbone to Atlanta, overwhelming the router. Some locations in the U.S., Europe, Russia, and South America were affected while others continued to operate normally. After the incident was resolved, the company announced it had “already made a global change to the backbone configuration that will prevent it from being able to occur again.”
August 11: Salesforce
For nearly four hours some Salesforce customers lost access to the service due to a power outage. Salesforce reported that some users hosted on its NA89 instance, which runs in Phoenix and Washington D.C. data centers, were indeed affected. To remedy the situation, the company rerouted traffic and executed an emergency site switch. This action caused an issue with Salesforce’s Live Agent tool but that was fixed within a matter of minutes.
August 20th: Google
G Suite users found themselves unable to send emails, share files, post messages in Google Chat, use Google Voice, or do any other activities that required Google’s business apps. After 6 hours of downtime, services were fully restored.
August 24th: Zoom
U.S and U.K. Zoom users found themselves unable to access the website (zoom.us) or start and join any Zoom meetings or webinars. The issue began as the workday kicked off on the East Coast and customers reported different levels of disruption, with some commenting that only the web interface wasn’t functioning while others seemed completely unable to access the service. About five hours later, Zoom announced that services were fully operational but didn’t disclose what triggered the event.
September 28th: Microsoft Azure & Microsoft 365
Users across the Americas were locked out of Azure, Microsoft 365, Dynamics 365, and custom applications using Azure Active Directory single sign-on service for 5 hours. Those already signed in had no issues, though. Microsoft reported that three separate and unrelated issues caused the outage: a service update with a code defect, a tooling error in Azure AD’s safe deployment system, and a code defect in Azure AD’s rollback mechanism.
November 25th: AWS
In the early morning hours, businesses that rely on Amazon’s US-East-1 region and use Kinesis were confronted by a significant outage that, according to Amazon, was due to a “relatively small addition of capacity” to its front-end fleet. That addition “caused all of the servers in the fleet to exceed the maximum number of threads allowed by an operating system configuration.” This outage, which started with Amazon’s Kinesis service, quickly impacted other services, including CloudWatch and Amazon Cognito.
In regards to Cognito, Amazon reported in its post-mortem that ‘the prolonged issue with Kinesis Data Streams triggered a latent bug in this buffering code that caused the Cognito webservers to begin to block on the backlogged Kinesis Data Stream buffers. As a result, Cognito customers experienced elevated API failures and increased latencies for Cognito User Pools and Identity Pools, which prevented external users from authenticating or obtaining temporary AWS credentials.”
December 14th: Google Cloud Platform & Google Workspace
For about an hour, many people experienced issues when trying to sign in to Google’s cloud platform and Workspace services which include Gmail, Google Drive, Google Classroom, and YouTube; however, those already signed in didn’t have trouble accessing the applications. The company worked quickly to restore issues with Gmail. In its post-mortem, the company explained that the cause of the error was due to the company’s internal tools not allocating enough storage space to services that handle authentication.
As we say at Parametrix Insurance: downtime happens. It’s disruptive and damaging to businesses in many ways, but we can’t control third-party services. And, while companies have to accept that downtime will happen, it doesn’t mean they have to roll over and do nothing. There are ways to mitigate downtime risks, ranging from creating redundancy to transferring that risk through insurance.