By Neta Rozy
Microsoft Azure experienced errors while performing authentication operations for more than three hours starting 18:40 UTC on March 15 when our monitoring system began detecting high error rates on the Microsoft service. This incident was widely reported online due to the fact that widely-used Microsoft services such as Microsoft 365 and Microsoft Teams were down for several hours.
The incident seems to have been caused by the Azure Active Directory service reducing capacity for its authentication service and returning errors so that it was unable to authenticate users. It was relevant to managed Microsoft applications including Teams, Microsoft 365, Exchange and Xbox which depend on Azure Active Directory, a Microsoft enterprise identity service which provides single sign-on and multi-factor authentication access.
Customers that directly use Azure Active Directory as part of their production systems felt a partial impact on some services such as Azure VMs and Azure Storage. But since this only impacted management features such as the creation of new processes, it did not affect any existing applications, processes or operations that were already running on Azure.
The Parametrix Monitoring System identified the outage as soon as it began. It monitored Azure’s error rate during the incident and identified a peak error rate of over 75%, meaning that more than 75% of system management requests were failing, with a focus on authentication problems.
Below you can see an actual graph from our Monitoring System showing the outage period with errors occurring around management operations (i.e., not those that cause Service Unavailable Status) of Virtual Machines (blue), SQL (purple) and Storage (yellow). This is only one of the hundreds of data points that our system curates based on algorithms which are trained to spot abnormalities just like this. As the image shows, the storage service was far less badly affected compared to the compute services (SQL & Virtual Machines).
Our system identified only a slight degradation of service in Azure Storage since the service continued to function. As you can see, there was a 10% peak error rate per region lasting only a few minutes. The failed operations themselves were not critical management ones such as the Azure Storage Account operations.
The graph below displays the crucial Azure Storage Application operations. As can be seen, these were not affected at all as there was no decrease in their success rate in any region.
Note: The difference between Azure Storage Account and Azure Storage Application operations is that the former is used for account management while the latter is required for the continuous operations of a running system.
Although there were high error rates in the SQL & Virtual Machines services, their errors were only in the management aspect of the service so the instances that were already running did not suffer an outage.
Displayed below is the aggregated uptime graph of every running service instance monitored by our system which shows that during the downtime, no running instances were affected from any region.
The graphs above show the deep granularity and accuracy of our monitoring system. It monitored exactly which services were down at the exact moment the downtime occurred, and specified exactly what did or didn’t work on each service.
Parametrix further uses the data it collects to analyse future events, inform our modeling processes, and produce data-driven insights that influence strategy at the company and at the market level.
This incident and all the publicity surrounding it demonstrates the market’s ever growing reliance on cloud and the need for insurance policies that cover downtime caused by third-party IT providers for services such as cloud, ecommerce, payments and communications.