The role of the cloud in global CrowdStrike outage

What was the role of the cloud in the recent CrowdStrike outage, and what does it say about vulnerabilities in cybersecurity infrastructure?

In the wake of the recent widespread outage affecting enterprises globally, the incident involving CrowdStrike has illuminated a critical vulnerability in modern cybersecurity infrastructure: the pivotal role of cloud technology in software updates. 

The incident unfolded

The global CrowdStrike outage stemmed from a routine security update distributed via their flagship product, Falcon. The software is installed on machines to protect them against cyber threats and receives automatic updates through the cloud. The cloud is essential for cybersecurity solutions like CrowdStrike because it enables them to quickly identify and patch cyber threats before hackers can attack through new vectors. During the outage, the software was automatically updated with an incorrect configuration that triggered system failures, resulting in machines around the world using both Windows and CrowdStrike to enter a stuck state called the BSOD (Blue Screen of Death). 

Cloud as the distributing force

Cloud technology enables cybersecurity companies to seamlessly update and deploy solutions on a global scale. CrowdStrike controls their own implementation behaviors on the cloud and leverages their cloud infrastructure to quickly spread through interconnected systems worldwide. This ensures rapid protection for their users against evolving cyber threats. While the cloud itself encountered no technical issues during the outage, we saw it paradoxically become the channel through which the outage had such a far reaching effect; impacting not only primary users but also downstream entities reliant on their services. 

Quick to destroy, slow to repair

Even though the outage spread rapidly through the cloud, the mitigation could not because affected machines were in a stuck state and could not be communicated through the cloud. The only mitigation is manually deleting a file called C-00000291*.sys in the Windows System32 folder located on each machine. Having the only fix be a manual fix complicated the outage even more than it already was. Crowdstrike released the manual mitigation after two hours of the disaster but IT teams around the world could not keep up with the vast number of machines that needed to be mended. Take for example, a large enterprise company, who can easily have over 200,000 machines. The average number of employees for a large enterprise is 62,000, and  with one full-time IT expert for every 100 employees, the IT team is roughly 620 employees. Even if all 620 IT members started fixing these machines immediately and it took only five minutes per machine, it would still take over 26 net working hours to resolve the issue at the company level. This is still unrealistic as this takes into consideration that each IT member works 24 hours a day with no food or bathroom breaks. Realistically this would take between 3-7 days to have the entire organization up and running again.


The Systemic cyber risks exposed

While the cloud has many ways to mitigate the systemic risk of its own potential failure, services dependent on the cloud can still fail and, as was demonstrated by the CrowdStrike outage, can have a far-reaching impact. The transition to cloud services has undoubtedly brought numerous benefits, but it has also introduced significant changes to the landscape of cyber systemic risks


Cloud based services as a key driver of systemic risk.  

While the previous focus of catastrophic cyber insurance risk focused on either malicious events of some kind or cloud platform instability and unavailability. The concern about malicious cyber put much of the focus on defensive security measures (one of which ironically caused this event). Concern about the cloud as a platform did lead to analysis of the aggregation of exposure to the cloud, which is relevant to this event, but did not address all vulnerabilities.

Here we see the cloud acting as a platform for contagion. There were no malicious actors and the cloud functioned normally. The connectivity that the cloud enables is a key risk in and of itself whether malicious or not. In this case the cause was human error, the cloud could just as easily be a platform for propagation of malicious acts. 

While the stability of the cloud itself remains a risk to be managed, it is quite distinct from the cloud's ability to propagate bugs and viruses, and this is likely to require additional analysis.

Additional lessons learned include:

1. Not just cyber - This event will be a case study of how such a systemic event will impact multiple lines of insurance. We have already seen travel insurers impacted by all the flight delays. We are yet to evaluate the impact on various lines of insurance, for the supply chains which were disrupted and the liability coverages triggered for all the critical medical procedures that were not performed or delayed, and for directors and officers not taking the necessary procedures to protect their firms against such an event.

2. Cascading Iimpacts - The complexity of the cascading impacts is immense. While some airlines were not directly affected, airports which they flew to or from were. Online travel agents that were not directly affected were flooded with requests from customers to rebook their canceled flights. What were the consequences of various payments that were not executed or trading on the exchanges when stock prices fell?  It will take time for this to play out.

3. Preservation of logs - To truly understand events and evaluate claims, the preservation of system logs becomes even more important. Their analysis could become a key factor in resolving insurance claims and determining preventative measures in the future.

Looking Ahead

In conclusion, while this incident serves as a sobering reminder of the vulnerabilities inherent in our digital infrastructure, it also presents an opportunity for innovation and resilience-building. By learning from these challenges, the industry can evolve towards more secure, reliable, and interconnected digital ecosystems.

--

Parametrix monitors third-party IT services across the globe and collects granular data on service interruptions. If you have any questions regarding the outage event please reach out at info@parametrixinsurance.com

The Parametrix Team
View Profile
Published
July 22, 2024