Cloud outages: Why do they happen and how can you safeguard your business

The benefits of the cloud are undeniable and include improved functionality, security, efficiency, and stability. In today's always-on, instant-everything world, it's hard to argue against using the cloud. But the cloud, and cloud providers, are not infallible.

Every few months, news headlines are peppered with a story about the most recent cloud outage, and those are only the ones you hear about. In fact, downtimes happen quite often even though top cloud providers, including Amazon’s AWS, Microsoft’s Azure, and Google’s GCP, can offer 99.9% uptime as part of their SLAs.

When cloud services stop working, everyone panics, and with good reason. Many businesses depend on constant access to their cloud-based mission-critical products and services, so cloud downtime can stop work in its tracks. In the best-case scenario, downtime lasts only a few minutes and affects a few minor services; in the worst case, businesses are paralyzed for half a day or even longer, and the effects can be highly detrimental, costing companies $9,000 or more per minute.

The same situation unfolds each time there is an outage. Users take to social media to ask if others are experiencing downtime and can't access critical cloud services. Usually, the cloud service provider begins to make statements within an hour at most, acknowledging the downtime and letting the public know they are trying to restore services. Sometimes a post-mortem is released detailing the incident and causes. Other times, no such report or statement is made, leaving many to wonder what happened.

So, what are the most common causes of cloud outages? Below, we'll look at what an outage actually is and what kinds of incidents trigger downtime events. In a follow-up post, we'll explore what kind of financial impact downtime events have on the most cloud-dependent industries.

What are cloud outages and why do they happen?

A cloud outage is when the cloud infrastructure or a particular cloud service becomes unavailable or performs inadequately in relation to the cloud provider's SLA metrics. Outages triggered by a single failure can have a waterfall effect that ripples across multiple services or systems.

Cloud outages are caused for a range of physical and software reasons, some of which are within the control of service providers while others are not, including:

Human Error - This is the only cause on our list that falls into both the physical and software categories. To provide best-in-class infrastructure and services, cloud providers continually update systems, and humans create those updates. To avoid human errors, cloud service providers follow stringent protocols and deploy updates region by region to avoid widespread failures. However, even with these precautions in place, a single incorrect command can bring down an entire IT service.

Networking Issues - Cloud service vendors partner with government organizations and telecommunication providers in order to use their communication networks. Networking problems, especially connectivity issues, are usually out of the cloud provider's control, leaving them (and their customers) dependent on their local partners. As the largest cloud vendors operate globally and balance workloads across geographically separated data centers, they are typically able to continue providing service to end-users while their partners resolve the networking outage.

Power Outage - Loss of power is a common physical cause of outages. Large data centers use around 100 megawatts (MW), roughly equivalent to the energy consumption of 80,000 homes. That puts significant demands on the power supply coming from the national grid or independent power plants, both of which have mostly held steady for cloud vendors. However, damage to the grid or plant itself impacts everything that relies on it, including data centers. Fortunately, data centers also have backup generators that can supply at least partial coverage should they lose power from their primary source.

Cybersecurity - Contrary to what many believe, cyberattacks are among the rarest causes of unavailable cloud services for the top three cloud providers since their distributed infrastructure reduces the chance of a global cyberattack. Their extensive security measures make them highly resistant but not impenetrable. A cyberattack like a Distributed Denial of Service (DDoS) is rare these days, but just last year Amazon mitigated the largest such attack ever. However, as demonstrated by the recent SolarWinds attack, other cloud-based businesses are susceptible, and their clients could experience outages as part of a result.

Environmental Causes - The one thing that data centers can't do much to control is natural disasters or weather-related events. Hurricanes, lightning storms, tsunamis, and earthquakes have triggered cloud downtime events, either directly at data centers or through the region’s power grid. Since environmental causes are localized and physical in nature, the chances that multiple global regions would be crippled remains nearly impossible, unless there was a catastrophic event across several continents.

Maintenance - While end-users only pay for the services they use, cloud providers need to maintain, manage and operate their entire complex IT infrastructure. They improve and upgrade their systems on a scheduled basis, leading to planned service interruptions or complete system restarts. Customers are notified in advance, but there's a risk of updates going wrong, and scheduled downtime is still downtime.

Take a proactive stance.

Outages are a reality that technology-dependent companies should plan for in advance. In general, you can trust that reputable cloud service providers have put huge amounts of money into ensuring uptime, while also acknowledging the fact that outages can and will happen.

To limit the impact cloud outages can have on your operations and finances, you can place your cloud services in one of the more stable regions, use redundancy measures, or transfer your risk by purchasing insurance coverage, or employ a combination of all three methods.  Whatever you decide to do, don’t neglect this issue because the damages can be significant to both your business continuity and your reputation.

[ Downtime Happens | Better make sure you’re covered | Learn more ]

Explore more resources.

Read about the latest outage events, industry trends, thought leadership pieces, and more.