Beyond the SLA: The Cloud Never Fails… Until it Does

Mark Kujawski — Wed, 05 Nov 2025 13:56:05 +0000

One of the core selling points of cloud computing has always been that the Big Three hyperscalers (AWS, Azure, Google Cloud) offer virtually limitless redundancy. Their massive global networks, we are told, make outages a thing of the past and eliminate the need for expensive disaster recovery strategies.

But recent events have proven otherwise. The AWS outage of October 20th disrupted thousands of businesses worldwide due to a single DNS resolution failure in the DynamoDB API for the US-EAST-1 region. That single point of failure rippled through global operations and its effects are still being felt.

The very next week, a global outage triggered by an “inadvertent configuration change” crippled Azure Front Door and associated platforms, including Microsoft 365, Minecraft, and Xbox Network. And in June of this year, Google Cloud went down taking Spotify, Snapchat, and Fitbit with it—for hours.

Collectively, these interruptions are estimated to have cost billions.

The most desired measure of system availability is the Five Nines, meaning a system or application is available and operational 99.999% of the time. That’s about five minutes and fifteen seconds of downtime per year or, if you like, 43 seconds per month. But this level of availability was never a guarantee from the hyperscalers.

Recent events show that even Four Nines (less than an hour of downtime per year) may be out of reach for hyperscalers. Cloud infrastructure after all is rented, not owned. Customers have no control over availability of the underlying systems; they can only trust that their cloud providers uphold the promise of resilience.

For many years, continuous-availability architectures (zero planned downtime plus highly resilient failover) have been the domain of mission-critical, on-prem systems like stock exchanges, telecom networks, and 911 emergency services.

Cloud computing hasn’t yet been able to reach that bar.

As more organizations move mission-critical workloads off-prem, CIOs are being forced to reevaluate risk. The assumption that cloud equals continuity has eroded. Now, the question isn’t whether to move, but how to do it safely.

The path forward starts with visibility.

Enterprises should run an assessment of their cloud environments to uncover weaknesses in reliability, cost efficiency, and security posture. A well-executed architectural review identifies single points of failure, quantifies exposure, and helps balance performance with cost and resilience. The goal: Restore confidence in cloud operations by designing for availability, not just assuming it.

If your organization depends on continuous uptime, now is the time to take a closer look at what “resilient” really means in the cloud era. Start by assessing where your risk lives and what’s within your control.

About the Author

Mark Kujawski is a Principal Director at apiphani. He leads the company’s Advisory Strategy Practice.

Discover a New Approach for Mission Critical with Deep Automation™ by Apiphani

Apiphani Team — Wed, 06 Dec 2023 10:35:00 +0000

This video presents Apiphani’s Deep Automation™ approach to managed services, positioning it as a response to legacy providers constrained by outdated technologies and manual support models. The solution is built natively on AI and machine learning, with a focus on incident avoidance rather than reactive ticket handling.

Deep Automation™ leverages machine learning to detect anomalous behavior, propose remediation actions to engineers, and execute approved solutions automatically, reducing dependence on traditional L1 and L2 support layers and significantly accelerating resolution times. The model combines automation with senior technical expertise, aiming to improve reliability, reduce human error, and enhance operational performance for mission-critical systems.

FAQ

What is Deep Automation™?

How does Deep Automation™ differ from traditional support models?

How are incidents handled in this model?

What operational impact is claimed?

Does automation replace human expertise?

DeepAutomation – apiphani

Beyond the SLA: The Cloud Never Fails… Until it Does

Discover a New Approach for Mission Critical with Deep Automation™ by Apiphani

FAQ