Plannedscape Postings

Image

AWS Outage of October 2025
Lessons For Them And You

Posted by Charlie Recksieck on 2025-11-06
In October of 2025 there was a widespread failure of Amazon Web Services and their cloud service for a huge amount of mainstay customers.

Then Microsoft Azure had its own problems a week later. We’ll focus on the AWS outage in this article, but the same general advice applies to relying on Azure.


What Happened

On October 20, a major outage occurred in the Amazon Web Services (AWS) US-EAST-1 region (Northern Virginia), severely disrupting cloud services globally.

The incident began early in the morning UTC when AWS reported increased error rates and latencies in the US-EAST-1 region.

The root cause was identified as a DNS resolution failure for the Amazon DynamoDB API endpoints in US-EAST-1; queries for the endpoint failed, which prevented many dependent services from functioning.

Long story short, other website rely on AWS for database queries and file storage. During the outage, AWS was internally missing the web address for a lot of its own essential components, causing errors on everything that touches most parts of AWS.


What Went Down

Venmo, Netflix, Amazon Alexa and many other useful-to-essential services went down. Sure, Netflix isn't essential - but try making the same case to me about Alexa at my house. But when you start messing with Venmo and banks, things are getting serious. (BTW, there’s some estimates out there that about 1/3 of Gen Z uses Venmo as their principal checking account.)

The disruption affected thousands of applications and services around the world: gaming platforms (e.g., Fortnite, Roblox), social apps (Snapchat, Signal), developer toolchains, streaming services, financial apps and more. ]AI/chat tools such as ChatGPT and Perplexity also experienced authentication or API issues. In the finance sector, platforms like Venmo and Robinhood showed disruptions in payments/trading processes. Streaming and retail services-from Amazon (including Alexa/Ring) to Disney+-also reported slowdowns or downtime.

For many, service availability degraded for several hours; even once the main DNS issue was mitigated, back-logs and cascading dependencies meant full stability took longer to achieve.


How Long Til A Fix That Day

Around 06:49 UTC the first signs of packet loss appeared at AWS edge nodes in the region. By about 07:26 UTC AWS engineers pinpointed the DNS issue affecting DynamoDB. Primary DNS recovery occurred around 09:24 UTC, but downstream systems remained impacted for hours. According to reports, AWS confirmed all services were returned to normal operations by the evening of October 20.


Problems Of Dependency On AWS / Azure / etc

Don't Rely on a Single Region

Most companies use AWS's US-East-1 as their main region. When that region goes down, so does everything linked to it. Businesses should always plan for multi-region backups. Impact


DNS Is a Single Point of Failure

DNS may seem simple, but when it fails, it can bring down huge systems. Having multiple DNS failovers is critical for resilience.


Health Monitoring Needs Redundancy

A monitoring system should never be the cause of an outage. This event showed that even internal health-check tools need strong safeguards.


Prepare for "Retry Storms"

When servers go down, automatic retries can overwhelm them during recovery. Smart throttling and rate limiting can help prevent this.


Cloud Is Powerful - but Not Perfect

Cloud platforms offer speed and scalability, but true reliability comes from good architecture and disaster planning, not just trust in the provider.



What Can You Do About Things On Your Side

Companies large and small rely on heavy hitters like Azure, AWS, Google Cloud Platform, etc. Perhaps this type of outage incident makes you question relying on cloud services to this extent; although it’s hard, expensive, time-consuming, and resource-draining to reinvent the wheel and do all of this yourself.

As long as you are sticking with cloud infrastructure services, here’s some bullet point areas to start with for your planning.

* Spread workloads across multiple regions

* Regularly test disaster recovery plans

* Understand their hidden cloud dependencies

* Monitor third-party integrations that could break when AWS fails


The Takeaway

The event highlighted how a single regional fault (especially in a heavily centralised region like US?EAST?1) can ripple globally due to shared dependencies and common infrastructure. Many organisations pointed to this incident as a wake-up call about vendor concentration and the need for multi-region or multi-cloud resilience.

You should run some proactive tests about how such an outage might affect your technologies and come up with contingency plans. But at some point we’re a little too dependent on cloud infrastructure and there’s only so much you can do.