Join us
@devopslinks ・ Nov 01,2025

Amazon apologized for the major AWS outage in the Northern Virginia region, caused by a race condition in the DynamoDB DNS management system, affecting services like DynamoDB, Network Load Balancer, and EC2.
Amazon issued an apology to customers affected by the outage, demonstrating a commitment to transparency and accountability.
The outage impacted several AWS services, including Amazon DynamoDB, Network Load Balancer, and EC2 instances, causing significant operational challenges.
The primary cause of the disruption was identified as a latent race condition in the DynamoDB DNS management system.
AWS implemented several measures to recover from the outage, and by 3:53 PM PDT (Oct 20), all AWS services had returned to normal operations.
The sentiment expressed is one of responsibility and transparency, but the inconvenience caused by the outage might overshadow the apology for some users.
The number of distinct periods of impact to customer applications during the outage
The total number of AWS services impacted during the outage
The number of Availability Zones where the DNS Enactor operated redundantly
The number of DNS Enactors operating redundantly in the N. Virginia region
AWS is responsible for managing the infrastructure and services affected by the outage, and they issued an apology and explanation regarding the incident.
DynamoDB experienced increased API error rates due to the DNS management system failure, affecting users.
Users of the Network Load Balancer faced connection errors during the outage, impacting application performance.
EC2 instance users encountered issues with instance launches, affecting their computing resources.
A significant outage occurred affecting multiple AWS services, leading to Amazon issuing an apology.
The outage started with increased API error rates in Amazon DynamoDB due to DNS resolution failures.
The cause of the outage was identified as DNS resolution issues for the regional DynamoDB service endpoints.
The DNS issue affecting DynamoDB was resolved, and services began recovering.
Network Manager began propagating updated network configurations to newly launched instances.
Monitoring systems detected increased health check failures in the Network Load Balancer (NLB) service.
Engineers disabled automatic health check failovers for NLB, resolving increased connection errors.
Network Load Balancer health checks were recovered.
Full recovery of EC2 APIs and new EC2 instance launches was achieved.
Automatic DNS health check failover was re-enabled for NLB.
All AWS services returned to normal operations.
The final update confirmed the resolution of increased error rates and latencies for AWS Services in the US-EAST-1 Region.
Amazon's been in the spotlight lately, and not for the reasons they'd like. They had to issue an apology after a significant AWS outage rocked the Northern Virginia (US-EAST-1) Region. This wasn't just a minor blip on the radar. Between October 19 and 20, 2025, users faced increased API error rates in Amazon DynamoDB, connection errors with the Network Load Balancer (NLB), and failed EC2 instance launches. The root of all this chaos? A latent race condition in the DynamoDB DNS management system, which led to endpoint resolution failures. In simpler terms, a technical glitch that caused quite a few headaches.
The trouble kicked off at 11:48 PM PDT on October 19, when users started noticing those pesky API error rates in DynamoDB. Things went downhill fast, with connection errors in NLB and issues launching EC2 instances. By 2:25 AM PDT on October 20, Amazon had managed to restore DNS information and was deep into recovery efforts. But, as these things often go, the problems didn't just disappear overnight. EC2 instance launches and NLB health checks continued to struggle, causing more network connectivity woes.
Amazon's team was in overdrive throughout the day, trying to fix the mess. They throttled operations and applied various fixes, a bit like trying to patch a leaky boat while still at sea. By 3:01 PM PDT, most AWS services were back to normal, though some were still dealing with backlogs. Amazon's promised to learn from this incident and make changes to prevent it from happening again. It's a classic case of learning the hard way, but hopefully, it means smoother sailing in the future.
Subscribe to our weekly newsletter DevOpsLinks to receive similar updates for free!
Join other developers and claim your FAUN.dev() account now!
FAUN.dev() is a developer-first platform built with a simple goal: help engineers stay sharp without wasting their time.

FAUN.dev()
@devopslinks