[Outages-discussion] Re: [EXTERNAL] Re: [Outages] AWS US-EAST-1

23 Oct 2025

      idk how you guys sanitize your internet consumption for your own sanity but
I use AI to amuse me through snark, and I thought this one was too good not
to share:

''''

*SassGPT, can you summarize this word-vomit for me and tell me how it could
have been avoided?*

https://aws.amazon.com/message/101925/
...
...
🌀 *“How AWS Took Down Half the Internet Because of One DNS Race Condition”*
On October 19, 2025, AWS’s biggest region (*us-east-1*, a.k.a. the “single
point of failure for the internet”) face-planted when *DynamoDB’s automated
DNS manager tripped over its own shoelaces*.
Two automation bots tried to update DNS at the same time, got confused,
and—like dueling Roombas—*deleted the service’s own DNS records*. DynamoDB
promptly vanished from the network, and every AWS service that depends on
it (which is most of them) started screaming.

That single typo-by-automation triggered a *cascade of chaos*:

   -

   *EC2* couldn’t launch new instances because its management system
   couldn’t talk to DynamoDB.
   -

   *Load balancers* began thinking healthy servers were dying and yanked
   them from DNS.
   -

   *Lambda, Redshift, Connect, IAM, and STS* all stumbled because they rely
   (directly or indirectly) on DynamoDB.
   -

   By the time engineers manually resurrected the DNS record, they still
   had to untangle a backlog of leases, health checks, and throttles.

After about *14 hours*, everything limped back to normal. AWS promised to
fix the race condition, add guardrails so automation can’t nuke live DNS,
and maybe stop using one region as everyone’s dependency hub.
------------------------------

*In short:*
A concurrency bug in a DNS script turned into a region-wide outage because
of excessive automation trust and cross-service dependencies. The lesson?
*Even hyperscale clouds can still be taken down by one bad variable and too
many dominoes.*

'''

On Thu, Oct 23, 2025 at 9:19 PM Jay R. Ashworth via Outages-discussion <
outages-discussion@outages.org> wrote:
...
----- Original Message -----
...
From: "Chapman, Brad (NBCUniversal) via Outages-discussion" <
outages-discussion@outages.org>
...
It looks like the "Post-Event Summary" for this outage has been
published.  I've
excerpted the one-sentence root cause analysis (emphasis mine):
https://aws.amazon.com/message/101925/
The root cause of this issue was a latent race condition in the DynamoDB
DNS
management system that resulted in an incorrect empty DNS record for the
service’s regional endpoint (dynamodb.us-east-1.amazonaws.com) that the
automation failed to repair.
I disagree with their description of the root cause, based on the AAR.
It sounds like there should be a timestamp on every plan the Planner
makes,
and that should be carried all the way through to Route53 -- the *actual*
root
cause was that the actual installation of those delayed (and now older)
records
was permitted to happen, even though newer data was already installed, IME.
Cheers,
-- jra
--
Jay R. Ashworth                  Baylink
jra@baylink.com
Designer                     The Things I Think                       RFC
2100
Ashworth & Associates       http://www.bcp38.info          2000 Land
Rover DII
St Petersburg FL USA      BCP38: Ask For It By Name!           +1 727 647
1274
______________________________________________
Outages-discussion mailing list outages-discussion@outages.org
Sign up for an account https://lists.outages.org/accounts/signup/
To subscribe send an email to outages-discussion-join@outages.org
To unsubscribe send an email to outages-discussion-leave@outages.org
To contact the list owners outages-owner@outages.org
Archives
https://lists.outages.org/archives/list/outages-discussion@outages.org/
Thank you for using outages-discussion Lists!

[Outages-discussion] Re: [EXTERNAL] Re: [Outages] AWS US-EAST-1

Shaun Potts