idk how you guys sanitize your internet consumption for your own sanity but I use AI to amuse me through snark, and I thought this one was too good not to share: '''' *SassGPT, can you summarize this word-vomit for me and tell me how it could have been avoided?* https://aws.amazon.com/message/101925/
🌀 *“How AWS Took Down Half the Internet Because of One DNS Race Condition”*
On October 19, 2025, AWS’s biggest region (*us-east-1*, a.k.a. the “single point of failure for the internet”) face-planted when *DynamoDB’s automated DNS manager tripped over its own shoelaces*. Two automation bots tried to update DNS at the same time, got confused, and—like dueling Roombas—*deleted the service’s own DNS records*. DynamoDB promptly vanished from the network, and every AWS service that depends on it (which is most of them) started screaming. That single typo-by-automation triggered a *cascade of chaos*: - *EC2* couldn’t launch new instances because its management system couldn’t talk to DynamoDB. - *Load balancers* began thinking healthy servers were dying and yanked them from DNS. - *Lambda, Redshift, Connect, IAM, and STS* all stumbled because they rely (directly or indirectly) on DynamoDB. - By the time engineers manually resurrected the DNS record, they still had to untangle a backlog of leases, health checks, and throttles. After about *14 hours*, everything limped back to normal. AWS promised to fix the race condition, add guardrails so automation can’t nuke live DNS, and maybe stop using one region as everyone’s dependency hub. ------------------------------ *In short:* A concurrency bug in a DNS script turned into a region-wide outage because of excessive automation trust and cross-service dependencies. The lesson? *Even hyperscale clouds can still be taken down by one bad variable and too many dominoes.* ''' On Thu, Oct 23, 2025 at 9:19 PM Jay R. Ashworth via Outages-discussion < outages-discussion@outages.org> wrote:
----- Original Message -----
From: "Chapman, Brad (NBCUniversal) via Outages-discussion" < outages-discussion@outages.org>
It looks like the "Post-Event Summary" for this outage has been published. I've excerpted the one-sentence root cause analysis (emphasis mine):
https://aws.amazon.com/message/101925/
The root cause of this issue was a latent race condition in the DynamoDB DNS management system that resulted in an incorrect empty DNS record for the service’s regional endpoint (dynamodb.us-east-1.amazonaws.com) that the automation failed to repair.
I disagree with their description of the root cause, based on the AAR.
It sounds like there should be a timestamp on every plan the Planner makes, and that should be carried all the way through to Route53 -- the *actual* root cause was that the actual installation of those delayed (and now older) records was permitted to happen, even though newer data was already installed, IME.
Cheers, -- jra -- Jay R. Ashworth Baylink jra@baylink.com Designer The Things I Think RFC 2100 Ashworth & Associates http://www.bcp38.info 2000 Land Rover DII St Petersburg FL USA BCP38: Ask For It By Name! +1 727 647 1274 ______________________________________________ Outages-discussion mailing list outages-discussion@outages.org Sign up for an account https://lists.outages.org/accounts/signup/ To subscribe send an email to outages-discussion-join@outages.org To unsubscribe send an email to outages-discussion-leave@outages.org To contact the list owners outages-owner@outages.org Archives https://lists.outages.org/archives/list/outages-discussion@outages.org/
Thank you for using outages-discussion Lists!