[Outages-discussion] Re: [EXTERNAL] Re: [Outages] AWS US-EAST-1

23 Oct 2025

      It looks like the "Post-Event Summary" for this outage has been published.  I've excerpted the one-sentence root cause analysis (emphasis mine):

https://aws.amazon.com/message/101925/

The root cause of this issue was a latent race condition in the DynamoDB DNS management system that resulted in an incorrect empty DNS record for the service’s regional endpoint (dynamodb.us-east-1.amazonaws.com) that the automation failed to repair.

-Brad Chapman

—Sent from my iPhone

On Oct 23, 2025, at 6:17 AM, Lee via Outages-discussion <outages-discussion@outages.org> wrote:

On Wed, Oct 22, 2025 at 5:03 PM Peter Beckman wrote:

Granted, I was there from 2010-2013. I'm sure things have changed.

I'm not sure I'd believe an opinion piece based on a loud employee
departure, extrapolating that AWS has few knowledgable employees left.

Did you look at the engadget link?
 https://urldefense.com/v3/__https://www.engadget.com/amazon-attrition-leader...

 "An investigation from the New York Times found that, among hourly
employees, Amazon’s turnover was approximately 150 percent annually,
while work from the Wall Street Journal and National Employment Law
Project have both found turnover to be around 100 percent in
warehouses — double the industry average. The rate at which Amazon has
burned through the American working-age populace led to another piece
of internal research, obtained this summer by Recode, which cautioned
that the company might “deplete the available labor supply in the US”
in certain metro regions within a few years."

I'm guessing that hourly employees means warehouse workers - yes?  But
if Amazon is such a great place for "knowledge workers" your 4 year
stint at amazon is hardly a ringing endorsement for their ability to
retain skilled workers.

AWS increases complexity of their systems regularly. They are able to, or
used to be able to, recruit excellent people. Outages will ALWAYS happen,
even if you have "the best" engineers ever. We're human. We will make
mistakes.

So yes, based on what I know of AWS, and who I STILL know at AWS, yes, even
still.

I don't know anybody that works for Amazon but their reputation as an
employer sure seems to be in the toilet.  But even so, maybe they can
recruit excellent people ... then the question is how long do they
retain those excellent people?

Regards,
Lee

On Wed, 22 Oct 2025, Lee wrote:

But is also why AWS can recover from issues quickly, because they built
everything, and have the on-call and on-site staff to resolve it quickly,

Even still?
ref:  https://urldefense.com/v3/__https://news.ycombinator.com/item?id=45649178__;...

Regards
Lee

On Tue, Oct 21, 2025 at 11:33 PM Peter Beckman via Outages-discussion
<outages-discussion@outages.org> wrote:

Having seen the code and infrastructure within AWS, calling any cloud
service as "being held together with spit and bailing wire" is uninformed.
Your lack of direct control of it does not make it fragile.

Unless you believe the Internet at large is in the same situation. /s

There is (was) a weekly meeting of all heads of all AWS services that focus
on reliability, redundancy, avoiding mistakes that cause ANY customer
impacting outages, and communicating that to every OTHER team to ensure
they didn't make the same mistakes. Charlie Bell was a powerhouse in that
meeting.

There are interdependencies on AWS services, yes, though when I was there
they tried to make decisions to keep service dependencies separate when
they were deemed critical. Other services are just more easily built and
maintained when they depend on AWS core services like EC2, EBS, S3, RDS.

Which is why you see sometimes cascading outages -- e.g. AWS Transcribe is
unavailable because S3 is having issues.

But is also why AWS can recover from issues quickly, because they built
everything, and have the on-call and on-site staff to resolve it quickly,
or as quickly as a huge system can be. 9-15 hours seemed painful, but for a
large scale system is pretty remarkable, given the last hour+ outage was
over a year ago.

AWS Outage History
    https://urldefense.com/v3/__https://aws.amazon.com/premiumsupport/technology...

There will be a beautiful and detailed writeup on how this outage occurred
and specifically what they have already done and will do to ensure that
such a situation won't happen again, and they really do mean it.

Just because YOU, the customer or end user, are not in control of resolving
the outage, does not mean that you could have recovered from the outage any
faster.

Because that being in control requires skilled staff to design, build,
maintain, and repair the system/infrastructure, on top of the staff that
built the application it runs on.

On Tue, 21 Oct 2025, David Eddleman via Outages-discussion wrote:

Part of the issue is, though, is that AWS (and Azure, and GCP, and ...) is
at least partially a black box. There's a lot going on that makes those
things work that you don't or can't see. You're depending upon that Secret
Sauce™ working fine for you all the time. If any part of it is held
together with spit and bailing wire (and you damn well better believe a
not-insignificant part is!), when it fails over it's going to take a lot
with it. When it does, you may have the ideal vendor-specified redundancy
in place but you're still going to be hurting. Whereas if you self-host,
you can get that redundancy, but you both gotta pay for it (rather than it
coming out of a "nebulous" bill) and have the skilled staff to keep it up
and make it work.

We saw that with this outage. Seemingly unrelated parts caused others to
fail due to internal dependencies. We even saw that with the Crowdstrike
fiasco, where EC2 storage latency was skyrocketing as people were manually
migrating and attaching volumes to other machines to try and recover them,
as there's no console access.

On Tue, Oct 21, 2025 at 11:51 AM Peter Beckman via Outages-discussion <
outages-discussion@outages.org> wrote:

On Tue, 21 Oct 2025, Jeff Shultz via Outages-discussion wrote:

Truly fault tolerant is not budget friendly.

Having worked for AWS, and having run multi-region fault-tolerant systems
for many years, it *can* be budget-friendly, if you are willing to put in
the effort and planning.

I can find two different hosting companies that offer bare-metal hosting,
and confirm that both are using a different mix of connectivity on
different ASNs, and are in geographically different areas.

I can deploy my workload across those systems, reducing risk, but also have
the workload spread across those disparate systems, so I don't need to
double my infrastructure costs.

This is even possible in AWS -- they provide multiple tools for
multi-region and multi-AZ deployments. When I worked for AWS my team built
AMI Copy in 2012/2013, so you could move AMIs between regions with an API
call, making it easier to start up new EC2 instances with your existing
images.

RDS has cross-region read-replicas. DynamoDB was built with multi-region in
mind.

You DO need to assume and plan that a whole AZ or Region will go dark, and
if your systems just immediately fail when that happens, then you've done a
less-than-ideal job of building your systems to be fault-tolerant.

Yes, it adds complexity and you have to test regularly, but it does NOT
need to add huge amounts of additional costs. You just need to know what
you're doing.

Beckman
---------------------------------------------------------------------------
Peter Beckman                                                  Internet Guy
beckman@angryox.com
https://urldefense.com/v3/__https://www.angryox.com/__;!!PIZeeW5wscynRQ!vEqb...
---------------------------------------------------------------------------
______________________________________________
Outages-discussion mailing list outages-discussion@outages.org
Sign up for an account https://urldefense.com/v3/__https://lists.outages.org/accounts/signup/__;!!P...
To subscribe send an email to outages-discussion-join@outages.org
To unsubscribe send an email to outages-discussion-leave@outages.org
To contact the list owners outages-owner@outages.org
Archives
https://urldefense.com/v3/__https://lists.outages.org/archives/list/outages-...

Thank you for using outages-discussion Lists!

______________________________________________
Outages-discussion mailing list outages-discussion@outages.org
Sign up for an account https://urldefense.com/v3/__https://lists.outages.org/accounts/signup/__;!!P...
To subscribe send an email to outages-discussion-join@outages.org
To unsubscribe send an email to outages-discussion-leave@outages.org
To contact the list owners outages-owner@outages.org
Archives https://urldefense.com/v3/__https://lists.outages.org/archives/list/outages-...

Thank you for using outages-discussion Lists!

---------------------------------------------------------------------------
Peter Beckman                                                  Internet Guy
beckman@angryox.com                                https://urldefense.com/v3/__https://www.angryox.com/__;!!PIZeeW5wscynRQ!vEqb...
---------------------------------------------------------------------------
______________________________________________
Outages-discussion mailing list outages-discussion@outages.org
Sign up for an account https://urldefense.com/v3/__https://lists.outages.org/accounts/signup/__;!!P...
To subscribe send an email to outages-discussion-join@outages.org
To unsubscribe send an email to outages-discussion-leave@outages.org
To contact the list owners outages-owner@outages.org
Archives https://urldefense.com/v3/__https://lists.outages.org/archives/list/outages-...

Thank you for using outages-discussion Lists!

---------------------------------------------------------------------------
Peter Beckman                                                  Internet Guy
beckman@angryox.com                                https://urldefense.com/v3/__https://www.angryox.com/__;!!PIZeeW5wscynRQ!vEqb...
---------------------------------------------------------------------------
______________________________________________
Outages-discussion mailing list outages-discussion@outages.org
Sign up for an account https://urldefense.com/v3/__https://lists.outages.org/accounts/signup/__;!!P...
To subscribe send an email to outages-discussion-join@outages.org
To unsubscribe send an email to outages-discussion-leave@outages.org
To contact the list owners outages-owner@outages.org
Archives https://urldefense.com/v3/__https://lists.outages.org/archives/list/outages-...

Thank you for using outages-discussion Lists!

[Outages-discussion] Re: [EXTERNAL] Re: [Outages] AWS US-EAST-1

Chapman, Brad (NBCUniversal)