[Outages-discussion] Re: [Outages] AWS US-EAST-1

23 Oct 2025

      On Mon, Oct 20, 2025 at 09:50:54PM -0400, Robert Webb via Outages-discussion wrote:
...
Isn't this pretty much the same thing that happened a year or so ago in US
East that took down almost all of AWS?
https://aws.amazon.com/premiumsupport/technology/pes/ is always worth
saving and sifting through.

The Kinesis outage in 2020 taught me that AWS internally suffers from
(what I consider to be) the same problem that AWS customers suffer from:
interwoven AWS service dependencies, resulting in effectively SPoFs.

As a reminder, from that outage we learned Cognito depends on Kinesis,
CloudWatch relies on Kinesis, EventBridge relies on Kinesis, etc...  The
same situation repeated itself (in a different fashion) in 2024.  These
two incidents are documented in AWS's post-mortems above.

There are plenty of AWS services that suffer from this phenomenon from
the "customer use" perspective: you want to use service X, but you end
up having to configure pieces for service Y and service Z, which depends
on service C.  By the time you're done, you've now got 6 AWS services
involved for something that quite likely had you home-grown yourself,
might have had 2 at most.  Two added bonuses: now someone -- usually
your SAs or SREs -- have to remember this whole interwoven mess, and all
while spending even more money with AWS.  (I can't speak of other cloud
providers as I have no experience with them.)

I'm highly biased because I'm a sysadmin by profession.  In general, my
30 years of experience has taught me to trust nothing and minimise
dependencies as much as possible; KISS principle above all else, and
assume everything can (will) break.  Think about those ramifications,
test those scenarios.  If you can't solve them (reason doesn't matter!),
OK, just document ones you're aware of.  It's usually infeasible to try
and relieve every single SPoF, but documenting (and publishing info
about) ones you know about is good practise.

Remember: less things involved = less stuff to have to worry about =
fewer things that can go wrong = less time spent during outages = fewer
outages = happier customers + happier engineers.

-- 
| Jeremy Chadwick                                 jdc@koitsu.org |
| UNIX Systems Administrator                      PGP 0x2A389531 |
| Making life hard for others since 1977.                        |