On Mon, Oct 20, 2025 at 09:50:54PM -0400, Robert Webb via Outages-discussion wrote:
Isn't this pretty much the same thing that happened a year or so ago in US East that took down almost all of AWS?
https://aws.amazon.com/premiumsupport/technology/pes/ is always worth saving and sifting through. The Kinesis outage in 2020 taught me that AWS internally suffers from (what I consider to be) the same problem that AWS customers suffer from: interwoven AWS service dependencies, resulting in effectively SPoFs. As a reminder, from that outage we learned Cognito depends on Kinesis, CloudWatch relies on Kinesis, EventBridge relies on Kinesis, etc... The same situation repeated itself (in a different fashion) in 2024. These two incidents are documented in AWS's post-mortems above. There are plenty of AWS services that suffer from this phenomenon from the "customer use" perspective: you want to use service X, but you end up having to configure pieces for service Y and service Z, which depends on service C. By the time you're done, you've now got 6 AWS services involved for something that quite likely had you home-grown yourself, might have had 2 at most. Two added bonuses: now someone -- usually your SAs or SREs -- have to remember this whole interwoven mess, and all while spending even more money with AWS. (I can't speak of other cloud providers as I have no experience with them.) I'm highly biased because I'm a sysadmin by profession. In general, my 30 years of experience has taught me to trust nothing and minimise dependencies as much as possible; KISS principle above all else, and assume everything can (will) break. Think about those ramifications, test those scenarios. If you can't solve them (reason doesn't matter!), OK, just document ones you're aware of. It's usually infeasible to try and relieve every single SPoF, but documenting (and publishing info about) ones you know about is good practise. Remember: less things involved = less stuff to have to worry about = fewer things that can go wrong = less time spent during outages = fewer outages = happier customers + happier engineers. -- | Jeremy Chadwick jdc@koitsu.org | | UNIX Systems Administrator PGP 0x2A389531 | | Making life hard for others since 1977. |