On Oct 20, 2025, at 05:32, Bill Woodcock via Outages <outages@outages.org> wrote:
Yet another reminder not to find yourself among the Eloi outsourcing their jobs to “cloud services.”
Yeah, not only do I think that ship has sailed, I also am reasonably certain this is not such a binary choice. Perhaps, like all things in network operations, we should avoid SPOFs? A single “Availability Zone” (or whatever your provider calls it) is most certainly a SPOF. One could consider a single cloud provider a SPOF. I do. Network operations - like life itself - is about tradeoffs. No one has infinite bandwidth, storage, time, money, resources, etc. Choose what you optimize for wisely. Finally one interesting side effect of the current industry and the fact “hyperscalers” exist: Customers, and bosses, are more tolerant of outages when Amazon, Google, Microsoft going down is front page news. (Showing my age - not sure that phrase means anything today?) If your service is hosted on Billy-Joe-Bob’s Bait, Tackle, and Cloud service, not so much. Which makes it that much harder to break the hold AWS has. -- TTFN, patrick
I always enjoy the armchair "haha that's why you don't use <x>" engineers. It's ironic that I was having a discussion with colleagues this morning about not putting all of our eggs in one basket, and this happened. This shouldn't be an "I told you so" moment, it should be an Aha! moment. Nothing is bulletproof. Design with the assumption that you're actively fighting off failure from every direction, and just decide how much failure is acceptable. On Mon, Oct 20, 2025 at 8:39 AM Patrick W. Gilmore via Outages-discussion < outages-discussion@outages.org> wrote:
On Oct 20, 2025, at 05:32, Bill Woodcock via Outages <outages@outages.org> wrote:
Yet another reminder not to find yourself among the Eloi outsourcing their jobs to “cloud services.”
Yeah, not only do I think that ship has sailed, I also am reasonably certain this is not such a binary choice.
Perhaps, like all things in network operations, we should avoid SPOFs? A single “Availability Zone” (or whatever your provider calls it) is most certainly a SPOF. One could consider a single cloud provider a SPOF. I do.
Network operations - like life itself - is about tradeoffs. No one has infinite bandwidth, storage, time, money, resources, etc. Choose what you optimize for wisely.
Finally one interesting side effect of the current industry and the fact “hyperscalers” exist: Customers, and bosses, are more tolerant of outages when Amazon, Google, Microsoft going down is front page news. (Showing my age - not sure that phrase means anything today?) If your service is hosted on Billy-Joe-Bob’s Bait, Tackle, and Cloud service, not so much. Which makes it that much harder to break the hold AWS has.
-- TTFN, patrick
______________________________________________ Outages-discussion mailing list outages-discussion@outages.org Sign up for an account https://lists.outages.org/accounts/signup/ To subscribe send an email to outages-discussion-join@outages.org To unsubscribe send an email to outages-discussion-leave@outages.org To contact the list owners outages-owner@outages.org Archives https://lists.outages.org/archives/list/outages-discussion@outages.org/
Thank you for using outages-discussion Lists!
On Mon, Oct 20, 2025 at 1:39 PM Shaun Potts via Outages-discussion < outages-discussion@outages.org> wrote:
I always enjoy the armchair "haha that's why you don't use <x>" engineers.
I always enjoy it when the next generation of engineers with fresh and exciting new ideas are forced to re-learn what "single point of failure" means. This is usually followed a few years later by realizing that SPOF includes companies (like AWS today), the various definitions of layer 8 on the OSI stack, and that one time I fired up 'cssh' with the wrong target and happily restarted a service for all customers instead of a much smaller subset. -A
I think we're both saying the same thing in a different way. Bad design is bad design is bad design. I think the metric is an acceptable level of failure. Remember a few years ago it was some random data center in San Antonio or something like that resulting in a multi-day outage for Microsoft because some core service lived/routed through that. The only guy we should be laughing at is the one who thinks he can design away all of these issues next year when his datacenter gets struck by lightning. On Mon, Oct 20, 2025 at 5:10 PM Aaron C. de Bruyn <aaron@heyaaron.com> wrote:
On Mon, Oct 20, 2025 at 1:39 PM Shaun Potts via Outages-discussion < outages-discussion@outages.org> wrote:
I always enjoy the armchair "haha that's why you don't use <x>" engineers.
I always enjoy it when the next generation of engineers with fresh and exciting new ideas are forced to re-learn what "single point of failure" means. This is usually followed a few years later by realizing that SPOF includes companies (like AWS today), the various definitions of layer 8 on the OSI stack, and that one time I fired up 'cssh' with the wrong target and happily restarted a service for all customers instead of a much smaller subset.
-A
Truly fault tolerant is not budget friendly. Guess which wins in the C-Suite/Boardroom? On Mon, Oct 20, 2025 at 6:26 PM Shaun Potts via Outages-discussion < outages-discussion@outages.org> wrote:
I think we're both saying the same thing in a different way.
Bad design is bad design is bad design. I think the metric is an acceptable level of failure.
Remember a few years ago it was some random data center in San Antonio or something like that resulting in a multi-day outage for Microsoft because some core service lived/routed through that.
The only guy we should be laughing at is the one who thinks he can design away all of these issues next year when his datacenter gets struck by lightning.
-- Like us on Social Media for News, Promotions, and other information!! <https://www.facebook.com/SCTCWEB/> <https://www.instagram.com/sctc_sctc/> <https://www.yelp.com/biz/sctc-stayton-3> <https://www.youtube.com/c/sctcvideos> _**** This message contains confidential information and is intended only for the individual named. If you are not the named addressee you should not disseminate, distribute or copy this e-mail. Please notify the sender immediately by e-mail if you have received this e-mail by mistake and delete this e-mail from your system. E-mail transmission cannot be guaranteed to be secure or error-free as information could be intercepted, corrupted, lost, destroyed, arrive late or incomplete, or contain viruses. The sender therefore does not accept liability for any errors or omissions in the contents of this message, which arise as a result of e-mail transmission. ****_
On 10/21/25 11:05, Jeff Shultz via Outages-discussion wrote:
Truly fault tolerant is not budget friendly.
Guess which wins in the C-Suite/Boardroom?
So true. Every time you add a 9 to the SLA you move the budget decimal place to the right. -- inoc.net!rblayzor PGP: https://pgp.inoc.net/rblayzor/
----- Original Message -----
From: "Robert Blayzor via Outages-discussion" <outages-discussion@outages.org>
On 10/21/25 11:05, Jeff Shultz via Outages-discussion wrote:
Truly fault tolerant is not budget friendly.
Guess which wins in the C-Suite/Boardroom?
So true. Every time you add a 9 to the SLA you move the budget decimal place to the right.
At least one. Once you get to 6-nines, it's probably 2 orders of magnitude for that step. 7-nines? You ain't got that money. Cheers, -- jra -- Jay R. Ashworth Baylink jra@baylink.com Designer The Things I Think RFC 2100 Ashworth & Associates http://www.bcp38.info 2000 Land Rover DII St Petersburg FL USA BCP38: Ask For It By Name! +1 727 647 1274
On Tue, 21 Oct 2025, Jeff Shultz via Outages-discussion wrote:
Truly fault tolerant is not budget friendly.
Having worked for AWS, and having run multi-region fault-tolerant systems for many years, it *can* be budget-friendly, if you are willing to put in the effort and planning. I can find two different hosting companies that offer bare-metal hosting, and confirm that both are using a different mix of connectivity on different ASNs, and are in geographically different areas. I can deploy my workload across those systems, reducing risk, but also have the workload spread across those disparate systems, so I don't need to double my infrastructure costs. This is even possible in AWS -- they provide multiple tools for multi-region and multi-AZ deployments. When I worked for AWS my team built AMI Copy in 2012/2013, so you could move AMIs between regions with an API call, making it easier to start up new EC2 instances with your existing images. RDS has cross-region read-replicas. DynamoDB was built with multi-region in mind. You DO need to assume and plan that a whole AZ or Region will go dark, and if your systems just immediately fail when that happens, then you've done a less-than-ideal job of building your systems to be fault-tolerant. Yes, it adds complexity and you have to test regularly, but it does NOT need to add huge amounts of additional costs. You just need to know what you're doing. Beckman --------------------------------------------------------------------------- Peter Beckman Internet Guy beckman@angryox.com https://www.angryox.com/ ---------------------------------------------------------------------------
this guy gets it multi cloud isn't i need to deploy my 8 servers in 2 locations, it's i need to deploy 4 servers in 2 locations and make my software be able to use either if the other is unavailable it's not rocket science but c levels sure make it seem like it is On Tue, Oct 21, 2025, 12:47 PM Peter Beckman via Outages-discussion < outages-discussion@outages.org> wrote:
On Tue, 21 Oct 2025, Jeff Shultz via Outages-discussion wrote:
Truly fault tolerant is not budget friendly.
Having worked for AWS, and having run multi-region fault-tolerant systems for many years, it *can* be budget-friendly, if you are willing to put in the effort and planning.
I can find two different hosting companies that offer bare-metal hosting, and confirm that both are using a different mix of connectivity on different ASNs, and are in geographically different areas.
I can deploy my workload across those systems, reducing risk, but also have the workload spread across those disparate systems, so I don't need to double my infrastructure costs.
This is even possible in AWS -- they provide multiple tools for multi-region and multi-AZ deployments. When I worked for AWS my team built AMI Copy in 2012/2013, so you could move AMIs between regions with an API call, making it easier to start up new EC2 instances with your existing images.
RDS has cross-region read-replicas. DynamoDB was built with multi-region in mind.
You DO need to assume and plan that a whole AZ or Region will go dark, and if your systems just immediately fail when that happens, then you've done a less-than-ideal job of building your systems to be fault-tolerant.
Yes, it adds complexity and you have to test regularly, but it does NOT need to add huge amounts of additional costs. You just need to know what you're doing.
Beckman --------------------------------------------------------------------------- Peter Beckman Internet Guy beckman@angryox.com https://www.angryox.com/ --------------------------------------------------------------------------- ______________________________________________ Outages-discussion mailing list outages-discussion@outages.org Sign up for an account https://lists.outages.org/accounts/signup/ To subscribe send an email to outages-discussion-join@outages.org To unsubscribe send an email to outages-discussion-leave@outages.org To contact the list owners outages-owner@outages.org Archives https://lists.outages.org/archives/list/outages-discussion@outages.org/
Thank you for using outages-discussion Lists!
So Cloud Hopping is an idea that you can dynamically relocate your infrastructure on the fly an utilize software like https://libcloud.readthedocs.io/en/stable/compute/pricing.html to make those choices. Relying on a particular vendors flavor of database can be dangerous. I am old and grumpy so when I say the what-ifs need to be in the README of any architecture, I share it out of memory and pain. </rant> Now I expect spam emails from this reply. :( On Tue, Oct 21, 2025 at 11:14 AM Shaun Potts via Outages-discussion <outages-discussion@outages.org> wrote:
this guy gets it
multi cloud isn't i need to deploy my 8 servers in 2 locations, it's i need to deploy 4 servers in 2 locations and make my software be able to use either if the other is unavailable
it's not rocket science but c levels sure make it seem like it is
On Tue, Oct 21, 2025, 12:47 PM Peter Beckman via Outages-discussion < outages-discussion@outages.org> wrote:
On Tue, 21 Oct 2025, Jeff Shultz via Outages-discussion wrote:
Truly fault tolerant is not budget friendly.
Having worked for AWS, and having run multi-region fault-tolerant systems for many years, it *can* be budget-friendly, if you are willing to put in the effort and planning.
I can find two different hosting companies that offer bare-metal hosting, and confirm that both are using a different mix of connectivity on different ASNs, and are in geographically different areas.
I can deploy my workload across those systems, reducing risk, but also have the workload spread across those disparate systems, so I don't need to double my infrastructure costs.
This is even possible in AWS -- they provide multiple tools for multi-region and multi-AZ deployments. When I worked for AWS my team built AMI Copy in 2012/2013, so you could move AMIs between regions with an API call, making it easier to start up new EC2 instances with your existing images.
RDS has cross-region read-replicas. DynamoDB was built with multi-region in mind.
You DO need to assume and plan that a whole AZ or Region will go dark, and if your systems just immediately fail when that happens, then you've done a less-than-ideal job of building your systems to be fault-tolerant.
Yes, it adds complexity and you have to test regularly, but it does NOT need to add huge amounts of additional costs. You just need to know what you're doing.
Beckman --------------------------------------------------------------------------- Peter Beckman Internet Guy beckman@angryox.com https://www.angryox.com/ --------------------------------------------------------------------------- ______________________________________________ Outages-discussion mailing list outages-discussion@outages.org Sign up for an account https://lists.outages.org/accounts/signup/ To subscribe send an email to outages-discussion-join@outages.org To unsubscribe send an email to outages-discussion-leave@outages.org To contact the list owners outages-owner@outages.org Archives https://lists.outages.org/archives/list/outages-discussion@outages.org/
Thank you for using outages-discussion Lists!
______________________________________________ Outages-discussion mailing list outages-discussion@outages.org Sign up for an account https://lists.outages.org/accounts/signup/ To subscribe send an email to outages-discussion-join@outages.org To unsubscribe send an email to outages-discussion-leave@outages.org To contact the list owners outages-owner@outages.org Archives https://lists.outages.org/archives/list/outages-discussion@outages.org/
Thank you for using outages-discussion Lists!
-- - Andrew "lathama" Latham -
Part of the issue is, though, is that AWS (and Azure, and GCP, and ...) is at least partially a black box. There's a lot going on that makes those things work that you don't or can't see. You're depending upon that Secret Sauce™ working fine for you all the time. If any part of it is held together with spit and bailing wire (and you damn well better believe a not-insignificant part is!), when it fails over it's going to take a lot with it. When it does, you may have the ideal vendor-specified redundancy in place but you're still going to be hurting. Whereas if you self-host, you can get that redundancy, but you both gotta pay for it (rather than it coming out of a "nebulous" bill) and have the skilled staff to keep it up and make it work. We saw that with this outage. Seemingly unrelated parts caused others to fail due to internal dependencies. We even saw that with the Crowdstrike fiasco, where EC2 storage latency was skyrocketing as people were manually migrating and attaching volumes to other machines to try and recover them, as there's no console access. On Tue, Oct 21, 2025 at 11:51 AM Peter Beckman via Outages-discussion < outages-discussion@outages.org> wrote:
On Tue, 21 Oct 2025, Jeff Shultz via Outages-discussion wrote:
Truly fault tolerant is not budget friendly.
Having worked for AWS, and having run multi-region fault-tolerant systems for many years, it *can* be budget-friendly, if you are willing to put in the effort and planning.
I can find two different hosting companies that offer bare-metal hosting, and confirm that both are using a different mix of connectivity on different ASNs, and are in geographically different areas.
I can deploy my workload across those systems, reducing risk, but also have the workload spread across those disparate systems, so I don't need to double my infrastructure costs.
This is even possible in AWS -- they provide multiple tools for multi-region and multi-AZ deployments. When I worked for AWS my team built AMI Copy in 2012/2013, so you could move AMIs between regions with an API call, making it easier to start up new EC2 instances with your existing images.
RDS has cross-region read-replicas. DynamoDB was built with multi-region in mind.
You DO need to assume and plan that a whole AZ or Region will go dark, and if your systems just immediately fail when that happens, then you've done a less-than-ideal job of building your systems to be fault-tolerant.
Yes, it adds complexity and you have to test regularly, but it does NOT need to add huge amounts of additional costs. You just need to know what you're doing.
Beckman --------------------------------------------------------------------------- Peter Beckman Internet Guy beckman@angryox.com https://www.angryox.com/ --------------------------------------------------------------------------- ______________________________________________ Outages-discussion mailing list outages-discussion@outages.org Sign up for an account https://lists.outages.org/accounts/signup/ To subscribe send an email to outages-discussion-join@outages.org To unsubscribe send an email to outages-discussion-leave@outages.org To contact the list owners outages-owner@outages.org Archives https://lists.outages.org/archives/list/outages-discussion@outages.org/
Thank you for using outages-discussion Lists!
Having seen the code and infrastructure within AWS, calling any cloud service as "being held together with spit and bailing wire" is uninformed. Your lack of direct control of it does not make it fragile. Unless you believe the Internet at large is in the same situation. /s There is (was) a weekly meeting of all heads of all AWS services that focus on reliability, redundancy, avoiding mistakes that cause ANY customer impacting outages, and communicating that to every OTHER team to ensure they didn't make the same mistakes. Charlie Bell was a powerhouse in that meeting. There are interdependencies on AWS services, yes, though when I was there they tried to make decisions to keep service dependencies separate when they were deemed critical. Other services are just more easily built and maintained when they depend on AWS core services like EC2, EBS, S3, RDS. Which is why you see sometimes cascading outages -- e.g. AWS Transcribe is unavailable because S3 is having issues. But is also why AWS can recover from issues quickly, because they built everything, and have the on-call and on-site staff to resolve it quickly, or as quickly as a huge system can be. 9-15 hours seemed painful, but for a large scale system is pretty remarkable, given the last hour+ outage was over a year ago. AWS Outage History https://aws.amazon.com/premiumsupport/technology/pes/ There will be a beautiful and detailed writeup on how this outage occurred and specifically what they have already done and will do to ensure that such a situation won't happen again, and they really do mean it. Just because YOU, the customer or end user, are not in control of resolving the outage, does not mean that you could have recovered from the outage any faster. Because that being in control requires skilled staff to design, build, maintain, and repair the system/infrastructure, on top of the staff that built the application it runs on. On Tue, 21 Oct 2025, David Eddleman via Outages-discussion wrote:
Part of the issue is, though, is that AWS (and Azure, and GCP, and ...) is at least partially a black box. There's a lot going on that makes those things work that you don't or can't see. You're depending upon that Secret Sauce™ working fine for you all the time. If any part of it is held together with spit and bailing wire (and you damn well better believe a not-insignificant part is!), when it fails over it's going to take a lot with it. When it does, you may have the ideal vendor-specified redundancy in place but you're still going to be hurting. Whereas if you self-host, you can get that redundancy, but you both gotta pay for it (rather than it coming out of a "nebulous" bill) and have the skilled staff to keep it up and make it work.
We saw that with this outage. Seemingly unrelated parts caused others to fail due to internal dependencies. We even saw that with the Crowdstrike fiasco, where EC2 storage latency was skyrocketing as people were manually migrating and attaching volumes to other machines to try and recover them, as there's no console access.
On Tue, Oct 21, 2025 at 11:51 AM Peter Beckman via Outages-discussion < outages-discussion@outages.org> wrote:
On Tue, 21 Oct 2025, Jeff Shultz via Outages-discussion wrote:
Truly fault tolerant is not budget friendly.
Having worked for AWS, and having run multi-region fault-tolerant systems for many years, it *can* be budget-friendly, if you are willing to put in the effort and planning.
I can find two different hosting companies that offer bare-metal hosting, and confirm that both are using a different mix of connectivity on different ASNs, and are in geographically different areas.
I can deploy my workload across those systems, reducing risk, but also have the workload spread across those disparate systems, so I don't need to double my infrastructure costs.
This is even possible in AWS -- they provide multiple tools for multi-region and multi-AZ deployments. When I worked for AWS my team built AMI Copy in 2012/2013, so you could move AMIs between regions with an API call, making it easier to start up new EC2 instances with your existing images.
RDS has cross-region read-replicas. DynamoDB was built with multi-region in mind.
You DO need to assume and plan that a whole AZ or Region will go dark, and if your systems just immediately fail when that happens, then you've done a less-than-ideal job of building your systems to be fault-tolerant.
Yes, it adds complexity and you have to test regularly, but it does NOT need to add huge amounts of additional costs. You just need to know what you're doing.
Beckman --------------------------------------------------------------------------- Peter Beckman Internet Guy beckman@angryox.com https://www.angryox.com/ --------------------------------------------------------------------------- ______________________________________________ Outages-discussion mailing list outages-discussion@outages.org Sign up for an account https://lists.outages.org/accounts/signup/ To subscribe send an email to outages-discussion-join@outages.org To unsubscribe send an email to outages-discussion-leave@outages.org To contact the list owners outages-owner@outages.org Archives https://lists.outages.org/archives/list/outages-discussion@outages.org/
Thank you for using outages-discussion Lists!
______________________________________________ Outages-discussion mailing list outages-discussion@outages.org Sign up for an account https://lists.outages.org/accounts/signup/ To subscribe send an email to outages-discussion-join@outages.org To unsubscribe send an email to outages-discussion-leave@outages.org To contact the list owners outages-owner@outages.org Archives https://lists.outages.org/archives/list/outages-discussion@outages.org/
Thank you for using outages-discussion Lists!
--------------------------------------------------------------------------- Peter Beckman Internet Guy beckman@angryox.com https://www.angryox.com/ ---------------------------------------------------------------------------
But is also why AWS can recover from issues quickly, because they built everything, and have the on-call and on-site staff to resolve it quickly,
Even still? ref: https://news.ycombinator.com/item?id=45649178 Regards Lee On Tue, Oct 21, 2025 at 11:33 PM Peter Beckman via Outages-discussion <outages-discussion@outages.org> wrote:
Having seen the code and infrastructure within AWS, calling any cloud service as "being held together with spit and bailing wire" is uninformed. Your lack of direct control of it does not make it fragile.
Unless you believe the Internet at large is in the same situation. /s
There is (was) a weekly meeting of all heads of all AWS services that focus on reliability, redundancy, avoiding mistakes that cause ANY customer impacting outages, and communicating that to every OTHER team to ensure they didn't make the same mistakes. Charlie Bell was a powerhouse in that meeting.
There are interdependencies on AWS services, yes, though when I was there they tried to make decisions to keep service dependencies separate when they were deemed critical. Other services are just more easily built and maintained when they depend on AWS core services like EC2, EBS, S3, RDS.
Which is why you see sometimes cascading outages -- e.g. AWS Transcribe is unavailable because S3 is having issues.
But is also why AWS can recover from issues quickly, because they built everything, and have the on-call and on-site staff to resolve it quickly, or as quickly as a huge system can be. 9-15 hours seemed painful, but for a large scale system is pretty remarkable, given the last hour+ outage was over a year ago.
AWS Outage History https://aws.amazon.com/premiumsupport/technology/pes/
There will be a beautiful and detailed writeup on how this outage occurred and specifically what they have already done and will do to ensure that such a situation won't happen again, and they really do mean it.
Just because YOU, the customer or end user, are not in control of resolving the outage, does not mean that you could have recovered from the outage any faster.
Because that being in control requires skilled staff to design, build, maintain, and repair the system/infrastructure, on top of the staff that built the application it runs on.
On Tue, 21 Oct 2025, David Eddleman via Outages-discussion wrote:
Part of the issue is, though, is that AWS (and Azure, and GCP, and ...) is at least partially a black box. There's a lot going on that makes those things work that you don't or can't see. You're depending upon that Secret Sauce™ working fine for you all the time. If any part of it is held together with spit and bailing wire (and you damn well better believe a not-insignificant part is!), when it fails over it's going to take a lot with it. When it does, you may have the ideal vendor-specified redundancy in place but you're still going to be hurting. Whereas if you self-host, you can get that redundancy, but you both gotta pay for it (rather than it coming out of a "nebulous" bill) and have the skilled staff to keep it up and make it work.
We saw that with this outage. Seemingly unrelated parts caused others to fail due to internal dependencies. We even saw that with the Crowdstrike fiasco, where EC2 storage latency was skyrocketing as people were manually migrating and attaching volumes to other machines to try and recover them, as there's no console access.
On Tue, Oct 21, 2025 at 11:51 AM Peter Beckman via Outages-discussion < outages-discussion@outages.org> wrote:
On Tue, 21 Oct 2025, Jeff Shultz via Outages-discussion wrote:
Truly fault tolerant is not budget friendly.
Having worked for AWS, and having run multi-region fault-tolerant systems for many years, it *can* be budget-friendly, if you are willing to put in the effort and planning.
I can find two different hosting companies that offer bare-metal hosting, and confirm that both are using a different mix of connectivity on different ASNs, and are in geographically different areas.
I can deploy my workload across those systems, reducing risk, but also have the workload spread across those disparate systems, so I don't need to double my infrastructure costs.
This is even possible in AWS -- they provide multiple tools for multi-region and multi-AZ deployments. When I worked for AWS my team built AMI Copy in 2012/2013, so you could move AMIs between regions with an API call, making it easier to start up new EC2 instances with your existing images.
RDS has cross-region read-replicas. DynamoDB was built with multi-region in mind.
You DO need to assume and plan that a whole AZ or Region will go dark, and if your systems just immediately fail when that happens, then you've done a less-than-ideal job of building your systems to be fault-tolerant.
Yes, it adds complexity and you have to test regularly, but it does NOT need to add huge amounts of additional costs. You just need to know what you're doing.
Beckman --------------------------------------------------------------------------- Peter Beckman Internet Guy beckman@angryox.com https://www.angryox.com/ --------------------------------------------------------------------------- ______________________________________________ Outages-discussion mailing list outages-discussion@outages.org Sign up for an account https://lists.outages.org/accounts/signup/ To subscribe send an email to outages-discussion-join@outages.org To unsubscribe send an email to outages-discussion-leave@outages.org To contact the list owners outages-owner@outages.org Archives https://lists.outages.org/archives/list/outages-discussion@outages.org/
Thank you for using outages-discussion Lists!
______________________________________________ Outages-discussion mailing list outages-discussion@outages.org Sign up for an account https://lists.outages.org/accounts/signup/ To subscribe send an email to outages-discussion-join@outages.org To unsubscribe send an email to outages-discussion-leave@outages.org To contact the list owners outages-owner@outages.org Archives https://lists.outages.org/archives/list/outages-discussion@outages.org/
Thank you for using outages-discussion Lists!
--------------------------------------------------------------------------- Peter Beckman Internet Guy beckman@angryox.com https://www.angryox.com/ --------------------------------------------------------------------------- ______________________________________________ Outages-discussion mailing list outages-discussion@outages.org Sign up for an account https://lists.outages.org/accounts/signup/ To subscribe send an email to outages-discussion-join@outages.org To unsubscribe send an email to outages-discussion-leave@outages.org To contact the list owners outages-owner@outages.org Archives https://lists.outages.org/archives/list/outages-discussion@outages.org/
Thank you for using outages-discussion Lists!
Granted, I was there from 2010-2013. I'm sure things have changed. I'm not sure I'd believe an opinion piece based on a loud employee departure, extrapolating that AWS has few knowledgable employees left. AWS increases complexity of their systems regularly. They are able to, or used to be able to, recruit excellent people. Outages will ALWAYS happen, even if you have "the best" engineers ever. We're human. We will make mistakes. So yes, based on what I know of AWS, and who I STILL know at AWS, yes, even still. On Wed, 22 Oct 2025, Lee wrote:
But is also why AWS can recover from issues quickly, because they built everything, and have the on-call and on-site staff to resolve it quickly,
Even still? ref: https://news.ycombinator.com/item?id=45649178
Regards Lee
On Tue, Oct 21, 2025 at 11:33 PM Peter Beckman via Outages-discussion <outages-discussion@outages.org> wrote:
Having seen the code and infrastructure within AWS, calling any cloud service as "being held together with spit and bailing wire" is uninformed. Your lack of direct control of it does not make it fragile.
Unless you believe the Internet at large is in the same situation. /s
There is (was) a weekly meeting of all heads of all AWS services that focus on reliability, redundancy, avoiding mistakes that cause ANY customer impacting outages, and communicating that to every OTHER team to ensure they didn't make the same mistakes. Charlie Bell was a powerhouse in that meeting.
There are interdependencies on AWS services, yes, though when I was there they tried to make decisions to keep service dependencies separate when they were deemed critical. Other services are just more easily built and maintained when they depend on AWS core services like EC2, EBS, S3, RDS.
Which is why you see sometimes cascading outages -- e.g. AWS Transcribe is unavailable because S3 is having issues.
But is also why AWS can recover from issues quickly, because they built everything, and have the on-call and on-site staff to resolve it quickly, or as quickly as a huge system can be. 9-15 hours seemed painful, but for a large scale system is pretty remarkable, given the last hour+ outage was over a year ago.
AWS Outage History https://aws.amazon.com/premiumsupport/technology/pes/
There will be a beautiful and detailed writeup on how this outage occurred and specifically what they have already done and will do to ensure that such a situation won't happen again, and they really do mean it.
Just because YOU, the customer or end user, are not in control of resolving the outage, does not mean that you could have recovered from the outage any faster.
Because that being in control requires skilled staff to design, build, maintain, and repair the system/infrastructure, on top of the staff that built the application it runs on.
On Tue, 21 Oct 2025, David Eddleman via Outages-discussion wrote:
Part of the issue is, though, is that AWS (and Azure, and GCP, and ...) is at least partially a black box. There's a lot going on that makes those things work that you don't or can't see. You're depending upon that Secret Sauce™ working fine for you all the time. If any part of it is held together with spit and bailing wire (and you damn well better believe a not-insignificant part is!), when it fails over it's going to take a lot with it. When it does, you may have the ideal vendor-specified redundancy in place but you're still going to be hurting. Whereas if you self-host, you can get that redundancy, but you both gotta pay for it (rather than it coming out of a "nebulous" bill) and have the skilled staff to keep it up and make it work.
We saw that with this outage. Seemingly unrelated parts caused others to fail due to internal dependencies. We even saw that with the Crowdstrike fiasco, where EC2 storage latency was skyrocketing as people were manually migrating and attaching volumes to other machines to try and recover them, as there's no console access.
On Tue, Oct 21, 2025 at 11:51 AM Peter Beckman via Outages-discussion < outages-discussion@outages.org> wrote:
On Tue, 21 Oct 2025, Jeff Shultz via Outages-discussion wrote:
Truly fault tolerant is not budget friendly.
Having worked for AWS, and having run multi-region fault-tolerant systems for many years, it *can* be budget-friendly, if you are willing to put in the effort and planning.
I can find two different hosting companies that offer bare-metal hosting, and confirm that both are using a different mix of connectivity on different ASNs, and are in geographically different areas.
I can deploy my workload across those systems, reducing risk, but also have the workload spread across those disparate systems, so I don't need to double my infrastructure costs.
This is even possible in AWS -- they provide multiple tools for multi-region and multi-AZ deployments. When I worked for AWS my team built AMI Copy in 2012/2013, so you could move AMIs between regions with an API call, making it easier to start up new EC2 instances with your existing images.
RDS has cross-region read-replicas. DynamoDB was built with multi-region in mind.
You DO need to assume and plan that a whole AZ or Region will go dark, and if your systems just immediately fail when that happens, then you've done a less-than-ideal job of building your systems to be fault-tolerant.
Yes, it adds complexity and you have to test regularly, but it does NOT need to add huge amounts of additional costs. You just need to know what you're doing.
Beckman --------------------------------------------------------------------------- Peter Beckman Internet Guy beckman@angryox.com https://www.angryox.com/ --------------------------------------------------------------------------- ______________________________________________ Outages-discussion mailing list outages-discussion@outages.org Sign up for an account https://lists.outages.org/accounts/signup/ To subscribe send an email to outages-discussion-join@outages.org To unsubscribe send an email to outages-discussion-leave@outages.org To contact the list owners outages-owner@outages.org Archives https://lists.outages.org/archives/list/outages-discussion@outages.org/
Thank you for using outages-discussion Lists!
______________________________________________ Outages-discussion mailing list outages-discussion@outages.org Sign up for an account https://lists.outages.org/accounts/signup/ To subscribe send an email to outages-discussion-join@outages.org To unsubscribe send an email to outages-discussion-leave@outages.org To contact the list owners outages-owner@outages.org Archives https://lists.outages.org/archives/list/outages-discussion@outages.org/
Thank you for using outages-discussion Lists!
--------------------------------------------------------------------------- Peter Beckman Internet Guy beckman@angryox.com https://www.angryox.com/ --------------------------------------------------------------------------- ______________________________________________ Outages-discussion mailing list outages-discussion@outages.org Sign up for an account https://lists.outages.org/accounts/signup/ To subscribe send an email to outages-discussion-join@outages.org To unsubscribe send an email to outages-discussion-leave@outages.org To contact the list owners outages-owner@outages.org Archives https://lists.outages.org/archives/list/outages-discussion@outages.org/
Thank you for using outages-discussion Lists!
--------------------------------------------------------------------------- Peter Beckman Internet Guy beckman@angryox.com https://www.angryox.com/ ---------------------------------------------------------------------------
On Wed, Oct 22, 2025 at 5:03 PM Peter Beckman wrote:
Granted, I was there from 2010-2013. I'm sure things have changed.
I'm not sure I'd believe an opinion piece based on a loud employee departure, extrapolating that AWS has few knowledgable employees left.
Did you look at the engadget link? https://www.engadget.com/amazon-attrition-leadership-ctsmd-201800110-2018001... "An investigation from the New York Times found that, among hourly employees, Amazon’s turnover was approximately 150 percent annually, while work from the Wall Street Journal and National Employment Law Project have both found turnover to be around 100 percent in warehouses — double the industry average. The rate at which Amazon has burned through the American working-age populace led to another piece of internal research, obtained this summer by Recode, which cautioned that the company might “deplete the available labor supply in the US” in certain metro regions within a few years." I'm guessing that hourly employees means warehouse workers - yes? But if Amazon is such a great place for "knowledge workers" your 4 year stint at amazon is hardly a ringing endorsement for their ability to retain skilled workers.
AWS increases complexity of their systems regularly. They are able to, or used to be able to, recruit excellent people. Outages will ALWAYS happen, even if you have "the best" engineers ever. We're human. We will make mistakes.
So yes, based on what I know of AWS, and who I STILL know at AWS, yes, even still.
I don't know anybody that works for Amazon but their reputation as an employer sure seems to be in the toilet. But even so, maybe they can recruit excellent people ... then the question is how long do they retain those excellent people? Regards, Lee
On Wed, 22 Oct 2025, Lee wrote:
But is also why AWS can recover from issues quickly, because they built everything, and have the on-call and on-site staff to resolve it quickly,
Even still? ref: https://news.ycombinator.com/item?id=45649178
Regards Lee
On Tue, Oct 21, 2025 at 11:33 PM Peter Beckman via Outages-discussion <outages-discussion@outages.org> wrote:
Having seen the code and infrastructure within AWS, calling any cloud service as "being held together with spit and bailing wire" is uninformed. Your lack of direct control of it does not make it fragile.
Unless you believe the Internet at large is in the same situation. /s
There is (was) a weekly meeting of all heads of all AWS services that focus on reliability, redundancy, avoiding mistakes that cause ANY customer impacting outages, and communicating that to every OTHER team to ensure they didn't make the same mistakes. Charlie Bell was a powerhouse in that meeting.
There are interdependencies on AWS services, yes, though when I was there they tried to make decisions to keep service dependencies separate when they were deemed critical. Other services are just more easily built and maintained when they depend on AWS core services like EC2, EBS, S3, RDS.
Which is why you see sometimes cascading outages -- e.g. AWS Transcribe is unavailable because S3 is having issues.
But is also why AWS can recover from issues quickly, because they built everything, and have the on-call and on-site staff to resolve it quickly, or as quickly as a huge system can be. 9-15 hours seemed painful, but for a large scale system is pretty remarkable, given the last hour+ outage was over a year ago.
AWS Outage History https://aws.amazon.com/premiumsupport/technology/pes/
There will be a beautiful and detailed writeup on how this outage occurred and specifically what they have already done and will do to ensure that such a situation won't happen again, and they really do mean it.
Just because YOU, the customer or end user, are not in control of resolving the outage, does not mean that you could have recovered from the outage any faster.
Because that being in control requires skilled staff to design, build, maintain, and repair the system/infrastructure, on top of the staff that built the application it runs on.
On Tue, 21 Oct 2025, David Eddleman via Outages-discussion wrote:
Part of the issue is, though, is that AWS (and Azure, and GCP, and ...) is at least partially a black box. There's a lot going on that makes those things work that you don't or can't see. You're depending upon that Secret Sauce™ working fine for you all the time. If any part of it is held together with spit and bailing wire (and you damn well better believe a not-insignificant part is!), when it fails over it's going to take a lot with it. When it does, you may have the ideal vendor-specified redundancy in place but you're still going to be hurting. Whereas if you self-host, you can get that redundancy, but you both gotta pay for it (rather than it coming out of a "nebulous" bill) and have the skilled staff to keep it up and make it work.
We saw that with this outage. Seemingly unrelated parts caused others to fail due to internal dependencies. We even saw that with the Crowdstrike fiasco, where EC2 storage latency was skyrocketing as people were manually migrating and attaching volumes to other machines to try and recover them, as there's no console access.
On Tue, Oct 21, 2025 at 11:51 AM Peter Beckman via Outages-discussion < outages-discussion@outages.org> wrote:
On Tue, 21 Oct 2025, Jeff Shultz via Outages-discussion wrote:
Truly fault tolerant is not budget friendly.
Having worked for AWS, and having run multi-region fault-tolerant systems for many years, it *can* be budget-friendly, if you are willing to put in the effort and planning.
I can find two different hosting companies that offer bare-metal hosting, and confirm that both are using a different mix of connectivity on different ASNs, and are in geographically different areas.
I can deploy my workload across those systems, reducing risk, but also have the workload spread across those disparate systems, so I don't need to double my infrastructure costs.
This is even possible in AWS -- they provide multiple tools for multi-region and multi-AZ deployments. When I worked for AWS my team built AMI Copy in 2012/2013, so you could move AMIs between regions with an API call, making it easier to start up new EC2 instances with your existing images.
RDS has cross-region read-replicas. DynamoDB was built with multi-region in mind.
You DO need to assume and plan that a whole AZ or Region will go dark, and if your systems just immediately fail when that happens, then you've done a less-than-ideal job of building your systems to be fault-tolerant.
Yes, it adds complexity and you have to test regularly, but it does NOT need to add huge amounts of additional costs. You just need to know what you're doing.
Beckman --------------------------------------------------------------------------- Peter Beckman Internet Guy beckman@angryox.com https://www.angryox.com/ --------------------------------------------------------------------------- ______________________________________________ Outages-discussion mailing list outages-discussion@outages.org Sign up for an account https://lists.outages.org/accounts/signup/ To subscribe send an email to outages-discussion-join@outages.org To unsubscribe send an email to outages-discussion-leave@outages.org To contact the list owners outages-owner@outages.org Archives https://lists.outages.org/archives/list/outages-discussion@outages.org/
Thank you for using outages-discussion Lists!
______________________________________________ Outages-discussion mailing list outages-discussion@outages.org Sign up for an account https://lists.outages.org/accounts/signup/ To subscribe send an email to outages-discussion-join@outages.org To unsubscribe send an email to outages-discussion-leave@outages.org To contact the list owners outages-owner@outages.org Archives https://lists.outages.org/archives/list/outages-discussion@outages.org/
Thank you for using outages-discussion Lists!
--------------------------------------------------------------------------- Peter Beckman Internet Guy beckman@angryox.com https://www.angryox.com/ --------------------------------------------------------------------------- ______________________________________________ Outages-discussion mailing list outages-discussion@outages.org Sign up for an account https://lists.outages.org/accounts/signup/ To subscribe send an email to outages-discussion-join@outages.org To unsubscribe send an email to outages-discussion-leave@outages.org To contact the list owners outages-owner@outages.org Archives https://lists.outages.org/archives/list/outages-discussion@outages.org/
Thank you for using outages-discussion Lists!
--------------------------------------------------------------------------- Peter Beckman Internet Guy beckman@angryox.com https://www.angryox.com/ ---------------------------------------------------------------------------
It looks like the "Post-Event Summary" for this outage has been published. I've excerpted the one-sentence root cause analysis (emphasis mine): https://aws.amazon.com/message/101925/ The root cause of this issue was a latent race condition in the DynamoDB DNS management system that resulted in an incorrect empty DNS record for the service’s regional endpoint (dynamodb.us-east-1.amazonaws.com) that the automation failed to repair. -Brad Chapman —Sent from my iPhone On Oct 23, 2025, at 6:17 AM, Lee via Outages-discussion <outages-discussion@outages.org> wrote: On Wed, Oct 22, 2025 at 5:03 PM Peter Beckman wrote: Granted, I was there from 2010-2013. I'm sure things have changed. I'm not sure I'd believe an opinion piece based on a loud employee departure, extrapolating that AWS has few knowledgable employees left. Did you look at the engadget link? https://urldefense.com/v3/__https://www.engadget.com/amazon-attrition-leader... "An investigation from the New York Times found that, among hourly employees, Amazon’s turnover was approximately 150 percent annually, while work from the Wall Street Journal and National Employment Law Project have both found turnover to be around 100 percent in warehouses — double the industry average. The rate at which Amazon has burned through the American working-age populace led to another piece of internal research, obtained this summer by Recode, which cautioned that the company might “deplete the available labor supply in the US” in certain metro regions within a few years." I'm guessing that hourly employees means warehouse workers - yes? But if Amazon is such a great place for "knowledge workers" your 4 year stint at amazon is hardly a ringing endorsement for their ability to retain skilled workers. AWS increases complexity of their systems regularly. They are able to, or used to be able to, recruit excellent people. Outages will ALWAYS happen, even if you have "the best" engineers ever. We're human. We will make mistakes. So yes, based on what I know of AWS, and who I STILL know at AWS, yes, even still. I don't know anybody that works for Amazon but their reputation as an employer sure seems to be in the toilet. But even so, maybe they can recruit excellent people ... then the question is how long do they retain those excellent people? Regards, Lee On Wed, 22 Oct 2025, Lee wrote: But is also why AWS can recover from issues quickly, because they built everything, and have the on-call and on-site staff to resolve it quickly, Even still? ref: https://urldefense.com/v3/__https://news.ycombinator.com/item?id=45649178__;... Regards Lee On Tue, Oct 21, 2025 at 11:33 PM Peter Beckman via Outages-discussion <outages-discussion@outages.org> wrote: Having seen the code and infrastructure within AWS, calling any cloud service as "being held together with spit and bailing wire" is uninformed. Your lack of direct control of it does not make it fragile. Unless you believe the Internet at large is in the same situation. /s There is (was) a weekly meeting of all heads of all AWS services that focus on reliability, redundancy, avoiding mistakes that cause ANY customer impacting outages, and communicating that to every OTHER team to ensure they didn't make the same mistakes. Charlie Bell was a powerhouse in that meeting. There are interdependencies on AWS services, yes, though when I was there they tried to make decisions to keep service dependencies separate when they were deemed critical. Other services are just more easily built and maintained when they depend on AWS core services like EC2, EBS, S3, RDS. Which is why you see sometimes cascading outages -- e.g. AWS Transcribe is unavailable because S3 is having issues. But is also why AWS can recover from issues quickly, because they built everything, and have the on-call and on-site staff to resolve it quickly, or as quickly as a huge system can be. 9-15 hours seemed painful, but for a large scale system is pretty remarkable, given the last hour+ outage was over a year ago. AWS Outage History https://urldefense.com/v3/__https://aws.amazon.com/premiumsupport/technology... There will be a beautiful and detailed writeup on how this outage occurred and specifically what they have already done and will do to ensure that such a situation won't happen again, and they really do mean it. Just because YOU, the customer or end user, are not in control of resolving the outage, does not mean that you could have recovered from the outage any faster. Because that being in control requires skilled staff to design, build, maintain, and repair the system/infrastructure, on top of the staff that built the application it runs on. On Tue, 21 Oct 2025, David Eddleman via Outages-discussion wrote: Part of the issue is, though, is that AWS (and Azure, and GCP, and ...) is at least partially a black box. There's a lot going on that makes those things work that you don't or can't see. You're depending upon that Secret Sauce™ working fine for you all the time. If any part of it is held together with spit and bailing wire (and you damn well better believe a not-insignificant part is!), when it fails over it's going to take a lot with it. When it does, you may have the ideal vendor-specified redundancy in place but you're still going to be hurting. Whereas if you self-host, you can get that redundancy, but you both gotta pay for it (rather than it coming out of a "nebulous" bill) and have the skilled staff to keep it up and make it work. We saw that with this outage. Seemingly unrelated parts caused others to fail due to internal dependencies. We even saw that with the Crowdstrike fiasco, where EC2 storage latency was skyrocketing as people were manually migrating and attaching volumes to other machines to try and recover them, as there's no console access. On Tue, Oct 21, 2025 at 11:51 AM Peter Beckman via Outages-discussion < outages-discussion@outages.org> wrote: On Tue, 21 Oct 2025, Jeff Shultz via Outages-discussion wrote: Truly fault tolerant is not budget friendly. Having worked for AWS, and having run multi-region fault-tolerant systems for many years, it *can* be budget-friendly, if you are willing to put in the effort and planning. I can find two different hosting companies that offer bare-metal hosting, and confirm that both are using a different mix of connectivity on different ASNs, and are in geographically different areas. I can deploy my workload across those systems, reducing risk, but also have the workload spread across those disparate systems, so I don't need to double my infrastructure costs. This is even possible in AWS -- they provide multiple tools for multi-region and multi-AZ deployments. When I worked for AWS my team built AMI Copy in 2012/2013, so you could move AMIs between regions with an API call, making it easier to start up new EC2 instances with your existing images. RDS has cross-region read-replicas. DynamoDB was built with multi-region in mind. You DO need to assume and plan that a whole AZ or Region will go dark, and if your systems just immediately fail when that happens, then you've done a less-than-ideal job of building your systems to be fault-tolerant. Yes, it adds complexity and you have to test regularly, but it does NOT need to add huge amounts of additional costs. You just need to know what you're doing. Beckman --------------------------------------------------------------------------- Peter Beckman Internet Guy beckman@angryox.com https://urldefense.com/v3/__https://www.angryox.com/__;!!PIZeeW5wscynRQ!vEqb... --------------------------------------------------------------------------- ______________________________________________ Outages-discussion mailing list outages-discussion@outages.org Sign up for an account https://urldefense.com/v3/__https://lists.outages.org/accounts/signup/__;!!P... To subscribe send an email to outages-discussion-join@outages.org To unsubscribe send an email to outages-discussion-leave@outages.org To contact the list owners outages-owner@outages.org Archives https://urldefense.com/v3/__https://lists.outages.org/archives/list/outages-... Thank you for using outages-discussion Lists! ______________________________________________ Outages-discussion mailing list outages-discussion@outages.org Sign up for an account https://urldefense.com/v3/__https://lists.outages.org/accounts/signup/__;!!P... To subscribe send an email to outages-discussion-join@outages.org To unsubscribe send an email to outages-discussion-leave@outages.org To contact the list owners outages-owner@outages.org Archives https://urldefense.com/v3/__https://lists.outages.org/archives/list/outages-... Thank you for using outages-discussion Lists! --------------------------------------------------------------------------- Peter Beckman Internet Guy beckman@angryox.com https://urldefense.com/v3/__https://www.angryox.com/__;!!PIZeeW5wscynRQ!vEqb... --------------------------------------------------------------------------- ______________________________________________ Outages-discussion mailing list outages-discussion@outages.org Sign up for an account https://urldefense.com/v3/__https://lists.outages.org/accounts/signup/__;!!P... To subscribe send an email to outages-discussion-join@outages.org To unsubscribe send an email to outages-discussion-leave@outages.org To contact the list owners outages-owner@outages.org Archives https://urldefense.com/v3/__https://lists.outages.org/archives/list/outages-... Thank you for using outages-discussion Lists! --------------------------------------------------------------------------- Peter Beckman Internet Guy beckman@angryox.com https://urldefense.com/v3/__https://www.angryox.com/__;!!PIZeeW5wscynRQ!vEqb... --------------------------------------------------------------------------- ______________________________________________ Outages-discussion mailing list outages-discussion@outages.org Sign up for an account https://urldefense.com/v3/__https://lists.outages.org/accounts/signup/__;!!P... To subscribe send an email to outages-discussion-join@outages.org To unsubscribe send an email to outages-discussion-leave@outages.org To contact the list owners outages-owner@outages.org Archives https://urldefense.com/v3/__https://lists.outages.org/archives/list/outages-... Thank you for using outages-discussion Lists!
I've quoted you on the main list, with a pointer back to -discuss; thanks for the excerpt. -- jra ----- Original Message -----
From: "Chapman, Brad (NBCUniversal) via Outages-discussion" <outages-discussion@outages.org> To: "outages meta-discussion Post" <outages-discussion@outages.org> Cc: "Chapman, Brad (NBCUniversal)" <Brad.Chapman@nbcuni.com> Sent: Thursday, October 23, 2025 6:09:09 PM Subject: [Outages-discussion] Re: [EXTERNAL] Re: [Outages] AWS US-EAST-1
It looks like the "Post-Event Summary" for this outage has been published. I've excerpted the one-sentence root cause analysis (emphasis mine):
https://aws.amazon.com/message/101925/
The root cause of this issue was a latent race condition in the DynamoDB DNS management system that resulted in an incorrect empty DNS record for the service’s regional endpoint (dynamodb.us-east-1.amazonaws.com) that the automation failed to repair.
-Brad Chapman
—Sent from my iPhone
On Oct 23, 2025, at 6:17 AM, Lee via Outages-discussion <outages-discussion@outages.org> wrote:
On Wed, Oct 22, 2025 at 5:03 PM Peter Beckman wrote:
Granted, I was there from 2010-2013. I'm sure things have changed.
I'm not sure I'd believe an opinion piece based on a loud employee departure, extrapolating that AWS has few knowledgable employees left.
Did you look at the engadget link? https://urldefense.com/v3/__https://www.engadget.com/amazon-attrition-leader...
"An investigation from the New York Times found that, among hourly employees, Amazon’s turnover was approximately 150 percent annually, while work from the Wall Street Journal and National Employment Law Project have both found turnover to be around 100 percent in warehouses — double the industry average. The rate at which Amazon has burned through the American working-age populace led to another piece of internal research, obtained this summer by Recode, which cautioned that the company might “deplete the available labor supply in the US” in certain metro regions within a few years."
I'm guessing that hourly employees means warehouse workers - yes? But if Amazon is such a great place for "knowledge workers" your 4 year stint at amazon is hardly a ringing endorsement for their ability to retain skilled workers.
AWS increases complexity of their systems regularly. They are able to, or used to be able to, recruit excellent people. Outages will ALWAYS happen, even if you have "the best" engineers ever. We're human. We will make mistakes.
So yes, based on what I know of AWS, and who I STILL know at AWS, yes, even still.
I don't know anybody that works for Amazon but their reputation as an employer sure seems to be in the toilet. But even so, maybe they can recruit excellent people ... then the question is how long do they retain those excellent people?
Regards, Lee
On Wed, 22 Oct 2025, Lee wrote:
But is also why AWS can recover from issues quickly, because they built everything, and have the on-call and on-site staff to resolve it quickly,
Even still? ref: https://urldefense.com/v3/__https://news.ycombinator.com/item?id=45649178__;...
Regards Lee
On Tue, Oct 21, 2025 at 11:33 PM Peter Beckman via Outages-discussion <outages-discussion@outages.org> wrote:
Having seen the code and infrastructure within AWS, calling any cloud service as "being held together with spit and bailing wire" is uninformed. Your lack of direct control of it does not make it fragile.
Unless you believe the Internet at large is in the same situation. /s
There is (was) a weekly meeting of all heads of all AWS services that focus on reliability, redundancy, avoiding mistakes that cause ANY customer impacting outages, and communicating that to every OTHER team to ensure they didn't make the same mistakes. Charlie Bell was a powerhouse in that meeting.
There are interdependencies on AWS services, yes, though when I was there they tried to make decisions to keep service dependencies separate when they were deemed critical. Other services are just more easily built and maintained when they depend on AWS core services like EC2, EBS, S3, RDS.
Which is why you see sometimes cascading outages -- e.g. AWS Transcribe is unavailable because S3 is having issues.
But is also why AWS can recover from issues quickly, because they built everything, and have the on-call and on-site staff to resolve it quickly, or as quickly as a huge system can be. 9-15 hours seemed painful, but for a large scale system is pretty remarkable, given the last hour+ outage was over a year ago.
AWS Outage History https://urldefense.com/v3/__https://aws.amazon.com/premiumsupport/technology...
There will be a beautiful and detailed writeup on how this outage occurred and specifically what they have already done and will do to ensure that such a situation won't happen again, and they really do mean it.
Just because YOU, the customer or end user, are not in control of resolving the outage, does not mean that you could have recovered from the outage any faster.
Because that being in control requires skilled staff to design, build, maintain, and repair the system/infrastructure, on top of the staff that built the application it runs on.
On Tue, 21 Oct 2025, David Eddleman via Outages-discussion wrote:
Part of the issue is, though, is that AWS (and Azure, and GCP, and ...) is at least partially a black box. There's a lot going on that makes those things work that you don't or can't see. You're depending upon that Secret Sauce™ working fine for you all the time. If any part of it is held together with spit and bailing wire (and you damn well better believe a not-insignificant part is!), when it fails over it's going to take a lot with it. When it does, you may have the ideal vendor-specified redundancy in place but you're still going to be hurting. Whereas if you self-host, you can get that redundancy, but you both gotta pay for it (rather than it coming out of a "nebulous" bill) and have the skilled staff to keep it up and make it work.
We saw that with this outage. Seemingly unrelated parts caused others to fail due to internal dependencies. We even saw that with the Crowdstrike fiasco, where EC2 storage latency was skyrocketing as people were manually migrating and attaching volumes to other machines to try and recover them, as there's no console access.
On Tue, Oct 21, 2025 at 11:51 AM Peter Beckman via Outages-discussion < outages-discussion@outages.org> wrote:
On Tue, 21 Oct 2025, Jeff Shultz via Outages-discussion wrote:
Truly fault tolerant is not budget friendly.
Having worked for AWS, and having run multi-region fault-tolerant systems for many years, it *can* be budget-friendly, if you are willing to put in the effort and planning.
I can find two different hosting companies that offer bare-metal hosting, and confirm that both are using a different mix of connectivity on different ASNs, and are in geographically different areas.
I can deploy my workload across those systems, reducing risk, but also have the workload spread across those disparate systems, so I don't need to double my infrastructure costs.
This is even possible in AWS -- they provide multiple tools for multi-region and multi-AZ deployments. When I worked for AWS my team built AMI Copy in 2012/2013, so you could move AMIs between regions with an API call, making it easier to start up new EC2 instances with your existing images.
RDS has cross-region read-replicas. DynamoDB was built with multi-region in mind.
You DO need to assume and plan that a whole AZ or Region will go dark, and if your systems just immediately fail when that happens, then you've done a less-than-ideal job of building your systems to be fault-tolerant.
Yes, it adds complexity and you have to test regularly, but it does NOT need to add huge amounts of additional costs. You just need to know what you're doing.
Beckman --------------------------------------------------------------------------- Peter Beckman Internet Guy beckman@angryox.com https://urldefense.com/v3/__https://www.angryox.com/__;!!PIZeeW5wscynRQ!vEqb... --------------------------------------------------------------------------- ______________________________________________ Outages-discussion mailing list outages-discussion@outages.org Sign up for an account https://urldefense.com/v3/__https://lists.outages.org/accounts/signup/__;!!P... To subscribe send an email to outages-discussion-join@outages.org To unsubscribe send an email to outages-discussion-leave@outages.org To contact the list owners outages-owner@outages.org Archives https://urldefense.com/v3/__https://lists.outages.org/archives/list/outages-...
Thank you for using outages-discussion Lists!
______________________________________________ Outages-discussion mailing list outages-discussion@outages.org Sign up for an account https://urldefense.com/v3/__https://lists.outages.org/accounts/signup/__;!!P... To subscribe send an email to outages-discussion-join@outages.org To unsubscribe send an email to outages-discussion-leave@outages.org To contact the list owners outages-owner@outages.org Archives https://urldefense.com/v3/__https://lists.outages.org/archives/list/outages-...
Thank you for using outages-discussion Lists!
--------------------------------------------------------------------------- Peter Beckman Internet Guy beckman@angryox.com https://urldefense.com/v3/__https://www.angryox.com/__;!!PIZeeW5wscynRQ!vEqb... --------------------------------------------------------------------------- ______________________________________________ Outages-discussion mailing list outages-discussion@outages.org Sign up for an account https://urldefense.com/v3/__https://lists.outages.org/accounts/signup/__;!!P... To subscribe send an email to outages-discussion-join@outages.org To unsubscribe send an email to outages-discussion-leave@outages.org To contact the list owners outages-owner@outages.org Archives https://urldefense.com/v3/__https://lists.outages.org/archives/list/outages-...
Thank you for using outages-discussion Lists!
--------------------------------------------------------------------------- Peter Beckman Internet Guy beckman@angryox.com https://urldefense.com/v3/__https://www.angryox.com/__;!!PIZeeW5wscynRQ!vEqb... --------------------------------------------------------------------------- ______________________________________________ Outages-discussion mailing list outages-discussion@outages.org Sign up for an account https://urldefense.com/v3/__https://lists.outages.org/accounts/signup/__;!!P... To subscribe send an email to outages-discussion-join@outages.org To unsubscribe send an email to outages-discussion-leave@outages.org To contact the list owners outages-owner@outages.org Archives https://urldefense.com/v3/__https://lists.outages.org/archives/list/outages-...
Thank you for using outages-discussion Lists! ______________________________________________ Outages-discussion mailing list outages-discussion@outages.org Sign up for an account https://lists.outages.org/accounts/signup/ To subscribe send an email to outages-discussion-join@outages.org To unsubscribe send an email to outages-discussion-leave@outages.org To contact the list owners outages-owner@outages.org Archives https://lists.outages.org/archives/list/outages-discussion@outages.org/
Thank you for using outages-discussion Lists!
-- Jay R. Ashworth Baylink jra@baylink.com Designer The Things I Think RFC 2100 Ashworth & Associates http://www.bcp38.info 2000 Land Rover DII St Petersburg FL USA BCP38: Ask For It By Name! +1 727 647 1274
----- Original Message -----
From: "Chapman, Brad (NBCUniversal) via Outages-discussion" <outages-discussion@outages.org>
It looks like the "Post-Event Summary" for this outage has been published. I've excerpted the one-sentence root cause analysis (emphasis mine):
https://aws.amazon.com/message/101925/
The root cause of this issue was a latent race condition in the DynamoDB DNS management system that resulted in an incorrect empty DNS record for the service’s regional endpoint (dynamodb.us-east-1.amazonaws.com) that the automation failed to repair.
I disagree with their description of the root cause, based on the AAR. It sounds like there should be a timestamp on every plan the Planner makes, and that should be carried all the way through to Route53 -- the *actual* root cause was that the actual installation of those delayed (and now older) records was permitted to happen, even though newer data was already installed, IME. Cheers, -- jra -- Jay R. Ashworth Baylink jra@baylink.com Designer The Things I Think RFC 2100 Ashworth & Associates http://www.bcp38.info 2000 Land Rover DII St Petersburg FL USA BCP38: Ask For It By Name! +1 727 647 1274
idk how you guys sanitize your internet consumption for your own sanity but I use AI to amuse me through snark, and I thought this one was too good not to share: '''' *SassGPT, can you summarize this word-vomit for me and tell me how it could have been avoided?* https://aws.amazon.com/message/101925/
🌀 *“How AWS Took Down Half the Internet Because of One DNS Race Condition”*
On October 19, 2025, AWS’s biggest region (*us-east-1*, a.k.a. the “single point of failure for the internet”) face-planted when *DynamoDB’s automated DNS manager tripped over its own shoelaces*. Two automation bots tried to update DNS at the same time, got confused, and—like dueling Roombas—*deleted the service’s own DNS records*. DynamoDB promptly vanished from the network, and every AWS service that depends on it (which is most of them) started screaming. That single typo-by-automation triggered a *cascade of chaos*: - *EC2* couldn’t launch new instances because its management system couldn’t talk to DynamoDB. - *Load balancers* began thinking healthy servers were dying and yanked them from DNS. - *Lambda, Redshift, Connect, IAM, and STS* all stumbled because they rely (directly or indirectly) on DynamoDB. - By the time engineers manually resurrected the DNS record, they still had to untangle a backlog of leases, health checks, and throttles. After about *14 hours*, everything limped back to normal. AWS promised to fix the race condition, add guardrails so automation can’t nuke live DNS, and maybe stop using one region as everyone’s dependency hub. ------------------------------ *In short:* A concurrency bug in a DNS script turned into a region-wide outage because of excessive automation trust and cross-service dependencies. The lesson? *Even hyperscale clouds can still be taken down by one bad variable and too many dominoes.* ''' On Thu, Oct 23, 2025 at 9:19 PM Jay R. Ashworth via Outages-discussion < outages-discussion@outages.org> wrote:
----- Original Message -----
From: "Chapman, Brad (NBCUniversal) via Outages-discussion" < outages-discussion@outages.org>
It looks like the "Post-Event Summary" for this outage has been published. I've excerpted the one-sentence root cause analysis (emphasis mine):
https://aws.amazon.com/message/101925/
The root cause of this issue was a latent race condition in the DynamoDB DNS management system that resulted in an incorrect empty DNS record for the service’s regional endpoint (dynamodb.us-east-1.amazonaws.com) that the automation failed to repair.
I disagree with their description of the root cause, based on the AAR.
It sounds like there should be a timestamp on every plan the Planner makes, and that should be carried all the way through to Route53 -- the *actual* root cause was that the actual installation of those delayed (and now older) records was permitted to happen, even though newer data was already installed, IME.
Cheers, -- jra -- Jay R. Ashworth Baylink jra@baylink.com Designer The Things I Think RFC 2100 Ashworth & Associates http://www.bcp38.info 2000 Land Rover DII St Petersburg FL USA BCP38: Ask For It By Name! +1 727 647 1274 ______________________________________________ Outages-discussion mailing list outages-discussion@outages.org Sign up for an account https://lists.outages.org/accounts/signup/ To subscribe send an email to outages-discussion-join@outages.org To unsubscribe send an email to outages-discussion-leave@outages.org To contact the list owners outages-owner@outages.org Archives https://lists.outages.org/archives/list/outages-discussion@outages.org/
Thank you for using outages-discussion Lists!
Once upon a time, Jay R. Ashworth <jra@baylink.com> said:
It sounds like there should be a timestamp on every plan the Planner makes, and that should be carried all the way through to Route53 -- the *actual* root cause was that the actual installation of those delayed (and now older) records was permitted to happen, even though newer data was already installed, IME.
It sounds more like a lack of locking than a race condition. If 123 has to be processed before 456, then you need to take a lock to ensure that processing of 456 cannot begin before processing of 123 has completed. Out-of-order processing just because something took longer than expected is not the definition of a race condition. -- Chris Adams <cma@cmadams.net>
----- Original Message -----
From: "Chris Adams via Outages-discussion" <outages-discussion@outages.org>
Once upon a time, Jay R. Ashworth <jra@baylink.com> said:
It sounds like there should be a timestamp on every plan the Planner makes, and that should be carried all the way through to Route53 -- the *actual* root cause was that the actual installation of those delayed (and now older) records was permitted to happen, even though newer data was already installed, IME.
It sounds more like a lack of locking than a race condition. If 123 has to be processed before 456, then you need to take a lock to ensure that processing of 456 cannot begin before processing of 123 has completed. Out-of-order processing just because something took longer than expected is not the definition of a race condition.
Well, they said race, not me. But I think it's *kinda* a race -- this isn't locking per se, as I understand their writeup, you don't really know how *many* bad trx might be in flight, you just know you should trash (probably with a warning) any that come in out-of-order, right? Cheers, -- jra -- Jay R. Ashworth Baylink jra@baylink.com Designer The Things I Think RFC 2100 Ashworth & Associates http://www.bcp38.info 2000 Land Rover DII St Petersburg FL USA BCP38: Ask For It By Name! +1 727 647 1274
Isn't this pretty much the same thing that happened a year or so ago in US East that took down almost all of AWS? On Mon, Oct 20, 2025, 17:54 Aaron C. de Bruyn via Outages-discussion < outages-discussion@outages.org> wrote:
On Mon, Oct 20, 2025 at 1:39 PM Shaun Potts via Outages-discussion < outages-discussion@outages.org> wrote:
I always enjoy the armchair "haha that's why you don't use <x>" engineers.
I always enjoy it when the next generation of engineers with fresh and exciting new ideas are forced to re-learn what "single point of failure" means. This is usually followed a few years later by realizing that SPOF includes companies (like AWS today), the various definitions of layer 8 on the OSI stack, and that one time I fired up 'cssh' with the wrong target and happily restarted a service for all customers instead of a much smaller subset.
-A ______________________________________________ Outages-discussion mailing list outages-discussion@outages.org Sign up for an account https://lists.outages.org/accounts/signup/ To subscribe send an email to outages-discussion-join@outages.org To unsubscribe send an email to outages-discussion-leave@outages.org To contact the list owners outages-owner@outages.org Archives https://lists.outages.org/archives/list/outages-discussion@outages.org/
Thank you for using outages-discussion Lists!
On Mon, Oct 20, 2025 at 09:50:54PM -0400, Robert Webb via Outages-discussion wrote:
Isn't this pretty much the same thing that happened a year or so ago in US East that took down almost all of AWS?
https://aws.amazon.com/premiumsupport/technology/pes/ is always worth saving and sifting through. The Kinesis outage in 2020 taught me that AWS internally suffers from (what I consider to be) the same problem that AWS customers suffer from: interwoven AWS service dependencies, resulting in effectively SPoFs. As a reminder, from that outage we learned Cognito depends on Kinesis, CloudWatch relies on Kinesis, EventBridge relies on Kinesis, etc... The same situation repeated itself (in a different fashion) in 2024. These two incidents are documented in AWS's post-mortems above. There are plenty of AWS services that suffer from this phenomenon from the "customer use" perspective: you want to use service X, but you end up having to configure pieces for service Y and service Z, which depends on service C. By the time you're done, you've now got 6 AWS services involved for something that quite likely had you home-grown yourself, might have had 2 at most. Two added bonuses: now someone -- usually your SAs or SREs -- have to remember this whole interwoven mess, and all while spending even more money with AWS. (I can't speak of other cloud providers as I have no experience with them.) I'm highly biased because I'm a sysadmin by profession. In general, my 30 years of experience has taught me to trust nothing and minimise dependencies as much as possible; KISS principle above all else, and assume everything can (will) break. Think about those ramifications, test those scenarios. If you can't solve them (reason doesn't matter!), OK, just document ones you're aware of. It's usually infeasible to try and relieve every single SPoF, but documenting (and publishing info about) ones you know about is good practise. Remember: less things involved = less stuff to have to worry about = fewer things that can go wrong = less time spent during outages = fewer outages = happier customers + happier engineers. -- | Jeremy Chadwick jdc@koitsu.org | | UNIX Systems Administrator PGP 0x2A389531 | | Making life hard for others since 1977. |
participants (14)
-
Aaron C. de Bruyn -
Andrew Latham -
Chapman, Brad (NBCUniversal) -
Chris Adams -
David Eddleman -
Jay R. Ashworth -
Jeff Shultz -
Jeremy Chadwick -
Lee -
Patrick W. Gilmore -
Peter Beckman -
Robert Blayzor -
Robert Webb -
Shaun Potts