Part of the issue is, though, is that AWS (and Azure, and GCP, and ...) is at least partially a black box. There's a lot going on that makes those things work that you don't or can't see. You're depending upon that Secret Sauce™ working fine for you all the time. If any part of it is held together with spit and bailing wire (and you damn well better believe a not-insignificant part is!), when it fails over it's going to take a lot with it. When it does, you may have the ideal vendor-specified redundancy in place but you're still going to be hurting. Whereas if you self-host, you can get that redundancy, but you both gotta pay for it (rather than it coming out of a "nebulous" bill) and have the skilled staff to keep it up and make it work. We saw that with this outage. Seemingly unrelated parts caused others to fail due to internal dependencies. We even saw that with the Crowdstrike fiasco, where EC2 storage latency was skyrocketing as people were manually migrating and attaching volumes to other machines to try and recover them, as there's no console access. On Tue, Oct 21, 2025 at 11:51 AM Peter Beckman via Outages-discussion < outages-discussion@outages.org> wrote:
On Tue, 21 Oct 2025, Jeff Shultz via Outages-discussion wrote:
Truly fault tolerant is not budget friendly.
Having worked for AWS, and having run multi-region fault-tolerant systems for many years, it *can* be budget-friendly, if you are willing to put in the effort and planning.
I can find two different hosting companies that offer bare-metal hosting, and confirm that both are using a different mix of connectivity on different ASNs, and are in geographically different areas.
I can deploy my workload across those systems, reducing risk, but also have the workload spread across those disparate systems, so I don't need to double my infrastructure costs.
This is even possible in AWS -- they provide multiple tools for multi-region and multi-AZ deployments. When I worked for AWS my team built AMI Copy in 2012/2013, so you could move AMIs between regions with an API call, making it easier to start up new EC2 instances with your existing images.
RDS has cross-region read-replicas. DynamoDB was built with multi-region in mind.
You DO need to assume and plan that a whole AZ or Region will go dark, and if your systems just immediately fail when that happens, then you've done a less-than-ideal job of building your systems to be fault-tolerant.
Yes, it adds complexity and you have to test regularly, but it does NOT need to add huge amounts of additional costs. You just need to know what you're doing.
Beckman --------------------------------------------------------------------------- Peter Beckman Internet Guy beckman@angryox.com https://www.angryox.com/ --------------------------------------------------------------------------- ______________________________________________ Outages-discussion mailing list outages-discussion@outages.org Sign up for an account https://lists.outages.org/accounts/signup/ To subscribe send an email to outages-discussion-join@outages.org To unsubscribe send an email to outages-discussion-leave@outages.org To contact the list owners outages-owner@outages.org Archives https://lists.outages.org/archives/list/outages-discussion@outages.org/
Thank you for using outages-discussion Lists!