I think we're both saying the same thing in a different way. Bad design is bad design is bad design. I think the metric is an acceptable level of failure. Remember a few years ago it was some random data center in San Antonio or something like that resulting in a multi-day outage for Microsoft because some core service lived/routed through that. The only guy we should be laughing at is the one who thinks he can design away all of these issues next year when his datacenter gets struck by lightning. On Mon, Oct 20, 2025 at 5:10 PM Aaron C. de Bruyn <aaron@heyaaron.com> wrote:
On Mon, Oct 20, 2025 at 1:39 PM Shaun Potts via Outages-discussion < outages-discussion@outages.org> wrote:
I always enjoy the armchair "haha that's why you don't use <x>" engineers.
I always enjoy it when the next generation of engineers with fresh and exciting new ideas are forced to re-learn what "single point of failure" means. This is usually followed a few years later by realizing that SPOF includes companies (like AWS today), the various definitions of layer 8 on the OSI stack, and that one time I fired up 'cssh' with the wrong target and happily restarted a service for all customers instead of a much smaller subset.
-A