
This is the purpose of learning from your mistakes in the past. Create a maintenance plan so it doesn't happen again! Fool me once... Josh Luthman Office: 937-552-2340 Direct: 937-552-2343 1100 Wayne St Suite 1337 Troy, OH 45373 "When you have eliminated the impossible, that which remains, however improbable, must be the truth." --- Sir Arthur Conan Doyle On Thu, Oct 22, 2009 at 10:43 PM, George Herbert <george.herbert@gmail.com>wrote:
On Thu, Oct 22, 2009 at 7:03 PM, Jay R. Ashworth <jra@baylink.com> wrote:
----- "Jeremy Chadwick" <outages@jdc.parodius.com> wrote:
On Tue, Oct 20, 2009 at 09:28:21AM -0700, Scott Howard wrote:
Looks like it's all back up as of about 30 mins ago.
Apparently either a core switch or router failed, which took down much of their network in Atlanta, as well as Memphis and Nashville.
Level 3 has a single router or switch handling packets at a major POP? I doubt this, but the outage is confirmation something bad happened. That said: where's the redundancy, and why didn't it kick in?
Oh; you're *always* asking that.
:-)
The Internet Backbone<tm> has been a commercial, rather than an engineering, construct for over 15 years now.
The RFO that went out somewhat after he asked that was more useful... N=2 redundancy was in place. However, when primary had hardware failure, secondary had (unknown / unstated) software, config, or hardware failure that hadn't been detected or checked, and it didn't work either.
It's hard to test clusters of things well when they have near-100% uptime requirements. The dependability of the untested failover unit is low, as you're not testing it well.
Sometimes you can test failovers in stream. But sometimes those supposedly harmless failover tests fail for baroque reasons, taking down a service when the primary was in fact just fine.
This isn't (just) an economics problem. Reliability of complex problems is an mathematically exponentially hard problem to crack from the engineering and theoretical levels.
Some people don't try - and get what they deserve - and some people give it a good or best commercial reasonable effort, and still fail. Doing better than that is really hard.
-- -george william herbert george.herbert@gmail.com _______________________________________________ outages mailing list outages@outages.org https://puck.nether.net/mailman/listinfo/outages