Re: [outages] Level 3 down in Atlanta

22 Oct 2009

      This is the purpose of learning from your mistakes in the past.  Create a
maintenance plan so it doesn't happen again!

Fool me once...

Josh Luthman
Office: 937-552-2340
Direct: 937-552-2343
1100 Wayne St
Suite 1337
Troy, OH 45373

"When you have eliminated the impossible, that which remains, however
improbable, must be the truth."
--- Sir Arthur Conan Doyle

On Thu, Oct 22, 2009 at 10:43 PM, George Herbert
<george.herbert@gmail.com>wrote:
...
On Thu, Oct 22, 2009 at 7:03 PM, Jay R. Ashworth <jra@baylink.com> wrote:
...
----- "Jeremy Chadwick" <outages@jdc.parodius.com> wrote:
...
On Tue, Oct 20, 2009 at 09:28:21AM -0700, Scott Howard wrote:
...
Looks like it's all back up as of about 30 mins ago.
Apparently either a core switch or router failed, which took down much
of
their network in Atlanta, as well as Memphis and Nashville.
Level 3 has a single router or switch handling packets at a major
POP?
I doubt this, but the outage is confirmation something bad happened.
That said: where's the redundancy, and why didn't it kick in?
Oh; you're *always* asking that.
:-)
The Internet Backbone<tm> has been a commercial, rather than an
engineering,
construct for over 15 years now.
The RFO that went out somewhat after he asked that was more useful...
N=2 redundancy was in place.  However, when primary had hardware
failure, secondary had (unknown / unstated) software, config, or
hardware failure that hadn't been detected or checked, and it didn't
work either.
It's hard to test clusters of things well when they have near-100%
uptime requirements.  The dependability of the untested failover unit
is low, as you're not testing it well.
Sometimes you can test failovers in stream.  But sometimes those
supposedly harmless failover tests fail for baroque reasons, taking
down a service when the primary was in fact just fine.
This isn't (just) an economics problem.  Reliability of complex
problems is an mathematically exponentially hard problem to crack from
the engineering and theoretical levels.
Some people don't try - and get what they deserve - and some people
give it a good or best commercial reasonable effort, and still fail.
Doing better than that is really hard.
--
-george william herbert
george.herbert@gmail.com
_______________________________________________
outages mailing list
outages@outages.org
https://puck.nether.net/mailman/listinfo/outages