Lessons Learned: RRTB outage

So I had to renumber some servers this afternoon, cause I was expanding to a larger netblock (a 28 instead of a 29). I renumbered my servers and my DNS (which I'd set the TTL on to 300 like a good boy on Wednesday), and then pulled the trigger with Road Runner. He "rescripted" his SMC router (the likely cause of some standard deviation noted by a couple of reporters -- the router, not the rescripting), and I pinged it and it was ok, and I mtr'd it and it was ok, so I hit the webserver, and that came up fine, too. So then my boss calls me 15 minutes later: it's not working. "I wonder what that could be", sez I; I'd even traced and hit the webserver from my Android phone (Sprint; Opera Mobile 11), and it had worked fine. That was Red Herring #1. So my boss uses a Mac. So does my best friend, and while he was on the way out the door to a second-anniversary-wake for a guy we went to school with, he took a moment to try to hit it as well. No luck. That was Red Herring #2 (both of them use Macs). Those of you who've been playing close, careful attention here may have noticed by now the thing I did *not* say I'd done: Changing the default gateway on the server. My office lan could hit it *because its uplink was in the same network*; *it* had a route for that network. Everyone else... couldn't. Apparently, Sprint operates a caching server, even if you're using the version of Opera (Mobile, not Mini) that does *not*, which explains Red Herring #1. As for Red Herring #2, well... Macs don't, apparently, hard-cache IPs the way WinXP does (I'm looking at *you*, "ipconfig/ flushdns"), but I already knew that, because boss had the right address. Lesson Learned: Make sure you know what your diagnostic tests are telling you, before you use them to rule out possible problems. Better yet: don't rule those potential problems out at all: work your whole diagnostic tree every time Oh: I forgot Red Herring #3: the traces that broke *didn't hit that carrier edge router* for some reason. No clue why. Thanks to the dozen or so people who responded; a couple of whom have way too {much time,many servers} on their hands. :-) Followups to -discuss Cheers, -- jra -- Jay R. Ashworth Baylink jra@baylink.com Designer The Things I Think RFC 2100 Ashworth & Associates http://baylink.pitas.com 2000 Land Rover DII St Petersburg FL USA http://photo.imageinc.us +1 727 647 1274

Wtf is the point you are tying to make ? Sent from my iPhone On Sep 23, 2011, at 7:17 PM, Jay Ashworth <jra@baylink.com> wrote:
So I had to renumber some servers this afternoon, cause I was expanding to a larger netblock (a 28 instead of a 29).
I renumbered my servers and my DNS (which I'd set the TTL on to 300 like a good boy on Wednesday), and then pulled the trigger with Road Runner. He "rescripted" his SMC router (the likely cause of some standard deviation noted by a couple of reporters -- the router, not the rescripting), and I pinged it and it was ok, and I mtr'd it and it was ok, so I hit the webserver, and that came up fine, too.
So then my boss calls me 15 minutes later: it's not working.
"I wonder what that could be", sez I; I'd even traced and hit the webserver from my Android phone (Sprint; Opera Mobile 11), and it had worked fine.
That was Red Herring #1.
So my boss uses a Mac. So does my best friend, and while he was on the way out the door to a second-anniversary-wake for a guy we went to school with, he took a moment to try to hit it as well. No luck.
That was Red Herring #2 (both of them use Macs).
Those of you who've been playing close, careful attention here may have noticed by now the thing I did *not* say I'd done:
Changing the default gateway on the server.
My office lan could hit it *because its uplink was in the same network*; *it* had a route for that network. Everyone else... couldn't.
Apparently, Sprint operates a caching server, even if you're using the version of Opera (Mobile, not Mini) that does *not*, which explains Red Herring #1.
As for Red Herring #2, well... Macs don't, apparently, hard-cache IPs the way WinXP does (I'm looking at *you*, "ipconfig/ flushdns"), but I already knew that, because boss had the right address.
Lesson Learned: Make sure you know what your diagnostic tests are telling you, before you use them to rule out possible problems. Better yet: don't rule those potential problems out at all: work your whole diagnostic tree every time
Oh: I forgot Red Herring #3: the traces that broke *didn't hit that carrier edge router* for some reason. No clue why.
Thanks to the dozen or so people who responded; a couple of whom have way too {much time,many servers} on their hands. :-)
Followups to -discuss
Cheers, -- jra -- Jay R. Ashworth Baylink jra@baylink.com Designer The Things I Think RFC 2100 Ashworth & Associates http://baylink.pitas.com 2000 Land Rover DII St Petersburg FL USA http://photo.imageinc.us +1 727 647 1274 _______________________________________________ Outages mailing list Outages@outages.org https://puck.nether.net/mailman/listinfo/outages

----- Original Message -----
From: "Kevin Kelley" <xirin6@yahoo.com>
Wtf is the point you are tying to make ?
[ puts on moderator hat ] The first point is: when you're *discussing* posts, do it on the -discussion list (as I requested). The second point is: when you ask for off-list replies, and say you'll summarize, do. The third point I will take up with you off-list, if you really insist. Cheers, -- jra -- Jay R. Ashworth Baylink jra@baylink.com Designer The Things I Think RFC 2100 Ashworth & Associates http://baylink.pitas.com 2000 Land Rover DII St Petersburg FL USA http://photo.imageinc.us +1 727 647 1274

Again WTF? how is your post related to anything on this list ? I thought this was outages not Jays lessons? From: Jay Ashworth <jra@baylink.com> To: outages@outages.org Sent: Friday, September 23, 2011 8:58 PM Subject: Re: [outages] Lessons Learned: RRTB outage ----- Original Message -----
From: "Kevin Kelley" <xirin6@yahoo.com>
Wtf is the point you are tying to make ?
[ puts on moderator hat ] The first point is: when you're *discussing* posts, do it on the -discussion list (as I requested). The second point is: when you ask for off-list replies, and say you'll summarize, do. The third point I will take up with you off-list, if you really insist. Cheers, -- jra -- Jay R. Ashworth Baylink jra@baylink.com Designer The Things I Think RFC 2100 Ashworth & Associates http://baylink.pitas.com 2000 Land Rover DII St Petersburg FL USA http://photo.imageinc.us +1 727 647 1274 _______________________________________________ Outages mailing list Outages@outages.org https://puck.nether.net/mailman/listinfo/outages

I, for one, appreciate and support Jay for the information he provided. I appreciated his follow-up to the original report, and it's certainly more information and less noise than you and I are contributing. Go ahead and turn off your iPhone and move the constructive discussion to -discuss. Kudos to Jay for thoughtfully following-up with the outages@ community. Thanks, On Fri, 23 Sep 2011, kevin kelley wrote:
Again WTF? how is your post related to anything on this list ? I thought this was outages not Jays lessons?
From: Jay Ashworth <jra@baylink.com> To: outages@outages.org Sent: Friday, September 23, 2011 8:58 PM Subject: Re: [outages] Lessons Learned: RRTB outage ----- Original Message -----
From: "Kevin Kelley" <xirin6@yahoo.com>
Wtf is the point you are tying to make ?
[ puts on moderator hat ] The first point is: when you're *discussing* posts, do it on the -discussion list (as I requested). [...]

Sorry I was interested in outages not Jays mistakes. I really just need a site that reports outages as it effects business and not the details on how Jay fixed his issues. From: Jay Ashworth <jra@baylink.com> To: outages@outages.org Sent: Friday, September 23, 2011 7:17 PM Subject: [outages] Lessons Learned: RRTB outage So I had to renumber some servers this afternoon, cause I was expanding to a larger netblock (a 28 instead of a 29). I renumbered my servers and my DNS (which I'd set the TTL on to 300 like a good boy on Wednesday), and then pulled the trigger with Road Runner. He "rescripted" his SMC router (the likely cause of some standard deviation noted by a couple of reporters -- the router, not the rescripting), and I pinged it and it was ok, and I mtr'd it and it was ok, so I hit the webserver, and that came up fine, too. So then my boss calls me 15 minutes later: it's not working. "I wonder what that could be", sez I; I'd even traced and hit the webserver from my Android phone (Sprint; Opera Mobile 11), and it had worked fine. That was Red Herring #1. So my boss uses a Mac. So does my best friend, and while he was on the way out the door to a second-anniversary-wake for a guy we went to school with, he took a moment to try to hit it as well. No luck. That was Red Herring #2 (both of them use Macs). Those of you who've been playing close, careful attention here may have noticed by now the thing I did *not* say I'd done: Changing the default gateway on the server. My office lan could hit it *because its uplink was in the same network*; *it* had a route for that network. Everyone else... couldn't. Apparently, Sprint operates a caching server, even if you're using the version of Opera (Mobile, not Mini) that does *not*, which explains Red Herring #1. As for Red Herring #2, well... Macs don't, apparently, hard-cache IPs the way WinXP does (I'm looking at *you*, "ipconfig/ flushdns"), but I already knew that, because boss had the right address. Lesson Learned: Make sure you know what your diagnostic tests are telling you, before you use them to rule out possible problems. Better yet: don't rule those potential problems out at all: work your whole diagnostic tree every time Oh: I forgot Red Herring #3: the traces that broke *didn't hit that carrier edge router* for some reason. No clue why. Thanks to the dozen or so people who responded; a couple of whom have way too {much time,many servers} on their hands. :-) Followups to -discuss Cheers, -- jra -- Jay R. Ashworth Baylink jra@baylink.com Designer The Things I Think RFC 2100 Ashworth & Associates http://baylink.pitas.com 2000 Land Rover DII St Petersburg FL USA http://photo.imageinc.us +1 727 647 1274 _______________________________________________ Outages mailing list Outages@outages.org https://puck.nether.net/mailman/listinfo/outages

We are done with this. Please drop this topic from the list. Josh Luthman Office: 937-552-2340 Direct: 937-552-2343 1100 Wayne St Suite 1337 Troy, OH 45373 On Sep 23, 2011 9:36 PM, "kevin kelley" <xirin6@yahoo.com> wrote:
Sorry I was interested in outages not Jays mistakes. I really just need a site that reports outages as it effects business and not the details on how Jay fixed his issues.
From: Jay Ashworth <jra@baylink.com> To: outages@outages.org Sent: Friday, September 23, 2011 7:17 PM Subject: [outages] Lessons Learned: RRTB outage
So I had to renumber some servers this afternoon, cause I was expanding to
a larger netblock (a 28 instead of a 29).
I renumbered my servers and my DNS (which I'd set the TTL on to 300 like a good boy on Wednesday), and then pulled the trigger with Road Runner. He "rescripted" his SMC router (the likely cause of some standard deviation noted by a couple of reporters -- the router, not the rescripting), and I pinged it and it was ok, and I mtr'd it and it was ok, so I hit the webserver, and that came up fine, too.
So then my boss calls me 15 minutes later: it's not working.
"I wonder what that could be", sez I; I'd even traced and hit the webserver from my Android phone (Sprint; Opera Mobile 11), and it had worked fine.
That was Red Herring #1.
So my boss uses a Mac. So does my best friend, and while he was on the way out the door to a second-anniversary-wake for a guy we went to school with, he took a moment to try to hit it as well. No luck.
That was Red Herring #2 (both of them use Macs).
Those of you who've been playing close, careful attention here may have noticed by now the thing I did *not* say I'd done:
Changing the default gateway on the server.
My office lan could hit it *because its uplink was in the same network*; *it* had a route for that network. Everyone else... couldn't.
Apparently, Sprint operates a caching server, even if you're using the version of Opera (Mobile, not Mini) that does *not*, which explains Red Herring #1.
As for Red Herring #2, well... Macs don't, apparently, hard-cache IPs the way WinXP does (I'm looking at *you*, "ipconfig/ flushdns"), but I already knew that, because boss had the right address.
Lesson Learned: Make sure you know what your diagnostic tests are telling you, before you use them to rule out possible problems. Better yet: don't rule those potential problems out at all: work your whole diagnostic tree every time
Oh: I forgot Red Herring #3: the traces that broke *didn't hit that carrier edge router* for some reason. No clue why.
Thanks to the dozen or so people who responded; a couple of whom have way too {much time,many servers} on their hands. :-)
Followups to -discuss
Cheers, -- jra -- Jay R. Ashworth Baylink jra@baylink.com Designer The Things I Think RFC 2100 Ashworth & Associates http://baylink.pitas.com 2000 Land Rover DII St Petersburg FL USA http://photo.imageinc.us +1 727 647 1274 _______________________________________________ Outages mailing list Outages@outages.org https://puck.nether.net/mailman/listinfo/outages

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA512 On 09/23/2011 08:30 PM, kevin kelley wrote:
Sorry I was interested in outages not Jays mistakes. I really just need a site that reports outages as it effects business and not the details on how Jay fixed his issues.
This looks like he was sending a follow-up regarding something he received help on. There may actually be people interested in known what happened especially if they helped him. If you don't like it then don't read it, or delete it or unsubscribe. Either way stfu. Sending several replies about how you don't like the message is actually more annoying then the initial email. We could really care less about what *you* are interested in. Larry Brower Linux System Administrator III HostGator.com LLC lbrower@hostgator.com Http://www.hostgator.com Http://support.hostgator.com/ Fedora Ambassador - North America Fedora Quality Assurance lbrower@fedoraproject.org http://www.fedoraproject.org/ -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.10 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iQIcBAEBCgAGBQJOfTWzAAoJEF1Xw4ZWTEoJ9UQP/0nrH8aTPlUPAR6BgiPdMpXJ 0+GMmCbOL5L3LyUy1hTU5JLosNsmx4lC0t+pvpCtXOhwPBfc1GM+slaFjJL9/083 nl19h3h97c+fc8KTYp+azaFettXzHduBmfFE+taOnO43OFF8x65G3Rh3IxwNUKp/ l+l6Bew3Jm2i6y56QcEnHWVMOxO+T9dpguyFCOI6xfac2ws0KdLC9xzTVc1FqMxa MWA+Gqgxc3Kz2xCWPJkV6MgYyQeIse///kPelxA30lK0FxUUk5t0/mbLppeCf0qp kCxEUUti6e27gpH0R9fnGFZkSLwZvsQNJcsPVh3DXghIwxciHNNQBNhQAHvMgw8i 3Rh8pk+sT6a46s9XrJ+JQ6JRWM2FAx8j9dSX7UEKcakLAnKjeVLeXUwWx5F3yx4g JSUsxQexdkd1DCN71ZLwYDICaXba7rnP8rlML0gJU/jO4F3S8JfHADyzZL3Nt+AA 6H6KnMP/xV6U/36ecHSMkGcMhBCDDELhZ2DIa+UtpnGouj2qfkqh71wWuhTuRGHb Im3ZV/xzqCSLA4yTIItyqM5cOT48LfLg95NnLWkfnrU0ieN7x88kKck9a9L6mMLL H8dInO65JBWUDUyJeBd+z8sySgpyI3R9EjdpwYWMfrzNhPRcea9nn7MX2xELJWkF nhPsoLy10KN7c6cLvlnz =mBK6 -----END PGP SIGNATURE-----
participants (6)
-
Jay Ashworth
-
Josh Luthman
-
Kevin Kelley
-
kevin kelley
-
Larry Brower
-
William R. Lorenz