Savvis/CenturyLink issues this AM

Unsure if this is related to the L3 issues or not. Savvis, at least at the CH3 datacenter, has had bad connectivity problems since 3:20AM CDT. Internet Pulse shows atrocious packet loss to almost everywhere from Savvis. Master case number at Savvis is 4579123. They don't have any more info than "yep, we're down, working on it". -- Bill Weiss

This outage is still ongoing and has far-reaching implications beyond their CH3 facility. I have servers in DC3 (Sterling, VA/Washington, DC) and haven't been able to reach 12.0.0.0/8 among other networks since early this morning. Looks like things will get worse before they get better. Here's their latest update: DATE OF EVENT: Tuesday, April 22, 2014 TIME OF EVENT : 03:20 CDT MASTER CASE: 4579123 UPDATE: In effort to resolve an ongoing incident, CenturyLink will be reloading several internet facing backbone devices within our network. This emergency maintenance will potentially result in an intermittent loss of connectivity and/or latency for up to 15 minutes. Clients with non-redundant ATS MPLS and EVPL connectivity could experience additional impact during this maintenance. This activity is required to address a resource constraint on several peering and hosting routers within the CenturyLink backbone network. CenturyLink will notify all impacted clients via email at the beginning and after the completion of the maintenance activity. We apologize for the inconvenience. CRC Management Request@savvis.net North America - 888 638 6771 EMEA - 00800 7288 4743 Asia-Pacific - +65 63058099 - Cary Wiedemann carywiedemann@gmail.com +1 703 592 6498 On Tue, Apr 22, 2014 at 10:34 AM, Bill Weiss <houdini+outages@clanspum.net>wrote:
Unsure if this is related to the L3 issues or not. Savvis, at least at the CH3 datacenter, has had bad connectivity problems since 3:20AM CDT. Internet Pulse shows atrocious packet loss to almost everywhere from Savvis.
Master case number at Savvis is 4579123. They don't have any more info than "yep, we're down, working on it".
-- Bill Weiss _______________________________________________ Outages mailing list Outages@outages.org https://puck.nether.net/mailman/listinfo/outages

Sounds like they ran out of TCAM space and need to reload with “mls cef maximum-routes ip 768” -Peter

It's convenient that there's no mention of when the "maintenances" will be (the DATE OF EVENT can't be it). For those curious what this looks like (sorry I don't have return path traceroutes): Host Loss% Snt Rcv Last Avg Best Wrst 1. gw.home.lan (192.168.1.1) 0.0% 31 31 0.2 0.3 0.2 0.4 2. 76.102.12.1 0.0% 31 31 8.6 9.3 8.0 16.2 3. 68.86.249.253 0.0% 31 31 8.3 9.2 8.2 16.7 4. te-1-1-0-12-ar01.oakland.ca.sfba.comcast.net (69.139.199.102) 0.0% 30 30 10.8 12.4 10.0 18.6 5. be-100-ar01.sfsutro.ca.sfba.comcast.net (68.85.155.18) 0.0% 30 30 10.8 12.3 10.3 14.0 6. he-3-8-0-0-cr01.sanjose.ca.ibone.comcast.net (68.86.94.85) 0.0% 30 30 13.7 13.9 11.6 16.8 7. er1-tengig2-4.sanjoseequinix.savvis.net (206.24.222.1) 0.0% 30 30 12.1 13.6 11.9 29.9 8. er2-xe-7-0-0.SanJoseEquinix.savvis.net (204.70.198.142) 0.0% 30 30 56.1 16.6 11.7 56.1 9. cr2-tengig0-7-3-0.sanfrancisco.savvis.net (204.70.206.57) 0.0% 30 30 14.7 19.4 14.0 74.8 10. cr2-tengig-0-7-0-0.chicago.savvis.net (204.70.196.246) 3.3% 30 29 65.8 67.7 65.0 85.5 11. hr2-tengigabitethernet-12-1.elkgrovech3.savvis.net (204.70.195. 76.7% 30 7 66.1 66.9 65.3 72.7 12. das5-v3032.ch3.savvis.net (64.37.207.158) 86.2% 30 4 69.0 80.1 65.2 120.8 13. 64.27.160.194 82.8% 30 5 65.1 68.9 65.0 78.2 14. star.slashdot.org (216.34.181.48) 72.4% 30 8 65.4 65.6 65.3 66.0 -- | Jeremy Chadwick jdc@koitsu.org | | UNIX Systems Administrator http://jdc.koitsu.org/ | | Making life hard for others since 1977. PGP 4BD6C0CB | On Tue, Apr 22, 2014 at 06:21:55PM -0400, Cary Wiedemann wrote:
This outage is still ongoing and has far-reaching implications beyond their CH3 facility. I have servers in DC3 (Sterling, VA/Washington, DC) and haven't been able to reach 12.0.0.0/8 among other networks since early this morning.
Looks like things will get worse before they get better. Here's their latest update:
DATE OF EVENT: Tuesday, April 22, 2014 TIME OF EVENT : 03:20 CDT MASTER CASE: 4579123
UPDATE: In effort to resolve an ongoing incident, CenturyLink will be reloading several internet facing backbone devices within our network. This emergency maintenance will potentially result in an intermittent loss of connectivity and/or latency for up to 15 minutes. Clients with non-redundant ATS MPLS and EVPL connectivity could experience additional impact during this maintenance. This activity is required to address a resource constraint on several peering and hosting routers within the CenturyLink backbone network. CenturyLink will notify all impacted clients via email at the beginning and after the completion of the maintenance activity. We apologize for the inconvenience.
CRC Management Request@savvis.net North America - 888 638 6771 EMEA - 00800 7288 4743 Asia-Pacific - +65 63058099
- Cary Wiedemann carywiedemann@gmail.com +1 703 592 6498
On Tue, Apr 22, 2014 at 10:34 AM, Bill Weiss <houdini+outages@clanspum.net>wrote:
Unsure if this is related to the L3 issues or not. Savvis, at least at the CH3 datacenter, has had bad connectivity problems since 3:20AM CDT. Internet Pulse shows atrocious packet loss to almost everywhere from Savvis.
Master case number at Savvis is 4579123. They don't have any more info than "yep, we're down, working on it".
-- Bill Weiss _______________________________________________ Outages mailing list Outages@outages.org https://puck.nether.net/mailman/listinfo/outages
_______________________________________________ Outages mailing list Outages@outages.org https://puck.nether.net/mailman/listinfo/outages

Yeah. This has been a disaster, and we've been seeing this since at least 9:00 AM CDT. I can't even make it from my apartment on the west side of Chicago over to CH3 without seeing 85% packet loss (via Comcast). The most recent we've heard is that they'll be doing their rebooting of stuff 6-8 hours after 3:00 PM, so 9:00 to 11:00. We'll see. In the interim, I think I'll order another pint. ;)
On Apr 22, 2014, at 8:51 PM, Jeremy Chadwick <jdc@koitsu.org> wrote:
It's convenient that there's no mention of when the "maintenances" will be (the DATE OF EVENT can't be it). For those curious what this looks like (sorry I don't have return path traceroutes):
Host Loss% Snt Rcv Last Avg Best Wrst 1. gw.home.lan (192.168.1.1) 0.0% 31 31 0.2 0.3 0.2 0.4 2. 76.102.12.1 0.0% 31 31 8.6 9.3 8.0 16.2 3. 68.86.249.253 0.0% 31 31 8.3 9.2 8.2 16.7 4. te-1-1-0-12-ar01.oakland.ca.sfba.comcast.net (69.139.199.102) 0.0% 30 30 10.8 12.4 10.0 18.6 5. be-100-ar01.sfsutro.ca.sfba.comcast.net (68.85.155.18) 0.0% 30 30 10.8 12.3 10.3 14.0 6. he-3-8-0-0-cr01.sanjose.ca.ibone.comcast.net (68.86.94.85) 0.0% 30 30 13.7 13.9 11.6 16.8 7. er1-tengig2-4.sanjoseequinix.savvis.net (206.24.222.1) 0.0% 30 30 12.1 13.6 11.9 29.9 8. er2-xe-7-0-0.SanJoseEquinix.savvis.net (204.70.198.142) 0.0% 30 30 56.1 16.6 11.7 56.1 9. cr2-tengig0-7-3-0.sanfrancisco.savvis.net (204.70.206.57) 0.0% 30 30 14.7 19.4 14.0 74.8 10. cr2-tengig-0-7-0-0.chicago.savvis.net (204.70.196.246) 3.3% 30 29 65.8 67.7 65.0 85.5 11. hr2-tengigabitethernet-12-1.elkgrovech3.savvis.net (204.70.195. 76.7% 30 7 66.1 66.9 65.3 72.7 12. das5-v3032.ch3.savvis.net (64.37.207.158) 86.2% 30 4 69.0 80.1 65.2 120.8 13. 64.27.160.194 82.8% 30 5 65.1 68.9 65.0 78.2 14. star.slashdot.org (216.34.181.48) 72.4% 30 8 65.4 65.6 65.3 66.0
-- | Jeremy Chadwick jdc@koitsu.org | | UNIX Systems Administrator http://jdc.koitsu.org/ | | Making life hard for others since 1977. PGP 4BD6C0CB |
On Tue, Apr 22, 2014 at 06:21:55PM -0400, Cary Wiedemann wrote: This outage is still ongoing and has far-reaching implications beyond their CH3 facility. I have servers in DC3 (Sterling, VA/Washington, DC) and haven't been able to reach 12.0.0.0/8 among other networks since early this morning.
Looks like things will get worse before they get better. Here's their latest update:
DATE OF EVENT: Tuesday, April 22, 2014 TIME OF EVENT : 03:20 CDT MASTER CASE: 4579123
UPDATE: In effort to resolve an ongoing incident, CenturyLink will be reloading several internet facing backbone devices within our network. This emergency maintenance will potentially result in an intermittent loss of connectivity and/or latency for up to 15 minutes. Clients with non-redundant ATS MPLS and EVPL connectivity could experience additional impact during this maintenance. This activity is required to address a resource constraint on several peering and hosting routers within the CenturyLink backbone network. CenturyLink will notify all impacted clients via email at the beginning and after the completion of the maintenance activity. We apologize for the inconvenience.
CRC Management Request@savvis.net North America - 888 638 6771 EMEA - 00800 7288 4743 Asia-Pacific - +65 63058099
- Cary Wiedemann carywiedemann@gmail.com +1 703 592 6498
On Tue, Apr 22, 2014 at 10:34 AM, Bill Weiss <houdini+outages@clanspum.net>wrote:
Unsure if this is related to the L3 issues or not. Savvis, at least at the CH3 datacenter, has had bad connectivity problems since 3:20AM CDT. Internet Pulse shows atrocious packet loss to almost everywhere from Savvis.
Master case number at Savvis is 4579123. They don't have any more info than "yep, we're down, working on it".
-- Bill Weiss _______________________________________________ Outages mailing list Outages@outages.org https://puck.nether.net/mailman/listinfo/outages
_______________________________________________ Outages mailing list Outages@outages.org https://puck.nether.net/mailman/listinfo/outages
_______________________________________________ Outages mailing list Outages@outages.org https://puck.nether.net/mailman/listinfo/outages

Chris Swingler(chris@chrisswingler.com)@Tue, Apr 22, 2014 at 09:09:58PM -0500:
Yeah. This has been a disaster, and we've been seeing this since at least 9:00 AM CDT. I can't even make it from my apartment on the west side of Chicago over to CH3 without seeing 85% packet loss (via Comcast).
The most recent we've heard is that they'll be doing their rebooting of stuff 6-8 hours after 3:00 PM, so 9:00 to 11:00. We'll see. In the interim, I think I'll order another pint. ;)
When I spoke to someone at their incident team around 6:50PM, they said the same "in the next 6-8 hours" window, so until 3AM? Yikes. The ~4PM email said they would "notify all impacted clients via email at the beginning and after the completion of the maintenance activity" and I still haven't received anything. -- Bill Weiss

Yeah, we've yet to receive anything as well. Their communication during this has been, uh, less than stellar.
On Apr 22, 2014, at 9:22 PM, Bill Weiss <houdini+outages@clanspum.net> wrote:
Chris Swingler(chris@chrisswingler.com)@Tue, Apr 22, 2014 at 09:09:58PM -0500:
Yeah. This has been a disaster, and we've been seeing this since at least 9:00 AM CDT. I can't even make it from my apartment on the west side of Chicago over to CH3 without seeing 85% packet loss (via Comcast).
The most recent we've heard is that they'll be doing their rebooting of stuff 6-8 hours after 3:00 PM, so 9:00 to 11:00. We'll see. In the interim, I think I'll order another pint. ;)
When I spoke to someone at their incident team around 6:50PM, they said the same "in the next 6-8 hours" window, so until 3AM? Yikes. The ~4PM email said they would "notify all impacted clients via email at the beginning and after the completion of the maintenance activity" and I still haven't received anything.
-- Bill Weiss _______________________________________________ Outages mailing list Outages@outages.org https://puck.nether.net/mailman/listinfo/outages

And right as I mention that: EVENT UPDATE 2100: The maintenance to repair the connectivity/latency issue within the Chicago region will begin at 00:00 CT and completed by 04:00 CT. Next update will be provided as we start the maintenance and hourly updates will be provided throughout the maintenance.
On Apr 22, 2014, at 9:28 PM, Chris Swingler <chris@chrisswingler.com> wrote:
Yeah, we've yet to receive anything as well. Their communication during this has been, uh, less than stellar.
On Apr 22, 2014, at 9:22 PM, Bill Weiss <houdini+outages@clanspum.net> wrote:
Chris Swingler(chris@chrisswingler.com)@Tue, Apr 22, 2014 at 09:09:58PM -0500:
Yeah. This has been a disaster, and we've been seeing this since at least 9:00 AM CDT. I can't even make it from my apartment on the west side of Chicago over to CH3 without seeing 85% packet loss (via Comcast).
The most recent we've heard is that they'll be doing their rebooting of stuff 6-8 hours after 3:00 PM, so 9:00 to 11:00. We'll see. In the interim, I think I'll order another pint. ;)
When I spoke to someone at their incident team around 6:50PM, they said the same "in the next 6-8 hours" window, so until 3AM? Yikes. The ~4PM email said they would "notify all impacted clients via email at the beginning and after the completion of the maintenance activity" and I still haven't received anything.
-- Bill Weiss _______________________________________________ Outages mailing list Outages@outages.org https://puck.nether.net/mailman/listinfo/outages

...and they were never heard from again. On Tue, Apr 22, 2014 at 10:30 PM, Chris Swingler <chris@chrisswingler.com>wrote:
And right as I mention that:
EVENT UPDATE 2100: The maintenance to repair the connectivity/latency issue within the Chicago region will begin at 00:00 CT and completed by 04:00 CT. Next update will be provided as we start the maintenance and hourly updates will be provided throughout the maintenance.
On Apr 22, 2014, at 9:28 PM, Chris Swingler <chris@chrisswingler.com> wrote:
Yeah, we've yet to receive anything as well. Their communication during this has been, uh, less than stellar.
On Apr 22, 2014, at 9:22 PM, Bill Weiss <houdini+outages@clanspum.net> wrote:
Chris Swingler(chris@chrisswingler.com)@Tue, Apr 22, 2014 at 09:09:58PM -0500:
Yeah. This has been a disaster, and we've been seeing this since at least 9:00 AM CDT. I can't even make it from my apartment on the west side of Chicago over to CH3 without seeing 85% packet loss (via Comcast).
The most recent we've heard is that they'll be doing their rebooting of stuff 6-8 hours after 3:00 PM, so 9:00 to 11:00. We'll see. In the interim, I think I'll order another pint. ;)
When I spoke to someone at their incident team around 6:50PM, they said
the same "in the next 6-8 hours" window, so until 3AM? Yikes. The ~4PM
email said they would "notify all impacted clients via email at the
beginning and after the completion of the maintenance activity" and I
still haven't received anything.
--
Bill Weiss
_______________________________________________
Outages mailing list
Outages@outages.org
https://puck.nether.net/mailman/listinfo/outages
_______________________________________________ Outages mailing list Outages@outages.org https://puck.nether.net/mailman/listinfo/outages

Sean Lally(sean.lally@crownpeak.com)@Wed, Apr 23, 2014 at 05:48:47AM -0400:
...and they were never heard from again.
I went to bed in despair :) Overnight updates: 2345-0330 things were rebooted (sessions bounced, etc). 0330 brought "Repairs on node 2 in the Chicago region uncovered a possible issue with a card in node 2. We are still expecting to complete repairs at 0400 CT. But may run over by a half hour." Last update at 0446 was "We have completed repairs in the Chicago region, but are seeing a small subset of clients with aberrant behavior. Further updates will be sent within an hour." Packet loss certainly seems to have gotten better. -- Bill Weiss

Last email from me on the topic, I promise :) As of 5:43 CDT: EVENT RESOLUTION: We have completed repairs have been completed related to this issue. Any clients that continue to see latency or connectivity loss should contact the service desk at the below number and a case will be raised to address the issue. -- Bill Weiss
participants (6)
-
Bill Weiss
-
Cary Wiedemann
-
Chris Swingler
-
Jeremy Chadwick
-
Peter Kranz
-
Sean Lally