Re: [outages] IPv6 tunnels in FRA1 on HE.net down?

12 Jun 2013

      Indeed, traceroute to my fra1 tunnel is definately getting mislaid within
he.net.

Time to ask for a refund?  oh wait... ;)

Tracing route to tmcc.me [2001:470:1f0a:3ee::2]
over a maximum of 30 hops:

  1    <1 ms    <1 ms    <1 ms
 6.9.1.2.2.d.e.f.f.f.f.f.6.7.a.0.3.f.7.9.d.8.4.6.0.b.8.0.1.0.0.2.ip6.arpa
[2001:8b0:648d:97f3:a76:ffff:fed2:2196]
  2    65 ms    65 ms    66 ms  c.gormless.thn.aa.net.uk [2001:8b0:0:53::53]
  3    56 ms    58 ms    61 ms  2001:7f8:4::50e8:1
  4    70 ms    74 ms    74 ms
40gigabitethernet1-1.core1.lon1.he.net[2001:7f8:4::1b1b:1]
  5   145 ms   141 ms   132 ms
10gigabitethernet10-4.core1.nyc4.he.net[2001:470:0:128::1]
  6   146 ms   150 ms   154 ms
100gigabitethernet7-2.core1.chi1.he.net[2001:470:0:298::1]
  7   211 ms   216 ms   212 ms
10gigabitethernet11-4.core1.pao1.he.net[2001:470:0:283::1]
  8     *        *        *     Request timed out.
  9     *        *        *     Request timed out.
 10     *        *        *     Request timed out.

On 12 June 2013 23:19, Constantine A. Murenin <mureninc@gmail.com> wrote:
...
I smell a big outage:
% date; traceroute tserv1.fra1.he.net
Wed Jun 12 15:17:04 PDT 2013
traceroute to tserv1.fra1.he.net (216.66.80.30), 30 hops max, 60 byte
packets
 1  192.168.105.3 (192.168.105.3)  0.673 ms  0.773 ms  0.923 ms
 2  10gigabitethernet7-6.core3.fmt2.he.net (65.49.10.217)  1.795 ms
1.811 ms  1.795 ms
 3  10gigabitethernet12-1.core1.lax1.he.net (184.105.213.26)  19.773
ms 10gigabitethernet10-1.core1.sjc2.he.net (184.105.222.14)  0.786 ms
0.756 ms
 4  10gigabitethernet10-8.core1.nyc4.he.net (72.52.92.225)  71.714 ms
76.569 ms 10gigabitethernet14-2.core1.nyc4.he.net (184.105.213.198)
71.690 ms
 5  * * *
 6  * * *
 7  * * *
 8  * * *
 9  * * *
10  * * *
11  * * *
12  * * *
13  * * *
14  * * *
15  * *^C
However, a "reverse" traceroute works fine:
% date; traceroute ns1.he.net
Wed Jun 12 15:18:47 PDT 2013
traceroute to ns1.he.net (216.218.130.2), 64 hops max, 40 byte packets
 1  static.33.203.4.46.clients.your-server.de (46.4.203.33)  0.682 ms
0.531 ms  0.481 ms
 2  hos-tr4.juniper2.rz13.hetzner.de (213.239.224.97)  0.245 ms
hos-tr3.juniper2.rz13.hetzner.de (213.239.224.65)  0.242 ms
hos-tr2.juniper1.rz13.hetzner.de (213.239.224.33)  0.241 ms
 3  hos-bb2.juniper4.ffm.hetzner.de (213.239.240.150)  5.951 ms
hos-bb1.juniper1.ffm.hetzner.de (213.239.240.224)  4.803 ms  4.786 ms
 4  30gigabitethernet4-3.core1.fra1.he.net (80.81.192.172)  6.354 ms
6.81 ms  5.451 ms
 5  10gigabitethernet10-2.core1.par2.he.net (72.52.92.26)  22.803 ms
24.531 ms  24.787 ms
 6  10gigabitethernet15-1.core1.ash1.he.net (184.105.213.93)  101.504
ms  99.563 ms  99.958 ms
 7  10gigabitethernet11-1.core1.pao1.he.net (184.105.213.177)  163.687
ms  171.711 ms  175.34 ms
 8  10gigabitethernet1-2.core1.fmt1.he.net (184.105.213.65)  163.940
ms  171.362 ms  167.67 ms
 9  ns1.he.net (216.218.130.2)  163.52 ms  165.265 ms  164.143 ms
C.
On 12 June 2013 15:11, Constantine A. Murenin <mureninc@gmail.com> wrote:
...
he.net fra1 tserv1 is down since about 10 minutes ago (~14:57 PT),
this time it seems like the whole thing is down, cannot even ping
ordns.he.net over IPv6; a connection of my friend who's running a
smokeping is also down, e.g. this is definitely widespread.
Not sure what's exactly down, but it seems to be bgp-related, perhaps:
1  2600:3c01::8678:acff:fe0d:79c1 (2600:3c01::8678:acff:fe0d:79c1)
0.699 ms  0.828 ms  0.970 ms
 2  10gigabitethernet7-6.core3.fmt2.he.net (2001:470:1:3b8::1)  5.936
ms  5.877 ms  5.859 ms
 3  10gigabitethernet5-4.core1.pao1.he.net (2001:470:0:263::2)  6.743
ms  6.877 ms  6.978 ms
 4  * * *
 5  * * *
 6  * * *
 7  * * *
 8  * * *
 9  * * *
10  * * *
11  * * *
12  * * *
13  * * *
14  * * *
15  * * *
16  * * *
17  * * *
18  * * *
C.
On 14 May 2013 15:50, Constantine A. Murenin <mureninc@gmail.com> wrote:
...
For what it is worth, further details about the issue have surfaced.
I found a friend who also has a tunnel on tserv1.fra1.he.net., and he
has been running smokeping to various IPv4 and IPv6 resources for
quite a while.
According to several of his smokeping reports, it can be concluded
that this very outage occurred during 14T18:00/05 and 14T18:45/50; but
we've also noticed that there was another, 6 hour (yes, 6 hour) outage
a day earlier, ~13T12 to ~13T18 (which corresponds to Monday early to
late morning Pacific Time).
I've contacted he.net again this time around, and they said that
they're trying to hunt some obscure kernel bug that is causing these
issues.
The tunnelbroker.net is a free service, but to have a 6 hour outage,
clearly spanning 1/4th of a whole day, is absolutely ridiculous.  I'm
stunned that IPv6 connectivity of tserv1.fra1.he.net. is, apparently,
still not monitored, even though it's known to be having these issues.
 ???
Alternatively, it is, of course, possible that some engineer has been
troubleshooting the root cause of this issue for those whole 5 or 6
hours on Sunday/Monday night; but I find that somewhat hard to
believe; more like it got busted, and noone responsible knew about it
being busted for most of the time that it was.
Even more troubling, is that they don't even publish any reports about
these extended outages.
For tserv1.fra1.he.net. end users:  if you can `ping6 ordns.he.net`
(it runs on tserv itself, try $(host `dig +short -6 @ordns.he.net
whoami.akamai.net`)), but cannot `ping6 ns4.he.net`, then it most
likely means that tserv1.fra1.he.net IPv6-connectivity is down again,
and you must open a ticket with HE.net ASAP.  Perhaps someone should
setup a smokeping with automatic emails to support@he.net?
C.
On 14 May 2013 09:42, Constantine A. Murenin <mureninc@gmail.com>
wrote:
...
This just happened again:  all IPv6 tunnels on tserv1.fra1.he.net.
were inaccessible; from within a tunnel, cannot access any IPv6
resource, other than ordns.he.net, which runs on the tunnel server
itself.
Why does FRA1 loses IPv6 connectivity so often?
Update: Seems like it has been resolved as I've been writing this
email, but this would seem to happen a few times too many.
C.
On 23 April 2013 19:28, Constantine A. Murenin <mureninc@gmail.com>
wrote:
...
On 23 April 2013 18:28, Constantine A. Murenin <mureninc@gmail.com>
wrote:
...
As of a couple of minutes ago, my IPv6 tunnel seems to have no
connectivity, weirdly other than being able to access ordns.he.net
(2001:470:20::2) just fine, but not ns{2,3,4,5}.he.net, or any other
IPv6 host.
Update: reported to he.net support@ on 18:34, including a follow-up
phone call;
everything's back online, as of at least 18:51 PT.
Pretty fast resolution, for a free service. :-)
According to he.net, tserv wasn't responding on its IPv6 address,
and has
henceforth been rebooted.
Which adds up as per my mtr from a Linode:
# mtr --report{,-wide,-cycles=60} --order "SRL BGAWV" -6 XXXXXXXXX ;
date
  2. 10gigabitethernet2-3.core1.fmt1.he.net     60    60  0.0%
 0.4   2.5   8.5  85.5  15.2
  3. 10gigabitethernet1-2.core1.sjc2.he.net     60    60  0.0%
 0.8   2.5   5.3  51.6   8.5
  4. 10gigabitethernet3-3.core1.den1.he.net     60    60  0.0%
27.8  31.7  32.3  70.0   7.3
  5. 10gigabitethernet5-5.core1.mci3.he.net     60    60  0.0%
39.7  43.6  44.3 114.3  10.1
  6. 10gigabitethernet5-2.core1.chi1.he.net     60    60  0.0%
52.0  56.2  57.4 177.2  17.0
  7. 100gigabitethernet7-2.core1.nyc4.he.net    60    59  1.7%
69.1  72.2  72.5 119.8   7.1
  8. 10gigabitethernet1-2.core1.lon1.he.net     60    60  0.0%
 137.8 141.8 142.2 229.2  12.0
  9. 10gigabitethernet4-2.core1.fra1.he.net     60    60  0.0%
 149.4 154.0 154.4 242.8  12.7
 10. ???                                        60     0 100.0    0.0
  0.0   0.0   0.0   0.0
Tue Apr 23 18:14:29 PDT 2013
...
2. 10gigabitethernet2-3.core1.fmt1.he.net     60    60  0.0%
 0.4   2.8 9.3 78.3 16.8
  3. 10gigabitethernet1-2.core1.sjc2.he.net     60    60  0.0%
 0.8   2.3   3.9  13.9   3.8
  4. 10gigabitethernet3-3.core1.den1.he.net     60    60  0.0%
27.8  30.3  30.5  39.2   3.5
  5. 10gigabitethernet5-5.core1.mci3.he.net     60    60  0.0%
39.7  43.4  43.6  52.8   4.4
  6. 10gigabitethernet5-2.core1.chi1.he.net     60    60  0.0%
52.0  54.1  54.2  62.2   3.2
  7. 100gigabitethernet7-2.core1.nyc4.he.net    60    60  0.0%
69.1  71.5  71.6  82.1   3.5
  8. 10gigabitethernet1-2.core1.lon1.he.net     60    60  0.0%
 137.8 140.3 140.4 149.1   3.5
  9. 10gigabitethernet4-2.core1.fra1.he.net     60    60  0.0%
 149.4 152.3 152.3 162.5   3.7
 10. tserv1.fra1.he.net                         60    60  0.0%
 154.1 155.5 155.5 163.4   2.2
 11. IPv6.XXXXXX                                60    59  1.7%  155.2
156.0 156.1 162.9   1.2
Tue Apr 23 18:55:34 PDT 2013
And I guess ordns.he.net (2001:470:20::2) really runs on tserv
(and hence wasn't affected during the outage).
Cns# echo {ordns,ns{2,3,4,5}}.he.net | xargs -n1 traceroute6 -l; date
traceroute6 to ordns.he.net (2001:470:20::2) from
2001:470:XXXX:YYYY::2, 64 hops max, 12 byte packets
 1  ordns.he.net (2001:470:20::2)  6.39 ms  6.495 ms  6.213 ms
traceroute6 to ns2.he.net (2001:470:200::2) from
2001:470:XXXX:YYYY::2, 64 hops max, 12 byte packets
 1  XXXXXX.tunnel.tserv6.fra1.ipv6.he.net (2001:470:XXXX:YYYY::1)
 9.353 ms  8.783 ms  9.328 ms
 2  gige-g2-20.core1.fra1.he.net (2001:470:0:69::1)  13.12 ms
 15.134 ms  6.252 ms
 3  10gigabitethernet5-3.core1.lon1.he.net (2001:470:0:1d2::1)
 18.113 ms  23.554 ms  20.342 ms
 4  ns2.he.net (2001:470:200::2)  20.449 ms  22.873 ms  20.517 ms
traceroute6 to ns3.he.net (2001:470:300::2) from
2001:470:XXXX:YYYY::2, 64 hops max, 12 byte packets
 1  * XXXXXX.tunnel.tserv6.fra1.ipv6.he.net (2001:470:XXXX:YYYY::1)
 8.743 ms  9.2 ms
 2  gige-g2-20.core1.fra1.he.net (2001:470:0:69::1)  5.934 ms
 10.749 ms  6.197 ms
 3  10gigabitethernet5-3.core1.ams1.he.net (2001:470:0:225::1)
 18.567 ms  17.728 ms  13.282 ms
 4  ns3.he.net (2001:470:300::2)  13.462 ms  13.412 ms  13.525 ms
traceroute6 to ns4.he.net (2001:470:400::2) from
2001:470:XXXX:YYYY::2, 64 hops max, 12 byte packets
 1  XXXXXX.tunnel.tserv6.fra1.ipv6.he.net (2001:470:XXXX:YYYY::1)
 8.838 ms  8.684 ms  8.682 ms
 2  gige-g2-20.core1.fra1.he.net (2001:470:0:69::1)  6.233 ms
 13.208 ms  5.97 ms
 3  ns4.he.net (2001:470:400::2)  6.145 ms  6.38 ms  6.384 ms
traceroute6 to ns5.he.net (2001:470:500::2) from
2001:470:XXXX:YYYY::2, 64 hops max, 12 byte packets
 1  * XXXXXX.tunnel.tserv6.fra1.ipv6.he.net (2001:470:XXXX:YYYY::1)
 8.655 ms  8.848 ms
 2  gige-g2-20.core1.fra1.he.net (2001:470:0:69::1)  12.607 ms
 6.527 ms  11.929 ms
 3  10gigabitethernet5-3.core1.ams1.he.net (2001:470:0:225::1)
 13.289 ms  13.968 ms  16.754 ms
 4  ns5.he.net (2001:470:500::2)  14.111 ms  13.575 ms  13.385 ms
Tue Apr 23 18:59:48 PDT 2013
However, it's unclear why IPv6 connectivity of tserv1.fra1.he.net
doesn't seem to be monitored otherwise. :/
Cheers,
Constantine.
--
В. В. Путин о совершенстве, 24 декабря 2000 года: Если человека все
устраивает, то он полный идиот. Здорового человека в нормальной памяти
не может всегда и всё устраивать.
_______________________________________________
Outages mailing list
Outages@outages.org
https://puck.nether.net/mailman/listinfo/outages

Re: [outages] IPv6 tunnels in FRA1 on HE.net down?

Tony McCrory