reliability issues with hetzner-nbg1-01

added component::internal services/tor sysadmin team owner::anarcat priority::medium resolution::fixed severity::blocker status::closed type::defect labels

Trac:

Trac:
Description: The new Prometheus monitoring server (hetzner-nbg1-01.torproject.org) is seeing intermittent networking issues. It's proving very difficult to get reliable metrics out of it, in any case. From its perspective, random hosts blink in and out of existence unreliably, with almost all hosts (63 of the ~80 monitored) are affected over a period of a week. This leads me to believe the problem is not with all hosts, but with the monitoring server itself. The attached screenshot (tpo-overview.png) shows the randomness of the problem, as seen from hetzner-nbg1-01.torproject.org during the last 7 days.

We have another monitoring server hosted in the Hetzner cloud (hetzner-hel1-01.torproject.org) which doesn't seem to suffer from the same problems. From its perspective, most hosts are healthy over the same period, with an average availability of 99.876% over all hosts, which includes at least one outlier at 88%. The other (nagios) monitoring server sees the new monitoring server with only a 99.728% availbility, with a total 30 minutes downtime over the last 7 days. Note that those statistics have a large margin of error as the Nagios checks are much less frequent than the Prometheus ones, with a granularity ranging in tens of minutes instead of seconds.

The alert history graph (second attachment, histogram.cgi-nbg1-01.png) shows more clearly the problem, especially when compared to a similar host in the vincinity (hetzner-nbg01-02, third attachement, histogram.cgi-nbg1-02.png).

I would therefore conclude there is a severe and intermittent routing issue with this server.

to

The new Prometheus monitoring server (hetzner-nbg1-01.torproject.org) is seeing intermittent networking issues. It's proving very difficult to get reliable metrics out of it, in any case. From its perspective, random hosts blink in and out of existence unreliably, with almost all hosts (63 of the ~80 monitored) are affected over a period of a week. This leads me to believe the problem is not with all hosts, but with the monitoring server itself. The attached screenshot (tpo-overview.png) shows the randomness of the problem, as seen from hetzner-nbg1-01.torproject.org during the last 7 days.

We have another monitoring server hosted in the Hetzner cloud (hetzner-hel1-01.torproject.org) which doesn't seem to suffer from the same problems. From its perspective, most hosts are healthy over the same period, with an average availability of 99.876% over all hosts, which includes at least one outlier at 88%. The other (nagios) monitoring server sees the new monitoring server with only a 99.728% availbility, with a total 30 minutes downtime over the last 7 days. Note that those statistics have a large margin of error as the Nagios checks are much less frequent than the Prometheus ones, with a granularity ranging in tens of minutes instead of seconds.

The alert history graph (second attachment, histogram.cgi-nbg1-01.png) shows more clearly the problem, especially when compared to a similar host in the vincinity (hetzner-nbg01-02, third attachement, histogram.cgi-nbg1-02.png).

I would therefore conclude there is a severe and intermittent routing issue with this server.

Trac:
Description: The new Prometheus monitoring server (hetzner-nbg1-01.torproject.org) is seeing intermittent networking issues. It's proving very difficult to get reliable metrics out of it, in any case. From its perspective, random hosts blink in and out of existence unreliably, with almost all hosts (63 of the ~80 monitored) are affected over a period of a week. This leads me to believe the problem is not with all hosts, but with the monitoring server itself. The attached screenshot (tpo-overview.png) shows the randomness of the problem, as seen from hetzner-nbg1-01.torproject.org during the last 7 days.

We have another monitoring server hosted in the Hetzner cloud (hetzner-hel1-01.torproject.org) which doesn't seem to suffer from the same problems. From its perspective, most hosts are healthy over the same period, with an average availability of 99.876% over all hosts, which includes at least one outlier at 88%. The other (nagios) monitoring server sees the new monitoring server with only a 99.728% availbility, with a total 30 minutes downtime over the last 7 days. Note that those statistics have a large margin of error as the Nagios checks are much less frequent than the Prometheus ones, with a granularity ranging in tens of minutes instead of seconds.

The alert history graph (second attachment, histogram.cgi-nbg1-01.png) shows more clearly the problem, especially when compared to a similar host in the vincinity (hetzner-nbg01-02, third attachement, histogram.cgi-nbg1-02.png).

I would therefore conclude there is a severe and intermittent routing issue with this server.

to

The new Prometheus monitoring server (hetzner-nbg1-01.torproject.org) is seeing intermittent networking issues. It's proving very difficult to get reliable metrics out of it, in any case. From its perspective, random hosts blink in and out of existence unreliably, with almost all hosts (63 of the ~80 monitored) are affected over a period of a week. This leads me to believe the problem is not with all hosts, but with the monitoring server itself. The attached screenshot (tpo-overview.png) shows the randomness of the problem, as seen from hetzner-nbg1-01.torproject.org during the last 7 days.

We have another monitoring server hosted in the Hetzner cloud (hetzner-hel1-01.torproject.org) which doesn't seem to suffer from the same problems. From its perspective, most hosts are healthy over the same period, with an average availability of 99.876% over all hosts, which includes at least one outlier at 88%. The other (nagios) monitoring server sees the new monitoring server with only a 99.728% availbility, with a total 30 minutes downtime over the last 7 days. Note that those statistics have a large margin of error as the Nagios checks are much less frequent than the Prometheus ones, with a granularity ranging in tens of minutes instead of seconds.

The alert history graph (second attachment, histogram.cgi-nbg1-01.png) shows more clearly the problem, especially when compared to a similar host in the vincinity (hetzner-nbg01-02, third attachement, histogram.cgi-nbg1-02.png).

I would therefore conclude there is a severe and intermittent routing issue with this server.

I filed this as an issue in the Hetzner "cloud" web interface and am waiting for feedback.

hetzner responded by asking for error messages, I sent them more logs from nagios:

 we see various errors from the nagios monitoring server
 (hetzner-hel1-01.torproject.org), looking at that one. here's an
 example, yesterday, of pings failing for about 15 minutes:

 [2019-10-01 16:35:44] SERVICE ALERT: hetzner-nbg1-01;PING;CRITICAL;SOFT;1;PING CRITICAL - Packet loss = 100%
 [2019-10-01 16:36:04] HOST ALERT: hetzner-nbg1-01;DOWN;SOFT;1;PING CRITICAL - Packet loss = 100%
 [2019-10-01 16:36:34] SERVICE ALERT: hetzner-nbg1-01;process - apache2 - master;CRITICAL;HARD;1;CHECK_NRPE STATE CRITICAL: Socket timeout after 50 seconds.
 [2019-10-01 16:36:44] SERVICE ALERT: hetzner-nbg1-01;PING;CRITICAL;HARD;1;PING CRITICAL - Packet loss = 100%
 [2019-10-01 16:37:24] HOST ALERT: hetzner-nbg1-01;DOWN;SOFT;2;PING CRITICAL - Packet loss = 100%
 [2019-10-01 16:38:44] HOST ALERT: hetzner-nbg1-01;DOWN;SOFT;3;PING CRITICAL - Packet loss = 100%
 [2019-10-01 16:40:04] HOST ALERT: hetzner-nbg1-01;DOWN;SOFT;4;PING CRITICAL - Packet loss = 100%
 [2019-10-01 16:40:14] HOST ALERT: hetzner-nbg1-01;UP;SOFT;5;PING OK - Packet loss = 0%, RTA = 26.97 ms
 [2019-10-01 16:41:34] SERVICE ALERT: hetzner-nbg1-01;PING;OK;HARD;1;PING OK - Packet loss = 0%, RTA = 23.79 ms
 [2019-10-01 16:50:44] SERVICE ALERT: hetzner-nbg1-01;process - apache2 - master;OK;HARD;1;PROCS OK: 1 process with UID = 0 (root), args '/usr/sbin/apache2'

 I could run a cross ping between the two servers in a screen session to
 try and diagnose this better for you, but from what i can tell, the
 packets just get dropped to the floor somewhere.

i've started a cross-ping between the nagios and prometheus servers to see if this can confirm the packet loss issue.

this could correlate with ipsec problems as well.

i saw a case of a downtime, live. both server couldn't ping each other for a few minutes. this happened in the logs, on nbg1:

Oct  2 19:03:20 hetzner-nbg1-01/hetzner-nbg1-01 charon[963]: 09[IKE] establishing CHILD_SA hetzner-hel1-01.torproject.org-hetzner-nbg1-01.torproject.org{6}
Oct  2 19:03:26 hetzner-nbg1-01/hetzner-nbg1-01 charon[963]: 08[IKE] 95.216.141.241 is initiating an IKE_SA
Oct  2 19:03:26 hetzner-nbg1-01/hetzner-nbg1-01 charon[963]: 13[IKE] IKE_SA hetzner-hel1-01.torproject.org-hetzner-nbg1-01.torproject.org[54] established between 195.201.139.202[195.201.139.202]...95.216.141.241[95.216.141.241]
Oct  2 19:03:26 hetzner-nbg1-01/hetzner-nbg1-01 charon[963]: 13[IKE] CHILD_SA hetzner-hel1-01.torproject.org-hetzner-nbg1-01.torproject.org{164} established with SPIs c0bc9534_i cf59efd5_o and TS 195.201.139.202/32 2a01:4f8:c2c:1e17::1/128 === 95.216.141.241/32 2a01:4f9:c010:5f1::1/128

... ie. a new session, negociated in 6 seconds. that's pretty slow, but tolerable I guess. the problem is that, from nagios' point of view, this was a much longer downtime:

Oct  2 18:57:31 hetzner-hel1-01/hetzner-hel1-01 charon[917]: 10[IKE] establishing CHILD_SA hetzner-hel1-01.torproject.org-hetzner-nbg1-01.torproject.org{6}
Oct  2 19:00:16 hetzner-hel1-01/hetzner-hel1-01 charon[917]: 16[IKE] initiating IKE_SA hetzner-hel1-01.torproject.org-hetzner-nbg1-01.torproject.org[44] to 195.201.139
.202

ie. the session started 6 minutes earlier, and took 3 more minutes to get to the initiating stage. then nagios noticed the node was down, naturally:

Oct  2 19:01:11 hetzner-hel1-01/hetzner-hel1-01 icinga[1469]: SERVICE ALERT: hetzner-nbg1-01;process - apache2 - worker;CRITICAL;SOFT;1;CHECK_NRPE STATE CRITICAL: Sock
et timeout after 50 seconds.
Oct  2 19:01:31 hetzner-hel1-01/hetzner-hel1-01 icinga[1469]: HOST ALERT: hetzner-nbg1-01;DOWN;SOFT;1;PING CRITICAL - Packet loss = 100%

then, another three minutes later, ipsec figured it out and fixed the outage:

Oct  2 19:03:02 hetzner-hel1-01/hetzner-hel1-01 charon[917]: 12[IKE] initiating IKE_SA hetzner-hel1-01.torproject.org-hetzner-nbg1-01.torproject.org[45] to 195.201.139.202
Oct  2 19:03:26 hetzner-hel1-01/hetzner-hel1-01 charon[917]: 14[IKE] establishing CHILD_SA hetzner-hel1-01.torproject.org-hetzner-nbg1-01.torproject.org{6}
Oct  2 19:03:26 hetzner-hel1-01/hetzner-hel1-01 charon[917]: 08[IKE] IKE_SA hetzner-hel1-01.torproject.org-hetzner-nbg1-01.torproject.org[45] established between 95.216.141.241[95.216.141.241]...195.201.139.202[195.201.139.202]
Oct  2 19:03:26 hetzner-hel1-01/hetzner-hel1-01 charon[917]: 08[IKE] CHILD_SA hetzner-hel1-01.torproject.org-hetzner-nbg1-01.torproject.org{139} established with SPIs cf59efd5_i c0bc9534_o and TS 95.216.141.241/32 2a01:4f9:c010:5f1::1/128 === 195.201.139.202/32 2a01:4f8:c2c:1e17::1/128
Oct  2 19:03:41 hetzner-hel1-01/hetzner-hel1-01 icinga[1469]: HOST ALERT: hetzner-nbg1-01;UP;SOFT;4;PING OK - Packet loss = 0%, RTA = 26.98 ms

A similar problem occured when macrum rebooted. From nbg's point of view, it went down, for sure, but then it took forever to rekey, as it was stuck in this state:

hetzner-nbg1-01.torproject.org-macrum.torproject.org[50]: ESTABLISHED 47 minutes ago, 195.201.139.202[195.201.139.202]...138.201.192.11[138.201.192.11]
hetzner-nbg1-01.torproject.org-macrum.torproject.org[50]: IKEv2 SPIs: c1869d7083d456cd_i adcdb88009c5736e_r*, pre-shared key reauthentication in 117 minutes
hetzner-nbg1-01.torproject.org-macrum.torproject.org[50]: IKE proposal: AES_CBC_128/HMAC_SHA2_256_128/PRF_HMAC_SHA2_256/MODP_3072
hetzner-nbg1-01.torproject.org-macrum.torproject.org[50]: Tasks queued: CHILD_REKEY
hetzner-nbg1-01.torproject.org-macrum.torproject.org[50]: Tasks active: CHILD_REKEY
hetzner-nbg1-01.torproject.org-macrum.torproject.org{152}:  REKEYING, TUNNEL, reqid 2, expires in 12 minutes
hetzner-nbg1-01.torproject.org-macrum.torproject.org{152}:   195.201.139.202/32 2a01:4f8:c2c:1e17::1/128 === 138.201.192.11/32 138.201.212.224/28 172.30.133.0/24 2a01:4f8:172:39ca::2/128 2a01:4f8:172:39ca:0:dad3::/96

At that point, the tunnel had already been down for a while - macrum was rebooted at 19:04 - but it took a full 20 minutes for ipsec to recover:

Oct  2 19:18:45 hetzner-nbg1-01/hetzner-nbg1-01 charon[963]: 16[IKE] establishing CHILD_SA hetzner-nbg1-01.torproject.org-macrum.torproject.org{2}
Oct  2 19:21:30 hetzner-nbg1-01/hetzner-nbg1-01 charon[963]: 14[IKE] initiating IKE_SA hetzner-nbg1-01.torproject.org-macrum.torproject.org[55] to 138.201.192.11
Oct  2 19:21:30 hetzner-nbg1-01/hetzner-nbg1-01 charon[963]: 12[IKE] establishing CHILD_SA hetzner-nbg1-01.torproject.org-macrum.torproject.org{2}
Oct  2 19:21:30 hetzner-nbg1-01/hetzner-nbg1-01 charon[963]: 10[IKE] IKE_SA hetzner-nbg1-01.torproject.org-macrum.torproject.org[55] established between 195.201.139.202[195.201.139.202]...138.201.192.11[138.201.192.11]
Oct  2 19:21:30 hetzner-nbg1-01/hetzner-nbg1-01 charon[963]: 10[IKE] CHILD_SA hetzner-nbg1-01.torproject.org-macrum.torproject.org{166} established with SPIs cfe1be1c_i c2ce0e71_o and TS 195.201.139.202/32 2a01:4f8:c2c:1e17::1/128 === 138.201.192.11/32 138.201.212.224/28 172.30.133.0/24 2a01:4f8:172:39ca::2/128 2a01:4f8:172:39ca:0:dad3::/96

in comparison, nagios barely flinched when the server went down:

Oct  2 19:04:00 hetzner-hel1-01/hetzner-hel1-01 charon[917]: 15[IKE] deleting IKE_SA hetzner-hel1-01.torproject.org-macrum.torproject.org[31] between 95.216.141.241[95.216.141.241]...138.201.192.11[138.201.192.11]
Oct  2 19:04:00 hetzner-hel1-01/hetzner-hel1-01 charon[917]: 05[IKE] initiating IKE_SA hetzner-hel1-01.torproject.org-macrum.torproject.org[46] to 138.201.192.11
Oct  2 19:04:31 hetzner-hel1-01/hetzner-hel1-01 icinga[1469]: HOST ALERT: macrum;DOWN;SOFT;1;PING CRITICAL - Packet loss = 100%
Oct  2 19:05:21 hetzner-hel1-01/hetzner-hel1-01 icinga[1469]: SERVICE ALERT: macrum;unwanted process - postgrey;CRITICAL;HARD;1;CHECK_NRPE STATE CRITICAL: Socket timeout after 50 seconds.
Oct  2 19:05:21 hetzner-hel1-01/hetzner-hel1-01 icinga[1469]: SERVICE ALERT: macrum;SSL cert - host;CRITICAL;HARD;1;CHECK_NRPE STATE CRITICAL: Socket timeout after 50 seconds.
Oct  2 19:05:30 hetzner-hel1-01/hetzner-hel1-01 charon[917]: 15[IKE] establishing CHILD_SA hetzner-hel1-01.torproject.org-macrum.torproject.org{2}
Oct  2 19:05:30 hetzner-hel1-01/hetzner-hel1-01 charon[917]: 06[IKE] IKE_SA hetzner-hel1-01.torproject.org-macrum.torproject.org[46] established between 95.216.141.241[95.216.141.241]...138.201.192.11[138.201.192.11]
Oct  2 19:05:30 hetzner-hel1-01/hetzner-hel1-01 charon[917]: 06[IKE] CHILD_SA hetzner-hel1-01.torproject.org-macrum.torproject.org{140} established with SPIs c93eb90f_i c8942f69_o and TS 95.216.141.241/32 2a01:4f9:c010:5f1::1/128 === 138.201.192.11/32 138.201.212.224/28 172.30.133.0/24 2a01:4f8:172:39ca::2/128 2a01:4f8:172:39ca:0:dad3::/96
Oct  2 19:05:41 hetzner-hel1-01/hetzner-hel1-01 icinga[1469]: HOST ALERT: macrum;UP;SOFT;2;PING OK - Packet loss = 0%, RTA = 25.46 ms

19:04:00 is (basically) when macrum went down (syslog stopped writing at 19:03:59), and it brought back up its strongswan service at 19:05:26, so it took less than 5 seconds for nagios to reconnect.

so there's definitely something wrong with strongswan on that prometheus server. i've closed the ticket with hetzner as it now seems obvious they are not the cause of this problem.

for what it's worth, cycling through the tunnels up/down works fine on the prom server, so that's not a problem. configurations are fairly identical between the nagios and prom servers too: only the peers definition (obviously) is different.

will keep on digging.

there was just another outage between macrum and nbg1, and the trouble wasn't with the network: if i ipsec down on both sides, packets flow again, and same when i ipsec up back.

So a long running ping on the two hosts shows this is distinctly a problem with nbg1. Here's the view from the nagios server:

┌──── hetzner-nbg1-01.torproject.org ping statistics ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│ 1121259 packets transmitted, 1090350 received, 2,76% packet loss, time 25923162,1ms                                                                                 │
│ RTT[ms]: min = 23, median = 24, p(95) = 24, max = 31                                                                                                                │
│ ▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ │
└─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
┌──── macrum.torproject.org ping statistics ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│ 1121259 packets transmitted, 1121151 received, 0,01% packet loss, time 28170013,1ms                                                                                 │
│ RTT[ms]: min = 25, median = 25, p(95) = 25, max = 31                                                                                                                │
│ ▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ │
└─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
┌──── bungei.torproject.org ping statistics ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│ 1121259 packets transmitted, 1121109 received, 0,01% packet loss, time 481253,4ms                                                                                   │
│ RTT[ms]: min = 0, median = 0, p(95) = 0, max = 5                                                                                                                    │
│ ▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ │
└─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘

and here's the view from nbg1:

┌──── hetzner-hel1-01.torproject.org ping statistics ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│ 1121178 packets transmitted, 1090274 received, 2,76% packet loss, time 25885640,9ms                                                                                 │
│ RTT[ms]: min = 23, median = 24, p(95) = 24, max = 30                                                                                                                │
│ ▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ │
└─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
┌──── macrum.torproject.org ping statistics ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│ 1121178 packets transmitted, 1033877 received, 7,79% packet loss, time 2914262,8ms                                                                                  │
│ RTT[ms]: min = 3, median = 3, p(95) = 3, max = 8                                                                                                                    │
│ ▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ │
└─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
┌──── bungei.torproject.org ping statistics ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│ 1121178 packets transmitted, 1121025 received, 0,01% packet loss, time 26312314,6ms                                                                                 │
│ RTT[ms]: min = 23, median = 23, p(95) = 23, max = 30                                                                                                                │
│ ▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ │
└─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘

in other words, nbg1 and hel1 have trouble reaching each other, and nbg1 has trouble reaching macrum, but hel1 does not have trouble reaching macrum, which would point at a problem with nbg1. neither machines have trouble with bungei, which is out of the vpn too, which seem to point at a problem with ipsec as well.

weasel setup checks inside the onionoo infra to check their tunnels to reproduce a possible similar problem there, so I followed that lead and setup a similar set of checks in nagios for nbg1.

as I can't figure out the network issue, i'm trying another tack. i've extended the scrape_interval from 15s to 5m while raising the retention_period from 30d to 365d. the latter shouldn't take effect for 30 days while the former will finish converting the database within 30 days. if, after 30 days, we still have this problem, we'll know this is not because of the aggressive retention interval and we might want to consider setting up a secondary server (#31244 (moved)) to see if it can reproduce this problem.

or, as the commitlog said:

origin/master 7cda3928fe9c6bf83ee3e8977b74d58acbb7519a
Author:     Antoine Beaupré <anarcat@debian.org>
AuthorDate: Tue Oct 22 13:46:05 2019 -0400
Commit:     Antoine Beaupré <anarcat@debian.org>
CommitDate: Tue Oct 22 13:46:05 2019 -0400

Parent:     91e379a5 make all mpm_worker paramaters configurable
Merged:     master sudo-ldap
Contained:  master

downgrade scrape interval on internal prometheus server (#31916)

This is an attempt at fixing the reliability issues on the prometheus
server detailed in #31916. The current theory is that ipsec might be
the culprit, but it's also possible that the prometheus is overloaded
and that's creating all sorts of other, unrelated problems.

This is sidetracking the setup of a *separate* long term monitoring
server (#31244), of course, but I'm not sure that's really necessary
for now. Since we don't use prometheus for alerting (#29864), we don't
absolutely /need/ redundancy here so we can afford a SPOF for
Prometheus while we figure out this bug.

If, in thirday days, we still have reliability problems, we will know
this is not due to the retention period and can cycle back to the
other solutions, including creating a secondary server to see if it
reproduces the problem.

1 file changed, 2 insertions(+), 1 deletion(-)
modules/profile/manifests/prometheus/server/internal.pp | 3 ++-

modified   modules/profile/manifests/prometheus/server/internal.pp
@@ -42,7 +42,8 @@ class profile::prometheus::server::internal (
     vhost_name          => $vhost_name,
     collect_scrape_jobs => $collect_scrape_jobs,
     scrape_configs      => $scrape_configs,
-    storage_retention   => '30d',
+    storage_retention   => '365d',
+    scrape_interval     => '5m',
   }
   # expose our IP address to exporters so they can allow us in
   #

Trac:
Status: assigned to needs_review

Trac:
Status: needs_review to needs_information

just turned off ipsec on that host to see if that affects it.

before, ping, as seen from nbg1:

--- hetzner-hel1-01.torproject.org ping statistics ---
1296649 packets transmitted, 1261574 received, 2% packet loss, time 1299266373ms
rtt min/avg/max/mdev = 23.358/24.190/7731.164/37.891 ms, pipe 8

noping:

--- hetzner-hel1-01.torproject.org ping statistics ---
1298788 packets transmitted, 1262968 received, 2,76% packet loss, time 30172995,2ms
RTT[ms]: min = 23, median = 24, p(95) = 24, max = 34

--- macrum.torproject.org ping statistics ---
1298788 packets transmitted, 1258030 received, 3,14% packet loss, time 3645717,0ms
RTT[ms]: min = 3, median = 3, p(95) = 3, max = 10

--- bungei.torproject.org ping statistics ---
1298788 packets transmitted, 1298338 received, 0,03% packet loss, time 30554619,0ms
RTT[ms]: min = 23, median = 24, p(95) = 24, max = 33

from hel1:

--- hetzner-nbg1-01.torproject.org ping statistics ---
3023774 packets transmitted, 2938980 received, 2,80% packet loss, time 70054412,4ms
RTT[ms]: min = 23, median = 24, p(95) = 24, max = 32

--- macrum.torproject.org ping statistics ---
3023774 packets transmitted, 3023132 received, 0,02% packet loss, time 76052419,4ms
RTT[ms]: min = 25, median = 25, p(95) = 25, max = 32

--- bungei.torproject.org ping statistics ---
3023774 packets transmitted, 3023570 received, 0,01% packet loss, time 1292354,7ms
RTT[ms]: min = 0, median = 0, p(95) = 1, max = 1