The new Prometheus monitoring server (hetzner-nbg1-01.torproject.org) is seeing intermittent networking issues. It's proving very difficult to get reliable metrics out of it, in any case. From its perspective, random hosts blink in and out of existence unreliably, with almost all hosts (63 of the ~80 monitored) are affected over a period of a week. This leads me to believe the problem is not with all hosts, but with the monitoring server itself. The attached screenshot (tpo-overview.png) shows the randomness of the problem, as seen from hetzner-nbg1-01.torproject.org during the last 7 days.
We have another monitoring server hosted in the Hetzner cloud (hetzner-hel1-01.torproject.org) which doesn't seem to suffer from the same problems. From its perspective, most hosts are healthy over the same period, with an average availability of 99.876% over all hosts, which includes at least one outlier at 88%. The other (nagios) monitoring server sees the new monitoring server with only a 99.728% availbility, with a total 30 minutes downtime over the last 7 days. Note that those statistics have a large margin of error as the Nagios checks are much less frequent than the Prometheus ones, with a granularity ranging in tens of minutes instead of seconds.
The alert history graph (second attachment, histogram.cgi-nbg1-01.png) shows more clearly the problem, especially when compared to a similar host in the vincinity (hetzner-nbg01-02, third attachement, histogram.cgi-nbg1-02.png).
I would therefore conclude there is a severe and intermittent routing issue with this server.
I filed this as an issue in the Hetzner "cloud" web interface and am waiting for feedback.
To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information
Child items 0
Show closed items
No child items are currently assigned. Use child items to break down this issue into smaller parts.
Linked items 0
Link issues together to show that they're related.
Learn more.
Trac: Description: The new Prometheus monitoring server (hetzner-nbg1-01.torproject.org) is seeing intermittent networking issues. It's proving very difficult to get reliable metrics out of it, in any case. From its perspective, random hosts blink in and out of existence unreliably, with almost all hosts (63 of the ~80 monitored) are affected over a period of a week. This leads me to believe the problem is not with all hosts, but with the monitoring server itself. The attached screenshot (tpo-overview.png) shows the randomness of the problem, as seen from hetzner-nbg1-01.torproject.org during the last 7 days.
We have another monitoring server hosted in the Hetzner cloud (hetzner-hel1-01.torproject.org) which doesn't seem to suffer from the same problems. From its perspective, most hosts are healthy over the same period, with an average availability of 99.876% over all hosts, which includes at least one outlier at 88%. The other (nagios) monitoring server sees the new monitoring server with only a 99.728% availbility, with a total 30 minutes downtime over the last 7 days. Note that those statistics have a large margin of error as the Nagios checks are much less frequent than the Prometheus ones, with a granularity ranging in tens of minutes instead of seconds.
The alert history graph (second attachment, histogram.cgi-nbg1-01.png) shows more clearly the problem, especially when compared to a similar host in the vincinity (hetzner-nbg01-02, third attachement, histogram.cgi-nbg1-02.png).
I would therefore conclude there is a severe and intermittent routing issue with this server.
to
The new Prometheus monitoring server (hetzner-nbg1-01.torproject.org) is seeing intermittent networking issues. It's proving very difficult to get reliable metrics out of it, in any case. From its perspective, random hosts blink in and out of existence unreliably, with almost all hosts (63 of the ~80 monitored) are affected over a period of a week. This leads me to believe the problem is not with all hosts, but with the monitoring server itself. The attached screenshot (tpo-overview.png) shows the randomness of the problem, as seen from hetzner-nbg1-01.torproject.org during the last 7 days.
We have another monitoring server hosted in the Hetzner cloud (hetzner-hel1-01.torproject.org) which doesn't seem to suffer from the same problems. From its perspective, most hosts are healthy over the same period, with an average availability of 99.876% over all hosts, which includes at least one outlier at 88%. The other (nagios) monitoring server sees the new monitoring server with only a 99.728% availbility, with a total 30 minutes downtime over the last 7 days. Note that those statistics have a large margin of error as the Nagios checks are much less frequent than the Prometheus ones, with a granularity ranging in tens of minutes instead of seconds.
The alert history graph (second attachment, histogram.cgi-nbg1-01.png) shows more clearly the problem, especially when compared to a similar host in the vincinity (hetzner-nbg01-02, third attachement, histogram.cgi-nbg1-02.png).
I would therefore conclude there is a severe and intermittent routing issue with this server.
Trac: Description: The new Prometheus monitoring server (hetzner-nbg1-01.torproject.org) is seeing intermittent networking issues. It's proving very difficult to get reliable metrics out of it, in any case. From its perspective, random hosts blink in and out of existence unreliably, with almost all hosts (63 of the ~80 monitored) are affected over a period of a week. This leads me to believe the problem is not with all hosts, but with the monitoring server itself. The attached screenshot (tpo-overview.png) shows the randomness of the problem, as seen from hetzner-nbg1-01.torproject.org during the last 7 days.
We have another monitoring server hosted in the Hetzner cloud (hetzner-hel1-01.torproject.org) which doesn't seem to suffer from the same problems. From its perspective, most hosts are healthy over the same period, with an average availability of 99.876% over all hosts, which includes at least one outlier at 88%. The other (nagios) monitoring server sees the new monitoring server with only a 99.728% availbility, with a total 30 minutes downtime over the last 7 days. Note that those statistics have a large margin of error as the Nagios checks are much less frequent than the Prometheus ones, with a granularity ranging in tens of minutes instead of seconds.
The alert history graph (second attachment, histogram.cgi-nbg1-01.png) shows more clearly the problem, especially when compared to a similar host in the vincinity (hetzner-nbg01-02, third attachement, histogram.cgi-nbg1-02.png).
I would therefore conclude there is a severe and intermittent routing issue with this server.
to
The new Prometheus monitoring server (hetzner-nbg1-01.torproject.org) is seeing intermittent networking issues. It's proving very difficult to get reliable metrics out of it, in any case. From its perspective, random hosts blink in and out of existence unreliably, with almost all hosts (63 of the ~80 monitored) are affected over a period of a week. This leads me to believe the problem is not with all hosts, but with the monitoring server itself. The attached screenshot (tpo-overview.png) shows the randomness of the problem, as seen from hetzner-nbg1-01.torproject.org during the last 7 days.
We have another monitoring server hosted in the Hetzner cloud (hetzner-hel1-01.torproject.org) which doesn't seem to suffer from the same problems. From its perspective, most hosts are healthy over the same period, with an average availability of 99.876% over all hosts, which includes at least one outlier at 88%. The other (nagios) monitoring server sees the new monitoring server with only a 99.728% availbility, with a total 30 minutes downtime over the last 7 days. Note that those statistics have a large margin of error as the Nagios checks are much less frequent than the Prometheus ones, with a granularity ranging in tens of minutes instead of seconds.
The alert history graph (second attachment, histogram.cgi-nbg1-01.png) shows more clearly the problem, especially when compared to a similar host in the vincinity (hetzner-nbg01-02, third attachement, histogram.cgi-nbg1-02.png).
I would therefore conclude there is a severe and intermittent routing issue with this server.
I filed this as an issue in the Hetzner "cloud" web interface and am waiting for feedback.
hetzner responded by asking for error messages, I sent them more logs from nagios:
we see various errors from the nagios monitoring server (hetzner-hel1-01.torproject.org), looking at that one. here's an example, yesterday, of pings failing for about 15 minutes: [2019-10-01 16:35:44] SERVICE ALERT: hetzner-nbg1-01;PING;CRITICAL;SOFT;1;PING CRITICAL - Packet loss = 100% [2019-10-01 16:36:04] HOST ALERT: hetzner-nbg1-01;DOWN;SOFT;1;PING CRITICAL - Packet loss = 100% [2019-10-01 16:36:34] SERVICE ALERT: hetzner-nbg1-01;process - apache2 - master;CRITICAL;HARD;1;CHECK_NRPE STATE CRITICAL: Socket timeout after 50 seconds. [2019-10-01 16:36:44] SERVICE ALERT: hetzner-nbg1-01;PING;CRITICAL;HARD;1;PING CRITICAL - Packet loss = 100% [2019-10-01 16:37:24] HOST ALERT: hetzner-nbg1-01;DOWN;SOFT;2;PING CRITICAL - Packet loss = 100% [2019-10-01 16:38:44] HOST ALERT: hetzner-nbg1-01;DOWN;SOFT;3;PING CRITICAL - Packet loss = 100% [2019-10-01 16:40:04] HOST ALERT: hetzner-nbg1-01;DOWN;SOFT;4;PING CRITICAL - Packet loss = 100% [2019-10-01 16:40:14] HOST ALERT: hetzner-nbg1-01;UP;SOFT;5;PING OK - Packet loss = 0%, RTA = 26.97 ms [2019-10-01 16:41:34] SERVICE ALERT: hetzner-nbg1-01;PING;OK;HARD;1;PING OK - Packet loss = 0%, RTA = 23.79 ms [2019-10-01 16:50:44] SERVICE ALERT: hetzner-nbg1-01;process - apache2 - master;OK;HARD;1;PROCS OK: 1 process with UID = 0 (root), args '/usr/sbin/apache2' I could run a cross ping between the two servers in a screen session to try and diagnose this better for you, but from what i can tell, the packets just get dropped to the floor somewhere.
i've started a cross-ping between the nagios and prometheus servers to see if this can confirm the packet loss issue.
i saw a case of a downtime, live. both server couldn't ping each other for a few minutes. this happened in the logs, on nbg1:
Oct 2 19:03:20 hetzner-nbg1-01/hetzner-nbg1-01 charon[963]: 09[IKE] establishing CHILD_SA hetzner-hel1-01.torproject.org-hetzner-nbg1-01.torproject.org{6}Oct 2 19:03:26 hetzner-nbg1-01/hetzner-nbg1-01 charon[963]: 08[IKE] 95.216.141.241 is initiating an IKE_SAOct 2 19:03:26 hetzner-nbg1-01/hetzner-nbg1-01 charon[963]: 13[IKE] IKE_SA hetzner-hel1-01.torproject.org-hetzner-nbg1-01.torproject.org[54] established between 195.201.139.202[195.201.139.202]...95.216.141.241[95.216.141.241]Oct 2 19:03:26 hetzner-nbg1-01/hetzner-nbg1-01 charon[963]: 13[IKE] CHILD_SA hetzner-hel1-01.torproject.org-hetzner-nbg1-01.torproject.org{164} established with SPIs c0bc9534_i cf59efd5_o and TS 195.201.139.202/32 2a01:4f8:c2c:1e17::1/128 === 95.216.141.241/32 2a01:4f9:c010:5f1::1/128
... ie. a new session, negociated in 6 seconds. that's pretty slow, but tolerable I guess. the problem is that, from nagios' point of view, this was a much longer downtime:
Oct 2 18:57:31 hetzner-hel1-01/hetzner-hel1-01 charon[917]: 10[IKE] establishing CHILD_SA hetzner-hel1-01.torproject.org-hetzner-nbg1-01.torproject.org{6}Oct 2 19:00:16 hetzner-hel1-01/hetzner-hel1-01 charon[917]: 16[IKE] initiating IKE_SA hetzner-hel1-01.torproject.org-hetzner-nbg1-01.torproject.org[44] to 195.201.139.202
ie. the session started 6 minutes earlier, and took 3 more minutes to get to the initiating stage. then nagios noticed the node was down, naturally:
Oct 2 19:01:11 hetzner-hel1-01/hetzner-hel1-01 icinga[1469]: SERVICE ALERT: hetzner-nbg1-01;process - apache2 - worker;CRITICAL;SOFT;1;CHECK_NRPE STATE CRITICAL: Socket timeout after 50 seconds.Oct 2 19:01:31 hetzner-hel1-01/hetzner-hel1-01 icinga[1469]: HOST ALERT: hetzner-nbg1-01;DOWN;SOFT;1;PING CRITICAL - Packet loss = 100%
then, another three minutes later, ipsec figured it out and fixed the outage:
Oct 2 19:03:02 hetzner-hel1-01/hetzner-hel1-01 charon[917]: 12[IKE] initiating IKE_SA hetzner-hel1-01.torproject.org-hetzner-nbg1-01.torproject.org[45] to 195.201.139.202Oct 2 19:03:26 hetzner-hel1-01/hetzner-hel1-01 charon[917]: 14[IKE] establishing CHILD_SA hetzner-hel1-01.torproject.org-hetzner-nbg1-01.torproject.org{6}Oct 2 19:03:26 hetzner-hel1-01/hetzner-hel1-01 charon[917]: 08[IKE] IKE_SA hetzner-hel1-01.torproject.org-hetzner-nbg1-01.torproject.org[45] established between 95.216.141.241[95.216.141.241]...195.201.139.202[195.201.139.202]Oct 2 19:03:26 hetzner-hel1-01/hetzner-hel1-01 charon[917]: 08[IKE] CHILD_SA hetzner-hel1-01.torproject.org-hetzner-nbg1-01.torproject.org{139} established with SPIs cf59efd5_i c0bc9534_o and TS 95.216.141.241/32 2a01:4f9:c010:5f1::1/128 === 195.201.139.202/32 2a01:4f8:c2c:1e17::1/128Oct 2 19:03:41 hetzner-hel1-01/hetzner-hel1-01 icinga[1469]: HOST ALERT: hetzner-nbg1-01;UP;SOFT;4;PING OK - Packet loss = 0%, RTA = 26.98 ms
A similar problem occured when macrum rebooted. From nbg's point of view, it went down, for sure, but then it took forever to rekey, as it was stuck in this state:
At that point, the tunnel had already been down for a while - macrum was rebooted at 19:04 - but it took a full 20 minutes for ipsec to recover:
Oct 2 19:18:45 hetzner-nbg1-01/hetzner-nbg1-01 charon[963]: 16[IKE] establishing CHILD_SA hetzner-nbg1-01.torproject.org-macrum.torproject.org{2}Oct 2 19:21:30 hetzner-nbg1-01/hetzner-nbg1-01 charon[963]: 14[IKE] initiating IKE_SA hetzner-nbg1-01.torproject.org-macrum.torproject.org[55] to 138.201.192.11Oct 2 19:21:30 hetzner-nbg1-01/hetzner-nbg1-01 charon[963]: 12[IKE] establishing CHILD_SA hetzner-nbg1-01.torproject.org-macrum.torproject.org{2}Oct 2 19:21:30 hetzner-nbg1-01/hetzner-nbg1-01 charon[963]: 10[IKE] IKE_SA hetzner-nbg1-01.torproject.org-macrum.torproject.org[55] established between 195.201.139.202[195.201.139.202]...138.201.192.11[138.201.192.11]Oct 2 19:21:30 hetzner-nbg1-01/hetzner-nbg1-01 charon[963]: 10[IKE] CHILD_SA hetzner-nbg1-01.torproject.org-macrum.torproject.org{166} established with SPIs cfe1be1c_i c2ce0e71_o and TS 195.201.139.202/32 2a01:4f8:c2c:1e17::1/128 === 138.201.192.11/32 138.201.212.224/28 172.30.133.0/24 2a01:4f8:172:39ca::2/128 2a01:4f8:172:39ca:0:dad3::/96
in comparison, nagios barely flinched when the server went down:
Oct 2 19:04:00 hetzner-hel1-01/hetzner-hel1-01 charon[917]: 15[IKE] deleting IKE_SA hetzner-hel1-01.torproject.org-macrum.torproject.org[31] between 95.216.141.241[95.216.141.241]...138.201.192.11[138.201.192.11]Oct 2 19:04:00 hetzner-hel1-01/hetzner-hel1-01 charon[917]: 05[IKE] initiating IKE_SA hetzner-hel1-01.torproject.org-macrum.torproject.org[46] to 138.201.192.11Oct 2 19:04:31 hetzner-hel1-01/hetzner-hel1-01 icinga[1469]: HOST ALERT: macrum;DOWN;SOFT;1;PING CRITICAL - Packet loss = 100%Oct 2 19:05:21 hetzner-hel1-01/hetzner-hel1-01 icinga[1469]: SERVICE ALERT: macrum;unwanted process - postgrey;CRITICAL;HARD;1;CHECK_NRPE STATE CRITICAL: Socket timeout after 50 seconds.Oct 2 19:05:21 hetzner-hel1-01/hetzner-hel1-01 icinga[1469]: SERVICE ALERT: macrum;SSL cert - host;CRITICAL;HARD;1;CHECK_NRPE STATE CRITICAL: Socket timeout after 50 seconds.Oct 2 19:05:30 hetzner-hel1-01/hetzner-hel1-01 charon[917]: 15[IKE] establishing CHILD_SA hetzner-hel1-01.torproject.org-macrum.torproject.org{2}Oct 2 19:05:30 hetzner-hel1-01/hetzner-hel1-01 charon[917]: 06[IKE] IKE_SA hetzner-hel1-01.torproject.org-macrum.torproject.org[46] established between 95.216.141.241[95.216.141.241]...138.201.192.11[138.201.192.11]Oct 2 19:05:30 hetzner-hel1-01/hetzner-hel1-01 charon[917]: 06[IKE] CHILD_SA hetzner-hel1-01.torproject.org-macrum.torproject.org{140} established with SPIs c93eb90f_i c8942f69_o and TS 95.216.141.241/32 2a01:4f9:c010:5f1::1/128 === 138.201.192.11/32 138.201.212.224/28 172.30.133.0/24 2a01:4f8:172:39ca::2/128 2a01:4f8:172:39ca:0:dad3::/96Oct 2 19:05:41 hetzner-hel1-01/hetzner-hel1-01 icinga[1469]: HOST ALERT: macrum;UP;SOFT;2;PING OK - Packet loss = 0%, RTA = 25.46 ms
19:04:00 is (basically) when macrum went down (syslog stopped writing at 19:03:59), and it brought back up its strongswan service at 19:05:26, so it took less than 5 seconds for nagios to reconnect.
so there's definitely something wrong with strongswan on that prometheus server. i've closed the ticket with hetzner as it now seems obvious they are not the cause of this problem.
for what it's worth, cycling through the tunnels up/down works fine on the prom server, so that's not a problem. configurations are fairly identical between the nagios and prom servers too: only the peers definition (obviously) is different.
there was just another outage between macrum and nbg1, and the trouble wasn't with the network: if i ipsec down on both sides, packets flow again, and same when i ipsec up back.
So a long running ping on the two hosts shows this is distinctly a problem with nbg1. Here's the view from the nagios server:
┌──── hetzner-nbg1-01.torproject.org ping statistics ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐│ 1121259 packets transmitted, 1090350 received, 2,76% packet loss, time 25923162,1ms ││ RTT[ms]: min = 23, median = 24, p(95) = 24, max = 31 ││ ▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ │└─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘┌──── macrum.torproject.org ping statistics ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐│ 1121259 packets transmitted, 1121151 received, 0,01% packet loss, time 28170013,1ms ││ RTT[ms]: min = 25, median = 25, p(95) = 25, max = 31 ││ ▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ │└─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘┌──── bungei.torproject.org ping statistics ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐│ 1121259 packets transmitted, 1121109 received, 0,01% packet loss, time 481253,4ms ││ RTT[ms]: min = 0, median = 0, p(95) = 0, max = 5 ││ ▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ │└─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
and here's the view from nbg1:
┌──── hetzner-hel1-01.torproject.org ping statistics ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐│ 1121178 packets transmitted, 1090274 received, 2,76% packet loss, time 25885640,9ms ││ RTT[ms]: min = 23, median = 24, p(95) = 24, max = 30 ││ ▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ │└─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘┌──── macrum.torproject.org ping statistics ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐│ 1121178 packets transmitted, 1033877 received, 7,79% packet loss, time 2914262,8ms ││ RTT[ms]: min = 3, median = 3, p(95) = 3, max = 8 ││ ▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ │└─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘┌──── bungei.torproject.org ping statistics ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐│ 1121178 packets transmitted, 1121025 received, 0,01% packet loss, time 26312314,6ms ││ RTT[ms]: min = 23, median = 23, p(95) = 23, max = 30 ││ ▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ │└─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
in other words, nbg1 and hel1 have trouble reaching each other, and nbg1 has trouble reaching macrum, but hel1 does not have trouble reaching macrum, which would point at a problem with nbg1. neither machines have trouble with bungei, which is out of the vpn too, which seem to point at a problem with ipsec as well.
weasel setup checks inside the onionoo infra to check their tunnels to reproduce a possible similar problem there, so I followed that lead and setup a similar set of checks in nagios for nbg1.
as I can't figure out the network issue, i'm trying another tack. i've extended the scrape_interval from 15s to 5m while raising the retention_period from 30d to 365d. the latter shouldn't take effect for 30 days while the former will finish converting the database within 30 days. if, after 30 days, we still have this problem, we'll know this is not because of the aggressive retention interval and we might want to consider setting up a secondary server (#31244 (moved)) to see if it can reproduce this problem.
or, as the commitlog said:
origin/master 7cda3928fe9c6bf83ee3e8977b74d58acbb7519aAuthor: Antoine Beaupré <anarcat@debian.org>AuthorDate: Tue Oct 22 13:46:05 2019 -0400Commit: Antoine Beaupré <anarcat@debian.org>CommitDate: Tue Oct 22 13:46:05 2019 -0400Parent: 91e379a5 make all mpm_worker paramaters configurableMerged: master sudo-ldapContained: masterdowngrade scrape interval on internal prometheus server (#31916)This is an attempt at fixing the reliability issues on the prometheusserver detailed in #31916. The current theory is that ipsec might bethe culprit, but it's also possible that the prometheus is overloadedand that's creating all sorts of other, unrelated problems.This is sidetracking the setup of a *separate* long term monitoringserver (#31244), of course, but I'm not sure that's really necessaryfor now. Since we don't use prometheus for alerting (#29864), we don'tabsolutely /need/ redundancy here so we can afford a SPOF forPrometheus while we figure out this bug.If, in thirday days, we still have reliability problems, we will knowthis is not due to the retention period and can cycle back to theother solutions, including creating a secondary server to see if itreproduces the problem.1 file changed, 2 insertions(+), 1 deletion(-)modules/profile/manifests/prometheus/server/internal.pp | 3 ++-modified modules/profile/manifests/prometheus/server/internal.pp@@ -42,7 +42,8 @@ class profile::prometheus::server::internal ( vhost_name => $vhost_name, collect_scrape_jobs => $collect_scrape_jobs, scrape_configs => $scrape_configs,- storage_retention => '30d',+ storage_retention => '365d',+ scrape_interval => '5m', } # expose our IP address to exporters so they can allow us in #