reliability issues with hetzner-nbg1-01
The new Prometheus monitoring server (hetzner-nbg1-01.torproject.org) is seeing intermittent networking issues. It's proving very difficult to get reliable metrics out of it, in any case. From its perspective, random hosts blink in and out of existence unreliably, with almost all hosts (63 of the ~80 monitored) are affected over a period of a week. This leads me to believe the problem is not with all hosts, but with the monitoring server itself. The attached screenshot (tpo-overview.png) shows the randomness of the problem, as seen from hetzner-nbg1-01.torproject.org during the last 7 days.
We have another monitoring server hosted in the Hetzner cloud (hetzner-hel1-01.torproject.org) which doesn't seem to suffer from the same problems. From its perspective, most hosts are healthy over the same period, with an average availability of 99.876% over all hosts, which includes at least one outlier at 88%. The other (nagios) monitoring server sees the new monitoring server with only a 99.728% availbility, with a total 30 minutes downtime over the last 7 days. Note that those statistics have a large margin of error as the Nagios checks are much less frequent than the Prometheus ones, with a granularity ranging in tens of minutes instead of seconds.
The alert history graph (second attachment, histogram.cgi-nbg1-01.png) shows more clearly the problem, especially when compared to a similar host in the vincinity (hetzner-nbg01-02, third attachement, histogram.cgi-nbg1-02.png).
I would therefore conclude there is a severe and intermittent routing issue with this server.
I filed this as an issue in the Hetzner "cloud" web interface and am waiting for feedback.