Opened 2 months ago

Closed 7 weeks ago

#29822 closed defect (fixed)

prometheus server cannot reach build-arm* boxes

Reported by: anarcat Owned by: weasel
Priority: Medium Milestone:
Component: Internal Services/Tor Sysadmin Team Version:
Severity: Minor Keywords:
Cc: Actual Points:
Parent ID: #29681 Points:
Reviewer: Sponsor:

Description

The build-arm-0[1-3].torproject.org boxes are behind NAT (or some sort of firewall?) which makes them unreachable from the global internet. They are therefore not monitored from the Prometheus server right now, although they *are* reachable from the Nagios server.

We need to setup a similar configuration to have those boxes scraped like the other ones.

Child Tickets

Change History (7)

comment:1 Changed 2 months ago by anarcat

From what I gathered, the ARM boxes share a IPsec VPN with each other and the nagios server (and maybe other machines). There seems to be a gateway box (mikrotik.sbg.torproject.org) that creates that network and gives access to the monitoring server. That configuration is not in Puppet and I do not believe I have access to that server. (I can reach it over SSH but my SSH key is not recognized.) The gateway is not in Puppet or LDAP.

It seems there is also an IPsec VPN interconnecting macrum, kvm4, kvm5, textile, unifolium but not moly. That part is configured in Puppet and fully accessible so technically, it *might* be possible to route through that VPN towards the gateway box, but I'm hesitant in messing around with that.

comment:2 Changed 2 months ago by anarcat

another approach: https://github.com/RobustPerception/PushProx

we also talked about wireguard and openvpn as alternatives, and there's also Tor hidden services which can obviously bypass NAT, and many other solutions.

Version 0, edited 2 months ago by anarcat (next)

comment:3 Changed 7 weeks ago by anarcat

Owner: changed from tpa to anarcat
Status: newassigned

comment:4 Changed 7 weeks ago by anarcat

Parent ID: #29681

comment:5 Changed 7 weeks ago by anarcat

Owner: changed from anarcat to weasel

i have tried setting up ipsec on nbg1 and it mostly works when connecting to the other TPO boxes. i've documented what I did in the wiki but mostly I have deployed everything through puppet following the existing configs and rebooted the monitoring server. i then ran puppet on all the other puppet nodes and things generally seem to work.

unfortunately, this doesn't bypass NAT: I cannot ping the ARM boxes behind the microtik server. I assume I also need the local peers configuration that is deployed on the other hosts.

I have tried adding the following static configuration:

conn hetzner-nbg1-01.torproject.org-mikrotik.sbg.torproject.org
  ike = aes128-sha256-modp3072
  #type = tunnel

  left       = 195.201.139.202
  leftsubnet = 195.201.139.202/32, 172.30.142.0/24

  right = 141.201.12.27
  rightallowany = yes
  rightid     = mikrotik.sbg.torproject.org
  rightsubnet = 172.30.115.0/24

  auto = route

  forceencaps = yes
  dpdaction = hold

I made up 172.30.142.0/24 because I didn't know what to put there. trying to raise that interface fails:

root@hetzner-nbg1-01:/etc/ipsec.conf.d# ipsec reload
Reloading strongSwan IPsec configuration...
root@hetzner-nbg1-01:/etc/ipsec.conf.d# ipsec up hetzner-nbg1-01.torproject.org-mikrotik.sbg.torproject.org
retransmit 3 of request with message ID 0
sending packet: from 195.201.139.202[500] to 141.201.12.27[500] (1300 bytes)
retransmit 4 of request with message ID 0
sending packet: from 195.201.139.202[500] to 141.201.12.27[500] (1300 bytes)
retransmit 5 of request with message ID 0
sending packet: from 195.201.139.202[500] to 141.201.12.27[500] (1300 bytes)
giving up after 5 retransmits
establishing IKE_SA failed, peer not responding
establishing connection 'hetzner-nbg1-01.torproject.org-mikrotik.sbg.torproject.org' failed

It looks like the microtik server refuses to talk to us somehow. I have also tried to connect to it as documented in tor-passwords, to no avail:

Authenticated to kvm4.torproject.org ([2a01:4f8:10b:239f::2]:22).
debug1: channel_connect_stdio_fwd mikrotik.sbg.torproject.org:22
debug1: channel 0: new [stdio-forward]
debug1: getpeername failed: Bad file descriptor
debug1: Requesting no-more-sessions@openssh.com
debug1: Entering interactive session.
debug1: pledge: network
debug1: client_input_global_request: rtype hostkeys-00@openssh.com want_reply 0
channel 0: open failed: connect failed: Connection timed out
stdio forwarding failed
ssh_exchange_identification: Connection closed by remote host
"ssh -v4 -J kvm4.torproject.org admin@mikrotik.sbg.torproject.org" took 2 mins 12 secs

So it seems I have a part of the configuration missing, namely the Microtik server bits, and I don't seem to have the access to perform that.

Reassigning to weasel so he can hold my hand for that last step. :)

comment:6 Changed 7 weeks ago by anarcat

also note that I researched possible alternatives to VPNs for NAT bypass, and they were not quite satisfactory:

  • prometheus push gateway: keeps *all* metrics on the gateway unless pruned, and doesn't integrate directly with other exporters, see this discussion for details
  • pagekite: required setting up a server on the prometheus server and clients on all affected boxes, may be more work than IPsec
  • PushProx: not packaged in Debian, no stable release, also requires clients on all affected boxes and extra server

This leaves us with other NAT bypass mechanisms as alternatives, namely Tor hidden services, wireguard, openvpn or tinc. For now, let's see if we can make the current solution work correctly.

Last edited 7 weeks ago by anarcat (previous) (diff)

comment:7 Changed 7 weeks ago by anarcat

Resolution: fixed
Status: assignedclosed

weasel fixed this by logging in through the arm boxes. for some reason the kvm boxes can't access the mikrotik directly anymore. he did the configuration on the mikrotik and prometheus can now scrape those metrics. i documented the process in the wiki, and we're all done here.

Note: See TracTickets for help on using tickets.