Opened 6 months ago

Closed 6 months ago

#29683 closed defect (fixed)

install prometheus-node-exporter everywhere

Reported by: anarcat Owned by: anarcat
Priority: Medium Milestone:
Component: Internal Services/Tor Sysadmin Team Version:
Severity: Normal Keywords:
Cc: Actual Points:
Parent ID: #29681 Points:
Reviewer: Sponsor:

Description

to replace munin (see parent), we need a thing like munin-node everywhere. it's called prometheus-node-exporter.

deploy this through puppet, preferably through a community-maintained module instead of homegrown.

Child Tickets

Change History (8)

comment:1 Changed 6 months ago by anarcat

i installed prometheus-node-exporter from backports on vineale and the temporary mon server. samples are now being recorded by prometheus. we're just missing grafana (#29684) to have a complete prototype.

of course, this all needs to be put in puppet, but I'd like to do that with a real module instead of reinventing the wheel:

https://forge.puppet.com/puppet/prometheus

this will need some glue with ferm as well - right now i've done this by hand in ferm::per_host but that should go away.

Last edited 6 months ago by anarcat (previous) (diff)

comment:2 Changed 6 months ago by anarcat

here's an example graph that can already be done with prometheus, showing the 1m load over time:

http://37.218.247.100/graph?g0.range_input=1h&g0.expr=node_load1&g0.tab=0&g1.range_input=1h&g1.expr=&g1.tab=1

comment:3 Changed 6 months ago by anarcat

"progress": i deployed the prometheus module in 3rdparty which required a bit of wrangling with the code already there. i think it's ready to deploy now, and I'll be following the role/profile/module principle (RPM? hahaha) by creating a "profile" for all our nodes in modules/profile/prometheus/client.pp and then use that in a "role::monitored" class that gets sucked in everywhere.

phew!

comment:4 Changed 6 months ago by anarcat

alright, this is almost done. i've deployed it on vineale, kvm4, and bungei so far, the latter two with puppet.

the next step is to hook that up in a prometheus server deployed through puppet, because I don't want to have to add all those hosts by hand in the prometheus.yml the way I've been doing so far.

this will require setting up a new VM for prom (#29389) because puppet isn't running on the current one. it also requires hacking at the puppet module to properly collect resources because upstream has that as a WIP so far:

https://github.com/voxpupuli/puppet-prometheus/pull/141

i've been working on porting that patchset to the latest version and will deploy this on a new VM tomorrow, hopefully completing the Prometheus setup for the time being (!).

comment:5 Changed 6 months ago by anarcat

the collector patchset was more complicated than I expected, and will require more work, which I'll hopefully finish today.

comment:6 Changed 6 months ago by anarcat

i figured out how to fix the collector puppet module and deployed node-exporter everywhere. given the puppet run latency, it will take a few hours for the change to propagate everywhere, but we already have about 20 out of 73 servers being scraped by the central server.

i'll keep this ticket opened until all nodes have run correctly. when that is done, i'll finally remove all traces of munin everywhere as well (#29682).

Last edited 6 months ago by anarcat (previous) (diff)

comment:7 Changed 6 months ago by anarcat

according to the targets list, we have 73 targets scraped by Prometheus. 3 of those are unreachable because they are behind NAT, from what I understand (the arm boxes) and i'm not sure how to fix that, so I'll just punt that away to another ticket for now.

there are technically 74 hosts in Puppet right now, but one of those (hyalinum.torproject.org) hasn't checked in for a long time (see #29796). so I think we can consider this one FIXED YEAH! :)

comment:8 Changed 6 months ago by anarcat

Resolution: fixed
Status: assignedclosed
Note: See TracTickets for help on using tickets.