Opened 9 months ago

Closed 8 months ago

#29676 closed task (fixed)

monitor puppet runs

Reported by: anarcat Owned by: anarcat
Priority: Low Milestone:
Component: Internal Services/Tor Sysadmin Team Version:
Severity: Normal Keywords:
Cc: Actual Points:
Parent ID: Points:
Reviewer: Sponsor:

Description

Puppet should be running on all TPO machines. Yet it's not, as far as I can tell, formally monitored in any way.

There should be nagios checks that make sure catalog runs succeeds and run regularly. I deployed such checks at koumbit and it would be fairly easy to do here as well.

Child Tickets

Change History (5)

comment:1 Changed 9 months ago by anarcat

started working on that as it's a requirement for stuff like #29682 because otherwise we can't tell if puppet ran everywhere before removing rotten code.

comment:2 Changed 9 months ago by anarcat

friends recommended this project. unfortunately it didn't run out of the box for us, because our puppetmaster doesn't have a report_hash field for some obscure reason. so i patched that out.

now i'm wondering how to deploy this - should that be added to the debian package? how do we make changes to debian packages?

comment:3 Changed 9 months ago by anarcat

the report_hash was empty because we didn't send reports to puppetdb. that is now fixed on the puppetmaster.

still need to figure out how to deploy.

comment:4 Changed 9 months ago by anarcat

another approach to check the puppet runs is https://github.com/aswen/nagios-plugins/blob/master/check_puppet_agent which runs on the nodes instead of the puppetmaster. interesting because the check is attached to each node instead of bundled up all in one check on the puppetmaster. downside is a puppetmaster failure will create a lot of noise, but that can be solved with service dependencies.

finally of course, this could be done in prometheus as well:

https://forge.puppet.com/puppet/prometheus_reporter

Last edited 9 months ago by anarcat (previous) (diff)

comment:5 Changed 8 months ago by anarcat

Resolution: fixed
Status: assignedclosed

i believe this is now complete. i have integrated the check_puppetdb_nodes script which, thanks to help from weasel, was added to the tor-nagios-checks package. thanks to weasel's guidance again, I figured out how to upload that to db.torproject.org and it was deployed everywhere. the puppetmaster now checks *all* nodes for anomalies and they are filed under that nodes' services as well as the puppetmaster's.

all catalogs are mostly clean. there were two servers that were offline, one of which was decommissioned by weasel.

Note: See TracTickets for help on using tickets.