Opened 8 months ago

Closed 7 months ago

#29682 closed defect (fixed)

remove traces munin-node everywhere

Reported by: anarcat Owned by: anarcat
Priority: Medium Milestone:
Component: Internal Services/Tor Sysadmin Team Version:
Severity: Normal Keywords:
Cc: Actual Points:
Parent ID: #29681 Points:
Reviewer: Sponsor:

Description


Child Tickets

Change History (4)

comment:1 Changed 7 months ago by anarcat

I have implemented the removal of Munin everywhere in Puppet, as much as I could find.

Fun facts found while ripping that stuff out:

  1. there is a DSA-specific munin package, with the following plugins:
  • spamassassin: ham/spam/total counts, looks for `spamd: ((processing|checking) message|identified spam|clean message)` in mail.log, could be replaced with mtail
  • postgres-wal-traffic_: output of psql -p "$port" --no-align --command 'SELECT * FROM pg_current_xlog_insert_location()' --tuples-only --quiet | tr -d /, probably covered by the postgres exporter
  • ksm_scans: output of /sys/kernel/mm/ksm/full_scans, see KSM docs
  • ksm: same, but with pages_shared, _unshared, _volatile, _sharing, possibly covered by the node exporter, but hardly seems critical in any case
  • vsftpd: logtail of /var/log/ftp/vsftpd.log looking for upload/download/login/delete/connexions, would require a custom mtail plugin as well
  • bind: logtail of /var/log/daemon.log, looking for queries etc, easy replacement with the Prometheus exporter
  • apache_servers: apache server-status, equivalent of the apache exporter, already deployed
  1. there's a packet counting script in ferm which seem to count per-IP packet stats from iptables:
    $munin_ips = split(regsubst($v4ips, '([^,]+)', 'ip_\1', 'G'), ',')
    munin::check { $munin_ips:
        script => "ip_";
    }
    if $v6ips {
        $munin6_ips = split(regsubst($v6ips, '([^,]+)', 'ip_\1', 'G'), ',')
        munin::check { $munin6_ips: script => 'ip_', }
    }
    
    i have just removed those, without a replacement.
  1. hiding in the haproxy puppet module is another munin plugin. there is also a prometheus exporter for haproxy which we can eventually deploy to replace this. in the meantime, it was deleted
  1. the VM image installer (modules/roles/files/virt/tor-install-VM) has noises about setting up VM-specific plugins: echo ' for i in /usr/local/sbin/vm_du_ suggest; do ln -vsf /usr/local/sbin/vm_du_ /etc/munin/plugins/vm_du_$i; done'. that file does not seem to be deployed through Puppet, and consists of a script checking the disk space of all VMs. It looks something like this on kvm4 right now:
#!/bin/bash
# -*- sh -*-

MUNIN_LIBDIR=${MUNIN_LIBDIR:-/usr/share/munin}
. $MUNIN_LIBDIR/plugins/plugin.sh

BASE=/srv/vmstore

VM=${0##*vm_du_}
#VM=${VM//_/.}

case $1 in
    autoconf)
        if [[ -d "$BASE" ]]; then
            echo yes
            exit 0
        else
            echo "no ($BASE not found)"
            exit 0
        fi
        ;;
    suggest)
        if [[ -d "$BASE" ]]; then
            find "$BASE" -mindepth 1 -maxdepth 1 -type d -a ! -name lost+found -printf '%f\n' # | tr . _
        fi
        exit 0
        ;;
    config)
        echo "graph_title disk usage VM $VM"
        echo 'graph_args --base 1024 --lower-limit 0'
        echo 'graph_vlabel bytes'
        echo 'graph_category disk'
        echo 'graph_total Total'

        find "$BASE/$VM" -mindepth 1 -maxdepth 1 -type f |
        while read fn; do
            label="${fn##*/}"
            label=${label//./_}
            name=${label//-/_}
            echo "$name.label $label"
            echo "$name.cdef $name,1024,*"
        done
        exit 0
        ;;
esac

find "$BASE/$VM" -mindepth 1 -maxdepth 1 -type f -printf '%f %k\n' |
while read fn du; do
  fn=${fn//[.-]/_}
  echo "$fn.value $du"
done

that is covered by #29816.

  1. the munin-common package doesn't remove its own user/group by default so I did that by hand. there's a possibility that some files are leftover in /var or /etc, but I am ready to assume the consequence of a possible UID reuse there to remove an extra account from all servers
  1. normally, the package removal process should have removed all of /etc/munin/plugins, but there are some leftovers sometimes, e.g. on oo-hetzner-03:
diskstats     fw_forwarded_local  if_err_eth0  ip_38.229.72.27  ntp_kernel_err       ntp_kernel_pll_off  postfix_mailvolume  threads
fw_conntrack  fw_packets          if_eth0      netstat          ntp_kernel_pll_freq  postfix_mailqueue   proc_pri            users

Those are all symlinks to builtin plugins, so I think they can be safely removed and have done so.

  1. nagios was watching that munin was running everywhere in its static configuration, I have removed that check as well

All those changes will take some time to propagate everywhere, which will make Nagios noisy for a little while. Tomorrow, it will be possible to remove remaining Munin code from Puppet entirely, assuming all nodes will have run Puppet correctly.

So, next step here:

  1. check that Puppet has run everywhere
  2. check that Nagios looks okay
  3. sample a random number of machines and make sure that the munin users and groups are gone, that there's nothing from the munin user in /var and that /etc/munin is not present
  4. remove any remaining Munin code in Puppet

Then we will be thoroughly done with Puppet, at last.

Last edited 7 months ago by anarcat (previous) (diff)

comment:2 Changed 7 months ago by anarcat

so current status:

  1. Puppet has run everywhere
  2. Nagios looks mostly okay
  3. Munin user and group are gone, but there are leftover unowned files...
  4. ... so it's a bit early to remove the Munin code for now

I made this magic recipe to list the last check-in times of nodes in PuppetDB:

curl -s -G http://localhost:8080/pdb/query/v4/nodes  | jq -r 'sort_by(.report_timestamp) | .[] | "\(.certname) \(.report_timestamp)"' | column -s\  -t

... documented in https://help.torproject.org/tsa/howto/puppet/ of course. :)

comment:3 Changed 7 months ago by anarcat

To check for unowned files, I am using the magic Cumin incantation:

cumin -p 0 -b 5 --force -o txt '*' 'find / -ignore_readdir_race -path /proc -prune -nouser -o -nogroup' | tee unowned-files

The result is around 300,000 files spread over all the servers, a large part of which are */home/* files which I'm not sure what to do with. The remaining ~800 files are not related to the Munin, so I'll just punt this problem elsewhere. :) (Specifically, in #29987.)

(Actually, there *was* a file related to Munin, /run/xtables.lock, which I have removed where relevant but that wasn't really a problem since it lived in a temporary filesystem and that file would have been removed eventually anyways.)

Tomorrow I'll remove the remaining Munin code in Puppet and this ticket will be done.

Last edited 7 months ago by anarcat (previous) (diff)

comment:4 Changed 7 months ago by anarcat

Resolution: fixed
Status: assignedclosed

i have just now removed all traces of the munin code in Puppet, so this is done.

Note: See TracTickets for help on using tickets.