#29682 closed defect (fixed)

remove traces munin-node everywhere

Reported by: anarcat Owned by: anarcat
Priority: Medium Milestone:
Component: Internal Services/Tor Sysadmin Team Version:
Severity: Normal Keywords:
Cc: Actual Points:
Parent ID: #29681 Points:
Reviewer: Sponsor:

Description


Child Tickets

Change History (4)

comment:1 Changed 19 months ago by anarcat

I have implemented the removal of Munin everywhere in Puppet, as much as I could find.

Fun facts found while ripping that stuff out:

  1. there is a DSA-specific munin package, with the following plugins:
  • spamassassin: ham/spam/total counts, looks for `spamd: ((processing|checking) message|identified spam|clean message)` in mail.log, could be replaced with mtail
  • postgres-wal-traffic_: output of psql -p "$port" --no-align --command 'SELECT * FROM pg_current_xlog_insert_location()' --tuples-only --quiet | tr -d /, probably covered by the postgres exporter
  • ksm_scans: output of /sys/kernel/mm/ksm/full_scans, see KSM docs
  • ksm: same, but with pages_shared, _unshared, _volatile, _sharing, possibly covered by the node exporter, but hardly seems critical in any case
  • vsftpd: logtail of /var/log/ftp/vsftpd.log looking for upload/download/login/delete/connexions, would require a custom mtail plugin as well
  • bind: logtail of /var/log/daemon.log, looking for queries etc, easy replacement with the Prometheus exporter
  • apache_servers: apache server-status, equivalent of the apache exporter, already deployed
  1. there's a packet counting script in ferm which seem to count per-IP packet stats from iptables:
    $munin_ips = split(regsubst($v4ips, '([^,]+)', 'ip_\1', 'G'), ',')
    munin::check { $munin_ips:
        script => "ip_";
    }
    if $v6ips {
        $munin6_ips = split(regsubst($v6ips, '([^,]+)', 'ip_\1', 'G'), ',')
        munin::check { $munin6_ips: script => 'ip_', }
    }
    
    i have just removed those, without a replacement.
  1. hiding in the haproxy puppet module is another munin plugin. there is also a prometheus exporter for haproxy which we can eventually deploy to replace this. in the meantime, it was deleted
  1. the VM image installer (modules/roles/files/virt/tor-install-VM) has noises about setting up VM-specific plugins: echo ' for i in /usr/local/sbin/vm_du_ suggest; do ln -vsf /usr/local/sbin/vm_du_ /etc/munin/plugins/vm_du_$i; done'. that file does not seem to be deployed through Puppet, and consists of a script checking the disk space of all VMs. It looks something like this on kvm4 right now:
#!/bin/bash
# -*- sh -*-

MUNIN_LIBDIR=${MUNIN_LIBDIR:-/usr/share/munin}
. $MUNIN_LIBDIR/plugins/plugin.sh

BASE=/srv/vmstore

VM=${0##*vm_du_}
#VM=${VM//_/.}

case $1 in
    autoconf)
        if [[ -d "$BASE" ]]; then
            echo yes
            exit 0
        else
            echo "no ($BASE not found)"
            exit 0
        fi
        ;;
    suggest)
        if [[ -d "$BASE" ]]; then
            find "$BASE" -mindepth 1 -maxdepth 1 -type d -a ! -name lost+found -printf '%f\n' # | tr . _
        fi
        exit 0
        ;;
    config)
        echo "graph_title disk usage VM $VM"
        echo 'graph_args --base 1024 --lower-limit 0'
        echo 'graph_vlabel bytes'
        echo 'graph_category disk'
        echo 'graph_total Total'

        find "$BASE/$VM" -mindepth 1 -maxdepth 1 -type f |
        while read fn; do
            label="${fn##*/}"
            label=${label//./_}
            name=${label//-/_}
            echo "$name.label $label"
            echo "$name.cdef $name,1024,*"
        done
        exit 0
        ;;
esac

find "$BASE/$VM" -mindepth 1 -maxdepth 1 -type f -printf '%f %k\n' |
while read fn du; do
  fn=${fn//[.-]/_}
  echo "$fn.value $du"
done

that is covered by #29816.

  1. the munin-common package doesn't remove its own user/group by default so I did that by hand. there's a possibility that some files are leftover in /var or /etc, but I am ready to assume the consequence of a possible UID reuse there to remove an extra account from all servers
  1. normally, the package removal process should have removed all of /etc/munin/plugins, but there are some leftovers sometimes, e.g. on oo-hetzner-03:
diskstats     fw_forwarded_local  if_err_eth0  ip_38.229.72.27  ntp_kernel_err       ntp_kernel_pll_off  postfix_mailvolume  threads
fw_conntrack  fw_packets          if_eth0      netstat          ntp_kernel_pll_freq  postfix_mailqueue   proc_pri            users

Those are all symlinks to builtin plugins, so I think they can be safely removed and have done so.

  1. nagios was watching that munin was running everywhere in its static configuration, I have removed that check as well

All those changes will take some time to propagate everywhere, which will make Nagios noisy for a little while. Tomorrow, it will be possible to remove remaining Munin code from Puppet entirely, assuming all nodes will have run Puppet correctly.

So, next step here:

  1. check that Puppet has run everywhere
  2. check that Nagios looks okay
  3. sample a random number of machines and make sure that the munin users and groups are gone, that there's nothing from the munin user in /var and that /etc/munin is not present
  4. remove any remaining Munin code in Puppet

Then we will be thoroughly done with Puppet, at last.

Last edited 19 months ago by anarcat (previous) (diff)

comment:2 Changed 19 months ago by anarcat

so current status:

  1. Puppet has run everywhere
  2. Nagios looks mostly okay
  3. Munin user and group are gone, but there are leftover unowned files...
  4. ... so it's a bit early to remove the Munin code for now

I made this magic recipe to list the last check-in times of nodes in PuppetDB:

curl -s -G http://localhost:8080/pdb/query/v4/nodes  | jq -r 'sort_by(.report_timestamp) | .[] | "\(.certname) \(.report_timestamp)"' | column -s\  -t

... documented in https://help.torproject.org/tsa/howto/puppet/ of course. :)

comment:3 Changed 19 months ago by anarcat

To check for unowned files, I am using the magic Cumin incantation:

cumin -p 0 -b 5 --force -o txt '*' 'find / -ignore_readdir_race -path /proc -prune -nouser -o -nogroup' | tee unowned-files

The result is around 300,000 files spread over all the servers, a large part of which are */home/* files which I'm not sure what to do with. The remaining ~800 files are not related to the Munin, so I'll just punt this problem elsewhere. :) (Specifically, in #29987.)

(Actually, there *was* a file related to Munin, /run/xtables.lock, which I have removed where relevant but that wasn't really a problem since it lived in a temporary filesystem and that file would have been removed eventually anyways.)

Tomorrow I'll remove the remaining Munin code in Puppet and this ticket will be done.

Last edited 19 months ago by anarcat (previous) (diff)

comment:4 Changed 19 months ago by anarcat

Resolution: fixed
Status: assignedclosed

i have just now removed all traces of the munin code in Puppet, so this is done.

Note: See TracTickets for help on using tickets.