replace munin with prometheus and grafana

munin died in a fire and people want to try out prometheus, let's do that.

this will also involve setting up a Grafana instance as the built-in graphs in Prometheus are too limited and/or hard to configure.

comment:1

replace munin with prometheus and grafana

comment:2

comment:3

good progress here: setup prom-node-exporter on vineale and the temp server, and samples are coming in.

next step is to figure out how to deploy grafana and start working on puppetizing the entire thing.

comment:4

update: grafana is now deployed, although not completely automatically. the last stage of this is to deploy more "exporters" to cover things that might have been built into munin, like mail and database server monitoring. apache webserver monitoring is already partly implemented.

once that is done and munin is cleaned up, i think we'll be done here.

comment:5

the apache dashboard has been deployed on all hosts running apache.

i did a quick investigation of the other possible exporters and dashboards that we could use in our infrastructure. notes moved into #30028.

comment:6

for what it's worth, I opened a ticket about the other side of Prometheus, alerting, and replacing Nagios, in #29864

comment:7

after approximately two weeks of scraping, we are nearing 11GiB of disk usage, which is below the estimates established in #29388 (20GiB), which is good.

comment:8

i moved the dashboard / exporter evaluation stuff into #30028 instead of blowing up this ticket with even more of those.

comment:9

i think most of the work here is done. there are still open issues on grafana, namely authentication (#30023) and deployment (#30026) but I don't think those are a priority in the grand scheme of things right now. there's also the wider question of monitoring of third-party services, for example in #30152 that i consider out of scope for this original deployment, along with an eventual nagios replacement.

because everything is covered by other tickets, I will make the call and claim the first phase of this project is complete.

TL;DR: done! Next steps:

  • #30023 - grafana authentication
  • #30026 - grafana docker deployment
  • #30152 - third-party monitoring
  • #29864 - nagios replacement
