Opened 6 months ago

Closed 5 months ago

#29681 closed task (fixed)

replace munin with prometheus and grafana

Reported by: anarcat Owned by: anarcat
Priority: Medium Milestone:
Component: Internal Services/Tor Sysadmin Team Version:
Severity: Normal Keywords:
Cc: Actual Points:
Parent ID: Points:
Reviewer: Sponsor:

Description (last modified by anarcat)

munin died in a fire and people want to try out prometheus, let's do that.

this will also involve setting up a Grafana instance as the built-in graphs in Prometheus are too limited and/or hard to configure.

Child Tickets

TicketStatusOwnerSummaryComponent
#29389closedanarcatSet up VM for PrometheusInternal Services/Tor Sysadmin Team
#29410closedtpaCan Prometheus help with multiple checks turning into one single alarm?Internal Services/Tor Sysadmin Team
#29682closedanarcatremove traces munin-node everywhereInternal Services/Tor Sysadmin Team
#29683closedanarcatinstall prometheus-node-exporter everywhereInternal Services/Tor Sysadmin Team
#29684closedanarcatsetup a grafana server somewhereInternal Services/Tor Sysadmin Team
#29822closedweaselprometheus server cannot reach build-arm* boxesInternal Services/Tor Sysadmin Team
#30028closedanarcatadditional prometheus/grafana exporters/dashboardsInternal Services/Tor Sysadmin Team

Change History (9)

comment:1 Changed 6 months ago by anarcat

Description: modified (diff)
Summary: replace munin with prometheusreplace munin with prometheus and grafana

comment:2 Changed 6 months ago by arma

Component: - Select a componentInternal Services/Tor Sysadmin Team
Owner: set to tpa
Type: defecttask

comment:3 Changed 6 months ago by anarcat

good progress here: setup prom-node-exporter on vineale and the temp server, and samples are coming in.

next step is to figure out how to deploy grafana and start working on puppetizing the entire thing.

comment:4 Changed 6 months ago by anarcat

Owner: changed from tpa to anarcat
Status: newaccepted

update: grafana is now deployed, although not completely automatically. the last stage of this is to deploy more "exporters" to cover things that might have been built into munin, like mail and database server monitoring. apache webserver monitoring is already partly implemented.

once that is done and munin is cleaned up, i think we'll be done here.

comment:5 Changed 6 months ago by anarcat

the apache dashboard has been deployed on all hosts running apache.

i did a quick investigation of the other possible exporters and dashboards that we could use in our infrastructure. notes moved into #30028.

Last edited 5 months ago by anarcat (previous) (diff)

comment:6 Changed 6 months ago by anarcat

for what it's worth, I opened a ticket about the other side of Prometheus, alerting, and replacing Nagios, in #29864

Last edited 5 months ago by anarcat (previous) (diff)

comment:7 Changed 6 months ago by anarcat

after approximately two weeks of scraping, we are nearing 11GiB of disk usage, which is below the estimates established in #29388 (20GiB), which is good.

comment:8 Changed 6 months ago by anarcat

i moved the dashboard / exporter evaluation stuff into #30028 instead of blowing up this ticket with even more of those.

Last edited 5 months ago by anarcat (previous) (diff)

comment:9 Changed 5 months ago by anarcat

Resolution: fixed
Status: acceptedclosed

i think most of the work here is done. there are still open issues on grafana, namely authentication (#30023) and deployment (#30026) but I don't think those are a priority in the grand scheme of things right now. there's also the wider question of monitoring of third-party services, for example in #30152 that i consider out of scope for this original deployment, along with an eventual nagios replacement.

because everything is covered by other tickets, I will make the call and claim the first phase of this project is complete.

TL;DR: done! Next steps:

  • #30023 - grafana authentication
  • #30026 - grafana docker deployment
  • #30152 - third-party monitoring
  • #29864 - nagios replacement
Note: See TracTickets for help on using tickets.