Opened 17 months ago

Last modified 10 months ago

#29864 new project

consider replacing nagios with prometheus

Reported by: anarcat Owned by: tpa
Priority: Low Milestone:
Component: Internal Services/Tor Sysadmin Team Version:
Severity: Major Keywords:
Cc: Actual Points:
Parent ID: Points:
Reviewer: Sponsor:

Description (last modified by anarcat)

As a followup to the Prometheus/Grafana setup started in #29681, I am wondering if we should also consider replacing the Nagios/Icinga server with Prometheus. I have done a little research on the subject and figured it might be good to at least document the current state of affairs.

This would remove a complex piece of architecture we have at TPO that was designed before Puppet was properly deployed. Prometheus has an interesting federated design that allows it to scale to multiple machines easily, along with a high availability component for the alertmanager that allows it to be more reliable than a traditionnal Nagios configuration. It would also simplify our architecture as the Nagios server automation is a complex mix of Debian packages and git hooks that is serving us well, but hard to comprehend and debug for new administrators. (I managed to wipe the entire Nagios config myself on my first week on the job by messing up a configuration file.) Having the monitoring server fully deployed by Puppet would be a huge improvement, even if it would be done with Nagios instead of Prometheus, of course.

Right now the Nagios server is actually running Icinga 1.13, a Nagios fork, on a heztner machine (hetzner-hel1-01). It's doing its job generally well although it feels a *little* noisy, but that's to be expected form Nagios servers. Reducing the number of alerts seems to be an objective, explicitely documented in #29410, for example.

Both Grafana and Prometheus can do alerting, with various mechanisms and plugins. I haven't investigated those deeply, but in general that's not a problem in alerting: you fire some script or API and the rest happens. I suspect we could port the current Nagios alerting scripts to Prometheus fairly easily, although I haven't investigated our scripts in details.

The problem is reproducing the check scripts and their associated alert threshold. In the Nagios world, when a check is installed, it *comes* with its own health ("OK", "WARNING", "CRITICAL") threshold and TPO has developed a wide variety of such checks. According to the current Nagios dashboard, it monitors 4612 services on 88 hosts (which is interesting considering LDAP thinks there are 78). That looks terrifying, but it's actually a set of 9 commands running on the Nagios server, including the complex check_nrpe system, which is basically a client-side nagios that has its own set of checks. And that's where the "cardinal explosion" happens: on a typical host, there are 315 such checks implemented.

That's the hard part: convert those 324 checks into Prometheus alerts, one at a time. Unfortunately, there are no "built-in" or even "third-party" "prometheus alert sets" that I could find in my original research, although that might have changed in the last year.

Each check in Prometheus is basically a YAML file describing a Prometheus query that, when it evaluates to "true" (e.g. disk_space > 90%), sends an alert. It's not impossible to do that conversion, it's just a lot of work.

To do this progressively while allowing us to make new alerts on Prometheus instead of Nagios, I suggest to proceed the same way Cloudflare did, which is to establish a "Nagios to Prometheus" bridge, by which Nagios doesn't send the alerts on its own and instead forwards them to the Prometheus server, a plugin they called Promsaint.

With the bridge in place, Nagios checks can be migrated into Prometheus alerts progressively without disruption. Note that Cloudflare documented their experience with Prometheus in this 2017 promcon talk. Cloudflare also made an alert dashboard called unsee (see also the fork called karma) and elasticsearch integration which might be good to investigate further.

Another useful piece is this NRPE to Prometheus exporter, which allows Prometheus to directly scrape NRPE targets. It doesn't include Prometheus alerts and instead relies on a Grafana dashboard to show possible problems so, as such, I don't think it's that useful an alternative. There's a similar approach using check_mk instead.

Another possible approach is to send alerts from Nagios based on Prometheus checks, using the Prometheus nagios plugins. This might allow us to get rid of NRPE everywhere but it would probably be useful only if we do want to keep Nagios in the long term and remove NRPE in favor of the existing Prometheus exporters.

So, battle plan is basically this:

  1. apt install prometheus-alertmanager
  2. reimplement the Nagios alerting commands
  3. send Nagios alerts through the alertmanager
  4. rewrite (non-NRPE) commands (9) as Prometheus alerts
  5. optionnally, scrape the NRPE metrics from Prometheus
  6. optionnally, create a dashboard and/or alerts for the NRPE metrics
  7. rewrite NRPE commands (300+) as Prometheus alerts
  8. turn off the Nagios server
  9. remove all traces of NRPE on all nodes

Child Tickets

Change History (6)

comment:1 Changed 17 months ago by anarcat

Description: modified (diff)

comment:2 Changed 16 months ago by anarcat

Description: modified (diff)

comment:3 Changed 14 months ago by anarcat

about such transitions, there was a good talk at srecon2019:

a key takeaway from the presentation is that they didn't replace nagios, but filtered its output. this matches our model where nagios sends notifications to alert manager.

we might also want to monitor the number of alerts sent out by nagios, if only to figure out trends that we could work to fix. for this I found two different solutions:

  1. a simple exporter relying on the check-mk-livestatus which did not make it to buster
  2. a more complex exporter that behaves like a broker but that is not packaged in Debian either

It seems neither is a good solution for now and that we should instead rely on the internal reporting metrics inside Nagios for the time being.

Last edited 14 months ago by anarcat (previous) (diff)

comment:4 Changed 10 months ago by anarcat

one thing i couldn't figure out what to do with prometheus so far is the "availability report" from nagios, which shows which percentage of time hosts are available over a given period. i was able to use:

sum(count(up==1))/sum(count(up)) by (alias)

which shows which percentage of hosts (not of time) are available at each point in time. but it doesn't give me which percentage of time each host is available over the entire period.

comment:5 Changed 10 months ago by anarcat

the magic formula is, according to @bbrazil on #prometheus:


comment:6 Changed 10 months ago by anarcat

if we end up writing alerting rules, it might be important to design tests for the rules too:

Note: See TracTickets for help on using tickets.