Opened 5 months ago

Closed 4 months ago

Last modified 4 months ago

#30028 closed project (fixed)

additional prometheus/grafana exporters/dashboards

Reported by: anarcat Owned by: anarcat
Priority: Medium Milestone:
Component: Internal Services/Tor Sysadmin Team Version:
Severity: Normal Keywords:
Cc: Actual Points:
Parent ID: #29681 Points:
Reviewer: Sponsor:

Description (last modified by anarcat)

our munin replacement is not entirely complete, as there are key parts of the infrastructure that are not monitored. here's a short inventory of what I found in #29681:

email servers monitoring (eugeni, etc? postfix)

mailman monitoring

no known exporter or dashboard

databases

DNS / bind

GitLab

there is builtin support for prometheus that has to be configured

those are the other missing things I found during the audit performed while removing Munin:

  • spamassassin: ham/spam/total counts, looks for spamd: ((processing|checking) message|identified spam|clean message) in mail.log, could be replaced with mtail
  • postgres-wal-traffic_: should be covered by the postgres_exporter mentioned above, otherwise hook psql -p "$port" --no-align --command 'SELECT * FROM pg_current_xlog_insert_location()' --tuples-only --quiet | tr -d /, into the node_exporter
  • ksm stats: extra memory statistics, might not be very important
  • haproxy: https://github.com/prometheus/haproxy_exporter
  • per VM disk usage: see #29816
  • vsftpd: custom mtail plugin, no known exporter or dashboard

See the full review in #29682 for details on those.

There were also demands from other teams for monitoring, see #29863 and #30006 for now.

Child Tickets

Change History (10)

comment:1 Changed 5 months ago by anarcat

out of those, I think we can consider the issue complete when we have those monitored:

  • bind
  • postfix
  • spamassassin
  • postgresql

everything else is sugar on top we can add as needed eventually or that is covered by other tickets.

comment:2 Changed 5 months ago by anarcat

Description: modified (diff)
Owner: changed from tpa to anarcat
Status: newassigned

comment:3 Changed 5 months ago by anarcat

i added the postfix exporter on all mail_processing nodes. it's not all postfix servers, because, well, postfix is installed *everywhere*. i figured i would keep this only to a subset.

this was a little trickier than the apache exporter, because the postfix exporter is not available in stretch, so I had to do an unofficial backport, which i uploaded to db.torproject.org. i also took this opportunity to upload a NMU of the package in Debian as well, since it was out of date.

i imported the dashboard from this issue but it doesn't work out of the box, we'll probably have to write our own. i started working on one, but it's incomplete.

next step is to finish those graphs and deploy exporters for bind and (maybe?) pgsql.

comment:4 Changed 4 months ago by anarcat

postfix dashboard now online: https://grafana.torproject.org/d/h36Havfik/postfix

contributed upstream in https://grafana.com/dashboards/10013 as well

next up are bind and pgsql

comment:5 Changed 4 months ago by anarcat

upgraded nutans to buster to test the prometheus-bind-exporter package without having to mess around with backporting. things seem to work well. added the statistics-channel internal service (on localhost) so the exporter can slurp in the stats, and the metrics are now available locally.

https://kb.isc.org/docs/aa-00769

next step is to add the bind exporter to the prometheus module (like we did for postfix) and dashboard to prometheus. for non-buster hosts, backport the package into db.tpo as well. upgrade (say) half the boxes to buster and leave the other (say fallax and nova because they might be removed anyways and nevii because master so more complicated) on stretch to be on the safe side.

comment:6 Changed 4 months ago by anarcat

Resolution: fixed
Status: assignedclosed

i have finished working on the dashboard, which required extensive enough fixing that i ended up publishing a fork at https://grafana.com/dashboards/10024

i have also rebuilt the bind exporter and uploaded it to db.torproject.org. i did that because I mistakenly thought I needed a full source upload with .changes and .dsc and .orig.tgz files to get the package to propagate. as it turns out it wasn't showing up on nova because it is a 32 bit system and i had uploaded the amd64 binary.

so this is now fixed and the exporter should be deployed on all nameservers soon.

given the time I have had to spend to deploy this exporter and fix the dashboards, I'm not sure it's worth deploying the other ones. so until we have concrete problems we're trying to solve, I'll consider this praticular ticket done. we have a reference of good exporters and dashboards for the remaining services anyways.

comment:7 Changed 4 months ago by anarcat

note that i pushed the grafana dashboards to a public repo here:

https://gitlab.com/anarcat/grafana-dashboards

i haven't used git.tpo because I want the community to possibly send us patches and pull requests, and it seems Trac is a bit heavy for that (and gitlab is not quite ready yet).

comment:8 Changed 4 months ago by anarcat

Resolution: fixed
Status: closedreopened

i'm looking at the psql exporter again because of a recent trac crash which i suspect might be related.

comment:9 Changed 4 months ago by anarcat

Description: modified (diff)
Resolution: fixed
Status: reopenedclosed

i deployed the psql exporter by hand on troodi. this required the magic sql injected as the postgres user (sudo -u postgres psql):

  CREATE USER prometheus;
  ALTER USER prometheus SET SEARCH_PATH TO prometheus,pg_catalog;
  
  CREATE SCHEMA prometheus AUTHORIZATION prometheus;
  
  CREATE FUNCTION prometheus.f_select_pg_stat_activity()
  RETURNS setof pg_catalog.pg_stat_activity
  LANGUAGE sql
  SECURITY DEFINER
  AS $$
    SELECT * from pg_catalog.pg_stat_activity;
  $$;
  
  CREATE FUNCTION prometheus.f_select_pg_stat_replication()
  RETURNS setof pg_catalog.pg_stat_replication
  LANGUAGE sql
  SECURITY DEFINER
  AS $$
    SELECT * from pg_catalog.pg_stat_replication;
  $$;
  
  CREATE VIEW prometheus.pg_stat_replication
  AS
    SELECT * FROM prometheus.f_select_pg_stat_replication();
  
  CREATE VIEW prometheus.pg_stat_activity
  AS
    SELECT * FROM prometheus.f_select_pg_stat_activity();
  
  GRANT SELECT ON prometheus.pg_stat_replication TO prometheus;
  GRANT SELECT ON prometheus.pg_stat_activity TO prometheus;

then the following in /etc/default/prometheus-postgres-exporter:

DATA_SOURCE_NAME='user=prometheus host=/run/postgresql dbname=postgres'

Finally, I have deployed the latter through puppet. Remaining steps are to figure out how the heck to load that custom SQL in the server correctly and to deploy the exporter package properly.

There's a postgresql::psql resource which we might use to load the blurb for what it's worth. We might also want to set a password on that user although the README.Debian provided in the exporter say it doesn't really need a password, presumably because its only access are readonly stats.

I've also deployed the most popular psql dashbaord (at the time of writing) in grafana. it provides basic stats and mostly works, but i've asked upstream for other suggestions.

it should also be noted that other debian fellows use the more generic sql exporter to do their magic sql stuff, which means they can deploy the same exporter everywhere, and just need to have the right SQL magic strings in a config file somewhere depending on the server backend. this is, in particular, what the folks at credative are doing with their elephant shed, which provides a grafana dashboard and sql exporter config.

that seems like a reasonable approach we could consider if we want to support mariadb as well in the future, but for now i focused on something that would just work.

comment:10 Changed 4 months ago by anarcat

Description: modified (diff)
Note: See TracTickets for help on using tickets.