Monitor anti-censorship www services with prometheus

added component::internal services/tor sysadmin team owner::hiro points::1 priority::medium resolution::fixed severity::normal status::closed tpa-roadmap-may type::task labels

Hi,

There are a few things about what you are asking that we might be able to do with prometheus, and some others that we can't do at the moment.

Like we cannot semd an email and parsing the result because Prometheus scrapes http endpoints. Also we are not doing alerting yet, only monitoring.

There is also another aspect to consider, in the case of a service like gettor, monitoring the https endpoint will only give us some info about the static html we are serving with apache. Gettor itself (the service sending emails) is a twisted service instead.

Maybe we can consider an approach in which services expose an http endpoint that we can use to know that the service is alive. Otherwise I think we could do some other monitoring via nagios checks.

More about what can be done on prometheus (and the instance I am setting up atm) on https://trac.torproject.org/projects/tor/ticket/29863 where we discuss the idea of monitoring disk space for snowflake. Also https://help.torproject.org/tsa/howto/prometheus/ for some more information.

Replying to hiro:

There is also another aspect to consider, in the case of a service like gettor, monitoring the https endpoint will only give us some info about the static html we are serving with apache. Gettor itself (the service sending emails) is a twisted service instead.

Gotcha. We have a similar problem with BridgeDB because it is exposed over an Apache reverse proxy and you cannot directly talk to BridgeDB. However, if BridgeDB is down, bridges.torproject.org responds with an internal server error if I remember correctly, so we can still monitor BridgeDB despite the reverse proxy, right?

To monitor BridgeDB, we need to set up an exporter, right?

Maybe we can consider an approach in which services expose an http endpoint that we can use to know that the service is alive. Otherwise I think we could do some other monitoring via nagios checks.

I think we already have that for BridgeDB and snowflake's website but not for GetTor.

Replying to phw:

Replying to hiro:

There is also another aspect to consider, in the case of a service like gettor, monitoring the https endpoint will only give us some info about the static html we are serving with apache. Gettor itself (the service sending emails) is a twisted service instead.

Gotcha. We have a similar problem with BridgeDB because it is exposed over an Apache reverse proxy and you cannot directly talk to BridgeDB. However, if BridgeDB is down, bridges.torproject.org responds with an internal server error if I remember correctly, so we can still monitor BridgeDB despite the reverse proxy, right?

Should, yes.

To monitor BridgeDB, we need to set up an exporter, right?

In Prometheus, yes. This could be a simple configuration in a "blackbox exporter":

https://github.com/prometheus/blackbox_exporter/

Maybe we can consider an approach in which services expose an http endpoint that we can use to know that the service is alive. Otherwise I think we could do some other monitoring via nagios checks.

I think we already have that for BridgeDB and snowflake's website but not for GetTor.

From what I can tell, we check bridges.torproject.org:

  -
    name: bridges.tpo web service
    nrpe: "/usr/lib/nagios/plugins/check_http -H bridges.torproject.org -S --string=bridge"
    hosts: polyanthum
    depends: network service - https

We also check onionoo:

 # non-tpa services
 ####
  -
    name: network service - onionoo backend
    nrpe: "/usr/lib/nagios/plugins/tor-check-onionoo 127.0.0.1:8080"
    hostgroups: onionoo-backend
    depends: "process - haproxy - master"
    contacts: +metrics
  -
    name: network service - onionoo varnish
    nrpe: "/usr/lib/nagios/plugins/tor-check-onionoo 127.0.0.1:6081"
    hostgroups: onionoo-backend
    depends: "process - haproxy - master"
    contacts: +metrics
  -
    name: network service - onionoo haproxy
    nrpe: "/usr/lib/nagios/plugins/tor-check-onionoo -s onionoo.torproject.org"
    hostgroups: onionoo-backend
    depends: "process - haproxy - master"
    contacts: +metrics

... but those are all TPA machines, so they can be monitored by Nagios.

I closed #32679 (moved) as a duplicate of this ticket. Here's a brief summary of what we concluded in #32679 (moved):

We will use Nagios for internal services: BridgeDB, Snowflake, and GetTor.
We will use Prometheus's "blackbox exporter" for default bridges, which are external services.
Our admins will handle our Nagios config and the anti-censorship team will handle Prometheus's black box exporter.
We will experiment with Prometheus's "alertmanager", which can send notifications if a monitoring target goes offline.
The anti-censorship team will also get access to Prometheus's grafana dashboard.

For Nagios, here are our monitoring targets:

https://bridges.torproject.org (the page should contain the string "What are bridges?")
https://gettor.torproject.org (the page should contain the string "What is GetTor?")
https://snowflake.torproject.org (the page should contain the string "Snowflake is a system")
https://snowflake-broker.torproject.net/debug (the page should contain the string "available")
~~gettor-01.torproject.org (the service should respond to emails; hiro already worked on this)~~

Note that the strings that should be present in the respective pages are mere suggestions. Ultimately, we just need a test that guarantees that these pages are correctly serving content.

awesome summary, thanks. i turned that into a checklist and assigned the ticket to hiro who, I think, will handle followup on this. hiro, let me know if you need help or if any of this is incorrect...

Trac:
Status: new to assigned
Owner: tpa to hiro
Description: In the anti-censorship team we currently monitor several services with sysmon. We recently discovered that sysmon doesn't seem to follow HTTP 301 redirects. This means that if a web service dies but the 301 redirect still works (e.g., BridgeDB is dead but its apache reverse proxy still works), sysmon won't notice.

Now that prometheus is running, we should fill this monitoring gap by testing the following web sites:

Our test should ensure that these sites serve the content we expect, e.g., make sure that bridges.tp.o contains the string "BridgeDB" in its HTML. Testing the HTTP status code does not suffice: if BridgeDB is down, the reverse proxy may still respond.

I wonder if prometheus could also help us with #12802 (moved) by sending an email to bridges@tp.o and making sure that it responds with at least one bridge?

to

In the anti-censorship team we currently monitor several services with sysmon. We recently discovered that sysmon doesn't seem to follow HTTP 301 redirects. This means that if a web service dies but the 301 redirect still works (e.g., BridgeDB is dead but its apache reverse proxy still works), sysmon won't notice.

Now that prometheus is running, we should fill this monitoring gap by testing the following web sites:

Our test should ensure that these sites serve the content we expect, e.g., make sure that bridges.tp.o contains the string "BridgeDB" in its HTML. Testing the HTTP status code does not suffice: if BridgeDB is down, the reverse proxy may still respond.

I wonder if prometheus could also help us with #12802 (moved) by sending an email to bridges@tp.o and making sure that it responds with at least one bridge?

Checklist:

monitor services in Nagios: BridgeDB, Snowflake, and GetTor
deploy Prometheus's "blackbox exporter" for default bridges, which are external services
delegate to (and train) the anti-censorship team the blackbox exporter configuration
experiment with Prometheus's "alertmanager", which can send notifications if a monitoring target goes offline
grant the anti-censorship team access to Prometheus's grafana dashboard.

Trac:
Keywords: N/A deleted, tpa-roadmap-february added

Trac:
Keywords: tpa-roadmap-february deleted, tpa-roadmap-february tpa-roadmap-march added

Hi, This is now available here: https://prometheus2.torproject.org/targets Grafana: https://grafana2.torproject.org/d/NgEq8C0Zz/blackbox-exporter?orgId=1 I'll share the password separately.

Replying to hiro:

Hi, This is now available here: https://prometheus2.torproject.org/targets Grafana: https://grafana2.torproject.org/d/NgEq8C0Zz/blackbox-exporter?orgId=1 I'll share the password separately.

Thanks! I checked the grafana box on our todo list in the ticket description because we now have access to it.

I see that BridgeDB is already being monitored. Are we able to add our own targets to Prometheus?

Trac:
Description: In the anti-censorship team we currently monitor several services with sysmon. We recently discovered that sysmon doesn't seem to follow HTTP 301 redirects. This means that if a web service dies but the 301 redirect still works (e.g., BridgeDB is dead but its apache reverse proxy still works), sysmon won't notice.

Now that prometheus is running, we should fill this monitoring gap by testing the following web sites:

Our test should ensure that these sites serve the content we expect, e.g., make sure that bridges.tp.o contains the string "BridgeDB" in its HTML. Testing the HTTP status code does not suffice: if BridgeDB is down, the reverse proxy may still respond.

I wonder if prometheus could also help us with #12802 (moved) by sending an email to bridges@tp.o and making sure that it responds with at least one bridge?

Checklist:

monitor services in Nagios: BridgeDB, Snowflake, and GetTor
deploy Prometheus's "blackbox exporter" for default bridges, which are external services
delegate to (and train) the anti-censorship team the blackbox exporter configuration
experiment with Prometheus's "alertmanager", which can send notifications if a monitoring target goes offline
grant the anti-censorship team access to Prometheus's grafana dashboard.

to

In the anti-censorship team we currently monitor several services with sysmon. We recently discovered that sysmon doesn't seem to follow HTTP 301 redirects. This means that if a web service dies but the 301 redirect still works (e.g., BridgeDB is dead but its apache reverse proxy still works), sysmon won't notice.

Now that prometheus is running, we should fill this monitoring gap by testing the following web sites:

Our test should ensure that these sites serve the content we expect, e.g., make sure that bridges.tp.o contains the string "BridgeDB" in its HTML. Testing the HTTP status code does not suffice: if BridgeDB is down, the reverse proxy may still respond.

I wonder if prometheus could also help us with #12802 (moved) by sending an email to bridges@tp.o and making sure that it responds with at least one bridge?

Checklist:

monitor services in Nagios: BridgeDB, Snowflake, and GetTor
deploy Prometheus's "blackbox exporter" for default bridges, which are external services
delegate to (and train) the anti-censorship team the blackbox exporter configuration
experiment with Prometheus's "alertmanager", which can send notifications if a monitoring target goes offline
grant the anti-censorship team access to Prometheus's grafana dashboard.

I can give you access to the machine and we can think a way to do this, but it would be better if you could pass me the targets and I add them on puppet directly. How does that sound?

Replying to hiro:

I can give you access to the machine and we can think a way to do this, but it would be better if you could pass me the targets and I add them on puppet directly. How does that sound?

Hmm, ok. Note that the entire reason for filing #32679 (moved) was that I wanted our team to have control over the list of monitoring targets, so we don't have to block on others. But we can go with your plan for now and see how it goes.

The list of default bridges is available in a table on this wiki page. Please ignore the two last rows in the table, 0.0.2.0:2 and 0.0.3.0:1. These are two pseudo IP addresses.

Hi phw, This is all configured now. It is quite quick for us to add targets and as I mentioned maybe we can give up on using puppet for this and just give you the opportunity to edit the configuration file directly. Let's see how it goes.

Trac:

mmonit Web UI

Replying to hiro:

This is all configured now. It is quite quick for us to add targets and as I mentioned maybe we can give up on using puppet for this and just give you the opportunity to edit the configuration file directly. Let's see how it goes.

Thanks!

I took a look at the Grafana dashboard and found it difficult to interpret the data. For example, 146.57.248.225:22 is currently offline and the panels don't reveal that. I understand that one can add panels (I think I would like an "Alert List") but I'm struggling with creating one.

I would like something similar to the following UI. Is this something you can help with?

Trac:
Status: assigned to needs_information

this is indeed a complex panel to create! i managed to make one using "singlestat" - I couldn't figure how to make the "alert list" thing work - but it's kind of clunky:

https://grafana2.torproject.org/d/fC77Nk6Wz/blackbox-probe-state

now after asking on #prometheus (freenode), i was told there's a Granafa plugin specifically for that purpose. it's really heavy on the Javascript, but it seems to actually work and provide a much better visualization. here's the dashboard I created with the plugin:

https://grafana2.torproject.org/d/6shXNz6Wz/blackbox-target-availability

the plugin is:

https://grafana.com/grafana/plugins/flant-statusmap-panel/installation

i installed it with:

sudo -u grafana grafana-cli plugins install flant-statusmap-panel
service grafana-server stop
service grafana-server start

... which needs to be added into Puppet if we're happy with the results.

let me know how that looks for you.

(and yes, it does seem like all blackbox targets except bridges.tpo are down.)

Trac:
Status: needs_information to needs_review

I am ok with this if people are happy with the result. I will add it to puppet.

Replying to hiro:

I am ok with this if people are happy with the result. I will add it to puppet.

The blackbox-target-availability plugin looks great and solves this problem. However, our default bridges aren't all down (only 146.57.248.225 is, as of 2020-04-27), so there seems to be an error with the blackbox exporter?

Trac:
Status: needs_review to needs_information

Hi phw. I see the hosts online on prometheus. I think the issue is that those targets are checking http/https connectivity. The blackbox exporter I had configured instead goes on general TCP connectivity. Here is an example where the host is clearly online: https://grafana2.torproject.org/d/NgEq8C0Zz/blackbox-exporter?orgId=1&var-blackboxes=All&var-targets=216.252.162.21:46089&var-interval=5s Let me see if I can check how those plugin on grafana can monitor TCP and not just http/s.

Trac:
Keywords: tpa-roadmap-february tpa-roadmap-march deleted, tpa-roadmap-may added

Hi all, I think this is now working: https://grafana2.torproject.org/d/6shXNz6Wz/blackbox-target-availability?orgId=1 Can you please confirm?

Replying to hiro:

Hi all, I think this is now working: https://grafana2.torproject.org/d/6shXNz6Wz/blackbox-target-availability?orgId=1 Can you please confirm?

Hi hiro! Hmm, 146.57.248.225:22 is shown as online but it's currently offline (its port times out and it doesn't respond to pings).

This is what I see on prometheus: https://grafana2.torproject.org/d/NgEq8C0Zz/blackbox-exporter?orgId=1&var-blackboxes=All&var-targets=146.57.248.225:22&var-interval=5s

Ok I see the tcp connect at 0. So that's probably a good indicator in this case. Not the others.

I think this does it: https://grafana2.torproject.org/d/TVZiAIRGz/bridges?orgId=1

Replying to hiro:

I think this does it: https://grafana2.torproject.org/d/TVZiAIRGz/bridges?orgId=1

This looks good, thanks hiro.

In the meanwhile, we've set up a monit instance on my VPS, which is now monitoring all of our anti-censorship infrastructure. Frankly, this works better for us than prometheus: it's simple, effective, and we control it. There's some merit in having prometheus monitor our infrastructure but given that the sysadmin team is stretched thin, I'm inclined to close this ticket as a "wontfix".

Thoughts?

Removing the parent (because we are already monitoring our anti-censorship infrastructure).

Trac:
Parent: #30152 (moved) to N/A

Trac:
Resolution: N/A to fixed
Status: needs_information to closed

closed

changed time estimate to 8h

mentioned in issue #31232 (moved)

mentioned in issue #32679 (moved)

mentioned in issue #33027 (moved)

mentioned in issue #33941 (moved)

Monitor anti-censorship www services with prometheus

Child items ...

Activity