Opened 11 months ago

Last modified 9 days ago

#31159 needs_information task

Monitor anti-censorship www services with prometheus

Reported by: phw Owned by: hiro
Priority: Medium Milestone:
Component: Internal Services/Tor Sysadmin Team Version:
Severity: Normal Keywords: tpa-roadmap-may
Cc: hiro, gman999 Actual Points:
Parent ID: Points: 1
Reviewer: Sponsor:

Description (last modified by phw)

In the anti-censorship team we currently monitor several services with sysmon. We recently discovered that sysmon doesn't seem to follow HTTP 301 redirects. This means that if a web service dies but the 301 redirect still works (e.g., BridgeDB is dead but its apache reverse proxy still works), sysmon won't notice.

Now that prometheus is running, we should fill this monitoring gap by testing the following web sites:

Our test should ensure that these sites serve the content we expect, e.g., make sure that bridges.tp.o contains the string "BridgeDB" in its HTML. Testing the HTTP status code does not suffice: if BridgeDB is down, the reverse proxy may still respond.

I wonder if prometheus could also help us with #12802 by sending an email to bridges@… and making sure that it responds with at least one bridge?

Checklist:

  1. [ ] monitor services in Nagios: BridgeDB, Snowflake, and GetTor
  2. [ ] deploy Prometheus's "​blackbox exporter" for default bridges, which are external services
  3. [ ] delegate to (and train) the anti-censorship team the blackbox exporter configuration
  4. [ ] experiment with Prometheus's "alertmanager", which can send notifications if a monitoring target goes offline
  5. [X] grant the anti-censorship team access to Prometheus's grafana dashboard.

Child Tickets

Attachments (1)

mmonit.png (99.1 KB) - added by phw 3 months ago.
mmonit Web UI

Download all attachments as: .zip

Change History (27)

comment:1 Changed 11 months ago by hiro

Hi,

There are a few things about what you are asking that we might be able to do with prometheus, and some others that we can't do at the moment.

Like we cannot semd an email and parsing the result because Prometheus scrapes http endpoints. Also we are not doing alerting yet, only monitoring.

There is also another aspect to consider, in the case of a service like gettor, monitoring the https endpoint will only give us some info about the static html we are serving with apache. Gettor itself (the service sending emails) is a twisted service instead.

Maybe we can consider an approach in which services expose an http endpoint that we can use to know that the service is alive. Otherwise I think we could do some other monitoring via nagios checks.

More about what can be done on prometheus (and the instance I am setting up atm) on https://trac.torproject.org/projects/tor/ticket/29863 where we discuss the idea of monitoring disk space for snowflake.
Also https://help.torproject.org/tsa/howto/prometheus/ for some more information.

comment:2 in reply to:  1 ; Changed 10 months ago by phw

Replying to hiro:

There is also another aspect to consider, in the case of a service like gettor, monitoring the https endpoint will only give us some info about the static html we are serving with apache. Gettor itself (the service sending emails) is a twisted service instead.


Gotcha. We have a similar problem with BridgeDB because it is exposed over an Apache reverse proxy and you cannot directly talk to BridgeDB. However, if BridgeDB is down, bridges.torproject.org responds with an internal server error if I remember correctly, so we can still monitor BridgeDB despite the reverse proxy, right?

To monitor BridgeDB, we need to set up an exporter, right?

Maybe we can consider an approach in which services expose an http endpoint that we can use to know that the service is alive. Otherwise I think we could do some other monitoring via nagios checks.


I think we already have that for BridgeDB and snowflake's website but not for GetTor.

comment:3 in reply to:  2 Changed 10 months ago by anarcat

Replying to phw:

Replying to hiro:

There is also another aspect to consider, in the case of a service like gettor, monitoring the https endpoint will only give us some info about the static html we are serving with apache. Gettor itself (the service sending emails) is a twisted service instead.


Gotcha. We have a similar problem with BridgeDB because it is exposed over an Apache reverse proxy and you cannot directly talk to BridgeDB. However, if BridgeDB is down, bridges.torproject.org responds with an internal server error if I remember correctly, so we can still monitor BridgeDB despite the reverse proxy, right?

Should, yes.

To monitor BridgeDB, we need to set up an exporter, right?

In Prometheus, yes. This could be a simple configuration in a "blackbox exporter":

https://github.com/prometheus/blackbox_exporter/

Maybe we can consider an approach in which services expose an http endpoint that we can use to know that the service is alive. Otherwise I think we could do some other monitoring via nagios checks.

I think we already have that for BridgeDB and snowflake's website but not for GetTor.

From what I can tell, we check bridges.torproject.org:

  -
    name: bridges.tpo web service
    nrpe: "/usr/lib/nagios/plugins/check_http -H bridges.torproject.org -S --string=bridge"
    hosts: polyanthum
    depends: network service - https

We also check onionoo:

 # non-tpa services
 ####
  -
    name: network service - onionoo backend
    nrpe: "/usr/lib/nagios/plugins/tor-check-onionoo 127.0.0.1:8080"
    hostgroups: onionoo-backend
    depends: "process - haproxy - master"
    contacts: +metrics
  -
    name: network service - onionoo varnish
    nrpe: "/usr/lib/nagios/plugins/tor-check-onionoo 127.0.0.1:6081"
    hostgroups: onionoo-backend
    depends: "process - haproxy - master"
    contacts: +metrics
  -
    name: network service - onionoo haproxy
    nrpe: "/usr/lib/nagios/plugins/tor-check-onionoo -s onionoo.torproject.org"
    hostgroups: onionoo-backend
    depends: "process - haproxy - master"
    contacts: +metrics

... but those are all TPA machines, so they can be monitored by Nagios.

comment:4 Changed 6 months ago by phw

I closed #32679 as a duplicate of this ticket. Here's a brief summary of what we concluded in #32679:

  • We will use Nagios for internal services: BridgeDB, Snowflake, and GetTor.
  • We will use Prometheus's "blackbox exporter" for default bridges, which are external services.
  • Our admins will handle our Nagios config and the anti-censorship team will handle Prometheus's black box exporter.
  • We will experiment with Prometheus's "alertmanager", which can send notifications if a monitoring target goes offline.
  • The anti-censorship team will also get access to Prometheus's grafana dashboard.

For Nagios, here are our monitoring targets:

Note that the strings that should be present in the respective pages are mere suggestions. Ultimately, we just need a test that guarantees that these pages are correctly serving content.

comment:5 Changed 6 months ago by anarcat

Description: modified (diff)
Owner: changed from tpa to hiro
Status: newassigned

awesome summary, thanks. i turned that into a checklist and assigned the ticket to hiro who, I think, will handle followup on this. hiro, let me know if you need help or if any of this is incorrect...

comment:6 Changed 4 months ago by gaba

Keywords: tpa-roadmap-february added

comment:7 Changed 4 months ago by gaba

Keywords: tpa-roadmap-march added

comment:8 Changed 4 months ago by hiro

Hi,
This is now available here: https://prometheus2.torproject.org/targets
Grafana: https://grafana2.torproject.org/d/NgEq8C0Zz/blackbox-exporter?orgId=1
I'll share the password separately.

comment:9 in reply to:  8 Changed 4 months ago by phw

Description: modified (diff)

Replying to hiro:

Hi,
This is now available here: https://prometheus2.torproject.org/targets
Grafana: https://grafana2.torproject.org/d/NgEq8C0Zz/blackbox-exporter?orgId=1
I'll share the password separately.


Thanks! I checked the grafana box on our todo list in the ticket description because we now have access to it.

I see that BridgeDB is already being monitored. Are we able to add our own targets to Prometheus?

comment:10 Changed 3 months ago by hiro

I can give you access to the machine and we can think a way to do this, but it would be better if you could pass me the targets and I add them on puppet directly. How does that sound?

comment:11 in reply to:  10 Changed 3 months ago by phw

Replying to hiro:

I can give you access to the machine and we can think a way to do this, but it would be better if you could pass me the targets and I add them on puppet directly. How does that sound?


Hmm, ok. Note that the entire reason for filing #32679 was that I wanted our team to have control over the list of monitoring targets, so we don't have to block on others. But we can go with your plan for now and see how it goes.

The list of default bridges is available in a table on this wiki page. Please ignore the two last rows in the table, 0.0.2.0:2 and 0.0.3.0:1. These are two pseudo IP addresses.

comment:12 Changed 3 months ago by hiro

Hi phw,
This is all configured now. It is quite quick for us to add targets and as I mentioned maybe we can give up on using puppet for this and just give you the opportunity to edit the configuration file directly. Let's see how it goes.

Changed 3 months ago by phw

Attachment: mmonit.png added

mmonit Web UI

comment:13 in reply to:  12 Changed 3 months ago by phw

Replying to hiro:

This is all configured now. It is quite quick for us to add targets and as I mentioned maybe we can give up on using puppet for this and just give you the opportunity to edit the configuration file directly. Let's see how it goes.


Thanks!

I took a look at the Grafana dashboard and found it difficult to interpret the data. For example, 146.57.248.225:22 is currently offline and the panels don't reveal that. I understand that one can add panels (I think I would like an "Alert List") but I'm struggling with creating one.

I would like something similar to the following UI. Is this something you can help with?

mmonit Web UI

comment:14 Changed 2 months ago by phw

Status: assignedneeds_information

comment:15 Changed 6 weeks ago by anarcat

Status: needs_informationneeds_review

this is indeed a complex panel to create! i managed to make one using "singlestat" - I couldn't figure how to make the "alert list" thing work - but it's kind of clunky:

https://grafana2.torproject.org/d/fC77Nk6Wz/blackbox-probe-state

now after asking on #prometheus (freenode), i was told there's a Granafa plugin specifically for that purpose. it's really heavy on the Javascript, but it seems to actually work and provide a much better visualization. here's the dashboard I created with the plugin:

https://grafana2.torproject.org/d/6shXNz6Wz/blackbox-target-availability

the plugin is:

https://grafana.com/grafana/plugins/flant-statusmap-panel/installation

i installed it with:

sudo -u grafana grafana-cli plugins install flant-statusmap-panel
service grafana-server stop
service grafana-server start

... which needs to be added into Puppet if we're happy with the results.

let me know how that looks for you.

(and yes, it does seem like all blackbox targets except bridges.tpo are down.)

comment:16 Changed 6 weeks ago by hiro

I am ok with this if people are happy with the result. I will add it to puppet.

comment:17 in reply to:  16 Changed 6 weeks ago by phw

Status: needs_reviewneeds_information

Replying to hiro:

I am ok with this if people are happy with the result. I will add it to puppet.


The blackbox-target-availability plugin looks great and solves this problem. However, our default bridges aren't all down (only 146.57.248.225 is, as of 2020-04-27), so there seems to be an error with the blackbox exporter?

comment:18 Changed 5 weeks ago by hiro

Hi phw.
I see the hosts online on prometheus. I think the issue is that those targets are checking http/https connectivity.
The blackbox exporter I had configured instead goes on general TCP connectivity. Here is an example where the host is clearly online:
https://grafana2.torproject.org/d/NgEq8C0Zz/blackbox-exporter?orgId=1&var-blackboxes=All&var-targets=216.252.162.21:46089&var-interval=5s
Let me see if I can check how those plugin on grafana can monitor TCP and not just http/s.

comment:19 Changed 5 weeks ago by hiro

Keywords: tpa-roadmap-may added; tpa-roadmap-february tpa-roadmap-march removed

comment:20 Changed 4 weeks ago by hiro

Hi all,
I think this is now working: https://grafana2.torproject.org/d/6shXNz6Wz/blackbox-target-availability?orgId=1
Can you please confirm?

comment:21 in reply to:  20 Changed 4 weeks ago by phw

Replying to hiro:

Hi all,
I think this is now working: https://grafana2.torproject.org/d/6shXNz6Wz/blackbox-target-availability?orgId=1
Can you please confirm?


Hi hiro! Hmm, 146.57.248.225:22 is shown as online but it's currently offline (its port times out and it doesn't respond to pings).

comment:23 Changed 4 weeks ago by hiro

Ok I see the tcp connect at 0. So that's probably a good indicator in this case. Not the others.

comment:25 in reply to:  24 Changed 9 days ago by phw

Replying to hiro:

I think this does it: https://grafana2.torproject.org/d/TVZiAIRGz/bridges?orgId=1


This looks good, thanks hiro.

In the meanwhile, we've set up a monit instance on my VPS, which is now monitoring all of our anti-censorship infrastructure. Frankly, this works better for us than prometheus: it's simple, effective, and we control it. There's some merit in having prometheus monitor our infrastructure but given that the sysadmin team is stretched thin, I'm inclined to close this ticket as a "wontfix".

Thoughts?

comment:26 Changed 9 days ago by phw

Parent ID: #30152

Removing the parent (because we are already monitoring our anti-censorship infrastructure).

Note: See TracTickets for help on using tickets.