Opened 7 months ago

Last modified 37 hours ago

#31159 assigned task

Monitor anti-censorship www services with prometheus

Reported by: phw Owned by: hiro
Priority: Medium Milestone:
Component: Internal Services/Tor Sysadmin Team Version:
Severity: Normal Keywords: tpa-roadmap-february
Cc: hiro, gman999 Actual Points:
Parent ID: #30152 Points: 1
Reviewer: Sponsor:

Description (last modified by anarcat)

In the anti-censorship team we currently monitor several services with sysmon. We recently discovered that sysmon doesn't seem to follow HTTP 301 redirects. This means that if a web service dies but the 301 redirect still works (e.g., BridgeDB is dead but its apache reverse proxy still works), sysmon won't notice.

Now that prometheus is running, we should fill this monitoring gap by testing the following web sites:

Our test should ensure that these sites serve the content we expect, e.g., make sure that bridges.tp.o contains the string "BridgeDB" in its HTML. Testing the HTTP status code does not suffice: if BridgeDB is down, the reverse proxy may still respond.

I wonder if prometheus could also help us with #12802 by sending an email to bridges@… and making sure that it responds with at least one bridge?

Checklist:

  1. [ ] monitor services in Nagios: BridgeDB, Snowflake, and GetTor
  2. [ ] deploy Prometheus's "​blackbox exporter" for default bridges, which are external services
  3. [ ] delegate to (and train) the anti-censorship team the blackbox exporter configuration
  4. [ ] experiment with Prometheus's "alertmanager", which can send notifications if a monitoring target goes offline
  5. [ ] grant the anti-censorship team access to Prometheus's grafana dashboard.

Child Tickets

Change History (6)

comment:1 Changed 6 months ago by hiro

Hi,

There are a few things about what you are asking that we might be able to do with prometheus, and some others that we can't do at the moment.

Like we cannot semd an email and parsing the result because Prometheus scrapes http endpoints. Also we are not doing alerting yet, only monitoring.

There is also another aspect to consider, in the case of a service like gettor, monitoring the https endpoint will only give us some info about the static html we are serving with apache. Gettor itself (the service sending emails) is a twisted service instead.

Maybe we can consider an approach in which services expose an http endpoint that we can use to know that the service is alive. Otherwise I think we could do some other monitoring via nagios checks.

More about what can be done on prometheus (and the instance I am setting up atm) on https://trac.torproject.org/projects/tor/ticket/29863 where we discuss the idea of monitoring disk space for snowflake.
Also https://help.torproject.org/tsa/howto/prometheus/ for some more information.

comment:2 in reply to:  1 ; Changed 6 months ago by phw

Replying to hiro:

There is also another aspect to consider, in the case of a service like gettor, monitoring the https endpoint will only give us some info about the static html we are serving with apache. Gettor itself (the service sending emails) is a twisted service instead.


Gotcha. We have a similar problem with BridgeDB because it is exposed over an Apache reverse proxy and you cannot directly talk to BridgeDB. However, if BridgeDB is down, bridges.torproject.org responds with an internal server error if I remember correctly, so we can still monitor BridgeDB despite the reverse proxy, right?

To monitor BridgeDB, we need to set up an exporter, right?

Maybe we can consider an approach in which services expose an http endpoint that we can use to know that the service is alive. Otherwise I think we could do some other monitoring via nagios checks.


I think we already have that for BridgeDB and snowflake's website but not for GetTor.

comment:3 in reply to:  2 Changed 6 months ago by anarcat

Replying to phw:

Replying to hiro:

There is also another aspect to consider, in the case of a service like gettor, monitoring the https endpoint will only give us some info about the static html we are serving with apache. Gettor itself (the service sending emails) is a twisted service instead.


Gotcha. We have a similar problem with BridgeDB because it is exposed over an Apache reverse proxy and you cannot directly talk to BridgeDB. However, if BridgeDB is down, bridges.torproject.org responds with an internal server error if I remember correctly, so we can still monitor BridgeDB despite the reverse proxy, right?

Should, yes.

To monitor BridgeDB, we need to set up an exporter, right?

In Prometheus, yes. This could be a simple configuration in a "blackbox exporter":

https://github.com/prometheus/blackbox_exporter/

Maybe we can consider an approach in which services expose an http endpoint that we can use to know that the service is alive. Otherwise I think we could do some other monitoring via nagios checks.

I think we already have that for BridgeDB and snowflake's website but not for GetTor.

From what I can tell, we check bridges.torproject.org:

  -
    name: bridges.tpo web service
    nrpe: "/usr/lib/nagios/plugins/check_http -H bridges.torproject.org -S --string=bridge"
    hosts: polyanthum
    depends: network service - https

We also check onionoo:

 # non-tpa services
 ####
  -
    name: network service - onionoo backend
    nrpe: "/usr/lib/nagios/plugins/tor-check-onionoo 127.0.0.1:8080"
    hostgroups: onionoo-backend
    depends: "process - haproxy - master"
    contacts: +metrics
  -
    name: network service - onionoo varnish
    nrpe: "/usr/lib/nagios/plugins/tor-check-onionoo 127.0.0.1:6081"
    hostgroups: onionoo-backend
    depends: "process - haproxy - master"
    contacts: +metrics
  -
    name: network service - onionoo haproxy
    nrpe: "/usr/lib/nagios/plugins/tor-check-onionoo -s onionoo.torproject.org"
    hostgroups: onionoo-backend
    depends: "process - haproxy - master"
    contacts: +metrics

... but those are all TPA machines, so they can be monitored by Nagios.

comment:4 Changed 6 weeks ago by phw

I closed #32679 as a duplicate of this ticket. Here's a brief summary of what we concluded in #32679:

  • We will use Nagios for internal services: BridgeDB, Snowflake, and GetTor.
  • We will use Prometheus's "blackbox exporter" for default bridges, which are external services.
  • Our admins will handle our Nagios config and the anti-censorship team will handle Prometheus's black box exporter.
  • We will experiment with Prometheus's "alertmanager", which can send notifications if a monitoring target goes offline.
  • The anti-censorship team will also get access to Prometheus's grafana dashboard.

For Nagios, here are our monitoring targets:

Note that the strings that should be present in the respective pages are mere suggestions. Ultimately, we just need a test that guarantees that these pages are correctly serving content.

comment:5 Changed 6 weeks ago by anarcat

Description: modified (diff)
Owner: changed from tpa to hiro
Status: newassigned

awesome summary, thanks. i turned that into a checklist and assigned the ticket to hiro who, I think, will handle followup on this. hiro, let me know if you need help or if any of this is incorrect...

comment:6 Changed 37 hours ago by gaba

Keywords: tpa-roadmap-february added
Note: See TracTickets for help on using tickets.