Opened 3 months ago

Last modified 2 months ago

#31159 new task

Monitor anti-censorship www services with prometheus

Reported by: phw Owned by: tpa
Priority: Medium Milestone:
Component: Internal Services/Tor Sysadmin Team Version:
Severity: Normal Keywords:
Cc: hiro, gman999 Actual Points:
Parent ID: #30152 Points: 1
Reviewer: Sponsor:

Description

In the anti-censorship team we currently monitor several services with sysmon. We recently discovered that sysmon doesn't seem to follow HTTP 301 redirects. This means that if a web service dies but the 301 redirect still works (e.g., BridgeDB is dead but its apache reverse proxy still works), sysmon won't notice.

Now that prometheus is running, we should fill this monitoring gap by testing the following web sites:

Our test should ensure that these sites serve the content we expect, e.g., make sure that bridges.tp.o contains the string "BridgeDB" in its HTML. Testing the HTTP status code does not suffice: if BridgeDB is down, the reverse proxy may still respond.

I wonder if prometheus could also help us with #12802 by sending an email to bridges@… and making sure that it responds with at least one bridge?

Child Tickets

Change History (3)

comment:1 Changed 3 months ago by hiro

Hi,

There are a few things about what you are asking that we might be able to do with prometheus, and some others that we can't do at the moment.

Like we cannot semd an email and parsing the result because Prometheus scrapes http endpoints. Also we are not doing alerting yet, only monitoring.

There is also another aspect to consider, in the case of a service like gettor, monitoring the https endpoint will only give us some info about the static html we are serving with apache. Gettor itself (the service sending emails) is a twisted service instead.

Maybe we can consider an approach in which services expose an http endpoint that we can use to know that the service is alive. Otherwise I think we could do some other monitoring via nagios checks.

More about what can be done on prometheus (and the instance I am setting up atm) on https://trac.torproject.org/projects/tor/ticket/29863 where we discuss the idea of monitoring disk space for snowflake.
Also https://help.torproject.org/tsa/howto/prometheus/ for some more information.

comment:2 in reply to:  1 ; Changed 2 months ago by phw

Replying to hiro:

There is also another aspect to consider, in the case of a service like gettor, monitoring the https endpoint will only give us some info about the static html we are serving with apache. Gettor itself (the service sending emails) is a twisted service instead.


Gotcha. We have a similar problem with BridgeDB because it is exposed over an Apache reverse proxy and you cannot directly talk to BridgeDB. However, if BridgeDB is down, bridges.torproject.org responds with an internal server error if I remember correctly, so we can still monitor BridgeDB despite the reverse proxy, right?

To monitor BridgeDB, we need to set up an exporter, right?

Maybe we can consider an approach in which services expose an http endpoint that we can use to know that the service is alive. Otherwise I think we could do some other monitoring via nagios checks.


I think we already have that for BridgeDB and snowflake's website but not for GetTor.

comment:3 in reply to:  2 Changed 2 months ago by anarcat

Replying to phw:

Replying to hiro:

There is also another aspect to consider, in the case of a service like gettor, monitoring the https endpoint will only give us some info about the static html we are serving with apache. Gettor itself (the service sending emails) is a twisted service instead.


Gotcha. We have a similar problem with BridgeDB because it is exposed over an Apache reverse proxy and you cannot directly talk to BridgeDB. However, if BridgeDB is down, bridges.torproject.org responds with an internal server error if I remember correctly, so we can still monitor BridgeDB despite the reverse proxy, right?

Should, yes.

To monitor BridgeDB, we need to set up an exporter, right?

In Prometheus, yes. This could be a simple configuration in a "blackbox exporter":

https://github.com/prometheus/blackbox_exporter/

Maybe we can consider an approach in which services expose an http endpoint that we can use to know that the service is alive. Otherwise I think we could do some other monitoring via nagios checks.

I think we already have that for BridgeDB and snowflake's website but not for GetTor.

From what I can tell, we check bridges.torproject.org:

  -
    name: bridges.tpo web service
    nrpe: "/usr/lib/nagios/plugins/check_http -H bridges.torproject.org -S --string=bridge"
    hosts: polyanthum
    depends: network service - https

We also check onionoo:

 # non-tpa services
 ####
  -
    name: network service - onionoo backend
    nrpe: "/usr/lib/nagios/plugins/tor-check-onionoo 127.0.0.1:8080"
    hostgroups: onionoo-backend
    depends: "process - haproxy - master"
    contacts: +metrics
  -
    name: network service - onionoo varnish
    nrpe: "/usr/lib/nagios/plugins/tor-check-onionoo 127.0.0.1:6081"
    hostgroups: onionoo-backend
    depends: "process - haproxy - master"
    contacts: +metrics
  -
    name: network service - onionoo haproxy
    nrpe: "/usr/lib/nagios/plugins/tor-check-onionoo -s onionoo.torproject.org"
    hostgroups: onionoo-backend
    depends: "process - haproxy - master"
    contacts: +metrics

... but those are all TPA machines, so they can be monitored by Nagios.

Note: See TracTickets for help on using tickets.