Opened 6 years ago

Closed 3 weeks ago

#12802 closed enhancement (implemented)

BridgeDB needs Nagios checks for the Email Distributor

Reported by: isis Owned by: phw
Priority: High Milestone:
Component: Circumvention/BridgeDB Version:
Severity: Normal Keywords: bridgedb-email, nagios, anti-censorship-roadmap-2020
Cc: isis, dawuud, sysrqb, Lunar, gaba, hiro Actual Points: 1
Parent ID: #30152 Points: 5
Reviewer: cohosh Sponsor: Sponsor30-must

Description

BridgeDB needs Nagios checks that the Email Distributor is working. The best way to do this would be to send an email to bridges@… which say "get help".

Child Tickets

Change History (38)

comment:1 Changed 6 years ago by isis

Cc: Lunar added

comment:2 Changed 6 years ago by dawuud

I think what is needed here is a passive style service check.
This check is runs on it's own schedule via cron or something; it sends e-mail to the Email Distributor
and then periodically checks it's e-mail inbox via IMAP...
If we don't receive an email with the heuristics we are looking for in X minutes then send an alert to the nagios server.

comment:3 Changed 6 years ago by isis

Status: newaccepted

I marked #10916 as a duplicate of this ticket. The pertinent points made there were:

Replying to sysrqb:

After chatting with lunar about it we began discussing additional monitoring for the email distributor. The check_email_delivery nagios plugin was suggested.


and

Replying to isis:

Replying to sysrqb:

I also wondered if we should consider whitelisting tp.o addresses for use by the monitoring system (among other reasons).

We can't safely whitelist torproject.org email addresses because the torproject.org mailserver doesn't do DKIM. Because of this, I started adding a (email_address, gpg_fingerprint) whitelisting feature, requiring that such whitelisted addresses be signed with a particular key. (See #9332 and note that this feature would present a maintainability nightmare.)


comment:4 Changed 2 years ago by teor

Severity: Normal

Set all open tickets without a severity to "Normal"

comment:5 Changed 19 months ago by gaba

Owner: changed from isis to hiro
Status: acceptedassigned

comment:6 Changed 17 months ago by gaba

Cc: gaba added

comment:7 Changed 17 months ago by gaba

Sponsor: Sponsor19

comment:8 Changed 16 months ago by gaba

Points: 4

comment:9 Changed 16 months ago by gaba

Points: 45

comment:10 Changed 16 months ago by dgoulet

Owner: changed from hiro to dgoulet

comment:11 Changed 14 months ago by phw

Parent ID: #30152

comment:12 Changed 13 months ago by phw

For what it's worth, we're now monitoring BridgeDB's SMTP port with sysmon. We will get notified if the SMTP server disappears but we are unable to detect more subtle, application-layer breakage.

comment:13 Changed 12 months ago by gaba

Keywords: ex-sponsor-19 added

Adding the keyword to mark everything that didn't fit into the time for sponsor 19.

comment:14 Changed 12 months ago by phw

Sponsor: Sponsor19Sponsor30-must

Moving from Sponsor 19 to Sponsor 30.

comment:15 Changed 12 months ago by gaba

Owner: dgoulet deleted

dgoulet will assign himself to the ones he is working on right now.

comment:16 Changed 11 months ago by gaba

Keywords: anti-censorship-roadmap-october added; ex-sponsor-19 removed

comment:17 Changed 4 months ago by gaba

Keywords: anti-censorship-roadmap-2020Q1 added; anti-censorship-roadmap-october removed

comment:18 Changed 3 months ago by teor

Status: assignednew

Change tickets that are assigned to nobody to "new".

comment:19 Changed 2 months ago by phw

Status: newneeds_information

I refactored hiro's "check for emails" script in this commit. The script writes its output to /srv/bridges.torproject.org/check/status. I can set up a cron job that runs this script every, say, six hours. We will probably encounter some more hiccups once the script is running in production. Hiro, can you remind me what will happen if nagios considers BridgeDB's email responder down? Will I be able to see this in the nagios web UI? I'm asking because there will probably be a few more hiccups with the "check email" script once it's running continuously.

comment:20 Changed 2 months ago by hiro

Hi phw,
you will be able to see the web ui and also we can setup an email alert.
Are you writing the status file already on bridgedb? So I can finish set everything up on nagios side.

Last edited 2 months ago by hiro (previous) (diff)

comment:21 in reply to:  20 Changed 2 months ago by phw

Replying to hiro:

you will be able to see the web ui and also we can setup an email alert.
Are you writing the status file already on bridgedb? So I can finish set everything up on nagios side.


Yes, /srv/bridges.torproject.org/check/status already exists.

comment:22 Changed 2 months ago by phw

For the record, I wrote a patch for bridgedb-admin, which whitelists our test address, and I have a ready-to-merge bridgedb branch that implements our test script.

I configured a cronjob that runs this script every three hours.

comment:23 Changed 2 months ago by phw

Owner: set to phw
Status: needs_informationassigned

comment:24 Changed 8 weeks ago by phw

Cc: hiro added

Is the Nagios check deployed already? Does the email alert work too? Also, what's the URL for the new Nagios page? I only have a URL for gettor-01.

comment:25 Changed 5 weeks ago by hiro

Hi phw,
So when I made the check for bridgedb I was under the impression that it was managed via our infra to at least some extent. We don't have nagios in bridgedb as the machine is managed by you guys. So I guess we can either add nagios to bridgedb or add an email check to prometheus.
What would you prefer in this case?
Apologies for this.

comment:26 in reply to:  25 Changed 5 weeks ago by phw

Replying to hiro:

So when I made the check for bridgedb I was under the impression that it was managed via our infra to at least some extent. We don't have nagios in bridgedb as the machine is managed by you guys. So I guess we can either add nagios to bridgedb or add an email check to prometheus.
What would you prefer in this case?


I'm not sure what "managed by you guys" means. Cecylia and I administer the BridgeDB service but not the machine as a whole. We don't have root on polyanthum. Isn't this the same situation as with gettor-01? If so, I suggest installing nagios on polyanthum.

comment:27 Changed 5 weeks ago by hiro

I understand now what is happening with this host. Sorry about the confusion. I will enable the check.

comment:28 in reply to:  27 Changed 5 weeks ago by phw

Replying to hiro:

I understand now what is happening with this host. Sorry about the confusion. I will enable the check.


Thanks! As mentioned in this comment, we now have a mailing list to send service alerts to. Can you please configure nagios to send alerts to anti-censorship-alerts at lists dot tpo?

comment:29 Changed 4 weeks ago by hiro

I phw this is now online: https://nagios.torproject.org/cgi-bin/icinga/extinfo.cgi?type=2&host=polyanthum&service=application+service+-+bridgedb+status

Can you check that the status file is giving the correct status?

As reference Gettor is publishing this in the status file:

OK
0: GetTor is good and sending emails with working links
Last edited 4 weeks ago by hiro (previous) (diff)

comment:30 in reply to:  29 Changed 4 weeks ago by phw

Replying to hiro:

I phw this is now online: https://nagios.torproject.org/cgi-bin/icinga/extinfo.cgi?type=2&host=polyanthum&service=application+service+-+bridgedb+status

Can you check that the status file is giving the correct status?

As reference Gettor is publishing this in the status file:

OK
0: GetTor is good and sending emails with working links


Thanks, hiro! Yes, the status file should look good. Right now, it says:

OK
0: BridgeDB's email responder works

Is there a problem with the file format?

Also, are email alerts now going to anti-censorship-alerts@…?

comment:31 Changed 4 weeks ago by phw

Status: assignedneeds_review

comment:32 Changed 4 weeks ago by gaba

Keywords: anti-censorship-roadmap-2020 added; anti-censorship-roadmap-2020Q1 removed

No more Q1 for 2020.

comment:33 Changed 4 weeks ago by hiro

Everything looks good to me. +1 on the PR on github. And Nagios is showing the check ok.
I am going to add the notification email now.

comment:34 Changed 3 weeks ago by cohosh

Reviewer: cohosh

comment:35 Changed 3 weeks ago by cohosh

Status: needs_reviewneeds_information

The script looks good to me. Some thoughts:

  • It's not super urgent in this case, but putting the email password in an environment variable would allow us to not have it in the bash history
  • 60 seconds seems a bit fast for email. How about making this 5-10 minutes? I'm thinking in terms of reducing false positives plus being realistic about our response time to the alert.
  • Is there a reason we're logging in twice (once at L56 and once at L128)?

Other than that, it looks good but I'll put it in needs_information

comment:36 in reply to:  35 ; Changed 3 weeks ago by phw

Status: needs_informationneeds_review

Replying to cohosh:

  • It's not super urgent in this case, but putting the email password in an environment variable would allow us to not have it in the bash history


That's a good point. This commit adds support for passing the key as environment variable.

  • 60 seconds seems a bit fast for email. How about making this 5-10 minutes? I'm thinking in terms of reducing false positives plus being realistic about our response time to the alert.


I used to have it set to 5 minutes but that turned out to be excessive. BridgeDB reliably responds within a few seconds, so I think it's safe to assume that if the response didn't come within 60 seconds, it won't come at all.

  • Is there a reason we're logging in twice (once at L56 and once at L128)?


One login is for our SMTP connection (to send the email) and the other is for our IMAP connection (to check for the response). Is there a way to simplify this?

comment:37 in reply to:  36 Changed 3 weeks ago by cohosh

Status: needs_reviewmerge_ready

Replying to phw:

Replying to cohosh:

  • It's not super urgent in this case, but putting the email password in an environment variable would allow us to not have it in the bash history


That's a good point. This commit adds support for passing the key as environment variable.

Looks good!


  • 60 seconds seems a bit fast for email. How about making this 5-10 minutes? I'm thinking in terms of reducing false positives plus being realistic about our response time to the alert.


I used to have it set to 5 minutes but that turned out to be excessive. BridgeDB reliably responds within a few seconds, so I think it's safe to assume that if the response didn't come within 60 seconds, it won't come at all.

Ah okay, cool :) This sounds good then.


  • Is there a reason we're logging in twice (once at L56 and once at L128)?


One login is for our SMTP connection (to send the email) and the other is for our IMAP connection (to check for the response). Is there a way to simplify this?

Oops thanks for clearing that up. I hadn't noticed the two different connection types.

Looks good to merge!

comment:38 Changed 3 weeks ago by phw

Actual Points: 1
Resolution: implemented
Status: merge_readyclosed

Merged in cc3277b.

Note: See TracTickets for help on using tickets.