Opened 8 months ago

Last modified 4 weeks ago

#28322 accepted project

Deploy better notification system for operational issues

Reported by: karsten Owned by: irl
Priority: High Milestone:
Component: Metrics Version:
Severity: Normal Keywords: metrics-roadmap-2019-q2
Cc: metrics-team Actual Points:
Parent ID: Points: 10
Reviewer: Sponsor:

Description

We have been using Nagios to monitor Onionoo for a few years now, and we recently extended (#28242) or added new Nagios checks (#28271).

We should consider adding even more checks. One year ago we discussed what checks that could be, and it seems like this list could still serve as starting point for adding new checks now.

Child Tickets

TicketStatusOwnerSummaryComponent
#28271acceptedirlCheck OnionPerf instances from NagiosMetrics/Onionperf

Attachments (1)

Screen Shot 2019-05-06 at 09.35.48.png (195.5 KB) - added by irl 6 weeks ago.

Download all attachments as: .zip

Change History (5)

comment:1 Changed 4 months ago by irl

Keywords: metrics-roadmap-2019-q2 added
Points: 5
Priority: MediumHigh
Type: taskproject

comment:2 Changed 6 weeks ago by irl

Owner: changed from metrics-team to irl
Points: 510
Status: newaccepted

Status update:

  • I think we're going to end up running our own Nagios instance, which is OK if it helps us move forward here.
  • I've got a testing environment running in Vagrant+Ansible and looking at adding checks now.
  • I'm using bushel's library code to implement fetching/parsing of Tor-specific documents.
  • I'm going to build a new repo "tor-metrics-nagios-checks" that builds a Debian package with all the checks in it.
  • I'm going to continue expanding the fetching and parsing logic in bushel, such that it's reusable elsewhere.
  • Once I've worked out secret handling in Ansible we can publish also the git repo that stands up the testing environment.
  • bushel will need a Debian package if we plan to deploy on a TPA machine. I'm thinking though that we could instead deploy to an AWS/GCP/Azure VM (yet to decide which of these I like best, we might want to do more cloud-native things in the future).

Current tests:

  • Check for latest index generated on CollecTor and that it is in a reasonable time.
  • Check for latest documents published on CollecTor and that they are in a reasonable time.

I'm increasing the points on this task to 10, as I think that is roughly the amount of time to spend to get something working and useful. I'll remove the points from this ticket once we have child tickets in place, each with specific points. Maybe this estimate will go up, maybe down.

Changed 6 weeks ago by irl

comment:3 Changed 6 weeks ago by irl


comment:4 Changed 4 weeks ago by irl

I took a go at deploying this on AWS. I had heard about this new fancy Lightsail and deployed it there. This was a mistake and ended up being a waste of time for a number of reasons.

Instead this is going to need to use EC2 (which Lightsail is based on anyway) so that we have better control over the firewall (ICMP is blocked on Lightsail) and so that it is possible to use the metadata service for AWS credentials (which will allow us to use SNS for alerting).

In good news though, the Ansible playbook works well for deploying the software and configuration.

As this is going to involve an EC2 instance, a couple of SNS topics, an IAM role and some glue I would like to see if I can get a CloudFormation template for this together so that we don't have AWS resources scattered and forgotten (and billed for) when we change this in the future.

Note: See TracTickets for help on using tickets.