Opened 6 years ago

Closed 9 months ago

#4407 closed task (wontfix)

Create a basic monitoring infrastructure for large scale events

Reported by: atagar Owned by:
Priority: Medium Milestone:
Component: Metrics/Website Version:
Severity: Normal Keywords:
Cc: Actual Points:
Parent ID: Points:
Reviewer: Sponsor:

Description

At present we don't seem to have alarming infrastructure or anything else that would notify us of events (even noisy attack attempts) within the network. Bandwidth authorities are generating the heuristics congestion attacks need to modify to effect client behavior, so we should be able to trivially detect large deltas (for instance, if half the network suddenly seems to vanish) and get a notification.

I don't believe that we should invest too much effort here since this probably isn't an arms race we'd do very well in. However, we shouldn't rely on users to notify us of DoS attempts or other particularly noisy attacks.

Putting this under metrics to get Karsten's thoughts. Maybe this is somehow already being monitored? If not then general purpose alarming (which could also replace my consensus tracker script) might be a fun project to hack on later, and possibly be a good candidate for GSoC next year. :)

Cheers! -Damian

Child Tickets

TicketStatusOwnerSummaryComponent
#5212closedrunaWrite up wiki page with requirements, questions etc for a basic monitoring infrastructureMetrics/Website

Change History (10)

comment:1 Changed 6 years ago by karsten

Component: Metrics UtilitiesMetrics Website
Summary: Basic Monitoring for Large Scale EventsCreate a basic monitoring infrastructure for large scale events
Type: projecttask

I agree that such a tool would be quite useful. If possible, I'd like to merge it with the consensus-health script and the script that checks whether a relay's bandwidth history timestamps are totally off (which I'm planning to write this week). So, the first question would be how we can combine these different monitoring scripts into a common infrastructure. My suggestion is to continue using a Java application that gets executed once per hour by cron. But I already know 1 person who won't like that suggestion. ;)

The second question would be what heuristics we should use to detect these large scale events. We can implement trivial things like "we just lost more than 25% of the relays in one hour." But what we really need is someone to sit down with the descriptor archives and look what are expected changes and what changes would be unusual. This may be a fine GSoC project (or a fine volunteer project outside of GSoC). Having an existing infrastructure that someone can extend might be helpful.

comment:2 Changed 6 years ago by atagar

My suggestion is to continue using a Java application that gets executed once per hour by cron.

The first question that comes to mind is: do we need monitors to have historical data? This was the reason I avoided the metrics codebase for my consensus tracker script. Once you add its java and DB prereqs the installation and complexity of the system gets much worse with, I think, little benefit.

What I'd like to see is for the alarm infrastructure to use a metrics service API, but itself be a separate and distinct component.

That said, this decision is really up to whoever codes it. If it's something like the above then I'd be happy to mentor, and if it's an expansion of the metrics codebase then guess that ball's in your court. If no one gets to it first then I might hack on it later as a client for stem.

But I already know 1 person who won't like that suggestion. ;)

Bold accusation! Actually, if you'd proposed a java project when I first joined the community I would have been all over it - I have far more java development experience than python.

We can implement trivial things like "we just lost more than 25% of the relays in one hour." But what we really need is someone to sit down with the descriptor archives and look what are expected changes and what changes would be unusual.

Right. What I'd like to see first is alarms for when the sky is falling. After that it becomes a question of tuning and pattern matching which could then easily lead to interesting research projects - hint hint, researchy people. :)

Cheers! -Damian

comment:3 in reply to:  2 ; Changed 6 years ago by karsten

Replying to atagar:

My suggestion is to continue using a Java application that gets executed once per hour by cron.

The first question that comes to mind is: do we need monitors to have historical data? This was the reason I avoided the metrics codebase for my consensus tracker script. Once you add its java and DB prereqs the installation and complexity of the system gets much worse with, I think, little benefit.

We don't need historical data for the monitoring infrastructure. Or rather, we'll want to keep our own state files, but we don't really need to have access to past descriptors. I agree with you that the monitoring infrastructure should be independent of the metrics database.

I came to a similar conclusion a few weeks ago, but for a slightly different reason. We had a single cronjob to download descriptors, import them into the metrics database, and run the consensus-health script. This approach turned out to be terribly error-prone. Whenever the database import got stuck, the download stopped and the consensus-health script didn't work anymore. That's why I made the consensus-health script a separate component that is independent of the metrics database.

But hey, Java is not a prereq, it's a programming language. Whether we require a certain JVM and Java libraries or a certain Python version and Python APIs makes no difference. Well, besides the personal developer preferences that have an influence on development speed.

What I'd like to see is for the alarm infrastructure to use a metrics service API, but itself be a separate and distinct component.

I like the idea of such a metrics service API. I have a TODO list item since way too many months for extracting the common parts of metrics-web and metrics-db that handle relay descriptors and put them in a separate API. In the meantime, ExoneraTor copies that code, the consensus-health script copies it, the extra-info descriptor health script would copy it, and the monitoring infrastructure is going to copy it, too. Let's finally make an API. I'm going to open a ticket today once I have a rough idea how the API could look like. Will post the ticket number here.

That said, this decision is really up to whoever codes it. If it's something like the above then I'd be happy to mentor, and if it's an expansion of the metrics codebase then guess that ball's in your court. If no one gets to it first then I might hack on it later as a client for stem.

We could also discuss what the API is supposed to do, and then implement it both in Java and Python. There are a few Java metrics programs that would make use of it, and I think you have a few Python applications which would use it, too.

But I already know 1 person who won't like that suggestion. ;)

Bold accusation! Actually, if you'd proposed a java project when I first joined the community I would have been all over it - I have far more java development experience than python.

Doh! ;)

We can implement trivial things like "we just lost more than 25% of the relays in one hour." But what we really need is someone to sit down with the descriptor archives and look what are expected changes and what changes would be unusual.

Right. What I'd like to see first is alarms for when the sky is falling. After that it becomes a question of tuning and pattern matching which could then easily lead to interesting research projects - hint hint, researchy people. :)

Agreed. This research project might even turn out to be quite interesting!

comment:4 in reply to:  3 Changed 6 years ago by karsten

Replying to karsten:

I'm going to open a ticket today once I have a rough idea how the API could look like. Will post the ticket number here.

See #4439.

comment:5 Changed 6 years ago by runa

Sounds like an interesting project to hack on. Has anyone made any progress so far?

comment:6 Changed 6 years ago by atagar

Nope, no one has started yet.

comment:7 Changed 6 years ago by karsten

Owner: changed from karsten to runa
Status: newassigned

Re-assigning to Runa who has expressed most interest in working on this.

comment:8 Changed 6 years ago by karsten

I just found #3237 which I opened 10 months ago and which may have more input on the topic.

comment:9 Changed 5 years ago by runa

Owner: runa deleted

comment:10 Changed 9 months ago by karsten

Resolution: wontfix
Severity: Normal
Status: assignedclosed

We have several scripts checking for bursts of newly added relays, groups of relays with similar fingerprints, relays doing bad things in the Tor network, and so on. We might want to have even more and better tools. But we very likely don't depend on keeping this ticket open in order to so. Closing.

Note: See TracTickets for help on using tickets.