Opened 6 years ago

Closed 6 years ago

#5212 closed task (fixed)

Write up wiki page with requirements, questions etc for a basic monitoring infrastructure

Reported by: runa Owned by: runa
Priority: Medium Milestone:
Component: Metrics/Website Version:
Severity: Keywords:
Cc: gsathya.ceg@… Actual Points:
Parent ID: #4407 Points:
Reviewer: Sponsor:

Description

We want to create some kind of basic monitoring infrastructure for large scale events (#4407). A good first step would be to collect all requirements, thoughts, questions etc on a wiki page.

Child Tickets

Change History (5)

comment:1 Changed 6 years ago by atagar

Parent ID: #4407

From the dev meeting here's the thoughts that I recall...

  • Ideally this would replace my consensus tracker script, Karsten's consensus health checker, and SoaT. Those all would be plugins for this system.
  • I suggested using the onionoo protocol for this, in which case the first step would be to make an onionoo client in whatever language the alarming framework would use.
  • I asked Roger for thoughts on other things that we could monitor and he suggested...
    • Entropy of bandwidth authority weights, so we know when the authority heuristics radically change.
    • Tor weather notice for when people should get a shirt. Ideally we'd then reach out to them to figure out how their experience as a relay operator was going.
    • Notice about new especially large big relays.
  • After some more thought I realized that we should also look around to see what sort of alarming frameworks already exist. That might save us a lot of work and maybe provide a nice UI too.

comment:2 in reply to:  1 Changed 6 years ago by karsten

Replying to atagar:

From the dev meeting here's the thoughts that I recall...

  • Ideally this would replace my consensus tracker script, Karsten's consensus health checker, and SoaT. Those all would be plugins for this system.

My understanding was that the alarming framework would be passive and only look at descriptors or other data sources, but that it shouldn't make any active requests. We didn't explicitly discuss this, but I think if active requests are in scope, then the scope is too broad. I'm worried that we fall into the kitchen-sink trap again.

Your consensus tracker script should do fine with passively looking at the current descriptors and maybe server descriptors. Part of the consensus-health script will be fine with that, too, except for the parts where we check if a directory authority tells us a recent consensus or not. But SoaT is mostly based on actively measuring whether exits are evil or not; that's something we cannot learn passively from looking at descriptors.

  • I suggested using the onionoo protocol for this, in which case the first step would be to make an onionoo client in whatever language the alarming framework would use.

I still disagree that this is a good application for the Onionoo protocol. That protocol is meant for applications that want to learn about the status of single relays or bridges that were running in the past week. Onionoo clients should be able to find all information they care about in the latest documents they download. They shouldn't care about past Onionoo documents. That's why Onionoo contains bandwidth history objects, for example, instead of expecting Onionoo clients to collect their own histories. Applications that want to create a history of relays or bridges should look at the original descriptors.

Of course, I'm not going to stop you from using Onionoo for anything. But I can't promise not to break it in the future for applications that don't use it in the way it was designed for. Nor can I promise to add lots of stuff to the formats that is useless for the main purpose and that makes documents unnecessarily big.

I suggest rsync'ing the metrics-recent directory, or parts from it, from metrics once per hour. That's also what the Onionoo server does. Once the alarming framework is deployed on a Tor VM, that's just a local connection from one VM to another VM on the same physical host or in the same LAN.

  • I asked Roger for thoughts on other things that we could monitor and he suggested...
    • Entropy of bandwidth authority weights, so we know when the authority heuristics radically change.
    • Tor weather notice for when people should get a shirt. Ideally we'd then reach out to them to figure out how their experience as a relay operator was going.

Note that the original Tor Weather use case of notifying operators when a node goes down can be implemented using the Onionoo protocol just fine. That's based on information from the past week. The t-shirt thing is something that could be implemented using the Onionoo protocol, but it's not what Onionoo was designed for.

  • Notice about new especially large big relays.
  • After some more thought I realized that we should also look around to see what sort of alarming frameworks already exist. That might save us a lot of work and maybe provide a nice UI too.

Using something existing would be good.

comment:3 Changed 6 years ago by gsathya

Cc: gsathya.ceg@… added

comment:4 Changed 6 years ago by gsathya

Hopefully I've aggregated most of the discussion into a wiki page - https://trac.torproject.org/projects/tor/wiki/doc/MonitoringFramework

comment:5 Changed 6 years ago by runa

Resolution: fixed
Status: newclosed

I'm going to close this ticket since we have a wiki page up. Thanks, gsathya!

Note: See TracTickets for help on using tickets.