Opened 4 years ago

Last modified 14 months ago

#13137 assigned enhancement

Provide more historical data to facilitate debugging network problems

Reported by: Sebastian Owned by: metrics-team
Priority: Low Milestone:
Component: Metrics/Onionoo Version:
Severity: Normal Keywords:
Cc: Actual Points:
Parent ID: Points:
Reviewer: Sponsor:

Description

It'd be great if it was possible to get an overview of historical data (for example, the past month). When was a node in the consensus, with what flags, bw, etc.

Child Tickets

Change History (8)

comment:1 Changed 4 years ago by karsten

Thanks for the suggestion!

Can you provide more information what you're looking for? We already provide historical data based on what's written about a relay in the consensus. Example, here's what a recent consensus says about gabelmoo:

r gabelmoo 8gREE9rC4C49a89HNaGbyh3pcoE 5KFudVCmNLttd+m4hgbJjZLquYA 2014-09-16 04:09:24 212.112.245.170 443 80
s Authority HSDir Running Stable V2Dir Valid
v Tor 0.2.6.0-alpha-dev
w Bandwidth=20 Unmeasured=1
p reject 1-65535

We already provide the following historical data:

  • details documents contain fields "first_seen" and "last_seen", telling you when the relay was first and last seen in a consensus.
  • details documents further contain a field "last_changed_address_or_port", indicating when the relay changed its primary OR address or port.
  • weights documents contain a field "consensus_weight" that is a history object with absolute consensus weight of the relay over time. This history is available for the past week, month, 3 months, year, and 5 years.
  • uptime documents contain a history object telling you what fraction of consensuses contained a given relay. This history is available for the same time frames as histories in weights documents.

(Note that your favorite Onionoo client may not display these fields yet, but that's a task for the client author, not a feature request for Onionoo.)

What specifically are you interested in, in addition to the existing histories?

For example, do you really care whether a relay was contained in a specific consensus three weeks and two days ago at 16:00 UTC, and which relay flags and consensus weight it had? (It's not likely that we'll be able to provide this amount of detail, so I hope the answer is no.)

Or would it already help to know when a relay first got certain relay flags assigned? I'm thinking of Guard and Stable here, so I could imagine two new fields first_guard and first_stable for details documents. (Are there more relay flags that you care about in the context of historical data?)

And for consensus weights, if the history in weights documents is not sufficient, what are you missing?

comment:2 Changed 4 years ago by Sebastian

Indeed I'd like exactly the stuff you might not provide. Basically, I'd like to be able to recreate the current consensus-health page for the past weeks or so, to see how stuff changes over time, with a bit more info :/

comment:3 Changed 4 years ago by karsten

Hang on. Let's focus more on the goals that you're trying to achieve, not on the suspected means that might get you there. Because I doubt that giving you tons of data and relying on another tool that re-creates consensus-health pages is the right approach. I think we can do better than that, but we need to do the hard thinking now, not when we're looking at the tons of data.

What stuff exactly do you want to see that changes over time? I suggested a few things that I think you might be interested in. What are you missing in that collection? Can you maybe give a few examples what you're wondering about when you're debugging an issue with a relay?

comment:4 Changed 4 years ago by Sebastian

It's really hard for me to come up with an exact usecase. Recently, I aimlessly scrolled through the consensus-health page, and noticed gabelmoo wasn't voting running for a bunch of relays. So I tried check its log, but it logged it found them reachable. So I thought maybe it was a fluke this hour, and waited for the next hour. Hrm, same thing. I then downloaded old consensuses to see if this was a recent development or something that came up a longer time ago, learned it was somewhat recent (which was good, because I try to take good care of gabelmoo and hence scroll through consensus-health every now and then). Then after some more poking I realized something all those relays had in common: ipv6. I then noticed that my upstream had broken ipv6 on the host. I didn't have proper monitoring in place for that, but without something like consensus-health I'd still not vote Running on these relays. If I could click on the page, and it'd pull up the descriptors for the relays and show me the flags and when I voted what for them, it could help track down what I changed to cause an issue.

Another instance is when a relay operator complains because they aren't in the consensus, I look at consensus-health to see who is voting what, and try to figure out why. It just gives a quick overview for a relay. For that, it'd be good if it would show ip information, too, so I could search for that (in this case, I had ip address and relay nickname, and fortunately the nickname was unique enough to identify the relay).

A third example is my work to get rid of the Naming flag, and redo the BadExiting/Rejecting/Valid-voting stuff. I tested a version of my patch, and it only voted 10/22 relays as BadExit, so obviously some where missing. Trouble was, none of the missing relays made it into the consensus, they were just in the votes. Their nicknames were "default", so searching for that didn't help either. After a bit, arma noticed that these were syrian and iranian relays. This, too, could've been much quicker resolved if I could've clicked on the relay, it would've pulled up the page with all the information about it (this time including country).

comment:5 in reply to:  4 Changed 4 years ago by karsten

Priority: normalminor

Replying to Sebastian:

It's really hard for me to come up with an exact usecase.

Thanks for trying!

Recently, I aimlessly scrolled through the consensus-health page, and noticed gabelmoo wasn't voting running for a bunch of relays. So I tried check its log, but it logged it found them reachable. So I thought maybe it was a fluke this hour, and waited for the next hour. Hrm, same thing. I then downloaded old consensuses to see if this was a recent development or something that came up a longer time ago, learned it was somewhat recent (which was good, because I try to take good care of gabelmoo and hence scroll through consensus-health every now and then). Then after some more poking I realized something all those relays had in common: ipv6. I then noticed that my upstream had broken ipv6 on the host. I didn't have proper monitoring in place for that, but without something like consensus-health I'd still not vote Running on these relays. If I could click on the page, and it'd pull up the descriptors for the relays and show me the flags and when I voted what for them, it could help track down what I changed to cause an issue.

It seems that #9778 would have helped here, though it doesn't come with history. But I'd say let's add current vote information first and think about history in step two.

Another instance is when a relay operator complains because they aren't in the consensus, I look at consensus-health to see who is voting what, and try to figure out why. It just gives a quick overview for a relay. For that, it'd be good if it would show ip information, too, so I could search for that (in this case, I had ip address and relay nickname, and fortunately the nickname was unique enough to identify the relay).

You're aware that you can search for partial fingerprints, IP address, nicknames, or any combination of those in Atlas/Globe? For example, F2044413 uniquely identifies gabelmoo, as does gabelmoo 212.112.245.170.

A third example is my work to get rid of the Naming flag, and redo the BadExiting/Rejecting/Valid-voting stuff. I tested a version of my patch, and it only voted 10/22 relays as BadExit, so obviously some where missing. Trouble was, none of the missing relays made it into the consensus, they were just in the votes. Their nicknames were "default", so searching for that didn't help either. After a bit, arma noticed that these were syrian and iranian relays. This, too, could've been much quicker resolved if I could've clicked on the relay, it would've pulled up the page with all the information about it (this time including country).

This is also related to #9778. Once we parse votes, we'll also include relays that didn't make it into the consensus.

Similarly to the second use case, relays with nickname "default" shouldn't pose a problem, because you can always search by partial fingerprint.

To summarize, let's do #9778 first and then see whether we should add more historical data to facilitate debugging problems like the ones described here. Setting priority to minor.

comment:6 Changed 4 years ago by karsten

Type: defectenhancement

Sounds like an enhancement to me, not a defect.

comment:7 Changed 14 months ago by karsten

Severity: Normal
Summary: Historical dataProvide more historical data to facilitate debugging network problems

Attempt to provide a more accurate summary.

comment:8 Changed 14 months ago by karsten

Owner: set to metrics-team
Status: newassigned
Note: See TracTickets for help on using tickets.