Opened 2 months ago

Last modified 2 months ago

#33061 needs_information enhancement

archived bandwidth scanner files lack explicit source attibution

Reported by: starlight Owned by: metrics-team
Priority: Medium Milestone:
Component: Metrics/CollecTor Version:
Severity: Normal Keywords:
Cc: metrics-team Actual Points:
Parent ID: Points:
Reviewer: Sponsor:

Description

Current files in

https://collector.torproject.org/archive/relay-descriptors/bandwidths/
https://collector.torproject.org/recent/relay-descriptors/bandwidths/

lack indication of which bandwidth scanner generated them. (files archived from Tom's collection are attributed)

Collect these files here and abandoned an attempt to fill a gap due to this issue. Ad-hoc logic to bin them may be possible but is not trivial. Can provide attribution by sha256 digest for most of them if the file naming is improved.

original ticket #21378

Child Tickets

Change History (5)

comment:1 Changed 2 months ago by karsten

Cc: metrics-team added
Status: newneeds_information
Type: defectenhancement

This was a deliberate design decision back when we added bandwidth files to CollecTor. Bandwidth scanners and the files they generate are not tied to directory authorities except that usually one bandwidth file is being used by one directory authority. But it could be that a bandwidth file is never being used, or used by more than one directory authority. The only way to be certain about a bandwidth file being used by a directory authority is to look at a vote find the bandwidth file reference in there.

I'll leave this open as an enhancement and in needs_information for our next team meeting to decide whether we want to question this design decision.

comment:2 Changed 2 months ago by starlight

I have conflated bandwidth scanner with bandwidth authority in my thinking and this ticket, but what's interesting is attribution of the bandwidth scanner source for each document. These originate in the bandwidth scanners, and as you say generally one bandwidth scanner is associated with one authority thus far. Have thought of the new mechanism mainly as a standardized way for making the information available, in contrast to the previous ad-hoc web-server hostings.

Last edited 2 months ago by starlight (previous) (diff)

comment:3 Changed 2 months ago by starlight

To clarify further: Each bandwidth scanner has a unique perspective of available bandwidth capacities in the network. Associating documents in time series tied to individual scanners is critical to making sense of the data.

comment:4 in reply to:  3 ; Changed 2 months ago by karsten

Replying to starlight:

To clarify further: Each bandwidth scanner has a unique perspective of available bandwidth capacities in the network. Associating documents in time series tied to individual scanners is critical to making sense of the data.

True. What you'll have to do is combine bandwidth files with votes to extract meaningful results. This is certainly more work than getting source information from bandwidth files directly. But it's also not trivial or maybe not even possible for CollecTor to include this information in bandwidth files while archiving them. That's why it needs to happen at the analysis stage right now.

Note that combining descriptors is not unusual for an analysis. Right now I'm combining consensuses, votes, server descriptors, and extra-infos for another, unrelated analysis. Sometimes it's simply necessary to combine data from different data sources; in the bandwidth files case from bandwidth scanners and directory authorities using bandwidth scanner data.

Maybe we cannot decide this right now. Maybe we first need to experience how painful it would be to analyze bandwidth files when we include that data somewhere.

comment:5 in reply to:  4 Changed 2 months ago by starlight

Replying to karsten:

Replying to starlight:

To clarify further: Each bandwidth scanner has a unique perspective of available bandwidth capacities in the network. Associating documents in time series tied to individual scanners is critical to making sense of the data.

True. What you'll have to do is combine bandwidth files with votes to extract meaningful results.

I agree combining votes and bandwidth documents is useful, but I find significant value in bandwidth scanner documents alone provided the source scanners are attributed.

. . .it's also not trivial or maybe not even possible for CollecTor to include this information in bandwidth files while archiving them.

I'm curious why--have no difficulty with attribution here. The scanner-to-authority correlation may not be the big picture design, but is the practical reality to date.

Note that combining descriptors is not unusual for an analysis. Right now I'm combining consensuses, votes, server descriptors, and extra-infos for another, unrelated analysis. Sometimes it's simply necessary to combine data from different data sources; in the bandwidth files case from bandwidth scanners and directory authorities using bandwidth scanner data.

No disagreement some forms of analysis are fine or even better without the source.

=====

I managed a perl script that successfully attributes scanner sources for the gaps filled from Collector. Willing to make the results available.

Note: See TracTickets for help on using tickets.