Opened 3 months ago

Last modified 7 days ago

#30219 needs_review enhancement

Add Tom's bandwidth file archive to CollecTor

Reported by: irl Owned by: metrics-team
Priority: Medium Milestone:
Component: Metrics/CollecTor Version:
Severity: Normal Keywords: tor-bwauth, tor-dirauth, metrics-roadmap-2019-q2
Cc: tom, teor, metrics-team, starlight@… Actual Points:
Parent ID: #21378 Points:
Reviewer: irl Sponsor:

Description

We should think about how best to do this, but once we've transitioned to fetching bandwidth files from the dir auths we can probably do a simple backfill of Tom's data.

Child Tickets

Change History (14)

comment:1 Changed 2 months ago by karsten

Hmm, I'm probably missing something obvious, but where would I find Tom's data?

comment:3 Changed 6 weeks ago by karsten

Okay, I finally got around to trying a local import with CollecTor. At least importing a small sample of these bandwidth files worked okay.

However, I'm wondering if we need to make an enhancement before import these bandwidth files into CollecTor. Consider these subdirectories in Tom's tarball:

 24G	bastet
8.2G	faravahar
 22G	gabelmoo
 29G	maatuska
2.2G	maatuska-21697
1.9G	maatuska-fastly
616M	maatuska-nodns
 58M	maatuska-nofasthop
3.5G	maatuska-vanilla
 26G	moria1

As of now, when we import these files into CollecTor we're losing meta data like source and human-readable annotations like whether DNS was broken or which bandwidth file server was used. We could provide those annotations via some timeline (e.g., by saying when maatuska switched to Fastly), but there's no good way to retain source information in these files.

Note that we're facing the same issue with current bandwidth files. We just said that they'll be referenced from votes which provides source information indirectly.

Also note that we briefly discussed including source information in the file name. But even if we do that, we should consider adding something to the file contents, most likely as an annotation in order to keep the digest unchanged. We should not put relevant information only in the file name.

Hmmm.

comment:4 Changed 6 weeks ago by tom

You should disregard all of the maatuska-* directories and only use the maatuska directory. All the others were a parallel bwauth running experiments.

comment:5 Changed 6 weeks ago by karsten

Okay, good to know. This solves the "annotations" problem, but it doesn't yet solve the "source" problem. Still hmmmm.

comment:6 Changed 5 weeks ago by irl

tor already uses @source when writing out cached descriptors to disk.

https://gitweb.torproject.org/tor.git/tree/src/feature/dirclient/dirclient.c#n1816

They are IP addresses when tor does it, but the annotation is not defined anywhere. I don't know that anything is parsing this back into an IP address and we should be able to treat this as an opaque string. In most cases all we will care about is equality comparison to another string.

comment:7 Changed 5 weeks ago by atagar

Hi Karsten, hi irl. I'm not against new annotations but if we go that route I'd like for us to formally specify them in the dir-spec.

Presently tor has two kinds of annotations...

  1. CollecTor's @type annotations. These are speced on the metrics site and only exist in CollecTor archives.
  1. Server descriptors cached into tor's data directory. These aren't speced anywhere and effectively nobody uses them.

If we'd like to add a new annotation type that's great! But I'd appreciate if we formally spec annotations in the dir-spec.

comment:8 Changed 5 weeks ago by atagar

Oh right! Forgot we have a ticket for this. Here it is...

https://trac.torproject.org/projects/tor/ticket/28067

comment:9 in reply to:  6 Changed 5 weeks ago by teor

Replying to irl:

tor already uses @source when writing out cached descriptors to disk.

https://gitweb.torproject.org/tor.git/tree/src/feature/dirclient/dirclient.c#n1816

They are IP addresses when tor does it, but the annotation is not defined anywhere. I don't know that anything is parsing this back into an IP address and we should be able to treat this as an opaque string. In most cases all we will care about is equality comparison to another string.

"Address" is an IP address or domain name, with a port:
https://gitweb.torproject.org/tor.git/tree/src/core/or/connection_st.h#n120

comment:10 Changed 8 days ago by karsten

Alright, it sounds like specifying annotations, including this one, is already under discussion in #28067.

What remains to be discussed is what annotation we're using in this case.

Regarding "@source" annotations, it's true that tor writes these as "FQDN (or IP) and port", but it also accepts other values such as "@source controller" when it attempts to parse a router descriptor received via the controller in router_load_single_router(). So it really doesn't care what's written in that annotation, which is fine.

In this case we might want to use the directory authority nickname rather than IP address or domain name, with a port, because nicknames work so much better in file names than the others.

How about we include an optional "@source $nickname" annotation for bandwidth files that we also add to the file name, such as:

2019-04-21-09-00-00-bandwidth-file-$nickname-42D07217935283D42CC559A61E7F788E3A92F0E2CB54BE58B5B83E25563C55A2 // nickname given
2019-04-21-09-00-00-bandwidth-file-42D07217935283D42CC559A61E7F788E3A92F0E2CB54BE58B5B83E25563C55A2 // no nickname given

comment:11 Changed 8 days ago by irl

Is there always a one to one mapping of bwauth to dirauth? Can you run a bwauth without having a dirauth attached to it? Can one bwauth be consumed by multiple dirauths?

If we're going to say $nickname then we should not have this be the same nickname as in a server descriptor, but rather one that we keep a registry of and can be more precise about. It's a key that you can use to look up more details about the source by asking the Metrics team (we'd probably put the list on collector.html).

comment:12 in reply to:  11 Changed 8 days ago by tom

Replying to irl:

Is there always a one to one mapping of bwauth to dirauth?

No

Can you run a bwauth without having a dirauth attached to it?

Absolutely. In fact all of the maatuska-* files in the dataset are a bwauth whose bwauth files were never used by a dirauth. Other people have run bwauths on the tor network without sending them to a dirauth also.

Whether collector wants to support those is a whole other matter. For one thing the files would need to be important manually (or a new collector-scanner set up to download them in a non-standard way.)

Can one bwauth be consumed by multiple dirauths?

Yes, although this is discouraged. I'm not sure it's ever happened, although we did consider it at one point to get the network kicked out of a state where we weren't using bwauth measurements for path selection.

Also you didn't ask: can a dirauth use multiple bwauths?

It can, although I don't think this has ever happened. The dirauth would need to switch between the bwauths on some schedule or otherwise combine the files in a way I don't think we've ever considered. Typically, to achieve diversity, a bwauth performs its downloads from multiple differingly-geolocated servers

comment:13 in reply to:  11 Changed 8 days ago by karsten

Good questions! I think our goal here is to archive bandwidth files as they're being used in the network. If a bandwidth file had been used by more than one dirauth, we'd store the same file multiple times, once under each source name. (Or if we'd download it without including the source annotation, we'd store it only once, and it would be referenced using the same digest from multiple votes.) And if a bandwidth file had not been used by any dirauth, we wouldn't get to see it at all.

comment:14 Changed 7 days ago by karsten

Reviewer: irl
Status: newneeds_review

Please review commit abad358 in my task-30219 branch, which includes the directory name as @source annotation and as part of the file name.

Note: See TracTickets for help on using tickets.