Opened 23 months ago

Last modified 3 weeks ago

#21378 accepted enhancement

Archive bwauth bandwidth files

Reported by: tom Owned by: irl
Priority: Medium Milestone:
Component: Metrics/CollecTor Version:
Severity: Normal Keywords: tor-bwauth tor-dirauth
Cc: teor, metrics-team, starlight@… Actual Points:
Parent ID: Points:
Reviewer: Sponsor:

Description (last modified by karsten)

The raw bwauth votes (sample: https://bwauth.ritter.vg/bwauth/bwscan.V3BandwidthsFile) contain information such as last measured time, circuit failures and (eventually) scanner information. This can be used for debugging purposes.

Blocked by #21377, possible next steps in comment 14.

Child Tickets

Change History (27)

comment:1 Changed 15 months ago by karsten

Summary: Collector should archive bwauth votesArchive bwauth votes

Keep summary as short and precise as possible.

comment:2 Changed 6 months ago by juga

As commented in https://trac.torproject.org/projects/tor/ticket/21377#comment:4

This is now fixed in sbws, though is still not being use by any DirAuth.
It is not planned to add this toTorflow.

We can correlate the votes with the Torflow bandwidth measurement files by the timestamp, right?, so i wonder if it would still be possible to archive the files produced by all the bwauths that are running Torflow.

Another more exotic idea (probably needs other ticket if it makes sense) is to collect the data from the testnet, since we have DirAuths running sbws there. Would this require lot of extra work?

comment:3 Changed 6 months ago by juga

Keywords: tor-bwauth tor-dirauth added

comment:4 Changed 6 months ago by juga

Cc: teor added

comment:5 in reply to:  2 Changed 6 months ago by teor

Summary: Archive bwauth votesArchive bwauth bandwidth files

Replying to juga:

As commented in https://trac.torproject.org/projects/tor/ticket/21377#comment:4

This is now fixed in sbws, though is still not being use by any DirAuth.
It is not planned to add this toTorflow.

We can correlate the votes with the Torflow bandwidth measurement files by the timestamp, right?, so i wonder if it would still be possible to archive the files produced by all the bwauths that are running Torflow.

Another more exotic idea (probably needs other ticket if it makes sense) is to collect the data from the testnet, since we have DirAuths running sbws there. Would this require lot of extra work?

This ticket is about archiving the entire v3bw file from each directory authority.
It's not enough to archive the files from the testnet.

We could implement this ticket by making the bandwidth file part of the directory protocol. We have a spec for the bandwidth file format, so all we need to do is specify the URL for the file in torspec, and implement it in the code.

comment:6 Changed 6 months ago by juga

So do you mean to add to dir-spec.txt something like?:

"bandwidth-file-url"

[At most once]

The Bandwidth file URL used to obtain the measured bandwidth.
These files SHOULD be available at:
http://<hostname>/tor/bwfiles/<bwfile>

Since Torflow and sbws use different names for the bandwidth files, i guess it's fine to don't specify the name that file needs to have, but just have them available in a known directory/path?.

Should all files be available or just the one used for the last vote or for some period of time?

comment:7 Changed 6 months ago by teor

No, I suggest that we make the current bandwidth file available on every directory authority at:
http://<hostname>/tor/status-vote/next/bandwidth.z

Just like the rest of the directory documents:
https://gitweb.torproject.org/torspec.git/tree/dir-spec.txt#n3294

Edit:
We could also store a temporary copy of the exact file we used, and serve it from http://<hostname>/tor/status-vote/current/bandwidth.z
But that is more complicated, so let's get the next URL working first.

Last edited 6 months ago by teor (previous) (diff)

comment:8 Changed 6 months ago by irl

Using the fixed URL ​http://<hostname>/tor/status-vote/next/bandwidth.z sounds like it would be very easy to add this to CollecTor.

We have discussed in the Metrics team extending dir-spec.txt to allow to fetch "recent" files as well as just next/current. In the case that there is a wide CollecTor outage, and we miss a file, it would be good to have those files cached (on a best-effort basis, not necessarily persisted to disk) and available via some URL.

I don't know if karsten already had some ideas about what these URLs would look like, but we should perhaps consider this before implementing changes to dir-spec.txt.

comment:9 in reply to:  8 ; Changed 6 months ago by teor

Replying to irl:

Using the fixed URL ​http://<hostname>/tor/status-vote/next/bandwidth.z sounds like it would be very easy to add this to CollecTor.

Thanks for the feedback!

We have discussed in the Metrics team extending dir-spec.txt to allow to fetch "recent" files as well as just next/current. In the case that there is a wide CollecTor outage, and we miss a file, it would be good to have those files cached (on a best-effort basis, not necessarily persisted to disk) and available via some URL.

How is this any different to losing descriptors or consensuses?
(Please answer this question on a separate ticket.)

I don't know if karsten already had some ideas about what these URLs would look like, but we should perhaps consider this before implementing changes to dir-spec.txt.

Please open a separate ticket for this feature. It's potentially a large feature. And it's not essential for the initial release of this feature.

comment:10 in reply to:  9 ; Changed 6 months ago by irl

It looks like the changes that are needed are:

1) Teach RelayDescriptorDownloader to download the new URL (in the downloadDescriptors function)
2) Teach RelayDescriptorParser to recognise the file
3) Teach ArchiveWriter where it should put the files in CollecTor's hierachy

comment:11 in reply to:  7 Changed 5 months ago by juga

Replying to teor:

No, I suggest that we make the current bandwidth file available on every directory authority at:
http://<hostname>/tor/status-vote/next/bandwidth.z

See comment https://trac.torproject.org/projects/tor/ticket/21377#comment:9

comment:12 in reply to:  9 ; Changed 5 months ago by teor

Replying to teor:

Replying to irl:

Using the fixed URL ​http://<hostname>/tor/status-vote/next/bandwidth.z sounds like it would be very easy to add this to CollecTor.

Thanks for the feedback!

We have discussed in the Metrics team extending dir-spec.txt to allow to fetch "recent" files as well as just next/current. In the case that there is a wide CollecTor outage, and we miss a file, it would be good to have those files cached (on a best-effort basis, not necessarily persisted to disk) and available via some URL.

How is this any different to losing descriptors or consensuses?
(Please answer this question on a separate ticket.)

I don't know if karsten already had some ideas about what these URLs would look like, but we should perhaps consider this before implementing changes to dir-spec.txt.

Please open a separate ticket for this feature. It's potentially a large feature. And it's not essential for the initial release of this feature.

If you open a separate ticket for historical directory documents, please make #26698 a child of that ticket. We'll need bandwidth file hashes to work out the exact file used in each vote.

comment:13 in reply to:  12 Changed 4 months ago by teor

I opened #27047 for tor authorities to keep recent documents.

comment:14 in reply to:  10 ; Changed 5 weeks ago by karsten

Cc: metrics-team added
Description: modified (diff)
Priority: LowMedium

I just went through the long discussion above and tried to identify next steps. irl's list of needed changes looks pretty good. I'll add some thoughts to these steps below that we need to discuss when implementing this.

Replying to irl:

It looks like the changes that are needed are:
1) Teach RelayDescriptorDownloader to download the new URL (in the downloadDescriptors function)

  • We can either attempt to fetch this file from each authority every time, or we can have a config option which authorities should have them. In the future, we can switch to fetching only those files that are referenced from votes, unless for some reason we want to have non-referenced files, too.
  • The relaydescs module runs twice per hour, so it's going to download the file twice every hour. Again, if we only fetch referenced files, we wouldn't download the same file more than once. But it sounds like the initial version will be rather simple in this regard. Which is fine.
  • I assume there are no plans that authorities serve bandwidth files of other authorities? That's different for votes which are cached by other authorities. Should be fine, but something to consider for the future.

2) Teach RelayDescriptorParser to recognise the file

  • While we're waiting for #21377, can we have a sample file to start writing some parsing code?

3) Teach ArchiveWriter where it should put the files in CollecTor's hierachy

  • Let's discuss what should go into the file name. Timestamp, fingerprint, and digest? Maybe something similar to the vote file name format (with some parts shortened): 2018-11-05-09-00-00-vote-EFCBE720[...]-0D97EDB6[...]?
  • As part of this step, we might have to teach metrics-lib to recognize the new descriptor type. I believe that CollecTor will store it anyway, but it's going to complain loudly. Just in case it acts up, we can teach metrics-lib to just recognize the descriptor type without providing getters for descriptor contents.

comment:15 in reply to:  14 ; Changed 5 weeks ago by teor

Replying to karsten:

I just went through the long discussion above and tried to identify next steps. irl's list of needed changes looks pretty good. I'll add some thoughts to these steps below that we need to discuss when implementing this.

Replying to irl:

It looks like the changes that are needed are:
1) Teach RelayDescriptorDownloader to download the new URL (in the downloadDescriptors function)

  • We can either attempt to fetch this file from each authority every time, or we can have a config option which authorities should have them.

I suggest "each authority every time", because a hard-coded config will miss some of the bandwidth files on new bandwidth authorities.

In the future, we can switch to fetching only those files that are referenced from votes, unless for some reason we want to have non-referenced files, too.

How are bandwidth files referenced from votes?

I don't think we will implement "bandwidth-file-url" from https://trac.torproject.org/projects/tor/ticket/21378?replyto=14#comment:6

Tor 0.3.5? and later add bandwidth file headers to each vote, and we may add a bandwidth file hash in future. Once all authorities upgrade, you can fetch the bandwidth file if the vote contains headers.

  • The relaydescs module runs twice per hour, so it's going to download the file twice every hour. Again, if we only fetch referenced files, we wouldn't download the same file more than once.

I am not sure if we plan on implementing "referenced files" in Tor. Can you explain what you mean?

But it sounds like the initial version will be rather simple in this regard. Which is fine.

I think Juga has written code for a more complex version. But we will focus on getting the simple version working first.

  • I assume there are no plans that authorities serve bandwidth files of other authorities? That's different for votes which are cached by other authorities. Should be fine, but something to consider for the future.

Votes are posted, fetched, and cached by authorities so that each authority can create a consensus.
There's no equivalent for bandwidth files, so we probably won't implement bandwidth file caching.
But if you tell us you really need it, we could work something out.

2) Teach RelayDescriptorParser to recognise the file

  • While we're waiting for #21377, can we have a sample file to start writing some parsing code?

moria1's votes have a bandwidth-file-headers line, but it's based on torflow's bandwidth file, so it only has a timestamp:
http://128.31.0.34:9131/tor/status-vote/current/authority

The bandwidth file spec contains sample data:
https://gitweb.torproject.org/torspec.git/tree/bandwidth-file-spec.txt#n450

We're working on version 1.2.0 of the format for sbws 1.0 in #28085. When sbws 1.0 is ready, we will update the spec with sample data from the latest sbws.

3) Teach ArchiveWriter where it should put the files in CollecTor's hierachy

  • Let's discuss what should go into the file name. Timestamp, fingerprint, and digest? Maybe something similar to the vote file name format (with some parts shortened): 2018-11-05-09-00-00-vote-EFCBE720[...]-0D97EDB6[...]?
  • As part of this step, we might have to teach metrics-lib to recognize the new descriptor type. I believe that CollecTor will store it anyway, but it's going to complain loudly. Just in case it acts up, we can teach metrics-lib to just recognize the descriptor type without providing getters for descriptor contents.

comment:16 in reply to:  15 ; Changed 5 weeks ago by karsten

Replying just to the parts where I wouldn't reply "okay, cool!":

Replying to teor:

Replying to karsten:

In the future, we can switch to fetching only those files that are referenced from votes, unless for some reason we want to have non-referenced files, too.

How are bandwidth files referenced from votes?

I don't think we will implement "bandwidth-file-url" from https://trac.torproject.org/projects/tor/ticket/21378?replyto=14#comment:6

Tor 0.3.5? and later add bandwidth file headers to each vote, and we may add a bandwidth file hash in future. Once all authorities upgrade, you can fetch the bandwidth file if the vote contains headers.

I guess I was thinking of 0.3.5 then. I'm not aware of any other plans.

  • The relaydescs module runs twice per hour, so it's going to download the file twice every hour. Again, if we only fetch referenced files, we wouldn't download the same file more than once.

I am not sure if we plan on implementing "referenced files" in Tor. Can you explain what you mean?

Same as above: bandwidth files referenced from votes.

  • I assume there are no plans that authorities serve bandwidth files of other authorities? That's different for votes which are cached by other authorities. Should be fine, but something to consider for the future.

Votes are posted, fetched, and cached by authorities so that each authority can create a consensus.
There's no equivalent for bandwidth files, so we probably won't implement bandwidth file caching.
But if you tell us you really need it, we could work something out.

Sounds reasonable. I don't think we'll need it.

2) Teach RelayDescriptorParser to recognise the file

  • While we're waiting for #21377, can we have a sample file to start writing some parsing code?

moria1's votes have a bandwidth-file-headers line, but it's based on torflow's bandwidth file, so it only has a timestamp:
http://128.31.0.34:9131/tor/status-vote/current/authority

The bandwidth file spec contains sample data:
https://gitweb.torproject.org/torspec.git/tree/bandwidth-file-spec.txt#n450

We're working on version 1.2.0 of the format for sbws 1.0 in #28085. When sbws 1.0 is ready, we will update the spec with sample data from the latest sbws.

Thanks! I didn't look just yet, but this should be a good start to write some code.

comment:17 in reply to:  16 ; Changed 5 weeks ago by teor

Replying to karsten:

Replying just to the parts where I wouldn't reply "okay, cool!":

Replying to teor:

Replying to karsten:

In the future, we can switch to fetching only those files that are referenced from votes, unless for some reason we want to have non-referenced files, too.

How are bandwidth files referenced from votes?

I don't think we will implement "bandwidth-file-url" from https://trac.torproject.org/projects/tor/ticket/21378?replyto=14#comment:6

Tor 0.3.5? and later add bandwidth file headers to each vote, and we may add a bandwidth file hash in future. Once all authorities upgrade, you can fetch the bandwidth file if the vote contains headers.

I guess I was thinking of 0.3.5 then. I'm not aware of any other plans.

The ticket for putting the bandwidth file hash in the votes is #26698.

Will you use a hexadecimal hash when you archive the bandwidth files?
If so, maybe we should switch to a hexadecimal hash in the vote.
(I said base64 when I did the initial design, but consistency is more important than saving a few bytes.)

comment:18 in reply to:  17 ; Changed 5 weeks ago by karsten

Replying to teor:

Will you use a hexadecimal hash when you archive the bandwidth files?
If so, maybe we should switch to a hexadecimal hash in the vote.
(I said base64 when I did the initial design, but consistency is more important than saving a few bytes.)

Either works for us. We're converting base64 to hex and back in other places of the code, and it's fine to do that in this case, too.

comment:19 in reply to:  18 Changed 5 weeks ago by teor

Replying to karsten:

Replying to teor:

Will you use a hexadecimal hash when you archive the bandwidth files?
If so, maybe we should switch to a hexadecimal hash in the vote.
(I said base64 when I did the initial design, but consistency is more important than saving a few bytes.)

Either works for us. We're converting base64 to hex and back in other places of the code, and it's fine to do that in this case, too.

I think we might use hex for the sake of the humans looking for files on collector.

comment:20 Changed 5 weeks ago by starlight

Cc: starlight@… added

comment:21 Changed 5 weeks ago by irl

I've already started work on a prototype Python version of this CollecTor module. Can we use the same encoding for the digest that is used to reference vote, server descriptor or extra info descriptor digests? This will allow for better code reuse. We will convert things into whatever at the point they are read into CollecTor and convert them back when read out. Consistency within the spec is more important than consistency with code that I'm currently replacing.

Humans using CollecTor will typically use either stem or metrics-lib to interact with it. I'm also playing with the idea that CollecTor will itself implement a dir-spec server and pretend to be a directory cache that also serves historical descriptors. We might still have legacy things internally that we need to make conversions for but these should be things that we can eventually get rid of, not have to keep forever because that's how the spec is written.

This means that ideally, we would have hex encoded SHA1 and/or base64 encoded SHA256. (At least, in my opinion, I'm happy to hear counter arguments. Maybe I've not understood something.)

       "sha1-digest" is a hex-encoded digest (using upper-case characters) of
       the router's extra-info document, as signed in the router's extra-info
       (that is, not including the signature).  (If this field is absent, the
       router is not uploading a corresponding extra-info document.)

       "sha256-digest" is a base64-encoded SHA256 digest of the extra-info
       document, computed over the same data.

comment:22 Changed 4 weeks ago by juga

Following #26698, last changes are in ​https://github.com/torproject/tor/pull/492, which is putting a sha1-digest (hex-encoded)

comment:23 Changed 4 weeks ago by irl

I'm not sure why human readable is a concern. I've been digging in this some more and CollecTor is currently not even consistent with upper-case vs. lower-case for SHA-1 hex digests. I'm also concerned about SHA-1 collisions that may become easier over time. I'd like to start using SHA-256 for references, especially where I'm using the reference to trust another document based on the signature over that hash. Currently looking at doing this for consensus->server descriptor and server descriptor->extra info relationships, I don't see any reason to not use SHA256 for bandwidth files too.

comment:24 Changed 4 weeks ago by juga

That's easy to change.
Now, if human readability is not a concern, then i we might actually want to encode it in base 64 instead of hexadecimal. My first PR, would accept any digest algorithm (i put SHA-256 as default), and encode by default in base 64 (which we might still not want to be the case).

comment:25 Changed 4 weeks ago by irl

I think that a base64-encoded SHA256 digest of the file would be the thing to have.

comment:26 in reply to:  25 Changed 4 weeks ago by teor

Replying to irl:

I've already started work on a prototype Python version of this CollecTor module. Can we use the same encoding for the digest that is used to reference vote, server descriptor or extra info descriptor digests? This will allow for better code reuse. We will convert things into whatever at the point they are read into CollecTor and convert them back when read out. Consistency within the spec is more important than consistency with code that I'm currently replacing.

Humans using CollecTor will typically use either stem or metrics-lib to interact with it.

I am a human who regularly uses CollecTor via the web interface (not stem or metrics-lib).

But if someone wrote a nice easy program or web interface for me to get a bandwidth file from a vote, I would use that instead.

Replying to juga:

Following #26698, last changes are in ​https://github.com/torproject/tor/pull/492, which is putting a sha1-digest (hex-encoded)

We do not use broken hash algorithms in new code.

Replying to irl:

I think that a base64-encoded SHA256 digest of the file would be the thing to have.

Ok.

If we need to discuss the hash any more, let's do it on the hash ticket #26698.

comment:27 Changed 3 weeks ago by irl

Owner: changed from metrics-team to irl
Status: newaccepted

I'll make a patch for Collector's File Structure Protocol for storing these files. Once we merge that, I will throw this back to metrics-team (or straight to karsten) for implementation in CollecTor.

Note: See TracTickets for help on using tickets.