Opened 4 months ago

Last modified 5 weeks ago

#30196 needs_revision enhancement

Add the tor version to the sbws bandwidth file header

Reported by: teor Owned by:
Priority: High Milestone: sbws: 1.2.x-final
Component: Core Tor/sbws Version:
Severity: Major Keywords: sbws-roadmap-september
Cc: juga, irl Actual Points:
Parent ID: #30255 Points: 1
Reviewer: Sponsor:

Description

See #30184 for a situation where we wanted the tor version.

Child Tickets

Change History (11)

comment:1 Changed 4 months ago by irl

I can see how in practice this is really useful, but this may not be the best field to add to the headers. Are we going to add fields for all of the versions of libraries on the system (e.g. OpenSSL), operating system kernel, etc. All of these fields could be places with important information that helps us to interpret the results we see but every change to the headers means also a change to parsers.

There is also an implicit assumption here that the bandwidth scanner uses tor at all. A future bandwidth scanner implementation may use stem.client or an alternative implementation.

For exit lists we are considering using a free-form text field with some suggested formats, e.g. something like "ExitScanner 55.5 using Tor 1.0.0 on Windows 10". If there is later something that affects the specific implementation we can adjust that string without a spec change to report library details.

For quick diagnosis of problems, it would be great if the bandwidth scanners added some contact information to the output. We are planning to add contact strings to exit scanners.

Whatever we do here, we should come up with good reasons why we are doing it and then apply the decisions to both exit lists and bandwidth files if the good reasons are working for both.

comment:2 Changed 4 months ago by irl

Cc: irl added

comment:3 Changed 4 months ago by irl

I had a discussion with juga in IRC, here are some things we discussed:

  • The need for the inline contact information becomes evident when we look at how much trouble we've had tracking down who runs the default bridges in Tor Browser #30121
  • Inline metadata is better than out-of-band metadata because it is archived by Tor Metrics along with the bandwidth measurement data
  • The need for AS as well as country allows us to correlate BGP events with bandwidth changes. Ideally we can know both the AS of the scanner and the target. Tor Metrics already knows the AS of the relay.
  • In some cases, the target is on a CDN, which means the AS may not be known in advance. In these cases we might choose to omit the AS number and instead use a user-assigned country code for the target.
  • We could assign QM-QZ (user-assigned codes not used elsewhere in Tor to my knowledge) to CDN providers but each new provider used would require a spec change, and external researchers would end up using whatever value and not being able to interop their data with ours if we clash.
  • We could use the OO escape code and build codes like "OOFastly" which is probably the more maintainable solution, but would require that parsers do not panic when they see >2 characters (I know the standard is known as alpha-2 but it isn't actually limited to 2 characters!)
  • We could use Tor's geoip database to resolve the IP address resolved via the same circuit of the CDN into a country/IP address to add to the measurements, but this may prove to be unreliable especially in the longer term as IPv4 exhaustion causes more and more IP address swapping about.

comment:4 in reply to:  1 ; Changed 4 months ago by teor

Replying to irl:

I can see how in practice this is really useful, but this may not be the best field to add to the headers. Are we going to add fields for all of the versions of libraries on the system (e.g. OpenSSL), operating system kernel, etc. All of these fields could be places with important information that helps us to interpret the results we see

You're right: we need to define scope.

Here's the scope of the sbws data file:

  • questions we have asked bandwidth authority operators (or they have asked us)
  • data we wanted Torflow to provide, but it couldn't (or didn't)
  • questions that relay operators ask
  • factors that have a large impact on measurements

The scope you're suggesting is much larger, and potentially unlimited.

I suggest that we restrict ourselves to data that we need to know to run the network.

In this context:

  • we wanted to know the tor version to recommend tor upgrades to bandwidth authority operators (#30184)
  • we might be interested in OpenSSL and NSS versions in future, because they have both had different bugs that stop relays connecting to each other
  • we have asked authority operators about operating systems before
    • but I have never wanted to know kernel versions
  • I have never even wondered about the versions of any other libraries

Here's another possible rule:

  • if it's in Tor relay descriptors, then it's useful for administering the network
    • tor version
    • operating system

but every change to the headers means also a change to parsers.

Parsers which follow the spec MUST allow unknown headers:
https://gitweb.torproject.org/torspec.git/tree/bandwidth-file-spec.txt#n481

A well-written parser will present unknown headers (or all headers) in a flexible, dictionary-like structure.

There is also an implicit assumption here that the bandwidth scanner uses tor at all. A future bandwidth scanner implementation may use stem.client or an alternative implementation.

Most headers are optional. We can specify that if there's no tor, then the tor version header SHOULD NOT be present.

For exit lists we are considering using a free-form text field with some suggested formats, e.g. something like "ExitScanner 55.5 using Tor 1.0.0 on Windows 10". If there is later something that affects the specific implementation we can adjust that string without a spec change to report library details.

Unstructured text is a nightmare to parse. Just look at the browser user-agent. Or the Tor ContactInfo field.

If we need something, let's specify it as a separate header.
If we're not sure, then let's leave it out.

For quick diagnosis of problems, it would be great if the bandwidth scanners added some contact information to the output. We are planning to add contact strings to exit scanners.

Let's not specify another unstructured contact string, please.

Instead, if we need to email someone, let's specify an operator email address.

Whatever we do here, we should come up with good reasons why we are doing it and then apply the decisions to both exit lists and bandwidth files if the good reasons are working for both.

I agree.

comment:5 in reply to:  4 ; Changed 4 months ago by juga

Summarizing the metadata proposed:

Replying to irl:

  • The need for the inline contact information becomes evident when we look at how much trouble we've had tracking down who runs the default bridges in Tor Browser #30121

Do we need contact information having an identifier (see https://trac.torproject.org/projects/tor/ticket/30217#comment:2) and initially knowing who runs the scanners (usually the same dirauth)?.

  • The need for AS as well as country allows us to correlate BGP events with bandwidth changes. Ideally we can know both the AS of the scanner and the target. Tor Metrics already knows the AS of the relay.

When we decided to include the country (#29299) i remember arma commenting in irc to just add it in the configuration to don't make the code more complicated. I'm commenting more on this in (#30229)

  • In some cases, the target is on a CDN, which means the AS may not be known in advance. In these cases we might choose to omit the AS number and instead use a user-assigned country code for the target.

Do we currently have more than one CDN?. I only know one, though would be great to have more.

  • We could assign QM-QZ (user-assigned codes not used elsewhere in Tor to my knowledge) to CDN providers but each new provider used would require a spec change, and external researchers would end up using whatever value and not being able to interop their data with ours if we clash.
  • We could use the OO escape code and build codes like "OOFastly" which is probably the more maintainable solution, but would require that parsers do not panic when they see >2 characters (I know the standard is known as alpha-2 but it isn't actually limited to 2 characters!)
  • We could use Tor's geoip database to resolve the IP address resolved via the same circuit of the CDN into a country/IP address to add to the measurements, but this may prove to be unreliable especially in the longer term as IPv4 exhaustion causes more and more IP address swapping about.

Not convinced on any of the options for CDNs, but can't think of other solution.

Replying to teor:

  • we wanted to know the tor version to recommend tor upgrades to bandwidth authority operators (#30184)

easy.

  • we might be interested in OpenSSL and NSS versions in future, because they have both had different bugs that stop relays connecting to each other

Is there a way to ask this to Tor? GETINFO only returns Tor version. Is it possible that python ssl.OPENSSL_VERSION might be different to the one being used by the running Tor?.

  • we have asked authority operators about operating systems before

What in concrete we would like to know about the operating system?, see options in https://docs.python.org/3/library/platform.html

comment:6 Changed 4 months ago by juga

Reminder to myself: i've implemented the operating system, ssl_version and tor_version in branch https://github.com/juga0/sbws/tree/ticket30196

comment:7 in reply to:  5 Changed 4 months ago by teor

Parent ID: #30255

I am a bit confused by this conversation about many different features in the same ticket.

Let's deal with each feature in a separate ticket?

I made them all children of #30255.

comment:8 Changed 4 months ago by teor

Replying to teor:

Replying to irl:

For exit lists we are considering using a free-form text field with some suggested formats, e.g. something like "ExitScanner 55.5 using Tor 1.0.0 on Windows 10". If there is later something that affects the specific implementation we can adjust that string without a spec change to report library details.

Unstructured text is a nightmare to parse. Just look at the browser user-agent. Or the Tor ContactInfo field.

If we need something, let's specify it as a separate header.
If we're not sure, then let's leave it out.

Replying to teor:

Replying to irl:

There is also an implicit assumption here that the bandwidth scanner uses tor at all. A future bandwidth scanner implementation may use stem.client or an alternative implementation.

Most headers are optional. We can specify that if there's no tor, then the tor version header SHOULD NOT be present.

Replying to juga:

Replying to teor:

  • we wanted to know the tor version to recommend tor upgrades to bandwidth authority operators (#30184)

easy.

comment:9 Changed 4 months ago by teor

Status: newneeds_revision

Replying to juga:

Reminder to myself: i've implemented the operating system, ssl_version and tor_version in branch https://github.com/juga0/sbws/tree/ticket30196

Here's my review of tor_version:

  • the code looks good
  • please refactor EXTRA_ARG_KEYVALUES and BANDWIDTH_HEADER_KEY_VALUES_INIT so you're not adding tor_version in two different places
  • please submit a bandwidth-file-spec patch, so the reviewer knows what your code should do
  • does the "chg: v3bwfile: !refactor, obtain headers from state" commit belong in a separate pull request? It's quite large.

Edit:

  • please submit each feature in a separate pull request
Last edited 4 months ago by teor (previous) (diff)

comment:10 Changed 2 months ago by teor

Priority: MediumHigh
Severity: NormalMajor

We also want the tor version so we can work out if bandwidth measurements change when the tor version changes:
https://lists.torproject.org/pipermail/tor-scaling/2019-June/000048.html

comment:11 Changed 5 weeks ago by gaba

Keywords: sbws-roadmap-september added
Note: See TracTickets for help on using tickets.