Opened 5 months ago

Closed 5 months ago

#30216 closed enhancement (fixed)

Add bandwidth file parser to metrics-lib

Reported by: irl Owned by: karsten
Priority: High Milestone:
Component: Metrics/Library Version:
Severity: Normal Keywords: tor-bwauth, tor-dirauth, metrics-roadmap-2019-q2
Cc: teor, metrics-team, starlight@…, tom Actual Points:
Parent ID: #21378 Points:
Reviewer: irl Sponsor:

Description

Ideally we can parse both 1.0.0 and 1.4.0 files, but I'm not sure that we have any of the versions in between that would ever end up in our archive.

The specification is at:

https://gitweb.torproject.org/torspec.git/tree/bandwidth-file-spec.txt

Child Tickets

Change History (13)

comment:1 Changed 5 months ago by irl

Type: defectenhancement

comment:2 Changed 5 months ago by karsten

Owner: changed from metrics-team to karsten
Status: newaccepted

comment:3 Changed 5 months ago by karsten

I just finished reading the spec. Should we start with a minimal parser that does some basic format verification and extracts only those fields that are necessary for storing the document? We could then parse the remaining fields in a later metrics-lib patch. What do you think?

comment:4 Changed 5 months ago by irl

That sounds fine, but we need to figure out what those fields are.

Likely we can ignore everything after the terminator for now, and just do the headers?

If we wanted to do even less than that then this will block on #30217.

comment:5 in reply to:  description Changed 5 months ago by teor

Replying to irl:

Ideally we can parse both 1.0.0 and 1.4.0 files, but I'm not sure that we have any of the versions in between that would ever end up in our archive.

Most bandwidth authority operators keep bandwidth file archives on disk. You might see some 1.1.0, 1.2.0, or 1.3.0 files in longclaw's and bastet's archives.

But a well-written 1.4.0 parser should be able to handle all the previous versions. The fields that were added after 1.0.0 are all marked as optional, or marked with the version they were introduced in. (Hopefully both, but let us know if we've made a mistake.)

comment:6 Changed 5 months ago by karsten

Priority: MediumHigh
Status: acceptedneeds_review

Okay, it turned out to be faster to just parse all fields than to figure out what fields we can add in a later step.

Please review commit 75f50f3 in my task-30216 branch.

comment:7 Changed 5 months ago by irl

Status: needs_reviewneeds_revision
      In version 1.0.0, Header List ends when the first relay bandwidth
      is found conforming to the next section.

but

switch (spaceSeparatedLineParts.length) {

is used to determine if you've reached the end of the headers. This should instead search the line for "bw=" as future header lines may contain spaces.

      Additional header Lines MUST NOT use any keywords specified in the
      relay measurements format.

To be safer, it should be looking for either start of line or space preceeding bw=. In PCRE, (^| )bw=. This test should also be disabled if we've already seen a version declared that expects a terminator.

When we are parsing country lists, ISO 2-alpha *can* be more than 2 characters, and we're looking at using that functionality for describing CDNs in an extensible way in #30196. Codes will only be longer than 2 characters if they start "OO".

Tests and checks are OK.

comment:8 in reply to:  7 ; Changed 5 months ago by karsten

Status: needs_revisionneeds_review

Thanks for the review!

Replying to irl:

      In version 1.0.0, Header List ends when the first relay bandwidth
      is found conforming to the next section.

but

switch (spaceSeparatedLineParts.length) {

is used to determine if you've reached the end of the headers. This should instead search the line for "bw=" as future header lines may contain spaces.

      Additional header Lines MUST NOT use any keywords specified in the
      relay measurements format.

To be safer, it should be looking for either start of line or space preceeding bw=. In PCRE, (^| )bw=. This test should also be disabled if we've already seen a version declared that expects a terminator.

Right, I thought about this, too, and you have a point that bandwidth-file-spec does not explicitly state whether future header lines can ever contain spaces or not. However, dir-spec implicitly rules that out.

The only place where future header lines could contain a space is between two or more KeyValue elements, which would basically be two or more header lines combined. But why would anybody want to introduce such lines if they can as well have two or more separate header lines?

How about we ask the bandwidth-file-spec authors to clarify whether this is planned in the future? If there are no such plans, that is, header lines with 1.x versions can never have spaces, and this is stated explicitly in the spec, I'd like to keep this simple and efficient parser implementation. Otherwise we can add a check like you suggested, which certainly makes the parser more complex, but which would address this case.

When we are parsing country lists, ISO 2-alpha *can* be more than 2 characters, and we're looking at using that functionality for describing CDNs in an extensible way in #30196. Codes will only be longer than 2 characters if they start "OO".

Hmm, bandwidth-file-spec explicitly says that country codes are exactly two characters long.

It's certainly easy to change the regex here, it's just not following the spec.

In fact, I'm not even sure if the authors can change the spec here for 1.x versions, as a correct parser that can parse a valid 1.4 version with exactly two country code characters wouldn't parse a 1.5 version with more characters. This change might require new CountryCodeExt and CountryCodeExtList nonterminals in newly defined header lines.

Tests and checks are OK.

Great! My suggestion would be to merge and release this code as a valid 1.0 to 1.4 parser, ask spec authors for clarifications/amendments, and possibly release any changes in the next metrics-lib version. It would be good to unblock further code to archive bandwidth files.

What do you think?

comment:9 in reply to:  8 Changed 5 months ago by karsten

Status: needs_reviewneeds_revision

Replying to karsten:

How about we ask the bandwidth-file-spec authors to clarify whether this is planned in the future? If there are no such plans, that is, header lines with 1.x versions can never have spaces, and this is stated explicitly in the spec, I'd like to keep this simple and efficient parser implementation. Otherwise we can add a check like you suggested, which certainly makes the parser more complex, but which would address this case.

Or here's another suggestion: We make the change you suggested, ask the spec authors to clarify, and then hopefully simplify the parser again. I'll work on a patch later today.

comment:10 Changed 5 months ago by karsten

Status: needs_revisionneeds_review

comment:11 Changed 5 months ago by irl

Reviewer: irl

comment:12 Changed 5 months ago by irl

Status: needs_reviewmerge_ready

Looks good to me. Checks and tests OK.

comment:13 Changed 5 months ago by karsten

Resolution: fixed
Status: merge_readyclosed

Great, thanks! Merged. Opening a new ticket for releasing metrics-lib 2.6.0.

Note: See TracTickets for help on using tickets.