Opened 3 years ago

Last modified 21 months ago

#20548 assigned enhancement

Handle bad input more consistently in metrics code bases

Reported by: karsten Owned by: metrics-team
Priority: Medium Milestone:
Component: Metrics Version:
Severity: Normal Keywords: metrics-2018
Cc: iwakeh Actual Points:
Parent ID: Points:
Reviewer: Sponsor:

Description

We started thinking about handling bad input while sanitizing bridge descriptors in CollecTor (#19834) and while reading relay descriptors in Onionoo (#20412). But before we implement any changes we should generalize our strategies for handling bad input to avoid solving the same problem over and over. The result of this discussion can also serve as guide for future code.

What we're not covering here (but what we should think about anyway) is how we're handling issues during processing that are not directly related to bad input, like problems with sanitizing bridge IP addresses.

So, we can distinguish a couple of use cases where we're handling descriptors as input:

  1. CollecTor downloads descriptors from the directory or TorDNSEL or Torperf or processes previously downloaded descriptors.
  2. CollecTor synchronizes descriptors from other CollecTor instances.
  3. CollecTor reads previously uploaded bridge descriptors and produces sanitized versions of them.
  4. Metrics, Onionoo, ExoneraTor, and other applications download descriptors from CollecTor and use them locally.

The requirements on input data are quite different for these four use cases. Let's go through them:

  1. As of a few weeks ago we're storing and serving descriptors even if metrics-lib cannot parse them. The idea is that CollecTor shouldn't decide what goes into the archive, but the directory authorities (or TorDNSEL and Torperf) should. As long as we can detect the descriptor type, extract the publication date, and possibly calculate the digest, we can accept a descriptor.
  2. The requirements are pretty much the same as for 1.
  3. We need to be very picky about bridge descriptors, in particular about unknown parts, because those might contain sensitive information that we'd rather not copy over to sanitized bridge descriptors.
  4. Most applications would want to skip bad descriptors and not bother much about it. As of a few days ago, they all use metrics-lib for parsing descriptors.

So much about the differences. Let's also list the commonalities or possible goals for common behavior:

  • Regardless of the chosen strategy, we should apply it to all variants of descriptor badness. This may sound obvious, but we're currently not doing this. For example, if we encounter an invalid "a" line in a bridge descriptor, we skip that line, whereas an invalid "r" line makes us skip the entire descriptor. This is part of what we're trying to streamline in #19834, even though this case is not explicitly listed there.
  • Whenever we encounter an error in processing a descriptor, we should attempt to recover and continue with subsequent good descriptors. The reason is that descriptors in a common source can come from different original sources, and we cannot blame valid descriptors for following a faulty descriptor. This is the cause for #20412.

This is just the start. Let's add more thoughts to this ticket and assemble a guide that we can apply to existing tickets and future code changes.

Child Tickets

TicketTypeStatusOwnerSummary
#19834enhancementassignedmetrics-teamRethink how we handle issues while sanitizing bridge descriptors
#20412defectclosedmetrics-teamSkip bad archived descriptors rather than aborting the entire import
#20421defectassignedmetrics-teamInvestigate invalid descriptors in out/ and recent/ subdirectories

Change History (4)

comment:1 Changed 3 years ago by iwakeh

Some thoughts:

One step is unifying the parsing process by replacing all parsing code with metrics-lib provided parsing (which is already under way for CollecTor). This addresses goal number one in the description above.

Goal number two (of the bullet point list in the description above) is fine, too, as descriptors are separate data units and failure of parsing one should not influence parsing and storing of subsequent descriptors only because these happened to be stored in the same file temporarily.

Regarding the second list: privacy and client expectation, i.e. topics 3. and 4., are the most important.

One way to combine storing-of-all-that-is-seen with privacy and client expectation, would be to store invalid descriptors separately. The separate location also can be public for relay descriptors and sanitized bridge descriptors,i.e., public folders for download would be 'archive', 'relay', and 'substandard' (or some better name). All bridge descriptors that cannot be sanitized should be stored too, but not yet be offered to the public.

Advantages:

  • privacy is ensured
  • clients can choose the quality of descriptors they're interested in
  • we'd get an overview of how many 'bad' descriptors show up every month and can analyze them
  • others can also analyze the 'substandard' descriptors, too, or use them, if they choose to.
  • Given that descriptors are not supposed to be altered other than for privacy reasons, some still could be later integrated into the 'normal' archives for example when more robust parsing is available.

Disadvantages:

  • implementation of the third storage (alover, i.e. for 'recent', 'out', and 'substandard'), but the implementation should be easy.
  • maintenance of third storage location.

Concerning already archived data there are two options:

  • leave them as thy are
  • or re-parse and sort substandard historic descriptors into tarballs in the 'substandard' directory.

comment:2 in reply to:  1 Changed 3 years ago by karsten

Replying to iwakeh:

Some thoughts:

One step is unifying the parsing process by replacing all parsing code with metrics-lib provided parsing (which is already under way for CollecTor).

Agreed.

This addresses goal number one in the description above.

Hmm, I'm not sure which goals you refer to. What I described above were different use cases, not goals. Nevertheless, unifying the parsing process seems worthwhile.

Goal number two (of the bullet point list in the description above) is fine, too, as descriptors are separate data units and failure of parsing one should not influence parsing and storing of subsequent descriptors only because these happened to be stored in the same file temporarily.

Agreed.

Regarding the second list: privacy and client expectation, i.e. topics 3. and 4., are the most important.

One way to combine storing-of-all-that-is-seen with privacy and client expectation, would be to store invalid descriptors separately. [...]

Hmmmm. Those are two big disadvantages there. :)

How about we do the following instead:

  • If we attempt to parse a relay descriptor in CollecTor (use cases 1 and 2) and cannot figure out descriptor type, publication time, or digest, we append the raw bytes to a new local file per execution, say, bad/2016-11-15-10-23-55, and log a warning. The operator can then look at that file, possibly reconfigure or fix the parsing code, and put it again in some in/ subdirectory to parse it again.
  • If we attempt to parse a bridge descriptor in CollecTor (use case 3) and encounter anything that prevents us from sanitizing it, we print out a warning including the tarball file name. The operator can look at the tarball, get the parsing code fixed or extended, and remove the line from the parse history file, so that the file will be parsed again next time.

comment:3 Changed 21 months ago by karsten

Keywords: metrics-2018 added

comment:4 Changed 21 months ago by karsten

Owner: set to metrics-team
Status: newassigned
Note: See TracTickets for help on using tickets.