Handle bad input more consistently in metrics code bases

added component::metrics metrics-2018 owner::metrics-team priority::medium severity::normal status::assigned type::enhancement labels

Some thoughts:

One step is unifying the parsing process by replacing all parsing code with metrics-lib provided parsing (which is already under way for CollecTor). This addresses goal number one in the description above.

Goal number two (of the bullet point list in the description above) is fine, too, as descriptors are separate data units and failure of parsing one should not influence parsing and storing of subsequent descriptors only because these happened to be stored in the same file temporarily.

Regarding the second list: privacy and client expectation, i.e. topics 3. and 4., are the most important.

One way to combine storing-of-all-that-is-seen with privacy and client expectation, would be to store invalid descriptors separately. The separate location also can be public for relay descriptors and sanitized bridge descriptors,i.e., public folders for download would be 'archive', 'relay', and 'substandard' (or some better name). All bridge descriptors that cannot be sanitized should be stored too, but not yet be offered to the public.

Advantages:

privacy is ensured
clients can choose the quality of descriptors they're interested in
we'd get an overview of how many 'bad' descriptors show up every month and can analyze them
others can also analyze the 'substandard' descriptors, too, or use them, if they choose to.
Given that descriptors are not supposed to be altered other than for privacy reasons, some still could be later integrated into the 'normal' archives for example when more robust parsing is available.

Disadvantages:

implementation of the third storage (alover, i.e. for 'recent', 'out', and 'substandard'), but the implementation should be easy.
maintenance of third storage location.

Concerning already archived data there are two options:

leave them as thy are
or re-parse and sort substandard historic descriptors into tarballs in the 'substandard' directory.

Replying to iwakeh:

Some thoughts:

One step is unifying the parsing process by replacing all parsing code with metrics-lib provided parsing (which is already under way for CollecTor).

Agreed.

This addresses goal number one in the description above.

Hmm, I'm not sure which goals you refer to. What I described above were different use cases, not goals. Nevertheless, unifying the parsing process seems worthwhile.

Goal number two (of the bullet point list in the description above) is fine, too, as descriptors are separate data units and failure of parsing one should not influence parsing and storing of subsequent descriptors only because these happened to be stored in the same file temporarily.

Agreed.

Regarding the second list: privacy and client expectation, i.e. topics 3. and 4., are the most important.

One way to combine storing-of-all-that-is-seen with privacy and client expectation, would be to store invalid descriptors separately. [...]

Hmmmm. Those are two big disadvantages there. :)

How about we do the following instead:

If we attempt to parse a relay descriptor in CollecTor (use cases 1 and 2) and cannot figure out descriptor type, publication time, or digest, we append the raw bytes to a new local file per execution, say, bad/2016-11-15-10-23-55, and log a warning. The operator can then look at that file, possibly reconfigure or fix the parsing code, and put it again in some in/ subdirectory to parse it again.
If we attempt to parse a bridge descriptor in CollecTor (use case 3) and encounter anything that prevents us from sanitizing it, we print out a warning including the tarball file name. The operator can look at the tarball, get the parsing code fixed or extended, and remove the line from the parse history file, so that the file will be parsed again next time.

Trac:
Keywords: N/A deleted, metrics-2018 added

Trac:
Owner: N/A to metrics-team
Status: new to assigned

Handle bad input more consistently in metrics code bases

Child items ...

Activity