The weird characters show up in dirreq-v3-ips and dirreq-v3-reqs. When looking at a direct download e.g. retrieving extra/all directly from an authority, these are visible, too.
Huh, fine questions. Let's go through some possible ways of handling this:
Continue to reject descriptors containing problematic characters: Admittedly, this is not pretty, because nobody will know why a descriptor is missing, even if it's missing for a good reason. After all, descriptors should only contain ASCII, period. But, meh.
Replace problematic characters: I don't think we can do that, because that would break the signature. In general, I think we shouldn't touch descriptor contents.
Ignore problematic characters: We're accepting non-ASCII characters in other lines (contact and platform, for example), so we could as well accept them in lines containing country codes. The ugly part is that we'd have to distinguish DescriptorParseExceptions by severity, which is not really supported by metrics-lib yet.
Accept anything as long as the signature is valid: This is like the previous suggestion but going one step further and even accepting lines we think are otherwise malformed. We're not verifying signatures yet, so this could only be a longer-term solution. Basically, whatever the directory authorities accept is also what we accept, regardless of whether there's a bug in the authorities or not.
I don't yet know which variant I like most. Thoughts?
I think 3 and 4 are the way to go. I can look into point 3 first and see what implementation options there are.
Still the non-ASCII descriptor parts should be reported in some way. Basically, these don't conform to the protocol. And if the numbers are very high either some measures have to be taken to prevent these at their source or the protocol has to officially be changed. But, of course that will be the very, very long time solution.
Idea for an "accept non-ASCII characters" solution (follow-up of comment:3):
As Karsten pointed out above (comment:2) this would require changes in metrics-lib, but not necessarily in the direction of having to distinguish Exception severity.
Another option would be to make use of the descriptor.parser and descriptor.reader properties and supply a different non-ascci-accepting parser, let's call it LenientParser, as well as a LenientReader.
Necessary change in CollecTor would be to set the descriptor.parser and descriptor.reader properties to the LenientParser class.
Necessary change in metrics-lib would be the addition of the LenientParser, which consist mostly in providing additional ParserHelper methods that accept non-ascii and calling these in the appropriate places; most of the code will be the same as in the current, stricter implementation. Also a LenientReader would have to be supplied.
That way we could switch between implementations.
Users of metrics-lib would also have another option.
Another question that would need to be investigated: how will CollecTor clients deal with the additional non-compliant data?
Another question that would need to be investigated: how will CollecTor clients deal with the additional non-compliant data?
This seems an odd question. CollecTor serves tarballs of the published descriptor data. If the authorities publish it then CollecTor should provide it, malformed or not.
Another question that would need to be investigated: how will CollecTor clients deal with the additional non-compliant data?
This seems an odd question. CollecTor serves tarballs of the published descriptor data. If the authorities publish it then CollecTor should provide it, malformed or not.
Of course, clients should deal with the data as it is collected. Currently, they are 'shielded' from non-conformant extra-info decriptors, b/c CollecTor drops them. After the change some might trip over that newly available data. I intended to find out what the change would trigger, for example what additional work we'd have with clients like Onionoo etc.
Another question that would need to be investigated: how will CollecTor clients deal with the additional non-compliant data?
This seems an odd question. CollecTor serves tarballs of the published descriptor data. If the authorities publish it then CollecTor should provide it, malformed or not.
There may be special cases where that statement doesn't hold, but in general, I agree that CollecTor should aim for providing all data that the directory authorities published.
Of course, clients should deal with the data as it is collected. Currently, they are 'shielded' from non-conformant extra-info decriptors, b/c CollecTor drops them. After the change some might trip over that newly available data. I intended to find out what the change would trigger, for example what additional work we'd have with clients like Onionoo etc.
Onionoo et al. shouldn't be affected, because they're using metrics-lib to parse descriptors which shields them from malformed descriptors. Other clients not using metrics-lib might be affected, but those clients would also break when parsing Tor data directly, so I don't think that we have to take special care there.
Regarding the LenientParser idea, I wonder whether we should just skip the metrics-lib check to see whether we can parse a descriptor before writing it to disk. See ArchiveWriter#store(). At that point we already parsed all relevant fields that we need for storing the descriptor without using metrics-lib, and that check is only there to make sure that metrics-lib will be able parse the descriptor later. But if we want to take that check out, which I think we should, then let's just change that code to print out an informational log statement and store the file anyway. What do you think?
...
Regarding the LenientParser idea, I wonder whether we should just skip the metrics-lib check to see whether we can parse a descriptor before writing it to disk. See ArchiveWriter#store(). At that point we already parsed all relevant fields that we need for storing the descriptor without using metrics-lib, and that check is only there to make sure that metrics-lib will be able parse the descriptor later. But if we want to take that check out, which I think we should, then let's just change that code to print out an informational log statement and store the file anyway. What do you think?
I agree, removing the metrics-lib check and informing known clients that do not use metrics-lib for parsing seems to be the simplest option here. It will decrease the number of missing descriptors immediately and is a 'minimal-invasive' code change.
Summary:
CollecTor should not require parsability - except for the fields necessary - of descriptors when downloading/storing and only logs a warning if metrics-lib complains. (Eventually CollecTor could verify the signature. But, that's a different thing.)
Without adapting metrics-lib these descriptors would trigger an exception during parsing in a metrics-lib dependent client, which handles parser exception already in some way (probably drop the descriptor). Clients without metrics-lib might have additional problems now and should be warned before such a change is deployed.
Possible future option: When augmenting metrics-lib with a LenientParser the decision has to be made for the various clients, if they also want to use that parser option. If the client keeps the strict parser this will be like case 1. If the client uses the LenientParser, it might provide unreadable characters (didn't we have tickets about that in Onionoo somewhere?).
Should I make this change, or would you want to do that?
Let's postpone the LenientParser discussion, as I'd also want to make unrelated changes to the parser, including making it lazy for performance reasons. And we'd want to include signature verification at some point. But that's all for later.
Should I make this change, or would you want to do that?
Please do.
Let's postpone the LenientParser discussion, as I'd also want to make unrelated changes to the parser, including making it lazy for performance reasons. And we'd want to include signature verification at some point. But that's all for later.
Let's postpone the LenientParser discussion, as I'd also want to make unrelated changes to the parser, including making it lazy for performance reasons. And we'd want to include signature verification at some point. But that's all for later.
That's fine.
Can you create a ticket for this, so we don't forget?
The patch doesn't correct the extra-info descriptor part here and below (afaict)?
logger should be accessed in a static way; currently all access to logger are done like this.logger. Maybe, just include this minor change in the patch?
The patch doesn't correct the extra-info descriptor part here and below (afaict)?
Hmm, I don't yet see what you mean. The only place where we're using metrics-lib's DescriptorParser to parse a byte array is ArchiveWriter#store(). The haveParsed... method called below the place you referenced only tells the downloader that we have parsed a descriptor and that it should attempt to fetch any descriptors referenced from it. And the ASCII conversion above is only there to find descriptor start and end, though we're computing the digest on the non-converted byte array. Can you give an example of a case where we're still rejecting a descriptor only because of metrics-lib?
logger should be accessed in a static way; currently all access to logger are done like this.logger. Maybe, just include this minor change in the patch?
You mean just the one call in the patch? Certainly! Pushed a fixup commit to the same branch. I'd also love to change the remaining calls, but I'm a bit worried that it'll make merging of the other branches harder. But I can easily do that prior to the release when all branches are in. I'll add a note to the release ticket.
Should this go into 1.0.0? Assigning to the milestone just to make sure we don't forget about it, but feel free to move to another milestone if 1.0.0 won't work.
The patch doesn't correct the extra-info descriptor part here and below (afaict)?
Hmm, I don't yet see what you mean. The only place where we're using metrics-lib's DescriptorParser to parse a byte array is ArchiveWriter#store(). The haveParsed... method called below the place you referenced only tells the downloader that we have parsed a descriptor and that it should attempt to fetch any descriptors referenced from it. And the ASCII conversion above is only there to find descriptor start and end, though we're computing the digest on the non-converted byte array. Can you give an example of a case where we're still rejecting a descriptor only because of metrics-lib?
Not because of metrics-lib, but this ticket is mostly about these extra-info descriptors that are not stored, b/c of ascii conversion failure. The metrics-lib part is fine, but a different reason.
So, only half is fixed.
I'm still unclear what you refer to. I believe we're now accepting those descriptors with non-ASCII country codes in extra-info descriptors that we rejected earlier. What case do you have in mind where we're not accepting descriptors because of ASCII conversion failures? Note that we need to perform some basic checks and computations to parse the descriptor publication time, find out the descriptor digest, etc., so that we can store it. Part of the requirements are an ASCII string at the beginning and the end of the descriptor that tells us what part to compute the digest of. If those are not ASCII, we cannot process the descriptor. And if anything in between is not ASCII, we shouldn't complain, well, after applying this patch.
You're change in here is fine and can go into 1.0.0.
It's just not sufficiently solving this ticket. The extra-info could be read, all information necessary for storing is ascii and readable, just these few country codes are non-ascii.
Or, am I overlooking something?
Indeed, I overlooked something obvious and you're right. Sorry for the confusion, for some reason I though a different class was responsible. I luckily still had the old log file were I could see which class logged.
So, ready for merge, all fine.
(Except that a test documenting the change is lacking, but that might be tricky to implement currently?)
Okay, cool, cherry-picked into my task-19720 branch that I'll merge into master as soon as #19720 (moved) is ready to be merged. (And yes, writing a test for this is indeed tricky. Not for 1.0.0, at least not if we're still planning to release today.)