Opened 3 years ago

Closed 3 years ago

#19170 closed defect (fixed)

make parsing more robust (extra-info)

Reported by: iwakeh Owned by: iwakeh
Priority: Medium Milestone: CollecTor 1.0.0
Component: Metrics/CollecTor Version:
Severity: Normal Keywords: ctip
Cc: Actual Points:
Parent ID: Points:
Reviewer: Sponsor:

Description

Almost all of the "Could not parse" warnings on the CollecTor mirror are caused by SweTor247.

It seems that some of the country codes use non-ascii characters. The code parsing this information is here.

Shouldn't the parser just replace the problematic characters by a default character and thus keep the readable information?

I'll add a stack trace when this happens again.

Child Tickets

Change History (26)

comment:1 Changed 3 years ago by iwakeh

The weird characters show up in dirreq-v3-ips and dirreq-v3-reqs. When looking at a direct download e.g. retrieving extra/all directly from an authority, these are visible, too.

comment:2 Changed 3 years ago by karsten

Huh, fine questions. Let's go through some possible ways of handling this:

  • Continue to reject descriptors containing problematic characters: Admittedly, this is not pretty, because nobody will know why a descriptor is missing, even if it's missing for a good reason. After all, descriptors should only contain ASCII, period. But, meh.
  • Replace problematic characters: I don't think we can do that, because that would break the signature. In general, I think we shouldn't touch descriptor contents.
  • Ignore problematic characters: We're accepting non-ASCII characters in other lines (contact and platform, for example), so we could as well accept them in lines containing country codes. The ugly part is that we'd have to distinguish DescriptorParseExceptions by severity, which is not really supported by metrics-lib yet.
  • Accept anything as long as the signature is valid: This is like the previous suggestion but going one step further and even accepting lines we think are otherwise malformed. We're not verifying signatures yet, so this could only be a longer-term solution. Basically, whatever the directory authorities accept is also what we accept, regardless of whether there's a bug in the authorities or not.

I don't yet know which variant I like most. Thoughts?

comment:3 Changed 3 years ago by iwakeh

Status: newaccepted

I think 3 and 4 are the way to go. I can look into point 3 first and see what implementation options there are.

Still the non-ASCII descriptor parts should be reported in some way. Basically, these don't conform to the protocol. And if the numbers are very high either some measures have to be taken to prevent these at their source or the protocol has to officially be changed. But, of course that will be the very, very long time solution.

comment:4 Changed 3 years ago by iwakeh

NB:
Conclusion from #18938: at some point in the future extra-info will only contain printable ASCII.

comment:5 Changed 3 years ago by atagar

This is a tor bug. Since it hasn't been mentioned here yet here's the tracking ticket about this relay...

https://trac.torproject.org/projects/tor/ticket/18656

It's a malformed descriptor (probably corruption with their geoip db). Stem considers these extrainfo documents to be invalid.

comment:6 Changed 3 years ago by iwakeh

Milestone: CollecTor 1.0.0

Added to first release milestone.

comment:7 Changed 3 years ago by iwakeh

Idea for an "accept non-ASCII characters" solution (follow-up of comment:3):

As Karsten pointed out above (comment:2) this would require changes in metrics-lib, but not necessarily in the direction of having to distinguish Exception severity.

Another option would be to make use of the descriptor.parser and descriptor.reader properties and supply a different non-ascci-accepting parser, let's call it LenientParser, as well as a LenientReader.

  • Necessary change in CollecTor would be to set the descriptor.parser and descriptor.reader properties to the LenientParser class.
  • Necessary change in metrics-lib would be the addition of the LenientParser, which consist mostly in providing additional ParserHelper methods that accept non-ascii and calling these in the appropriate places; most of the code will be the same as in the current, stricter implementation. Also a LenientReader would have to be supplied.

That way we could switch between implementations.
Users of metrics-lib would also have another option.

Another question that would need to be investigated: how will CollecTor clients deal with the additional non-compliant data?

comment:8 Changed 3 years ago by atagar

Another question that would need to be investigated: how will CollecTor clients deal with the additional non-compliant data?

This seems an odd question. CollecTor serves tarballs of the published descriptor data. If the authorities publish it then CollecTor should provide it, malformed or not.

comment:9 in reply to:  8 ; Changed 3 years ago by iwakeh

Replying to atagar:

Another question that would need to be investigated: how will CollecTor clients deal with the additional non-compliant data?

This seems an odd question. CollecTor serves tarballs of the published descriptor data. If the authorities publish it then CollecTor should provide it, malformed or not.

Of course, clients should deal with the data as it is collected. Currently, they are 'shielded' from non-conformant extra-info decriptors, b/c CollecTor drops them. After the change some might trip over that newly available data. I intended to find out what the change would trigger, for example what additional work we'd have with clients like Onionoo etc.

comment:10 Changed 3 years ago by iwakeh

Milestone: CollecTor 1.0.0

Removed milestone setting, b/c this topic might have to be addressed in metrics-lib first.
Leave the component setting unchanged for now.

comment:11 in reply to:  9 ; Changed 3 years ago by karsten

Replying to iwakeh:

Replying to atagar:

Another question that would need to be investigated: how will CollecTor clients deal with the additional non-compliant data?

This seems an odd question. CollecTor serves tarballs of the published descriptor data. If the authorities publish it then CollecTor should provide it, malformed or not.

There may be special cases where that statement doesn't hold, but in general, I agree that CollecTor should aim for providing all data that the directory authorities published.

Of course, clients should deal with the data as it is collected. Currently, they are 'shielded' from non-conformant extra-info decriptors, b/c CollecTor drops them. After the change some might trip over that newly available data. I intended to find out what the change would trigger, for example what additional work we'd have with clients like Onionoo etc.

Onionoo et al. shouldn't be affected, because they're using metrics-lib to parse descriptors which shields them from malformed descriptors. Other clients not using metrics-lib might be affected, but those clients would also break when parsing Tor data directly, so I don't think that we have to take special care there.

Regarding the LenientParser idea, I wonder whether we should just skip the metrics-lib check to see whether we can parse a descriptor before writing it to disk. See ArchiveWriter#store(). At that point we already parsed all relevant fields that we need for storing the descriptor without using metrics-lib, and that check is only there to make sure that metrics-lib will be able parse the descriptor later. But if we want to take that check out, which I think we should, then let's just change that code to print out an informational log statement and store the file anyway. What do you think?

comment:12 in reply to:  11 Changed 3 years ago by iwakeh

Replying to karsten:

...
Regarding the LenientParser idea, I wonder whether we should just skip the metrics-lib check to see whether we can parse a descriptor before writing it to disk. See ArchiveWriter#store(). At that point we already parsed all relevant fields that we need for storing the descriptor without using metrics-lib, and that check is only there to make sure that metrics-lib will be able parse the descriptor later. But if we want to take that check out, which I think we should, then let's just change that code to print out an informational log statement and store the file anyway. What do you think?

I agree, removing the metrics-lib check and informing known clients that do not use metrics-lib for parsing seems to be the simplest option here. It will decrease the number of missing descriptors immediately and is a 'minimal-invasive' code change.

Summary:

CollecTor should not require parsability - except for the fields necessary - of descriptors when downloading/storing and only logs a warning if metrics-lib complains. (Eventually CollecTor could verify the signature. But, that's a different thing.)

Without adapting metrics-lib these descriptors would trigger an exception during parsing in a metrics-lib dependent client, which handles parser exception already in some way (probably drop the descriptor). Clients without metrics-lib might have additional problems now and should be warned before such a change is deployed.

Possible future option: When augmenting metrics-lib with a LenientParser the decision has to be made for the various clients, if they also want to use that parser option. If the client keeps the strict parser this will be like case 1. If the client uses the LenientParser, it might provide unreadable characters (didn't we have tickets about that in Onionoo somewhere?).

comment:13 Changed 3 years ago by karsten

Alright, I sent a heads-up to tor-dev@.

Should I make this change, or would you want to do that?

Let's postpone the LenientParser discussion, as I'd also want to make unrelated changes to the parser, including making it lazy for performance reasons. And we'd want to include signature verification at some point. But that's all for later.

comment:14 in reply to:  13 ; Changed 3 years ago by iwakeh

Replying to karsten:

Alright, I sent a heads-up to tor-dev@.

Good, nice warning mail.

Should I make this change, or would you want to do that?

Please do.

Let's postpone the LenientParser discussion, as I'd also want to make unrelated changes to the parser, including making it lazy for performance reasons. And we'd want to include signature verification at some point. But that's all for later.

That's fine.

comment:15 in reply to:  14 Changed 3 years ago by karsten

Status: acceptedneeds_review

Replying to iwakeh:

Replying to karsten:

Alright, I sent a heads-up to tor-dev@.

Good, nice warning mail.

Thanks. :)

Should I make this change, or would you want to do that?

Please do.

Sure, please find my branch task-19170.

Let's postpone the LenientParser discussion, as I'd also want to make unrelated changes to the parser, including making it lazy for performance reasons. And we'd want to include signature verification at some point. But that's all for later.

That's fine.

Can you create a ticket for this, so we don't forget?

comment:16 Changed 3 years ago by iwakeh

The patch doesn't correct the extra-info descriptor part here and below (afaict)?

logger should be accessed in a static way; currently all access to logger are done like this.logger. Maybe, just include this minor change in the patch?

comment:17 Changed 3 years ago by iwakeh

LenientParser ticket #19842

comment:18 in reply to:  16 ; Changed 3 years ago by karsten

Replying to iwakeh:

The patch doesn't correct the extra-info descriptor part here and below (afaict)?

Hmm, I don't yet see what you mean. The only place where we're using metrics-lib's DescriptorParser to parse a byte array is ArchiveWriter#store(). The haveParsed... method called below the place you referenced only tells the downloader that we have parsed a descriptor and that it should attempt to fetch any descriptors referenced from it. And the ASCII conversion above is only there to find descriptor start and end, though we're computing the digest on the non-converted byte array. Can you give an example of a case where we're still rejecting a descriptor only because of metrics-lib?

logger should be accessed in a static way; currently all access to logger are done like this.logger. Maybe, just include this minor change in the patch?

You mean just the one call in the patch? Certainly! Pushed a fixup commit to the same branch. I'd also love to change the remaining calls, but I'm a bit worried that it'll make merging of the other branches harder. But I can easily do that prior to the release when all branches are in. I'll add a note to the release ticket.

comment:19 Changed 3 years ago by karsten

Milestone: CollecTor 1.0.0

Should this go into 1.0.0? Assigning to the milestone just to make sure we don't forget about it, but feel free to move to another milestone if 1.0.0 won't work.

comment:20 in reply to:  18 Changed 3 years ago by iwakeh

Status: needs_reviewneeds_revision

Replying to karsten:

Replying to iwakeh:

The patch doesn't correct the extra-info descriptor part here and below (afaict)?

Hmm, I don't yet see what you mean. The only place where we're using metrics-lib's DescriptorParser to parse a byte array is ArchiveWriter#store(). The haveParsed... method called below the place you referenced only tells the downloader that we have parsed a descriptor and that it should attempt to fetch any descriptors referenced from it. And the ASCII conversion above is only there to find descriptor start and end, though we're computing the digest on the non-converted byte array. Can you give an example of a case where we're still rejecting a descriptor only because of metrics-lib?

Not because of metrics-lib, but this ticket is mostly about these extra-info descriptors that are not stored, b/c of ascii conversion failure. The metrics-lib part is fine, but a different reason.
So, only half is fixed.

Can this be fixed for 1.0.0?

comment:21 Changed 3 years ago by karsten

I'm still unclear what you refer to. I believe we're now accepting those descriptors with non-ASCII country codes in extra-info descriptors that we rejected earlier. What case do you have in mind where we're not accepting descriptors because of ASCII conversion failures? Note that we need to perform some basic checks and computations to parse the descriptor publication time, find out the descriptor digest, etc., so that we can store it. Part of the requirements are an ASCII string at the beginning and the end of the descriptor that tells us what part to compute the digest of. If those are not ASCII, we cannot process the descriptor. And if anything in between is not ASCII, we shouldn't complain, well, after applying this patch.

comment:22 Changed 3 years ago by iwakeh

You're change in here is fine and can go into 1.0.0.

It's just not sufficiently solving this ticket. The extra-info could be read, all information necessary for storing is ascii and readable, just these few country codes are non-ascii.
Or, am I overlooking something?

The following will be skipped and not stored:

@type extra-info 1.0

extra-info SweTor247 4FF069CA2B2D1A76C7AB451EFB7117F29AEF750A
published 2016-05-22 16:43:55
write-history 2016-05-22 16:32:25 (900 s) 1357860864,1486376960,1279031296,1351328768,1743697920,2200178688,1234156544,1523289088,1551633408,1621464064,1254597632,1775264768,1689665536,1753520128,1103904768,1434341376,1150939136,967985152,895990784,1302690816,1300997120,947147776,1441388544,1407930368,1558607872,1063849984,916183040,1294087168,1107414016,1026692096,1144520704,1861045248,1282239488,1273192448,1158875136,1321010176,945436672,720959488,723689472,1121559552,1056056320,746829824,705452032,910871552,1741348864,809389056,817775616,1078247424,1004999680,1189643264,705234944,1284806656,1980861440,1514937344,838163456,1152607232,1203712000,1020697600,906333184,1218160640,1198831616,835339264,1586464768,2203707392,1559570432,1039450112,1134445568,1238086656,1253769216,1233560576,1691033600,1659950080,1356476416,1166150656,889988096,1218139136,1569614848,1237613568,1162537984,1312793600,1228486656,1187644416,1113488384,1428703232,1186930688,998151168,1278302208,1259377664,1092085760,1001015296,1191922688,1477314560,1337351168,1455647744,974535680,1528930304
read-history 2016-05-22 16:32:25 (900 s) 1332578304,1448282112,1254513664,1324864512,1714417664,2170276864,1203768320,1480467456,1523788800,1583406080,1221485568,1746613248,1660780544,1711060992,1076665344,1403919360,1125062656,931348480,868599808,1276554240,1274421248,923617280,1413637120,1372585984,1531207680,1032589312,898709504,1270477824,1081565184,1002877952,1121116160,1838573568,1259815936,1247204352,1134850048,1294400512,920386560,693218304,710129664,1097239552,1029796864,715806720,690363392,885669888,1722977280,782941184,789390336,1056364544,981941248,1157256192,674372608,1265626112,1953830912,1484488704,815069184,1126297600,1173146624,989871104,883805184,1192287232,1172829184,808004608,1563955200,2174199808,1531340800,1005859840,1106042880,1209826304,1215944704,1204684800,1657624576,1623773184,1321722880,1129592832,859763712,1191492608,1539031040,1200889856,1137287168,1283620864,1196333056,1147316224,1086354432,1395887104,1154003968,975105024,1249856512,1233748992,1061946368,968668160,1165745152,1449252864,1302113280,1418886144,941774848,1504149504
dirreq-write-history 2016-05-22 16:38:42 (900 s) 26721280,26181632,22713344,18153472,30959616,27491328,23710720,33908736,30176256,29278208,26059776,23006208,31754240,32760832,26696704,19438592,26868736,30183424,19349504,20831232,20793344,20324352,35196928,19519488,23652352,23080960,14424064,17801216,22862848,16696320,19476480,17430528,21992448,21087232,18184192,23530496,23248896,17436672,12994560,19902464,25379840,23402496,10625024,18203648,19986432,20084736,23708672,17466368,19121152,30393344,16679936,14265344,27670528,21859328,23498752,22908928,21556224,25786368,16757760,23324672,20441088,21089280,19623936,22488064,26152960,26180608,23634944,25591808,25244672,26862592,26347520,32198656,31865856,26145792,23252992,24669184,29142016,20024320,22441984,25981952,26183680,31656960,23510016,28493824,26350592,21965824,20175872,27035648,23478272,25636864,23432192,23070720,33317888,23294976,28601344,27546624
dirreq-read-history 2016-05-22 16:38:42 (900 s) 1080320,988160,893952,980992,803840,1124352,1157120,522240,451584,670720,1185792,701440,845824,804864,617472,487424,409600,909312,688128,741376,381952,664576,1137664,591872,783360,539648,562176,498688,319488,802816,352256,1174528,293888,785408,299008,1057792,389120,432128,291840,414720,657408,318464,351232,724992,330752,365568,648192,856064,930816,601088,514048,332800,381952,486400,753664,768000,588800,633856,472064,621568,693248,582656,687104,483328,508928,835584,675840,1278976,832512,596992,728064,1417216,1032192,706560,791552,623616,710656,568320,901120,917504,1327104,1090560,954368,1106944,1076224,946176,679936,1252352,1022976,1053696,1033216,536576,933888,888832,885760,1418240
geoip-db-digest 2ACCCC7A9A2449D86D5F0A7A3F80588028EE7102
geoip6-db-digest 4D916E7A0FB1B80C6DAFF63689E36AA0E50D8E06
dirreq-stats-end 2016-05-22 07:12:47 (86400 s)
dirreq-v3-ips us=176,ru=136,??=112,de=104,fr=96,gb=56,es=40,br=32,it=32,jp=32,ar=24,cn=24,pl=24,tw=24,ua=24,at=16,au=16,ca=16,ch=16,cz=16,in=16,mx=16,nl=16,ro=16,se=16,vn=16,]h=8,ae=8,al=8,ao=8,bd=8,be=8,bg=8,bo=8,by=8,cd=8,cl=8,co=8,dd=8,dk=8,do=8,du=8,dz=8,ec=8,ee=8,eg=8,fi=8,gd=8,gh=8,gr=8,hk=8,hn=8,hr=8,hu=8,id=8,il=8,is=8,i�=8,ke=8,kr=8,kw=8,kz=8,lb=8,lt=8,lu=8,lv=8,ma=8,md=8,mn=8,my=8,ng=8,no=8,nz=8,pe=8,ph=8,pr=8,pt=8,py=8,qa=8,re=8,rs=8,sd=8,sg=8,si=8,sk=8,st=8,sy=8,sz=8,th=8,tj=8,tn=8,tr=8,uy=8,uz=8,ve=8,za=8
dirreq-v3-reqs us=984,ru=280,??=240,de=200,fr=144,gb=120,jp=56,se=56,es=48,it=48,br=40,ca=40,ua=40,cn=32,nl=32,ro=32,ar=24,at=24,by=24,ch=24,fi=24,il=24,kr=24,kz=24,lv=24,mx=24,pl=24,tw=24,vn=24,au=16,be=16,cd=16,cz=16,ee=16,hk=16,id=16,in=16,md=16,ng=16,nz=16,sg=16,th=16,]h=8,ae=8,al=8,ao=8,bd=8,bg=8,bo=8,cl=8,co=8,dd=8,dk=8,do=8,du=8,dz=8,ec=8,eg=8,gd=8,gh=8,gr=8,hn=8,hr=8,hu=8,is=8,i�=8,ke=8,kw=8,lb=8,lt=8,lu=8,ma=8,mn=8,my=8,no=8,pe=8,ph=8,pr=8,pt=8,py=8,qa=8,re=8,rs=8,sd=8,si=8,sk=8,st=8,sy=8,sz=8,tj=8,tn=8,tr=8,uy=8,uz=8,ve=8,za=8
dirreq-v3-resp ok=2944,not-enough-sigs=0,unavailable=0,not-found=0,not-modified=504,busy=0
dirreq-v3-direct-dl complete=0,timeout=0,running=0
dirreq-v3-tunneled-dl complete=2808,timeout=140,running=0,min=1966,d1=135428,d2=267335,q1=320462,d3=397691,d4=535243,md=661980,d6=798489,d7=877941,q3=940817,d8=1053538,d9=1474341,max=3124308
router-signature
-----BEGIN SIGNATURE-----
dsAqdP897XSqSBMwePHUTSRsCBAtPR42pQmrJTg44b7GIFyVj0MpAJNazOiNy97n
xm7aVTAGfaU4Q4sPDFXTpi/Ax4ubc1tdiOSiWxBwlLgbQA93rKd93IbHNN/pIw+r
+z1GuFb33/tCPFKWBPTrjvfnIAC77FZkML6tSiVoZEQ=
-----END SIGNATURE-----

comment:23 Changed 3 years ago by karsten

I believe that descriptor will not be skipped and stored. What makes you think otherwise?

comment:24 Changed 3 years ago by iwakeh

Status: needs_revisionmerge_ready

Indeed, I overlooked something obvious and you're right. Sorry for the confusion, for some reason I though a different class was responsible. I luckily still had the old log file were I could see which class logged.

So, ready for merge, all fine.
(Except that a test documenting the change is lacking, but that might be tricky to implement currently?)

comment:25 Changed 3 years ago by karsten

Okay, cool, cherry-picked into my task-19720 branch that I'll merge into master as soon as #19720 is ready to be merged. (And yes, writing a test for this is indeed tricky. Not for 1.0.0, at least not if we're still planning to release today.)

comment:26 Changed 3 years ago by karsten

Resolution: fixed
Status: merge_readyclosed

Merged together with the task-19720 branch. Closing. Thanks!

Note: See TracTickets for help on using tickets.