Opened 5 years ago

Closed 5 years ago

#8049 closed defect (fixed)

Stem's DescriptorReader misses 10% of descriptors in tarballs

Reported by: karsten Owned by: atagar
Priority: Medium Milestone:
Component: Core Tor/Stem Version:
Severity: Keywords:
Cc: Actual Points:
Parent ID: Points:
Reviewer: Sponsor:

Description

When I parse server descriptors in a metrics tarball and in extracted form using DescriptorReader, I get different numbers of descriptors:

from stem.descriptor.reader import DescriptorReader

descriptors = 0
with DescriptorReader('server-descriptors-2012-12.tar') as reader:
    for descriptor in reader:
        descriptors += 1
print "%d descriptors in tarball." % (descriptors, )
descriptors = 0
with DescriptorReader('server-descriptors-2012-12/') as reader:
    for descriptor in reader:
        descriptors += 1
print "%d descriptors in extracted directory." % (descriptors, )

250048 descriptors in tarball.
279042 descriptors in extracted directory.

What happens to the rest?

This bug means that most of #7828 must be re-run. :(

Child Tickets

Change History (13)

comment:1 Changed 5 years ago by atagar

Status: newneeds_information

Hi Karsten. Please add a skip listener, as I did in #7828. Most likely the reader is finding something that seems malformed and is concluding that it isn't descriptor content. The skip listener will tell you the things that are skipped and why.

comment:2 Changed 5 years ago by atagar

Hmm, on more thought that does seem screwy though. I haven't a clue why it would differ over the exact same content. If you don't find anything interesting with the skip listener then please attach the tarball.

comment:3 in reply to:  2 Changed 5 years ago by rransom

Replying to atagar:

Hmm, on more thought that does seem screwy though. I haven't a clue why it would differ over the exact same content. If you don't find anything interesting with the skip listener then please attach the tarball.

Presumably the server-descriptors-2012-12.tar tarball he used is the same one that can be obtained using curl https://metrics.torproject.org/data/server-descriptors-2012-12.tar.bz2 |bzip2 -cd >server-descriptors-2012-12.tar .

comment:4 Changed 5 years ago by karsten

No luck with the skip listener, though I'm not 100% certain that I used it correctly. And rransom is right, this is the server descriptor tarball that's available on the metrics website.

comment:5 Changed 5 years ago by atagar

Status: needs_informationassigned

What a headache. I've been picking at this over the last couple days but investigating it is slow since running over this tarball takes roughly an hour on my tiny little netbook.

I'm able to repro this and got a 'fingerprint => file' mapping for the skipped files, but I'm not spotting anything interesting about them. For added fun, if I add a hack so the reader skips all of the files except those then it *does* read them.

I'm gonna next do a run that enumerates the filenames read by the tarfile module so we can at least figure out where they're going poof.

comment:6 Changed 5 years ago by karsten

Does extracting and recreating the tarball change anything? If so, maybe Stem/Python doesn't like how we create tarballs. I vaguely remember we had a similar problem with a Java library a few years ago, but I'm not sure.

comment:7 Changed 5 years ago by atagar

Was worth a shot but nay...

atagar@morrigan:~/Desktop/stem$ tar -cf server-descriptors-2012-12_recreated.tar server-descriptors-2012-12
atagar@morrigan:~/Desktop/stem$ python example.py
277790 descriptors in tarball.
279042 descriptors in extracted directory.

comment:8 Changed 5 years ago by atagar

Ah ha! After a week of banging my head on the wall I got it...

atagar@morrigan:~/Desktop/stem$ python example.py 
279042 descriptors in tarball.
279042 descriptors in extracted directory.

The difference is that I dropped a single descriptor:

server-descriptors-2012-12/5/6/564778a0a0aea0d6c6c0ee4bbce9c9233a0d70cd

Something is seriously screwy about its contact line. In vim it appears as...

Xcontact root@[ipaddress]

... but while parsing it causes an exception because it can't be converted to ASCII. When I tried (... repeatedly) to log output the logging exception with those characters corrupted the output (causing the log file to be truncated to 6k of binary, and peg my cpu when viewed in vim or less).

So... er, yea. Fun. The descriptor file is malformed (since the characters appear before the contact keyword) but stem certainly shouldn't choke this way on... whatever that is. I'll add this descriptor to our integ tests and make it fail more gracefully.

comment:9 Changed 5 years ago by karsten

Nice catch!

I wonder if I should make metrics-db/-lib stricter when it comes to malformed descriptors. Right now, they don't consider this descriptor malformed, because there are no malformed arguments in a known-keyword line. They consider "?contact" (with ? being those non-ASCII bytes) an unknown keyword, and unknown keywords are allowed. I guess I could require

KeywordChar ::= 'A' ... 'Z' | 'a' ... 'z' | '0' ... '9' | '-'

The result would be that this descriptor is malformed, not just that it contains an unrecognized line. Does that make sense? How will Stem handle this?

comment:10 Changed 5 years ago by atagar

How will Stem handle this?

If validation is enabled then Stem should raise a ValueError with...

Line contains invalid characters: <line>

... and simply skip the line if validation is disabled. Stem's presently failing before that with a UnicodeDecodeError, but that's a bug.

comment:11 Changed 5 years ago by atagar

Uggg, this ticket fills me with rage. I've tried a couple different approaches to repro this problem in our integ tests but no luck. Things that I've figured out are...

  • The troublesome descriptor parses just fine when alone or with a handful of other descriptors in a tarball.
  • If validation is turned off then things parse correctly...
atagar@morrigan:~/Desktop/stem$ python example.py 
READING TAR
279043 descriptors in tarball.
READING DIRECTORY
279043 descriptors in extracted directory.
  • The problem is that this descriptor somehow causes thousands of descriptors to go unread in the tarball as you observed, yet I haven't a clue why that is...
atagar@morrigan:~/Desktop/stem$ python example.py 
READING TAR
250048 descriptors in tarball.
READING DIRECTORY
279042 descriptors in extracted directory.

comment:12 Changed 5 years ago by atagar

... oh wow. That's a little embarrassing - I can't believe that it took me this long to spot this. We abort parsing archives when we encounter a non-descriptor. Changing the behavior to instead treat the tarball contents as individual files.

comment:13 Changed 5 years ago by atagar

Resolution: fixed
Status: assignedclosed

Fixed and retested with that tarball. I'm a little confused why the integ test I wrote earlier to repro this didn't surface this problem, but I've banged my head against this issue far too long to continue to care. :)

It now reads 279042 descriptors from the tarball as it does with the extracted version. Thanks for the catch!

https://gitweb.torproject.org/stem.git/commitdiff/6927e68d80bf78d898afe321a599bf174e116e40

Note: See TracTickets for help on using tickets.