Stem's DescriptorReader misses 10% of descriptors in tarballs

added component::archived/stem owner::atagar priority::medium resolution::fixed status::closed type::defect labels

Hi Karsten. Please add a skip listener, as I did in #7828 (closed). Most likely the reader is finding something that seems malformed and is concluding that it isn't descriptor content. The skip listener will tell you the things that are skipped and why.

Trac:
Status: new to needs_information

Hmm, on more thought that does seem screwy though. I haven't a clue why it would differ over the exact same content. If you don't find anything interesting with the skip listener then please attach the tarball.

Replying to atagar:

Hmm, on more thought that does seem screwy though. I haven't a clue why it would differ over the exact same content. If you don't find anything interesting with the skip listener then please attach the tarball.

Presumably the server-descriptors-2012-12.tar tarball he used is the same one that can be obtained using curl https://metrics.torproject.org/data/server-descriptors-2012-12.tar.bz2 |bzip2 -cd >server-descriptors-2012-12.tar .

No luck with the skip listener, though I'm not 100% certain that I used it correctly. And rransom is right, this is the server descriptor tarball that's available on the metrics website.

What a headache. I've been picking at this over the last couple days but investigating it is slow since running over this tarball takes roughly an hour on my tiny little netbook.

I'm able to repro this and got a 'fingerprint => file' mapping for the skipped files, but I'm not spotting anything interesting about them. For added fun, if I add a hack so the reader skips all of the files except those then it does read them.

I'm gonna next do a run that enumerates the filenames read by the tarfile module so we can at least figure out where they're going poof.

Trac:
Status: needs_information to assigned

Does extracting and recreating the tarball change anything? If so, maybe Stem/Python doesn't like how we create tarballs. I vaguely remember we had a similar problem with a Java library a few years ago, but I'm not sure.

Was worth a shot but nay...

atagar@morrigan:~/Desktop/stem$ tar -cf server-descriptors-2012-12_recreated.tar server-descriptors-2012-12
atagar@morrigan:~/Desktop/stem$ python example.py
277790 descriptors in tarball.
279042 descriptors in extracted directory.

Ah ha! After a week of banging my head on the wall I got it...

atagar@morrigan:~/Desktop/stem$ python example.py 
279042 descriptors in tarball.
279042 descriptors in extracted directory.

The difference is that I dropped a single descriptor:

server-descriptors-2012-12/5/6/564778a0a0aea0d6c6c0ee4bbce9c9233a0d70cd

Something is seriously screwy about its contact line. In vim it appears as...

^Tþ^Xcontact root@[ipaddress]

... but while parsing it causes an exception because it can't be converted to ASCII. When I tried (... repeatedly) to log output the logging exception with those characters corrupted the output (causing the log file to be truncated to 6k of binary, and peg my cpu when viewed in vim or less).

So... er, yea. Fun. The descriptor file is malformed (since the characters appear before the contact keyword) but stem certainly shouldn't choke this way on... whatever that is. I'll add this descriptor to our integ tests and make it fail more gracefully.

Nice catch!

I wonder if I should make metrics-db/-lib stricter when it comes to malformed descriptors. Right now, they don't consider this descriptor malformed, because there are no malformed arguments in a known-keyword line. They consider "?contact" (with ? being those non-ASCII bytes) an unknown keyword, and unknown keywords are allowed. I guess I could require

KeywordChar ::= 'A' ... 'Z' | 'a' ... 'z' | '0' ... '9' | '-'

The result would be that this descriptor is malformed, not just that it contains an unrecognized line. Does that make sense? How will Stem handle this?

How will Stem handle this?

If validation is enabled then Stem should raise a ValueError with...

Line contains invalid characters:

... and simply skip the line if validation is disabled. Stem's presently failing before that with a UnicodeDecodeError, but that's a bug.

Uggg, this ticket fills me with rage. I've tried a couple different approaches to repro this problem in our integ tests but no luck. Things that I've figured out are...

The troublesome descriptor parses just fine when alone or with a handful of other descriptors in a tarball.
If validation is turned off then things parse correctly...

atagar@morrigan:~/Desktop/stem$ python example.py 
READING TAR
279043 descriptors in tarball.
READING DIRECTORY
279043 descriptors in extracted directory.

The problem is that this descriptor somehow causes thousands of descriptors to go unread in the tarball as you observed, yet I haven't a clue why that is...

atagar@morrigan:~/Desktop/stem$ python example.py 
READING TAR
250048 descriptors in tarball.
READING DIRECTORY
279042 descriptors in extracted directory.

... oh wow. That's a little embarrassing - I can't believe that it took me this long to spot this. We abort parsing archives when we encounter a non-descriptor. Changing the behavior to instead treat the tarball contents as individual files.

Fixed and retested with that tarball. I'm a little confused why the integ test I wrote earlier to repro this didn't surface this problem, but I've banged my head against this issue far too long to continue to care. :)

It now reads 279042 descriptors from the tarball as it does with the extracted version. Thanks for the catch!

https://gitweb.torproject.org/stem.git/commitdiff/6927e68d80bf78d898afe321a599bf174e116e40

Trac:
Status: assigned to closed
Resolution: N/A to fixed

closed

mentioned in issue #8798 (moved)

Stem's DescriptorReader misses 10% of descriptors in tarballs

Child items ...

Activity