When I parse server descriptors in a metrics tarball and in extracted form using DescriptorReader, I get different numbers of descriptors:
from stem.descriptor.reader import DescriptorReaderdescriptors = 0with DescriptorReader('server-descriptors-2012-12.tar') as reader: for descriptor in reader: descriptors += 1print "%d descriptors in tarball." % (descriptors, )descriptors = 0with DescriptorReader('server-descriptors-2012-12/') as reader: for descriptor in reader: descriptors += 1print "%d descriptors in extracted directory." % (descriptors, )
250048 descriptors in tarball.
279042 descriptors in extracted directory.
What happens to the rest?
This bug means that most of #7828 (closed) must be re-run. :(
To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information
Child items
...
Show closed items
Linked items
0
Link issues together to show that they're related.
Learn more.
Hi Karsten. Please add a skip listener, as I did in #7828 (closed). Most likely the reader is finding something that seems malformed and is concluding that it isn't descriptor content. The skip listener will tell you the things that are skipped and why.
Hmm, on more thought that does seem screwy though. I haven't a clue why it would differ over the exact same content. If you don't find anything interesting with the skip listener then please attach the tarball.
Hmm, on more thought that does seem screwy though. I haven't a clue why it would differ over the exact same content. If you don't find anything interesting with the skip listener then please attach the tarball.
Presumably the server-descriptors-2012-12.tar tarball he used is the same one that can be obtained using curl https://metrics.torproject.org/data/server-descriptors-2012-12.tar.bz2 |bzip2 -cd >server-descriptors-2012-12.tar .
No luck with the skip listener, though I'm not 100% certain that I used it correctly. And rransom is right, this is the server descriptor tarball that's available on the metrics website.
What a headache. I've been picking at this over the last couple days but investigating it is slow since running over this tarball takes roughly an hour on my tiny little netbook.
I'm able to repro this and got a 'fingerprint => file' mapping for the skipped files, but I'm not spotting anything interesting about them. For added fun, if I add a hack so the reader skips all of the files except those then it does read them.
I'm gonna next do a run that enumerates the filenames read by the tarfile module so we can at least figure out where they're going poof.
Does extracting and recreating the tarball change anything? If so, maybe Stem/Python doesn't like how we create tarballs. I vaguely remember we had a similar problem with a Java library a few years ago, but I'm not sure.
atagar@morrigan:~/Desktop/stem$ tar -cf server-descriptors-2012-12_recreated.tar server-descriptors-2012-12atagar@morrigan:~/Desktop/stem$ python example.py277790 descriptors in tarball.279042 descriptors in extracted directory.
Something is seriously screwy about its contact line. In vim it appears as...
^Tþ^Xcontact root@[ipaddress]
... but while parsing it causes an exception because it can't be converted to ASCII. When I tried (... repeatedly) to log output the logging exception with those characters corrupted the output (causing the log file to be truncated to 6k of binary, and peg my cpu when viewed in vim or less).
So... er, yea. Fun. The descriptor file is malformed (since the characters appear before the contact keyword) but stem certainly shouldn't choke this way on... whatever that is. I'll add this descriptor to our integ tests and make it fail more gracefully.
I wonder if I should make metrics-db/-lib stricter when it comes to malformed descriptors. Right now, they don't consider this descriptor malformed, because there are no malformed arguments in a known-keyword line. They consider "?contact" (with ? being those non-ASCII bytes) an unknown keyword, and unknown keywords are allowed. I guess I could require
The result would be that this descriptor is malformed, not just that it contains an unrecognized line. Does that make sense? How will Stem handle this?
Uggg, this ticket fills me with rage. I've tried a couple different approaches to repro this problem in our integ tests but no luck. Things that I've figured out are...
The troublesome descriptor parses just fine when alone or with a handful of other descriptors in a tarball.
If validation is turned off then things parse correctly...
atagar@morrigan:~/Desktop/stem$ python example.py READING TAR279043 descriptors in tarball.READING DIRECTORY279043 descriptors in extracted directory.
The problem is that this descriptor somehow causes thousands of descriptors to go unread in the tarball as you observed, yet I haven't a clue why that is...
atagar@morrigan:~/Desktop/stem$ python example.py READING TAR250048 descriptors in tarball.READING DIRECTORY279042 descriptors in extracted directory.
... oh wow. That's a little embarrassing - I can't believe that it took me this long to spot this. We abort parsing archives when we encounter a non-descriptor. Changing the behavior to instead treat the tarball contents as individual files.
Fixed and retested with that tarball. I'm a little confused why the integ test I wrote earlier to repro this didn't surface this problem, but I've banged my head against this issue far too long to continue to care. :)
It now reads 279042 descriptors from the tarball as it does with the extracted version. Thanks for the catch!