Stem's descriptor parser has gotten a pretty good workout, being exercised every time we do a run of the integ tests. However, I've only done spot checks for historical data.
We should talk with Karsten about running a small stem script on one of the metrics hosts that attempts to parse all of the historical descriptors. The script would be trivial to write, and given a week or so we'd know either that stem can handle all historical descriptor content, or where the issues lay.
To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information
Child items ...
Show closed items
Linked items 0
Link issues together to show that they're related.
Learn more.
Thanks! Here's a script that should do the trick. Just fill in the 'LOG_FILE' with the destination for the output, and provide the descriptor paths to the reader. The DescriptorReader's paths can be either files or directories.
# Reads a series of files, logging issues that it comes across.import loggingfrom stem.descriptor.reader import DescriptorReaderLOG_FILE = "/home/atagar/Desktop/check_descriptors_log"LOGGER = logging.getLogger("check_descriptors")LOGGER.addHandler(logging.FileHandler(LOG_FILE))LOGGER.setLevel(logging.DEBUG)reader = DescriptorReader(( "/home/atagar/Desktop/stem/test/data/cached-descriptors", "/home/atagar/Desktop/stem/test/data/cached-consensus",))reader.register_read_listener( lambda path: LOGGER.debug("Reading %s" % path))reader.register_skip_listener( lambda path, exc: LOGGER.warning(" skipped due to %s" % exc))with reader: for descriptor in reader: unrecognized_lines = descriptor.get_unrecognized_lines() if unrecognized_lines: LOGGER.warning(" unrecognized descriptor content: %s" % unrecognized_lines)
Are the descriptors in text files or tarballs? I'm hoping for the former since I suspect that we still have performance concerns around tarballs, but there's no rush on this so as long as it finishes eventually I'm happy.
Thanks! Here's a script that should do the trick. Just fill in the 'LOG_FILE' with the destination for the output, and provide the descriptor paths to the reader. The DescriptorReader's paths can be either files or directories.
Okay, I started running this on serra. This will take a few days to run. Good thing serra is bored anyway.
Are the descriptors in text files or tarballs? I'm hoping for the former since I suspect that we still have performance concerns around tarballs, but there's no rush on this so as long as it finishes eventually I'm happy.
I'm feeding it with decompressed tarballs. That's what's fastest with metrics-lib. Do you know if that's different for stem? If so, can we do anything to improve parsing decompressed tarballs, because that's most convenient for all sorts of analyses? (Extracting years of descriptor tarballs is somewhat painful, in particular if you accidentally include those directories in a backup.)
Trac: Status: new to accepted Owner: atagar to karsten
Okay, I started running this on serra. This will take a few days to run. Good thing serra is bored anyway.
Thanks!
I'm feeding it with decompressed tarballs. That's what's fastest with metrics-lib.
In looking back at our "Python metrics-lib' thread from 3/25/12 it looks like stem was slower with uncompressed tarballs, but not disastrously so. It's something that would be really nice to fix, but probably isn't critical for this.
Ahh, you're right. I tried again with a uncompressed tarball and theruntime for the same cached descriptor was 7.94 seconds (0.0059seconds per entry). That's about 1.5x slower than a plaintextdescriptor which bad, but not outside the realm of being reasonable.
Started parsing consensuses, ran into #7866 (moved), will resume once that one is resolved.
Completed relay server descriptors, only issues were "reject6" lines ("unrecognized descriptor content: ['reject6 [::1]/8:*']") which were never in dir-spec.txt and apparently will never be. Considering this done.
Started parsing consensuses, ran into #7866 (moved), will resume once that one is resolved.
Thanks! Done.
Cool! Resuming to parse consensuses with the new Stem version after handling the fallout of parsing 2011 and 2010 votes.
Completed relay server descriptors, only issues were "reject6" lines
That's odd. Was including them a tor bug?
I think this was people experimenting with adding IPv6 exit support to Tor. I wouldn't worry about these lines, but Nick would be in a better position to answer this.
FYI, there might also be an issue with parsing the following line in the consensus documents
Are you sure? That consensus header was in your example for #7866 (moved), so if it recognized it there then I'm not sure what sort of issue you mean.
I didn't run into problems here. Note that peer wrote this, so maybe (s)he ran into problems parsing files with some other tool.
Not really. See #8049 (closed) which defeated all past efforts here. I was waiting for that ticket to be resolved before starting over, because extracting tarballs containing lots of files is quite painful. But I'll start with network statuses now. It would be cool to have a fix for #8049 (closed) for server and extra-info descriptors though.
Consensuses and votes are parsed as of two days ago. No new problems there.
Server descriptors and extra-info descriptors are inflated as of yesterday and are running now. I expect that to take a few days. Will let you know how it goes.
There's a problem, but I can't track it down right now:
karsten@serra:~/tasks/task-7828/stem$ ./parse.pyParsingFailure!Exception in thread Descriptor Reader:Traceback (most recent call last): File "/usr/lib/python2.6/threading.py", line 532, in __bootstrap_inner self.run() File "/usr/lib/python2.6/threading.py", line 484, in run self.__target(*self.__args, **self.__kwargs) File "/home/karsten/tasks/task-7828/stem/stem/descriptor/reader.py", line 434, in _read_descriptor_files self._handle_walker(walker, new_processed_files) File "/home/karsten/tasks/task-7828/stem/stem/descriptor/reader.py", line 462, in _handle_walker self._handle_file(os.path.join(root, filename), new_processed_files) File "/home/karsten/tasks/task-7828/stem/stem/descriptor/reader.py", line 515, in _handle_file self._handle_archive(target) File "/home/karsten/tasks/task-7828/stem/stem/descriptor/reader.py", line 571, in _handle_archive self._notify_skip_listeners(target, ParsingFailure(exc)) File "/home/karsten/tasks/task-7828/stem/stem/descriptor/reader.py", line 586, in _notify_skip_listeners listener(path, exception) File "./parse.py", line 22, in <lambda> lambda path, exc: LOGGER.warning(" skipped %s due to '%s' (type: %s)" % (path, exc, type(exc), ))UnicodeEncodeError: 'ascii' codec can't encode characters in position 34-35: ordinal not in range(128)^C^Z[3]+ Stopped ./parse.pykarsten@serra:~/tasks/task-7828/stem$karsten@serra:~/tasks/task-7828/stem$ git diffdiff --git a/stem/descriptor/__init__.py b/stem/descriptor/__init__.pyindex 25b180b..395cbe6 100644--- a/stem/descriptor/__init__.py+++ b/stem/descriptor/__init__.py@@ -331,11 +331,14 @@ class _UnicodeReader(object): def readline(self): return stem.util.str_tools.to_unicode(self.wrapped_file.readline())- def readlines(self, sizehint = 0):+ def readlines(self, sizehint = None): # being careful to do in-place conversion so we don't accidently double our # memory usage- results = self.wrapped_file.readlines(sizehint)+ if sizehint is not None:+ results = self.wrapped_file.readlines(sizehint)+ else:+ results = self.wrapped_file.readlines() for i in xrange(len(results)): results[i] = stem.util.str_tools.to_unicode(results[i])diff --git a/stem/descriptor/reader.py b/stem/descriptor/reader.pyindex 0125a49..55ef886 100644--- a/stem/descriptor/reader.py+++ b/stem/descriptor/reader.py@@ -126,8 +126,8 @@ class ParsingFailure(FileSkipped): def __init__(self, parsing_exception): super(ParsingFailure, self).__init__(parsing_exception) self.exception = parsing_exception- print "ParsingFailure: %s" % (parsing_exception, )-+ print "ParsingFailure!"+ #print "ParsingFailure: %s" % (parsing_exception.encode('ascii', 'ignore'), ) class UnrecognizedType(FileSkipped): """karsten@serra:~/tasks/task-7828/stem$ git log | headcommit 3fd28f26a86e6e071906d77c5bc8d6f6c6fb52aaMerge: 8615af1 be9a532Author: Karsten Loesing <karsten@serra.torproject.org>Date: Tue Feb 26 11:58:50 2013 +0000 Merge branch 'master' of https://git.torproject.org/stemcommit be9a5323a37ea0f1b7d497d7fc33e101453eb2cfAuthor: Karsten Loesing <karsten.loesing@gmx.net>Date: Wed Feb 20 12:26:29 2013 +0100karsten@serra:~/tasks/task-7828/stem$ ls data/extra-infos-2007-08.tar extra-infos-2010-09.tar server-descriptors-2006-11.tar server-descriptors-2009-12.tarextra-infos-2007-09.tar extra-infos-2010-10.tar server-descriptors-2006-12.tar server-descriptors-2010-01.tarextra-infos-2007-10.tar extra-infos-2010-11.tar server-descriptors-2007-01.tar server-descriptors-2010-02.tarextra-infos-2007-11.tar extra-infos-2010-12.tar server-descriptors-2007-02.tar server-descriptors-2010-03.tarextra-infos-2007-12.tar extra-infos-2011-01.tar server-descriptors-2007-03.tar server-descriptors-2010-04.tarextra-infos-2008-01.tar extra-infos-2011-02.tar server-descriptors-2007-04.tar server-descriptors-2010-05.tarextra-infos-2008-02.tar extra-infos-2011-03.tar server-descriptors-2007-05.tar server-descriptors-2010-06.tarextra-infos-2008-03.tar extra-infos-2011-04.tar server-descriptors-2007-06.tar server-descriptors-2010-07.tarextra-infos-2008-04.tar extra-infos-2011-05.tar server-descriptors-2007-07.tar server-descriptors-2010-08.tarextra-infos-2008-05.tar extra-infos-2011-06.tar server-descriptors-2007-08.tar server-descriptors-2010-09.tarextra-infos-2008-06.tar extra-infos-2011-07.tar server-descriptors-2007-09.tar server-descriptors-2010-10.tarextra-infos-2008-07.tar extra-infos-2011-08.tar server-descriptors-2007-10.tar server-descriptors-2010-11.tarextra-infos-2008-08.tar extra-infos-2011-09.tar server-descriptors-2007-11.tar server-descriptors-2010-12.tarextra-infos-2008-09.tar extra-infos-2011-10.tar server-descriptors-2007-12.tar server-descriptors-2011-01.tarextra-infos-2008-10.tar extra-infos-2011-11.tar server-descriptors-2008-01.tar server-descriptors-2011-02.tarextra-infos-2008-11.tar extra-infos-2011-12.tar server-descriptors-2008-02.tar server-descriptors-2011-03.tarextra-infos-2008-12.tar extra-infos-2012-01.tar server-descriptors-2008-03.tar server-descriptors-2011-04.tarextra-infos-2009-01.tar extra-infos-2012-02.tar server-descriptors-2008-04.tar server-descriptors-2011-05.tarextra-infos-2009-02.tar extra-infos-2012-03.tar server-descriptors-2008-05.tar server-descriptors-2011-06.tarextra-infos-2009-03.tar extra-infos-2012-04.tar server-descriptors-2008-06.tar server-descriptors-2011-07.tarextra-infos-2009-04.tar extra-infos-2012-05.tar server-descriptors-2008-07.tar server-descriptors-2011-08.tarextra-infos-2009-05.tar extra-infos-2012-06.tar server-descriptors-2008-08.tar server-descriptors-2011-09.tarextra-infos-2009-06.tar extra-infos-2012-07.tar server-descriptors-2008-09.tar server-descriptors-2011-10.tarextra-infos-2009-07.tar extra-infos-2012-08.tar server-descriptors-2008-10.tar server-descriptors-2011-11.tarextra-infos-2009-08.tar extra-infos-2012-09.tar server-descriptors-2008-11.tar server-descriptors-2011-12.tarextra-infos-2009-09.tar extra-infos-2012-10.tar server-descriptors-2008-12.tar server-descriptors-2012-01.tarextra-infos-2009-10.tar extra-infos-2012-11.tar server-descriptors-2009-01.tar server-descriptors-2012-02.tarextra-infos-2009-11.tar server-descriptors-2005-12.tar server-descriptors-2009-02.tar server-descriptors-2012-03.tarextra-infos-2009-12.tar server-descriptors-2006-02.tar server-descriptors-2009-03.tar server-descriptors-2012-04.tarextra-infos-2010-01.tar server-descriptors-2006-03.tar server-descriptors-2009-04.tar server-descriptors-2012-05.tarextra-infos-2010-02.tar server-descriptors-2006-04.tar server-descriptors-2009-05.tar server-descriptors-2012-06.tarextra-infos-2010-03.tar server-descriptors-2006-05.tar server-descriptors-2009-06.tar server-descriptors-2012-07.tarextra-infos-2010-04.tar server-descriptors-2006-06.tar server-descriptors-2009-07.tar server-descriptors-2012-08.tarextra-infos-2010-05.tar server-descriptors-2006-07.tar server-descriptors-2009-08.tar server-descriptors-2012-09.tarextra-infos-2010-06.tar server-descriptors-2006-08.tar server-descriptors-2009-09.tar server-descriptors-2012-10.tarextra-infos-2010-07.tar server-descriptors-2006-09.tar server-descriptors-2009-10.tar server-descriptors-2012-11.tarextra-infos-2010-08.tar server-descriptors-2006-10.tar server-descriptors-2009-11.tar
Want to get an account on serra and try parsing descriptors yourself? I might not be able to look into this in the next week or two, or I'll run into trouble with deliverables. :/
Want to get an account on serra and try parsing descriptors yourself?
That's probably not necessary. The script's read listener should have registered the file it was last on in check_descriptors_log, right? If so then I can just snag the particular tarball in question.
Ugh, this was another instance of #8049 (closed). The parsing script (not stem) broke when printing out " skipped %s due to '%s' (type: %s)" for non-ASCII line "<9C><FD>^Xcontact root@[ipaddress]". See the traceback above.
I guess we have reached a point where the effort for finding new bugs is higher than fixing bugs we're going to find during normal operation. For example, we might find more bugs in metrics tarballs when initializing pyonionoo once it's written. But I think we should stop actively searching for bugs at this point. I'd say feel free to close the ticket.
Great, thanks for the help! Given how successful this was at finding issues I should set up a daemon to continually do this kind of check, but that can fall under another ticket (#8677 (closed)).
Again, all the help is much appreciated! -Damian
Trac: Status: accepted to closed Resolution: N/Ato fixed