Opened 5 years ago

Closed 4 years ago

#7828 closed task (fixed)

Run descriptor parser over all prior descriptors

Reported by: atagar Owned by: karsten
Priority: Medium Milestone:
Component: Core Tor/Stem Version:
Severity: Keywords: descriptors
Cc: gsathya, karsten Actual Points:
Parent ID: Points:
Reviewer: Sponsor:

Description

Stem's descriptor parser has gotten a pretty good workout, being exercised every time we do a run of the integ tests. However, I've only done spot checks for historical data.

We should talk with Karsten about running a small stem script on one of the metrics hosts that attempts to parse all of the historical descriptors. The script would be trivial to write, and given a week or so we'd know either that stem can handle all historical descriptor content, or where the issues lay.

Child Tickets

Change History (19)

comment:1 Changed 5 years ago by gsathya

  • Cc gsathya added

comment:2 Changed 5 years ago by karsten

  • Cc karsten added

Happy to run such a script. Or, if gsathya wants to run it, serra seems to be a fine place to do that.

comment:3 follow-up: Changed 5 years ago by atagar

Thanks! Here's a script that should do the trick. Just fill in the 'LOG_FILE' with the destination for the output, and provide the descriptor paths to the reader. The DescriptorReader's paths can be either files or directories.

#!/usr/bin/env python
# Reads a series of files, logging issues that it comes across.

import logging
from stem.descriptor.reader import DescriptorReader

LOG_FILE = "/home/atagar/Desktop/check_descriptors_log"

LOGGER = logging.getLogger("check_descriptors")
LOGGER.addHandler(logging.FileHandler(LOG_FILE))
LOGGER.setLevel(logging.DEBUG)

reader = DescriptorReader((
  "/home/atagar/Desktop/stem/test/data/cached-descriptors",
  "/home/atagar/Desktop/stem/test/data/cached-consensus",
))

reader.register_read_listener(
  lambda path: LOGGER.debug("Reading %s" % path)
)

reader.register_skip_listener(
  lambda path, exc: LOGGER.warning("  skipped due to %s" % exc)
)

with reader:
  for descriptor in reader:
    unrecognized_lines = descriptor.get_unrecognized_lines()
    
    if unrecognized_lines:
      LOGGER.warning("  unrecognized descriptor content: %s" % unrecognized_lines)

Are the descriptors in text files or tarballs? I'm hoping for the former since I suspect that we still have performance concerns around tarballs, but there's no rush on this so as long as it finishes eventually I'm happy.

Cheers! -Damian

comment:4 in reply to: ↑ 3 Changed 5 years ago by karsten

  • Owner changed from atagar to karsten
  • Status changed from new to accepted

Replying to atagar:

Thanks! Here's a script that should do the trick. Just fill in the 'LOG_FILE' with the destination for the output, and provide the descriptor paths to the reader. The DescriptorReader's paths can be either files or directories.

Okay, I started running this on serra. This will take a few days to run. Good thing serra is bored anyway.

Are the descriptors in text files or tarballs? I'm hoping for the former since I suspect that we still have performance concerns around tarballs, but there's no rush on this so as long as it finishes eventually I'm happy.

I'm feeding it with decompressed tarballs. That's what's fastest with metrics-lib. Do you know if that's different for stem? If so, can we do anything to improve parsing decompressed tarballs, because that's most convenient for all sorts of analyses? (Extracting years of descriptor tarballs is somewhat painful, in particular if you accidentally include those directories in a backup.)

comment:5 Changed 5 years ago by atagar

Okay, I started running this on serra. This will take a few days to run. Good thing serra is bored anyway.

Thanks!

I'm feeding it with decompressed tarballs. That's what's fastest with metrics-lib.

In looking back at our "Python metrics-lib' thread from 3/25/12 it looks like stem was slower with uncompressed tarballs, but not disastrously so. It's something that would be really nice to fix, but probably isn't critical for this.

Ahh, you're right. I tried again with a uncompressed tarball and the
runtime for the same cached descriptor was 7.94 seconds (0.0059
seconds per entry). That's about 1.5x slower than a plaintext
descriptor which bad, but not outside the realm of being reasonable.

Cheers! -Damian

comment:6 Changed 5 years ago by karsten

Some progress:

  • Started parsing consensuses, ran into #7866, will resume once that one is resolved.
  • Completed relay server descriptors, only issues were "reject6" lines ("unrecognized descriptor content: ['reject6 [::1]/8:*']") which were never in dir-spec.txt and apparently will never be. Considering this done.
  • Relay extra-info descriptors will be next.

comment:7 Changed 5 years ago by peer

FYI, there might also be an issue with parsing the following line in the consensus documents (see #7241).

@type network-status-consensus-3 1.0

comment:8 follow-up: Changed 5 years ago by atagar

Started parsing consensuses, ran into #7866, will resume once that one is resolved.

Thanks! Done.

Completed relay server descriptors, only issues were "reject6" lines

That's odd. Was including them a tor bug?

FYI, there might also be an issue with parsing the following line in the consensus documents

Are you sure? That consensus header was in your example for #7866, so if it recognized it there then I'm not sure what sort of issue you mean.

Cheers! -Damian

comment:9 in reply to: ↑ 8 Changed 5 years ago by karsten

Replying to atagar:

Started parsing consensuses, ran into #7866, will resume once that one is resolved.

Thanks! Done.

Cool! Resuming to parse consensuses with the new Stem version after handling the fallout of parsing 2011 and 2010 votes.

Completed relay server descriptors, only issues were "reject6" lines

That's odd. Was including them a tor bug?

I think this was people experimenting with adding IPv6 exit support to Tor. I wouldn't worry about these lines, but Nick would be in a better position to answer this.

FYI, there might also be an issue with parsing the following line in the consensus documents

Are you sure? That consensus header was in your example for #7866, so if it recognized it there then I'm not sure what sort of issue you mean.

I didn't run into problems here. Note that peer wrote this, so maybe (s)he ran into problems parsing files with some other tool.

comment:10 Changed 4 years ago by atagar

Cool! Resuming to parse consensuses with the new Stem version after handling the fallout of parsing 2011 and 2010 votes.

Hi Karsten. Any more finds?

comment:11 Changed 4 years ago by karsten

Not really. See #8049 which defeated all past efforts here. I was waiting for that ticket to be resolved before starting over, because extracting tarballs containing lots of files is quite painful. But I'll start with network statuses now. It would be cool to have a fix for #8049 for server and extra-info descriptors though.

comment:12 Changed 4 years ago by atagar

Ahhh, gotcha. #8049 is the next thing on my dance card so I'll be looking at it tomorrow morning.

comment:13 Changed 4 years ago by atagar

Hi Karsten. How is this going?

comment:14 Changed 4 years ago by karsten

Consensuses and votes are parsed as of two days ago. No new problems there.

Server descriptors and extra-info descriptors are inflated as of yesterday and are running now. I expect that to take a few days. Will let you know how it goes.

comment:15 Changed 4 years ago by atagar

Hi Karsten, has the parser finished with the server and extra-info descriptors? Any new finds?

comment:16 Changed 4 years ago by karsten

There's a problem, but I can't track it down right now:

karsten@serra:~/tasks/task-7828/stem$ ./parse.py
ParsingFailure!
Exception in thread Descriptor Reader:
Traceback (most recent call last):
  File "/usr/lib/python2.6/threading.py", line 532, in __bootstrap_inner
    self.run()
  File "/usr/lib/python2.6/threading.py", line 484, in run
    self.__target(*self.__args, **self.__kwargs)
  File "/home/karsten/tasks/task-7828/stem/stem/descriptor/reader.py", line 434, in _read_descriptor_files
    self._handle_walker(walker, new_processed_files)
  File "/home/karsten/tasks/task-7828/stem/stem/descriptor/reader.py", line 462, in _handle_walker
    self._handle_file(os.path.join(root, filename), new_processed_files)
  File "/home/karsten/tasks/task-7828/stem/stem/descriptor/reader.py", line 515, in _handle_file
    self._handle_archive(target)
  File "/home/karsten/tasks/task-7828/stem/stem/descriptor/reader.py", line 571, in _handle_archive
    self._notify_skip_listeners(target, ParsingFailure(exc))
  File "/home/karsten/tasks/task-7828/stem/stem/descriptor/reader.py", line 586, in _notify_skip_listeners
    listener(path, exception)
  File "./parse.py", line 22, in <lambda>
    lambda path, exc: LOGGER.warning("  skipped %s due to '%s' (type: %s)" % (path, exc, type(exc), ))
UnicodeEncodeError: 'ascii' codec can't encode characters in position 34-35: ordinal not in range(128)

^C^Z
[3]+  Stopped                 ./parse.py
karsten@serra:~/tasks/task-7828/stem$
karsten@serra:~/tasks/task-7828/stem$ git diff
diff --git a/stem/descriptor/__init__.py b/stem/descriptor/__init__.py
index 25b180b..395cbe6 100644
--- a/stem/descriptor/__init__.py
+++ b/stem/descriptor/__init__.py
@@ -331,11 +331,14 @@ class _UnicodeReader(object):
   def readline(self):
     return stem.util.str_tools.to_unicode(self.wrapped_file.readline())

-  def readlines(self, sizehint = 0):
+  def readlines(self, sizehint = None):
     # being careful to do in-place conversion so we don't accidently double our
     # memory usage

-    results = self.wrapped_file.readlines(sizehint)
+    if sizehint is not None:
+      results = self.wrapped_file.readlines(sizehint)
+    else:
+      results = self.wrapped_file.readlines()

     for i in xrange(len(results)):
       results[i] = stem.util.str_tools.to_unicode(results[i])
diff --git a/stem/descriptor/reader.py b/stem/descriptor/reader.py
index 0125a49..55ef886 100644
--- a/stem/descriptor/reader.py
+++ b/stem/descriptor/reader.py
@@ -126,8 +126,8 @@ class ParsingFailure(FileSkipped):
   def __init__(self, parsing_exception):
     super(ParsingFailure, self).__init__(parsing_exception)
     self.exception = parsing_exception
-    print "ParsingFailure: %s" % (parsing_exception, )
-
+    print "ParsingFailure!"
+    #print "ParsingFailure: %s" % (parsing_exception.encode('ascii', 'ignore'), )

 class UnrecognizedType(FileSkipped):
   """
karsten@serra:~/tasks/task-7828/stem$ git log | head
commit 3fd28f26a86e6e071906d77c5bc8d6f6c6fb52aa
Merge: 8615af1 be9a532
Author: Karsten Loesing <karsten@serra.torproject.org>
Date:   Tue Feb 26 11:58:50 2013 +0000

    Merge branch 'master' of https://git.torproject.org/stem

commit be9a5323a37ea0f1b7d497d7fc33e101453eb2cf
Author: Karsten Loesing <karsten.loesing@gmx.net>
Date:   Wed Feb 20 12:26:29 2013 +0100
karsten@serra:~/tasks/task-7828/stem$ ls data/
extra-infos-2007-08.tar  extra-infos-2010-09.tar         server-descriptors-2006-11.tar  server-descriptors-2009-12.tar
extra-infos-2007-09.tar  extra-infos-2010-10.tar         server-descriptors-2006-12.tar  server-descriptors-2010-01.tar
extra-infos-2007-10.tar  extra-infos-2010-11.tar         server-descriptors-2007-01.tar  server-descriptors-2010-02.tar
extra-infos-2007-11.tar  extra-infos-2010-12.tar         server-descriptors-2007-02.tar  server-descriptors-2010-03.tar
extra-infos-2007-12.tar  extra-infos-2011-01.tar         server-descriptors-2007-03.tar  server-descriptors-2010-04.tar
extra-infos-2008-01.tar  extra-infos-2011-02.tar         server-descriptors-2007-04.tar  server-descriptors-2010-05.tar
extra-infos-2008-02.tar  extra-infos-2011-03.tar         server-descriptors-2007-05.tar  server-descriptors-2010-06.tar
extra-infos-2008-03.tar  extra-infos-2011-04.tar         server-descriptors-2007-06.tar  server-descriptors-2010-07.tar
extra-infos-2008-04.tar  extra-infos-2011-05.tar         server-descriptors-2007-07.tar  server-descriptors-2010-08.tar
extra-infos-2008-05.tar  extra-infos-2011-06.tar         server-descriptors-2007-08.tar  server-descriptors-2010-09.tar
extra-infos-2008-06.tar  extra-infos-2011-07.tar         server-descriptors-2007-09.tar  server-descriptors-2010-10.tar
extra-infos-2008-07.tar  extra-infos-2011-08.tar         server-descriptors-2007-10.tar  server-descriptors-2010-11.tar
extra-infos-2008-08.tar  extra-infos-2011-09.tar         server-descriptors-2007-11.tar  server-descriptors-2010-12.tar
extra-infos-2008-09.tar  extra-infos-2011-10.tar         server-descriptors-2007-12.tar  server-descriptors-2011-01.tar
extra-infos-2008-10.tar  extra-infos-2011-11.tar         server-descriptors-2008-01.tar  server-descriptors-2011-02.tar
extra-infos-2008-11.tar  extra-infos-2011-12.tar         server-descriptors-2008-02.tar  server-descriptors-2011-03.tar
extra-infos-2008-12.tar  extra-infos-2012-01.tar         server-descriptors-2008-03.tar  server-descriptors-2011-04.tar
extra-infos-2009-01.tar  extra-infos-2012-02.tar         server-descriptors-2008-04.tar  server-descriptors-2011-05.tar
extra-infos-2009-02.tar  extra-infos-2012-03.tar         server-descriptors-2008-05.tar  server-descriptors-2011-06.tar
extra-infos-2009-03.tar  extra-infos-2012-04.tar         server-descriptors-2008-06.tar  server-descriptors-2011-07.tar
extra-infos-2009-04.tar  extra-infos-2012-05.tar         server-descriptors-2008-07.tar  server-descriptors-2011-08.tar
extra-infos-2009-05.tar  extra-infos-2012-06.tar         server-descriptors-2008-08.tar  server-descriptors-2011-09.tar
extra-infos-2009-06.tar  extra-infos-2012-07.tar         server-descriptors-2008-09.tar  server-descriptors-2011-10.tar
extra-infos-2009-07.tar  extra-infos-2012-08.tar         server-descriptors-2008-10.tar  server-descriptors-2011-11.tar
extra-infos-2009-08.tar  extra-infos-2012-09.tar         server-descriptors-2008-11.tar  server-descriptors-2011-12.tar
extra-infos-2009-09.tar  extra-infos-2012-10.tar         server-descriptors-2008-12.tar  server-descriptors-2012-01.tar
extra-infos-2009-10.tar  extra-infos-2012-11.tar         server-descriptors-2009-01.tar  server-descriptors-2012-02.tar
extra-infos-2009-11.tar  server-descriptors-2005-12.tar  server-descriptors-2009-02.tar  server-descriptors-2012-03.tar
extra-infos-2009-12.tar  server-descriptors-2006-02.tar  server-descriptors-2009-03.tar  server-descriptors-2012-04.tar
extra-infos-2010-01.tar  server-descriptors-2006-03.tar  server-descriptors-2009-04.tar  server-descriptors-2012-05.tar
extra-infos-2010-02.tar  server-descriptors-2006-04.tar  server-descriptors-2009-05.tar  server-descriptors-2012-06.tar
extra-infos-2010-03.tar  server-descriptors-2006-05.tar  server-descriptors-2009-06.tar  server-descriptors-2012-07.tar
extra-infos-2010-04.tar  server-descriptors-2006-06.tar  server-descriptors-2009-07.tar  server-descriptors-2012-08.tar
extra-infos-2010-05.tar  server-descriptors-2006-07.tar  server-descriptors-2009-08.tar  server-descriptors-2012-09.tar
extra-infos-2010-06.tar  server-descriptors-2006-08.tar  server-descriptors-2009-09.tar  server-descriptors-2012-10.tar
extra-infos-2010-07.tar  server-descriptors-2006-09.tar  server-descriptors-2009-10.tar  server-descriptors-2012-11.tar
extra-infos-2010-08.tar  server-descriptors-2006-10.tar  server-descriptors-2009-11.tar

Want to get an account on serra and try parsing descriptors yourself? I might not be able to look into this in the next week or two, or I'll run into trouble with deliverables. :/

comment:17 Changed 4 years ago by atagar

Want to get an account on serra and try parsing descriptors yourself?

That's probably not necessary. The script's read listener should have registered the file it was last on in check_descriptors_log, right? If so then I can just snag the particular tarball in question.

comment:18 Changed 4 years ago by karsten

Ugh, this was another instance of #8049. The parsing script (not stem) broke when printing out " skipped %s due to '%s' (type: %s)" for non-ASCII line "<9C><FD>^Xcontact root@[ipaddress]". See the traceback above.

I guess we have reached a point where the effort for finding new bugs is higher than fixing bugs we're going to find during normal operation. For example, we might find more bugs in metrics tarballs when initializing pyonionoo once it's written. But I think we should stop actively searching for bugs at this point. I'd say feel free to close the ticket.

comment:19 Changed 4 years ago by atagar

  • Resolution set to fixed
  • Status changed from accepted to closed

Great, thanks for the help! Given how successful this was at finding issues I should set up a daemon to continually do this kind of check, but that can fall under another ticket (#8677).

Again, all the help is much appreciated! -Damian

Note: See TracTickets for help on using tickets.