Stem network status document parsing

added component::archived/stem owner::neena priority::medium resolution::implemented status::closed type::task labels

Note: some of the private methods, _generate_router in particular were written so that they could be overridden for microdescriptor consensus parsing.

Trac:
Status: new to needs_review

Hi Ravi. Great to see that this is progressing!

stem/descriptor/init.py

import stem.descriptor.networkstatus_descriptor

Wrong import and mix of new/old usage. I'm guessing that the tests don't cover parse_file()? If it is being tested then this might be due to a phantom networkstatus_descriptor.pyc (I hate it when that happens).

+def _strptime(string, validate = True, optional = False):

try:

return datetime.datetime.strptime(string, "%Y-%m-%d %H:%M:%S")

except ValueError, exc:

if validate or not optional: raise exc

Please end with "else: return None". I know this is the default behavior when there isn't a return, but explicit is better than implicit. :)

+class DescriptorParser:

"""

Helper class to parse documents.

:var str line: current line to be being parsed

:var list lines: list of remaining lines to be parsed

"""

Needs to extend object. I'm also kinda worried about how it holds on to a copy of both _raw_content and lines. Ideally it would have neither. I realize that the Descriptor class has a _raw_content attribute for its str() method. In DescriptorParser's case though _raw_content is presently unused and it should be lazily fetched since we're dealing with far larger amounts of data here. My suggestion would be for the constructor to accept a file object, then read from it on demand. This will let us have constant memory usage.

Speaking of which does this now work when processing a full consensus? I'm not spotting the test we talked about for this though maybe I'm missing it...

This class reminds me a lot of the ControlLine class. I didn't need this style of helper for the server or extrainfo descriptor classes... are you sure that it's really necessary here? The descriptors do a lot of this (first pass parsing, 'opt' prefix handling, etc) via the _get_descriptor_components() helper, and your read_block() method reminds me a bit of the _get_pseudo_pgp_block() helper.

This style of helper class generally hurts code readability. I made the ControlLine class because controller responses are generally structured like a stack, and I wanted to avoid the numerous regexes that TorCtl uses to handle it. That said, I'd love to get rid of the ControlLine class if we had a good alternative.

So in summary:

I'd suggest lazily evaluating what you need from a file object. See the server or extrainfo descriptor's parse_file() functions - they read from a file line by line and yield descriptors so its memory usage isn't greater than a single descriptor.
I'd also suggest replacing this helper class with helper functions that give you back python primitives. This module has several such helpers that we might be able to expand to meet your purposes.

Thoughts?

stem/descriptor/networkstatus.py

:var bool is_valid: * router is valid

:var bool is_guard: * router is suitable for use as an entry guard

:var bool is_named: * router is named

Interesting, I hadn't thought of doing a series of boolean flags for this. However, I'd suggest adding a Flags enum and making this a tuple instead. Users can do the same with fewer methods and also query the RouterDescriptor for all of its flags (which you can't do with these booleans). This would look like...

Flag = stem.util.enum.Enum(
  ("GUARD", "Guard"),
  ("NAMED", "Named"),
  ...
)

... then used like...

>>> my_router_descriptor = RouterDescriptor(...)
>>> print Flag.NAMED in my_router_descriptor.flags
True

>>> for flag in my_router_descriptor.flags:
...   print flag
Named
Guard
Valid

I'll hold off on reviewing the rest for the moment since the DescriptorParser might lead to quite a bit of refactoring (sorry about that!).

Trac:
Status: needs_review to needs_revision

Oh, forgot to mention - with the flags we'll need both a 'flags' and 'unknown_flags' attribute. Anything that isn't in the Flag enum would go into 'unknown_flags'. We should also assert in the test_cached_consensus() that this is empty so we're alerted if Tor adds a new flag.

Replying to atagar:

stem/descriptor/init.py

import stem.descriptor.networkstatus_descriptor Wrong import and mix of new/old usage. I'm guessing that the tests don't cover parse_file()? If it is being tested then this might be due to a phantom networkstatus_descriptor.pyc (I hate it when that happens). Yeah, it was a phantom pyc. fixed.

+class DescriptorParser:

"""

Needs to extend object. I'm also kinda worried about how it holds on to a copy of both _raw_content and lines. Ideally it would have neither. I realize that the Descriptor class has a _raw_content attribute for its str() method. In DescriptorParser's case though _raw_content is presently unused and it should be lazily fetched since we're dealing with far larger amounts of data here. My suggestion would be for the constructor to accept a file object, then read from it on demand. This will let us have constant memory usage.

Speaking of which does this now work when processing a full consensus? I'm not spotting the test we talked about for this though maybe I'm missing it... It does. The test_cached_consensus is parsing the full consensus. I set it to fail if it uses > 100m, and it failed randomly a few times, so I made that 200m. Though, I still don't know why :/

This style of helper class generally hurts code readability. I made the ControlLine class because controller responses are generally structured like a stack, and I wanted to avoid the numerous regexes that TorCtl uses to handle it. That said, I'd love to get rid of the ControlLine class if we had a good alternative. Initially, I started with editing the existing helper methods, but, it seemed easier to do this then.

So in summary:

I'd suggest lazily evaluating what you need from a file object. See the server or extrainfo descriptor's parse_file() functions - they read from a file line by line and yield descriptors so its memory usage isn't greater than a single descriptor. Done.

I'd also suggest replacing this helper class with helper functions that give you back python primitives. This module has several such helpers that we might be able to expand to meet your purposes. Agreed, and done. I cheated by using StringIO objects in most places. But, I don't think that should affect performance too much. I wouldn't mind replacing them with lists of strings though.

Thoughts?

stem/descriptor/networkstatus.py

:var bool is_valid: * router is valid

:var bool is_guard: * router is suitable for use as an entry guard

:var bool is_named: * router is named

Interesting, I hadn't thought of doing a series of boolean flags for this. However, I'd suggest adding a Flags enum and making this a tuple instead. Users can do the same with fewer methods and also query the RouterDescriptor for all of its flags (which you can't do with these booleans). This would look like... Yeah... this definitely makes a lot more sense.

Replying to atagar:

Oh, forgot to mention - with the flags we'll need both a 'flags' and 'unknown_flags' attribute. Anything that isn't in the Flag enum would go into 'unknown_flags'. We should also assert in the test_cached_consensus() that this is empty so we're alerted if Tor adds a new flag. Since we aren't doing anything special with the flags, we could just reuse the known_flags data from the NetworkStatusDocument. Folks then won't have to update stem when a new flag is added.

Trac:
Status: needs_revision to needs_review

Hi Ravi. Just some first thoughts so far from my bus ride...

file_parser = lambda f: stem.descriptor.networkstatus.parse_file(f, True, "microdesc")

Very neat, I've never seen this trick before. I like it.

 desc = stem.descriptor.networkstatus.NetworkStatusDocument(descriptor_file.read())

Again this is frontloading the read rather than passing the file object itself to the NetworkStatusDocument. This means that the memory usage is, at a minimum, as large as the whole consensus file. If we process individual consensus entries then that'll be reduced to the size of an individual consensus entry.

I realize that this is a large change but it's likely necessary to avoid swapping on some systems.

if " " in line:

keyword = line.split(" ", 1)[0]

if keyword == "opt":
   keyword = line.split(" ", 2)[1]
else: keyword = line.strip()

I probably would have done the same though, on reflection, this can actually be reduced to...

if line.startswith("opt "):
  line = line[4:]

keyword = line.split(" ", 1)[0].strip()

... this is because split doesn't need the whitespace to have a first element...

>>> line = "foobar"
>>> line.split(" ", 1)[0]
'foobar'

+try:

from cStringIO import StringIO +except:

from StringIO import StringIO

Out of curiosity when is cStringIO unavailable? Iirc we're using it elsewhere too so if it, say, isn't available on python 2.5 then that would be good to know.

+Flag = stem.util.enum.Enum(

("AUTHORITY", "Authority"),

("BADEXIT", "BadExit"),

("EXIT", "Exit"),

("FAST", "Fast"),

("GUARD", "Guard"),

("HSDIR", "HSDir"),

("NAMED", "Named"),

("RUNNING", "Running"),

("STABLE", "Stable"),

("UNNAMED", "Unnamed"),

("V2DIR", "V2Dir"),

("VALID", "Valid"),

)

+Flag = stem.util.enum.Enum(*[(flag.upper(), flag) for flag in ["Authority", "BadExit", "Exit", "Fast", ...

Duplicate assignment. Personally I think the first is more readable though up to you. If you like the second better then please put the flag list into a temporary variable first to avoid having a single enormous line.

Also, minor nitpick: when doing a multi-line block please place the last parentheses at the same indentation as the block. Ie...

Flag = stem.util.enum.Enum(
  ("AUTHORITY", "Authority"),
  ...
)

... mostly just for consistency and to give a little better visual que that it's the end of the block.

Bit more feedback from looking it over this morning. I'm gonna swap this back into needs_revision until these are addressed.

stem/descriptor/init.py

Metrics descriptor handling. These contain a single descriptor per file.

This comment of mine is no longer accurate with this addition. Maybe this would now be a little nicer as chained iterators...

def parse_file(path, descriptor_file):
  ...

  filename, file_parser = os.path.basename(path), None

  if filename == "cached-descriptors":
    file_parser = stem.descriptor.server_descriptor.parse_file
  elif filename == "cached-extrainfo":
    file_parser = stem.descriptor.extrainfo_descriptor.parse_file
  else:
    # check if this is a metrics descriptor
    metrics_header_match = re.match("^@type (\S+) (\d+).(\d+)$", first_line)

    if metrics_header_match:
      desc_type, major_version, minor_version = metrics_header_match.groups()
      file_parser = lambda f: _parse_metrics_file(desc_type, int(major_version), int(minor_version), f)

  if file_parser:
    for desc in file_parser(descriptor_file):
      desc._set_path(path)
      yield desc

    return
  else:
    raise TypeError("Unable to determine the descriptor's type. filename: '%s', first line: '%s'" % (filename, first_line))

def _parse_metrics_file(descriptor_type, major_version, minor_version, descriptor_file):
  # Parses descriptor files from metrics, yielding individual descriptors. This
  # throws a TypeError if the descriptor_type or version isn't recognized.

  if descriptor_type == "server-descriptor" and major_version == 1:
    yield stem.descriptor.server_descriptor.RelayDescriptor(descriptor_file.read())
  elif descriptor_type == "bridge-server-descriptor" and major_version == 1:
    yield stem.descriptor.server_descriptor.BridgeDescriptor(descriptor_file.read())
  ... etc...
  elif desc_type in ("network-status-consensus-3", "network-status-vote-3") and major_version == 1:
    # TODO: this upfront read still makes me twitch
    desc = stem.descriptor.networkstatus.NetworkStatusDocument(descriptor_file.read())
    for desc in desc.router_descriptors:
      yield desc
  else:
    raise TypeError("Unrecognized metrics descriptor, type: %s, major version: %i, minor version: %i" % (descriptor_type, major_version, minor_version))

+def _peek_keyword(descriptor_file): ...

last_position = descriptor_file.tell()

line = descriptor_file.readline()

if not line: return None

Lets use _peek_line() here. We could probably replace this function with...

def _peek_keyword(descriptor_file):
  ...
  line = _peek_line(descriptor_file)

  if line.startswith("opt "):
    line = line[4:]

  if line:
    return line.split(' ', 1)[0]
  else:
    return None

+def _read_keyword_line(keyword, descriptor_file, validate = True, optional = False): ...

:param str keyword: keyword the line must begin with

:param bool optional: if the current line must begin with the given keyword

:param bool validate: validation is enabled

Missing a parameter and out of order. The opt handling in this method is more complicated than it needs to be - like _peek_keyword() lets just strip it off of the line variable if it's there.

elif line.startswith("opt"):

if this is something new we don't recognize

ignore it and go to the next line

descriptor_file.readline()

return _read_keyword_line(keyword, descriptor_file, optional)

This doesn't look right. I think you meant "elif optional:", having this start with 'opt' really doesn't mean anything.

Also, when checking for an 'opt' prefix please do a startswith() check for 'opt ' so you don't match against, say, our new "optometrist" keyword. :P

+def _read_keyword_line(keyword, descriptor_file, validate = True, optional = False):

"""

Returns the rest of the line if the first keyword matches the given keyword. If

it doesn't, a ValueError is raised if optional and validate are True, if

not, None is returned.

Respects the opt keyword and returns the next keyword if the first is "opt".

... vs...

+def _read_keyword_line_str(keyword, lines, validate = True, optional = False):

"""

Returns the rest of the line if the first keyword matches the given keyword. If

it doesn't, a ValueError is raised if optional and validate are True, if

not, None is returned.

Respects the opt keyword and returns the next keyword if the first is "opt".

These pydocs look really identical to me. How to these functions differ? Can we get rid of _read_keyword_line_str()?

+def _read_until_keywords(keywords, descriptor_file, inclusive = False, ignore_first = False): ...

:param bool ignore_first: doesn't check if the first line read has one of the given keywords

This new parameter seems oddly specific. The problem that you're trying to solve is that a router status entry starts with a 'r ' line and continues until... well, that part isn't defined in the spec. At present you're hardcoding our check to be for...

+_router_desc_end_kws = ["r", "bandwidth-weights", "directory-footer", "directory-signature"]

This strikes me as being both a bit hacky and vulnerable to breakage if tor adds a new footer value. I'd suggest asking for clarification from Nick - this looks like a spec bug to me.

Trac:
Status: needs_review to needs_revision

Replying to atagar:

 desc = stem.descriptor.networkstatus.NetworkStatusDocument(descriptor_file.read())
Again this is frontloading the read rather than passing the file object itself to the NetworkStatusDocument. This means that the memory usage is, at a minimum, as large as the whole consensus file. If we process individual consensus entries then that'll be reduced to the size of an individual consensus entry.

I realize that this is a large change but it's likely necessary to avoid swapping on some systems.

Eek. fixed. As it is right now, NetworkStatusDocument parses descriptor data. stem.descriptor.networkstatus.parse_file does handle the router descriptors individually. So, if ever there is a need to parse NetworkStatusDocuments from a string, we can use the class directly.

I probably would have done the same though, on reflection, this can actually be reduced to...

{{{ if line.startswith("opt "): line = line[4:]

keyword = line.split(" ", 1)[0].strip() }}}

... this is because split doesn't need the whitespace to have a first element... Fixed.

+try:

from cStringIO import StringIO +except:

from StringIO import StringIO

Out of curiosity when is cStringIO unavailable? Iirc we're using it elsewhere too so if it, say, isn't available on python 2.5 then that would be good to know. From what I understand cStringIO is unavailable on 'some platforms'. Not sure what those are.

Duplicate assignment. Personally I think the first is more readable though up to you. If you like the second better then please put the flag list into a temporary variable first to avoid having a single enormous line. Fixed.

Also, minor nitpick: when doing a multi-line block please place the last parentheses at the same indentation as the block. Ie... and fixed.

Replying to atagar:

stem/descriptor/init.py

Metrics descriptor handling. These contain a single descriptor per file.

This comment of mine is no longer accurate with this addition. Maybe this would now be a little nicer as chained iterators...

It is accurate now. I modified stem.descriptor.networkstatus.parse_file to return NetworkStatusDocument objects instead of iterating over the router descriptors.

Lets use _peek_line() here. We could probably replace this function with... done

Missing a parameter and out of order. The opt handling in this method is more complicated than it needs to be - like _peek_keyword() lets just strip it off of the line variable if it's there. ugh, fixed.

elif line.startswith("opt"):

if this is something new we don't recognize

ignore it and go to the next line

descriptor_file.readline()

return _read_keyword_line(keyword, descriptor_file, optional)

This doesn't look right. I think you meant "elif optional:", having this start with 'opt' really doesn't mean anything. If it starts with "opt ", it should mean it's something from the future that the parser doesn't understand yet. We should be ignoring that and using the next line.

Also, when checking for an 'opt' prefix please do a startswith() check for 'opt ' so you don't match against, say, our new "optometrist" keyword. :P Eek, yes, I thought I fixed that.

These pydocs look really identical to me. How to these functions differ? Can we get rid of _read_keyword_line_str()? I added _read_keyword_line_str to handle lists of strings, because I didn't want to use StringIO initially. We could get rid of it and convert everything to use StringIO. But, I'd rather leave it as it is, and eventually convert everything to use lists of strings internally.

+def _read_until_keywords(keywords, descriptor_file, inclusive = False, ignore_first = False): ...

:param bool ignore_first: doesn't check if the first line read has one of the given keywords

This new parameter seems oddly specific. The problem that you're trying to solve is that a router status entry starts with a 'r ' line and continues until... well, that part isn't defined in the spec. At present you're hardcoding our check to be for...

+_router_desc_end_kws = ["r", "bandwidth-weights", "directory-footer", "directory-signature"]

This strikes me as being both a bit hacky and vulnerable to breakage if tor adds a new footer value. I'd suggest asking for clarification from Nick - this looks like a spec bug to me.

It isn't exactly hacky. From the order in the dir-spec the first line after the router descriptors will be

directory-footer if consensus method >= 9
bandwidth-weights if it's a consensus (at most for consensus, it says)
directory-signature otherwise

If tor adds a new footer value, it would start with the "opt" keyword. This would be ignored because of " elif line.startswith("opt"):".

I also committed the microdescriptor flavoured consensus parsing code. Should I create a seperate ticket for that? or will this do?

Trac:
Status: needs_revision to needs_review

You have impeccable timing to push code changes right when I get to work. :P

If it starts with "opt ", it should mean it's something from the future that the parser doesn't understand yet.

Where is that in the spec? From what I can tell 'opt ' is a historical thing that is supposed to be completely ignored: https://gitweb.torproject.org/torspec.git/blob/HEAD:/dir-spec-v2.txt#l154

At least for server and extrainfo descriptors it's fine for the parser to see new keywords. They just get added to the unrecognized lines listing.

Replying to atagar:

You have impeccable timing to push code changes right when I get to work. :P heh

If it starts with "opt ", it should mean it's something from the future that the parser doesn't understand yet.

Where is that in the spec? From what I can tell 'opt ' is a historical thing that is supposed to be completely ignored: https://gitweb.torproject.org/torspec.git/blob/HEAD:/dir-spec-v2.txt#l154

I seem to have misread. ugh. Fixed now.

Alternatively, if something was added to the protocol would that still be a v3 document?

Alternatively, if something was added to the protocol would that still be a v3 document?

The protocol allows for blank lines and new additions so as I understand it new additions to the consensus does not require a version bump... https://gitweb.torproject.org/torspec.git/blob/HEAD:/dir-spec-v2.txt#l350

Hi Ravi. Just finished making the changes that I wanted to for network status document handling. Chief changes are...

addition of unit tests... lots of them :P
rewrite of the parsing to address some misunderstandings and make it similar to how the other descriptors are parsed
addition of v2 document support

Commits can be found in the document-parsing branch of my repo... https://gitweb.torproject.org/user/atagar/stem.git/shortlog/refs/heads/document-parsing

This is... an immense change... https://gitweb.torproject.org/user/atagar/stem.git/commitdiff/31eb0e1?hp=ac5be8c

If you would like to review it then I'd welcome your comments. However, it took me weeks to get through the diff when you sent it to me for review, so I'd certainly understand if this is too big to reasonably go through. Up to you.

Cheers! -Damian

Replying to atagar:

Hi Ravi. Just finished making the changes that I wanted to for network status document handling. Chief changes are...

addition of unit tests... lots of them :P

rewrite of the parsing to address some misunderstandings and make it similar to how the other descriptors are parsed

addition of v2 document support

Commits can be found in the document-parsing branch of my repo... https://gitweb.torproject.org/user/atagar/stem.git/shortlog/refs/heads/document-parsing I've been following them for a while, until about a week ago. It looked good, AFAIR.

This is... an immense change... https://gitweb.torproject.org/user/atagar/stem.git/commitdiff/31eb0e1?hp=ac5be8c

If you would like to review it then I'd welcome your comments. However, it took me weeks to get through the diff when you sent it to me for review, so I'd certainly understand if this is too big to reasonably go through. Up to you.

I'm travelling and I won't be able to review it until middle of next week. I think it's best if you merge it now. I will suggest modifications later when I'm back home.

Trac:
Status: needs_review to accepted

Great! Pushing, improvements welcome. The various Descriptor subclassses have a lot of redundancy which we should probably clean up at some point...

Trac:
Resolution: N/A to implemented
Status: accepted to closed

closed

Stem network status document parsing

Child items ...

Activity

Metrics descriptor handling. These contain a single descriptor per file.

if this is something new we don't recognize

ignore it and go to the next line

Metrics descriptor handling. These contain a single descriptor per file.

if this is something new we don't recognize

ignore it and go to the next line