Wrong import and mix of new/old usage. I'm guessing that the tests don't cover parse_file()? If it is being tested then this might be due to a phantom networkstatus_descriptor.pyc (I hate it when that happens).
Please end with "else: return None". I know this is the default behavior when there isn't a return, but explicit is better than implicit. :)
+class DescriptorParser:
"""
Helper class to parse documents.
:var str line: current line to be being parsed
:var list lines: list of remaining lines to be parsed
"""
Needs to extend object. I'm also kinda worried about how it holds on to a copy of both _raw_content and lines. Ideally it would have neither. I realize that the Descriptor class has a _raw_content attribute for its str() method. In DescriptorParser's case though _raw_content is presently unused and it should be lazily fetched since we're dealing with far larger amounts of data here. My suggestion would be for the constructor to accept a file object, then read from it on demand. This will let us have constant memory usage.
Speaking of which does this now work when processing a full consensus? I'm not spotting the test we talked about for this though maybe I'm missing it...
This class reminds me a lot of the ControlLine class. I didn't need this style of helper for the server or extrainfo descriptor classes... are you sure that it's really necessary here? The descriptors do a lot of this (first pass parsing, 'opt' prefix handling, etc) via the _get_descriptor_components() helper, and your read_block() method reminds me a bit of the _get_pseudo_pgp_block() helper.
This style of helper class generally hurts code readability. I made the ControlLine class because controller responses are generally structured like a stack, and I wanted to avoid the numerous regexes that TorCtl uses to handle it. That said, I'd love to get rid of the ControlLine class if we had a good alternative.
So in summary:
I'd suggest lazily evaluating what you need from a file object. See the server or extrainfo descriptor's parse_file() functions - they read from a file line by line and yield descriptors so its memory usage isn't greater than a single descriptor.
I'd also suggest replacing this helper class with helper functions that give you back python primitives. This module has several such helpers that we might be able to expand to meet your purposes.
Thoughts?
stem/descriptor/networkstatus.py
:var bool is_valid: * router is valid
:var bool is_guard: * router is suitable for use as an entry guard
:var bool is_named: * router is named
Interesting, I hadn't thought of doing a series of boolean flags for this. However, I'd suggest adding a Flags enum and making this a tuple instead. Users can do the same with fewer methods and also query the RouterDescriptor for all of its flags (which you can't do with these booleans). This would look like...
Flag = stem.util.enum.Enum( ("GUARD", "Guard"), ("NAMED", "Named"), ...)
... then used like...
>>> my_router_descriptor = RouterDescriptor(...)>>> print Flag.NAMED in my_router_descriptor.flagsTrue>>> for flag in my_router_descriptor.flags:... print flagNamedGuardValid
I'll hold off on reviewing the rest for the moment since the DescriptorParser might lead to quite a bit of refactoring (sorry about that!).
Oh, forgot to mention - with the flags we'll need both a 'flags' and 'unknown_flags' attribute. Anything that isn't in the Flag enum would go into 'unknown_flags'. We should also assert in the test_cached_consensus() that this is empty so we're alerted if Tor adds a new flag.
import stem.descriptor.networkstatus_descriptor
Wrong import and mix of new/old usage. I'm guessing that the tests don't cover parse_file()? If it is being tested then this might be due to a phantom networkstatus_descriptor.pyc (I hate it when that happens).
Yeah, it was a phantom pyc. fixed.
+class DescriptorParser:
"""
Needs to extend object. I'm also kinda worried about how it holds on to a copy of both _raw_content and lines. Ideally it would have neither. I realize that the Descriptor class has a _raw_content attribute for its str() method. In DescriptorParser's case though _raw_content is presently unused and it should be lazily fetched since we're dealing with far larger amounts of data here. My suggestion would be for the constructor to accept a file object, then read from it on demand. This will let us have constant memory usage.
Speaking of which does this now work when processing a full consensus? I'm not spotting the test we talked about for this though maybe I'm missing it...
It does. The test_cached_consensus is parsing the full consensus. I set it to fail if it uses > 100m, and it failed randomly a few times, so I made that 200m.
Though, I still don't know why :/
This style of helper class generally hurts code readability. I made the ControlLine class because controller responses are generally structured like a stack, and I wanted to avoid the numerous regexes that TorCtl uses to handle it. That said, I'd love to get rid of the ControlLine class if we had a good alternative.
Initially, I started with editing the existing helper methods, but, it seemed easier to do this then.
So in summary:
I'd suggest lazily evaluating what you need from a file object. See the server or extrainfo descriptor's parse_file() functions - they read from a file line by line and yield descriptors so its memory usage isn't greater than a single descriptor.
Done.
I'd also suggest replacing this helper class with helper functions that give you back python primitives. This module has several such helpers that we might be able to expand to meet your purposes.
Agreed, and done. I cheated by using StringIO objects in most places. But, I don't think that should affect performance too much. I wouldn't mind replacing them with lists of strings though.
Thoughts?
stem/descriptor/networkstatus.py
:var bool is_valid: * router is valid
:var bool is_guard: * router is suitable for use as an entry guard
:var bool is_named: * router is named
Interesting, I hadn't thought of doing a series of boolean flags for this. However, I'd suggest adding a Flags enum and making this a tuple instead. Users can do the same with fewer methods and also query the RouterDescriptor for all of its flags (which you can't do with these booleans). This would look like...
Yeah... this definitely makes a lot more sense.
Oh, forgot to mention - with the flags we'll need both a 'flags' and 'unknown_flags' attribute. Anything that isn't in the Flag enum would go into 'unknown_flags'. We should also assert in the test_cached_consensus() that this is empty so we're alerted if Tor adds a new flag.
Since we aren't doing anything special with the flags, we could just reuse the known_flags data from the NetworkStatusDocument. Folks then won't have to update stem when a new flag is added.
Again this is frontloading the read rather than passing the file object itself to the NetworkStatusDocument. This means that the memory usage is, at a minimum, as large as the whole consensus file. If we process individual consensus entries then that'll be reduced to the size of an individual consensus entry.
I realize that this is a large change but it's likely necessary to avoid swapping on some systems.
if " " in line:
keyword = line.split(" ", 1)[0]
if keyword == "opt":
keyword = line.split(" ", 2)[1]
else: keyword = line.strip()
I probably would have done the same though, on reflection, this can actually be reduced to...
if line.startswith("opt "): line = line[4:]keyword = line.split(" ", 1)[0].strip()
... this is because split doesn't need the whitespace to have a first element...
>>> line = "foobar">>> line.split(" ", 1)[0]'foobar'
+try:
from cStringIO import StringIO
+except:
from StringIO import StringIO
Out of curiosity when is cStringIO unavailable? Iirc we're using it elsewhere too so if it, say, isn't available on python 2.5 then that would be good to know.
+Flag = stem.util.enum.Enum(
("AUTHORITY", "Authority"),
("BADEXIT", "BadExit"),
("EXIT", "Exit"),
("FAST", "Fast"),
("GUARD", "Guard"),
("HSDIR", "HSDir"),
("NAMED", "Named"),
("RUNNING", "Running"),
("STABLE", "Stable"),
("UNNAMED", "Unnamed"),
("V2DIR", "V2Dir"),
("VALID", "Valid"),
)
+Flag = stem.util.enum.Enum(*[(flag.upper(), flag) for flag in ["Authority", "BadExit", "Exit", "Fast", ...
Duplicate assignment. Personally I think the first is more readable though up to you. If you like the second better then please put the flag list into a temporary variable first to avoid having a single enormous line.
Also, minor nitpick: when doing a multi-line block please place the last parentheses at the same indentation as the block. Ie...
Flag = stem.util.enum.Enum( ("AUTHORITY", "Authority"), ...)
... mostly just for consistency and to give a little better visual que that it's the end of the block.
:param str keyword: keyword the line must begin with
:param bool optional: if the current line must begin with the given keyword
:param bool validate: validation is enabled
Missing a parameter and out of order. The opt handling in this method is more complicated than it needs to be - like _peek_keyword() lets just strip it off of the line variable if it's there.
:param bool ignore_first: doesn't check if the first line read has one of the given keywords
This new parameter seems oddly specific. The problem that you're trying to solve is that a router status entry starts with a 'r ' line and continues until... well, that part isn't defined in the spec. At present you're hardcoding our check to be for...
This strikes me as being both a bit hacky and vulnerable to breakage if tor adds a new footer value. I'd suggest asking for clarification from Nick - this looks like a spec bug to me.
Again this is frontloading the read rather than passing the file object itself to the NetworkStatusDocument. This means that the memory usage is, at a minimum, as large as the whole consensus file. If we process individual consensus entries then that'll be reduced to the size of an individual consensus entry.
I realize that this is a large change but it's likely necessary to avoid swapping on some systems.
Eek. fixed.
As it is right now, NetworkStatusDocument parses descriptor data. stem.descriptor.networkstatus.parse_file does handle the router descriptors individually. So, if ever there is a need to parse NetworkStatusDocuments from a string, we can use the class directly.
I probably would have done the same though, on reflection, this can actually be reduced to...
{{{
if line.startswith("opt "):
line = line[4:]
keyword = line.split(" ", 1)[0].strip()
}}}
... this is because split doesn't need the whitespace to have a first element...
Fixed.
+try:
from cStringIO import StringIO
+except:
from StringIO import StringIO
Out of curiosity when is cStringIO unavailable? Iirc we're using it elsewhere too so if it, say, isn't available on python 2.5 then that would be good to know.
From what I understand cStringIO is unavailable on 'some platforms'. Not sure what those are.
Duplicate assignment. Personally I think the first is more readable though up to you. If you like the second better then please put the flag list into a temporary variable first to avoid having a single enormous line.
Fixed.
Also, minor nitpick: when doing a multi-line block please place the last parentheses at the same indentation as the block. Ie...
and fixed.
Metrics descriptor handling. These contain a single descriptor per file.
This comment of mine is no longer accurate with this addition. Maybe this would now be a little nicer as chained iterators...
It is accurate now. I modified stem.descriptor.networkstatus.parse_file to return NetworkStatusDocument objects instead of iterating over the router descriptors.
Lets use _peek_line() here. We could probably replace this function with...
done
Missing a parameter and out of order. The opt handling in this method is more complicated than it needs to be - like _peek_keyword() lets just strip it off of the line variable if it's there.
ugh, fixed.
This doesn't look right. I think you meant "elif optional:", having this start with 'opt' really doesn't mean anything.
If it starts with "opt ", it should mean it's something from the future that the parser doesn't understand yet. We should be ignoring that and using the next line.
Also, when checking for an 'opt' prefix please do a startswith() check for 'opt ' so you don't match against, say, our new "optometrist" keyword. :P
Eek, yes, I thought I fixed that.
These pydocs look really identical to me. How to these functions differ? Can we get rid of _read_keyword_line_str()?
I added _read_keyword_line_str to handle lists of strings, because I didn't want to use StringIO initially. We could get rid of it and convert everything to use StringIO. But, I'd rather leave it as it is, and eventually convert everything to use lists of strings internally.
:param bool ignore_first: doesn't check if the first line read has one of the given keywords
This new parameter seems oddly specific. The problem that you're trying to solve is that a router status entry starts with a 'r ' line and continues until... well, that part isn't defined in the spec. At present you're hardcoding our check to be for...
This strikes me as being both a bit hacky and vulnerable to breakage if tor adds a new footer value. I'd suggest asking for clarification from Nick - this looks like a spec bug to me.
It isn't exactly hacky. From the order in the dir-spec the first line after the router descriptors will be
directory-footer if consensus method >= 9
bandwidth-weights if it's a consensus (at most for consensus, it says)
directory-signature otherwise
If tor adds a new footer value, it would start with the "opt" keyword. This would be ignored because of " elif line.startswith("opt"):".
I also committed the microdescriptor flavoured consensus parsing code. Should I create a seperate ticket for that? or will this do?
If you would like to review it then I'd welcome your comments. However, it took me weeks to get through the diff when you sent it to me for review, so I'd certainly understand if this is too big to reasonably go through. Up to you.
If you would like to review it then I'd welcome your comments. However, it took me weeks to get through the diff when you sent it to me for review, so I'd certainly understand if this is too big to reasonably go through. Up to you.
I'm travelling and I won't be able to review it until middle of next week. I think it's best if you merge it now. I will suggest modifications later when I'm back home.