Currently we're lenient about how we parse descriptors, in some cases. For example, if we only need the first four values on a line, but the spec mandates there be 6 we presently do:
if len(items) >= 4: nickname = items[1] ip = items[2].strip('[]') orport = int(items[3])
We should ensure the line is well-formed before accepting the values. Something closer to:
if len(items) == 6: nickname = item[1] ip = items[2].strip('[]') orport = int(items[3])else: logging.warn("Parsed malformed descriptor 'router' line. Expected 6 items, but line has %d" % len(items)) logging.debug(" Line: %s" % items)
(Logging sold separately. (Untested code))
To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information
Child items
0
Show closed items
No child items are currently assigned. Use child items to break down this issue into smaller parts.
Linked items
0
Link issues together to show that they're related.
Learn more.
Possibly naively, I think another option is that we scrap the current parsers and dep on stem. I have neither reviewed stem's parsers nor have I looked at the api...but I think it sounds like a reasonable idea. :)
atagar, thoughts? Basically, we need to parse bridge descriptors, extra-info descriptors, and networkstatuses and retrieve ~10% of the information they contain. I'm under the impression stem has support for all tor docs, but this may be wrong, though. Do you think this is a good application for stem and is it worth pulling in the entire lib just for the parsers? I really appreciate any input you have!
atagar, thoughts? Basically, we need to parse bridge descriptors, extra-info descriptors, and networkstatuses and retrieve ~10% of the information they contain.
Hi sysrqb. Yup, this would be a perfect use case for stem. I assume by 'bridge descriptors' you mean 'bridge variant of server descriptors' in which case yup, we have support for all of those...
If you have need of a descriptor parser then I definitely think it would be worth snagging stem. Stem does rather strict validation of the content (which you're looking for) and has quite a bit of tests around these modules. Also, I keep it up to date with the specs.
Oh, it just crossed my mind that the 'bridge' variants might not be what you're looking for. Those handle the sanitized bridge descriptor output on metrics. You might want the non-scrubbed RelayDescriptor and RelayExtraInfoDescriptor counterparts.
Arg. When replying to this two days ago, I did a stupid thing where I Ctrl-X'd the text I'd written to paste it into my editor, and then I highlighted something else and lost it.
tl;drewrite:
BridgeDB should use stem.
Most of BridgeDB's descriptor parsing code should be completely rewritten. (see the later point on unittests!)
Bridge descriptors are not well documented, and in some places not clearly spec'd. This should also be fixed.
There is not yet, as far as I know, an easy way to generate fake unsanitised bridge descriptors which would actually pass as real, current unsanitised bridge descriptors.
If atagar is amenable to such, we should create classes for bridge descriptors, (or possible modify the existing ones with some sort of is_sanitised_version=False attribute.
We need a real unittest framework, I started looking into a library which Lunar pointed me at, called sure, and if it plays well with twisted.trial. Especially for checking that descriptor parsing is going according to plan, sure seems really nice. (This should possibly be another ticket.)
Rather than merging, I am going to rename #9462 (moved) and set it as this ticket's parent. To make it more concrete, should this ticket then be about "BridgeDB should use stem for parsing descriptors according to torspec"?
Trac: Status: new to needs_revision Keywords: N/Adeleted, tor-bridge, stem, refactoring, descriptor, bridgedb added Cc: atagar to atagar, isis@torproject.org Parent: N/Ato#9462 (moved)
Additionally, for this ticket, there is a whole mess of code, mostly in bridgedb.Main and bridgedb.Bridges which deals with descriptor parsing which should be killed with fire. Or, nuked from orbit, to use weasel's terms.
This is something we've discussed several times but are you sure BridgeNetworkStatusDocument is what you want? Your earlier comment indicates that these files are simply a list of router status entries (no consensus header or footer). If that's the case then you could simplify this whole function to be somethng like...
def parseNetworkStatusFile(filename, validate=True): with open(filename) as descriptor_file: return list(stem.descriptor.router_status_entry._parse_file( descriptor_file, validate, entry_class = stem.descriptor.router_status_entry.RouterStatusEntryV2, ))
This will provide you a list of RouterStatusEntryV2 entries. I think you really want RouterStatusEntryV3, but your current code is providing you RouterStatusEntryV2 instances (maybe a bug).
routers = [router for router in document]return routers
This could also be...
return list(document)
for descriptor in descriptors: router = descriptors.pop(descriptors.index(descriptor))
This has a couple issues...
You're modifying a list while you iterate over it. This is a bad idea in just about any language (iirc java would raise a ConcurrentModificationException).
Doing this is both pointless and very, very inefficient. You're doing a O(n) iteration over the descripters then doing a O(n) lookup to find the index then another O(n) modification. In other words this is O(n^2) twice.
This would be both simpler and more efficient as...
while descriptors: router = descriptors.pop()
That said this whole function can simplified as...
def deduplicate(descriptors): # Dictionary of {fingerprint: descriptor} for the newest descriptor with a # given fingerprint. fp_to_descriptor = {} duplicates = [] for desc in descriptors: if desc.fingerprint in fp_to_descriptor: conflict = fp_to_descriptor[desc.fingerprint] if desc.published > conflict.published: fp_to_descriptor[desc.fingerprint] = desc duplicates.append(conflict) else: duplicates.append(desc) else: fp_to_descriptor[desc.fingerprint] = desc return fp_to_descriptor.values()
descriptors.extend([router for router in document])
Again, there's no reason to do list comprehension to just convert an iterable to a list. Calling...
Hi isis, here's a quick code review for the file you pointed me toward (descriptors.py)...
if not descriptors.startswith("published"): precise = datetime.datetime.now().isoformat(sep=chr(0x20)) timestamp = precise.rsplit('.', 1)[0] descriptors = "published {t}\n{d}".format(t=timestamp, d=descriptors)}}}Interesting, any reason to do 'sep=chr(0x20)' rather than 'sep=' !''?
tl;dr: The datetime module is braindamaged. Or Python2.x string types are braindamaged. Or both.
This was done to fix a unicode errors resulting on some of the test boxes, via the implicit concatenation/formatting of a string (datetime only returns ascii strings) with what might be a unicode string on some systems. Though which system was having the error, and if it was ponticum (the BridgeDB production server), I do not remember.
Regardless, this would be simpler as...
{{{
if not descriptors.startswith("published"):
published_line = "published %s\n" % datetime.datetime.now().strftime('%Y-%m-%d %H:%M:%S')
descriptor = published_line + descriptor
}}}
Using strftime() certainly looks nicer! I'll have to try that and see if I hit the same unicode problem again. Or just fix the problem that the bridge networkstatus documents don't have a published line. :)
This is something we've discussed several times but are you sure BridgeNetworkStatusDocument is what you want? Your earlier comment indicates that these files are simply a list of router status entries (no consensus header or footer). If that's the case then you could simplify this whole function to be something like...
{{{
def parseNetworkStatusFile(filename, validate=True):
with open(filename) as descriptor_file:
return list(stem.descriptor.router_status_entry._parse_file(
descriptor_file,
validate,
entry_class = stem.descriptor.router_status_entry.RouterStatusEntryV2,
))
}}}
This will provide you a list of RouterStatusEntryV2 entries. I think you really want RouterStatusEntryV3, but your current code is providing you RouterStatusEntryV2 instances (maybe a bug).
Either way (with your code or mine), Stem is getting angry about the flag-thresholds line in the networkstatus file:
{{{
ValueError: Router status entries (v3) must have a 'r' line:
flag-thresholds stable-uptime=613624 stable-mtbf=2488616 fast-speed=15000 guard-wfu=98.000% guard-tk=691200 guard-bw-inc-exits=55000 guard-bw-exc-exits=55000 enough-mtbf=1 ignoring-advertised-bws=0
It stops getting angry if the networkstatus document isn't validated, i.e. `parseNetworkStatusFile(filename, validate=False`. (We should probably just leave it at `False` for this one, since [there isn't a BridgeAuth signature](https://trac.torproject.org/projects/tor/ticket/12254) or anything else, other than sound formatting, to validate anyway.> ----> > ```> routers = [router for router in document]> return routers> }}}> > This could also be...> > {{{> return list(document)> }}}> Oh, fancy! Thanks, atagar! I've changed it to your version.> ----> > {{{> for descriptor in descriptors:> router = descriptors.pop(descriptors.index(descriptor))> }}}> > This has a couple issues...> > * You're modifying a list while you iterate over it. This is a bad idea in just about any language (iirc java would raise a ConcurrentModificationException).> * Doing this is both pointless and very, very inefficient. You're doing a O(n) iteration over the descripters then doing a O(n) lookup to find the index then another O(n) modification. In other words this is O(n^2) twice.> > This would be both simpler and more efficient as...> > {{{> while descriptors:> router = descriptors.pop()> }}}> > That said this whole function can simplified as...> > {{{> def deduplicate(descriptors):> # Dictionary of {fingerprint: descriptor} for the newest descriptor with a> # given fingerprint.> > fp_to_descriptor = {}> duplicates = []> > for desc in descriptors:> if desc.fingerprint in fp_to_descriptor:> conflict = fp_to_descriptor[desc.fingerprint]> > if desc.published > conflict.published:> fp_to_descriptor[desc.fingerprint] = desc> duplicates.append(conflict)> else:> duplicates.append(desc)> else:> fp_to_descriptor[desc.fingerprint] = desc> > return fp_to_descriptor.values()> }}}> This has been refactored as a mix between my original code and yours (because I wanted to keep the warnings for misbehaving OR implementations which might be submitting multiple bridge descriptors as we likely have seen in the past):{{{def deduplicate(descriptors): """Deduplicate some descriptors, returning only the newest for each router. .. note:: If two descriptors for the same router are discovered, AND both descriptors have the **same** published timestamp, then the router's fingerprint WILL BE LOGGED ON PURPOSE, because we assume that router to be malicious (deliberately, or unintentionally). :param list descriptors: A list of :api:`stem.descriptor.server_descriptor.RelayDescriptor`s, :api:`stem.descriptor.extrainfo_descriptor.BridgeExtraInfoDescriptor`s, :api:`stem.descriptor.router_status_entry.RouterStatusEntryV2`s. """ duplicates = {} nonDuplicates = {} for descriptor in descriptors: fingerprint = descriptor.fingerprint logging.debug("Deduplicating %s descriptor for router %s" % (str(descriptor.__class__).rsplit('.', 1)[1], safelog.logSafely(fingerprint))) if fingerprint in nonDuplicates.keys(): # We already found a descriptor for this fingerprint: conflict = nonDuplicates[fingerprint] # If the descriptor we are currently parsing is newer than the # last one we found: if descriptor.published > conflict.published: # And if this is the first duplicate we've found for this # router, then create a list in the ``duplicates`` dictionary # for the router: if not fingerprint in duplicates.keys(): duplicates[fingerprint] = list() # Add this to the list of duplicates for this router: duplicates[fingerprint].append(conflict) # Finally, put the newest descriptor in the ``nonDuplicates`` # dictionary: nonDuplicates[fingerprint] = descriptor # Same thing, but this time the one we're parsing is older: elif descriptor.published < conflict.published: if not fingerprint in duplicates.keys(): duplicates[fingerprint] = list() duplicates[fingerprint].append(descriptor) # This *shouldn't* happen. It would mean that two descriptors for # the same router had the same timestamps, probably meaning there # is a severely-messed up OR implementation out there. Let's log # its fingerprint (no matter what!) so that we can look up its # ``platform`` line in its server-descriptor and tell whoever # wrote that code that they're probably (D)DOSing the Tor network. else: logging.warn(("Duplicate descriptor with identical timestamp " "(%s) for router with fingerprint '%s'!") % (descriptor.published, fingerprint)) # Hoorah! No duplicates! (yet...) else: nonDuplicates[fingerprint] = descriptor logging.info("Descriptor deduplication finished.") logging.info("Number of duplicates: %d" % len(duplicates)) for (fingerprint, dittos) in duplicates.items(): logging.info(" For %s: %d duplicates" % (safelog.logSafely(fingerprint), len(dittos))) logging.info("Number of non-duplicates: %d" % len(nonDuplicates)) return nonDuplicates
{{{
descriptors.extend([router for router in document])
}}}
Again, there's no reason to do list comprehension to just convert an iterable to a list. Calling...
tl;dr: The datetime module is braindamaged. Or Python2.x string types are braindamaged. Or both.
Curious! I'm not sure why one particular system would exhibit a bytes vs unicode issue. That's certainly a gotcha if you're using python 3.x on some systems, but I've never seen it be related to the platform itself. I'm also unsure why chr(0x20) would've helped since that simply returns a string.
Regardless, if you get a clearer understanding of what value might be either bytes or unicode then we can normalize to avoid that issue.
Either way (with your code or mine), Stem is getting angry about the flag-thresholds line in the networkstatus file:
Oh, ok. So there are some network status document headers. Honestly if you don't care about these headers you could just skip down to the first router status entry then use router_status_entry._parse_file() as suggested.
if fingerprint in nonDuplicates.keys():
You don't need to specify keys() here. Iteration and contents checks are via a dictionary's keys so this is equivalent to...
I've got some more progress at https://github.com/joelanders/bridgedb/commits/fix/9380-stem_r3. Basically using isis's new functions from descriptors.py in Bridges.py and Main.py. Probably doesn't work quite right yet, and I haven't written any specs yet, but wanted to let someone chime in if I'm going in the wrong direction :)
I've got some more progress at https://github.com/joelanders/bridgedb/commits/fix/9380-stem_r3. Basically using isis's new functions from descriptors.py in Bridges.py and Main.py. Probably doesn't work quite right yet, and I haven't written any specs yet, but wanted to let someone chime in if I'm going in the wrong direction :)
Thanks! Before getting into code review, I went ahead and tested your patches.
I tested your changes and they failed. The first reason was that #12953 (moved) was causing parser errors (not the fault of the parsers, it was Leekspin's fault), so I fixed that, pushed Leekspin-0.2.1, and bumped BridgeDB's Leekspin dependency to 0.2.1.
Because the Leekspin version would need to be bumped for all development branches, I did it in the develop branch. I now have a fix/9380-stem_r4branch which is my same commits as before (in fix/9380-stem_r3) but with the Leekspin bump commit underneath them.
Next, to test them again, I cherry-picked your commits on top. There's still a few issues shown here in the CI results.
We don't normally throw out code or unittests, we throw old code into lib/bridgedb/test/deprecated.py, and then use a monkey.MonkeyPatcher in lib/bridgedb/test/test_Tests.py to run unitests with the old code and the new code to ensure compatibility (except in the cases where we want to change behaviours).
In this case, apart from fixing the duplicated transports issue (#11216 (moved)), I would actually prefer that the parsers behave the same.
Case in point: while it was in the tor-spec.txt (until a few months ago) that the "a" line in a networkstatus-bridges file could be an IPv4 or IPv6 address, in reality, in implementation, this was never true. They only contain IPv6 addresses.
I guess what I mean to say is "just because the spec says something, doesn't at all mean that it's true, because nobody ever sees these descriptors in their unsanitised form (esp. the extrainfo ones) and therefore these things rarely get checked. We need to be certain that BridgeDB will still behave the same way after switching to Stem's parsers, otherwise things might get ugly.
Everything in BridgeDB whose filename starts with a capital letter is slated to be killed with fire. Try not to add new code to those files. Make new lowercased modules wherever you think they should live.
There's some general refactoring work on the bridgedb.Bridges module which this ticket could likely benefit from in my fix/12029-dist-api_r2 branch. I'm going to try to get those patches merge-able now, before the rest of the Distributor API changes (#12029 (moved)) so that we're not forcing the results from these parsers into the horrible mess of bad code in bridgedb.Bridges.
Cool, cool, thanks! I shoulda mentioned I hadn't run the existing tests on this new code. Just sketching the outline, mostly. Figuring out where to put things and stuff.
I've run into the specs vs. implementation tension a bit. Stem's RouterStatusEntryV2 doesn't have the or_addresses attribute (corresponding to the "a line" in the network status document. V3 does? Also, I think these or_addresses can have a "portlist" (as called in dir-spec.txt) as opposed to just an integer.
Cool, cool, thanks! I shoulda mentioned I hadn't run the existing tests on this new code. Just sketching the outline, mostly. Figuring out where to put things and stuff.
I've run into the specs vs. implementation tension a bit. Stem's RouterStatusEntryV2 doesn't have the or_addresses attribute (corresponding to the "a line" in the network status document. V3 does? Also, I think these or_addresses can have a "portlist" (as called in dir-spec.txt) as opposed to just an integer.
I think both you and Damian are correct; this would be one of the instances where stem.descriptor.router_status_entry.RouterStatusEntryV3 would work better, as it has an or_addresses attribute which would contain the IPv6 addresses from a bridge networkstatus "a"-line. We'll need to switch to using router_status_entry._parse_file() as Damian suggests.
About the portlist for IPv6 addresses in one of these lines... I'm actually not sure what the spec says. In practice, I only recall seeing one IPv6 address and one port to go with it, but that doesn't mean it couldn't be otherwise. You might have to look at tor's source to determine what's allowed.
Also, I think these or_addresses can have a "portlist" (as called in dir-spec.txt) as opposed to just an integer.
Hi joelanders, I think it's a mistake for the spec to say "portlist" there (filed a ticket).
About the portlist for IPv6 addresses in one of these lines... I'm actually not sure what the spec says.
I'm not quite sure what this is in reference to, but the a-line can be IPv4 or IPv6 addresses. Stem's or_addresses attribute is a three value tuple telling you which it is.
About the portlist for IPv6 addresses in one of these lines... I'm actually not sure what the spec says.
I'm not quite sure what this is in reference to, but the a-line can be IPv4 or IPv6 addresses. Stem's or_addresses attribute is a three value tuple telling you which it is.
tl;dr: The "a"/"or-address" lines, in implementation, only happen once each per router, and only ever contain IPv6 addresses, despite what dir-spec.txt says.
"or-address" SP ADDRESS ":" PORT NL [Any number] ADDRESS = IP6ADDR | IP4ADDR IPV6ADDR = an ipv6 address, surrounded by square brackets. IPV4ADDR = an ipv4 address, represented as a dotted quad. PORT = a number between 1 and 65535 inclusive. An alternative for the address and ORPort of the "router" line, but with two added capabilities: * or-address can be either an IPv4 or IPv6 address * or-address allows for multiple ORPorts and addresses A descriptor SHOULD NOT include an or-address line that does nothing but duplicate the address:port pair from its "router" line. The ordering of or-address lines and their PORT entries matter because Tor MAY accept a limited number of addresses or ports. As of Tor 0.2.3.x only the first address and the first port are used.
In terms of how many "a"/"or-address" lines there may be, the spec is only correct if you pay super close attention to the last sentence (this is actually the first time I've noticed it :) ).
Unless we're aiming to build parsers for some other, spec-compliant¹, imaginary OR implementation? Which is totally worth doing, and if that's what Stem is doing, then that's great! But if that is the case, then BridgeDB would likely have to modify Stem's parsing, because it certainly wouldn't be very good if someone were to write an alternate OR implementation which started submitting the fully-spec-compliant "[Any number]" of addresses to BridgeDB and spamming the databases.
¹ I'm ignoring Orchid because last I checked (two months ago), it wasn't what I'd call "spec-compliant".
Trac: Summary: BridgeDB should use stem for parsing descriptors according to torspec to BridgeDB should use Stem for parsing descriptors
Unless we're aiming to build parsers for some other, spec-compliant¹, imaginary OR implementation?
We are. Stem aims to conform with tor's spec, not its implementation. If tor doesn't match its spec then that's a tor bug and either tor or its spec needs to change.
Please shoot Nick a ticket if the spec's inaccurate about the 'a' lines.
The latest changes are in my fix/9380-stem_r10branch, with 100% unittest coverage for all included changes. (Overall, BridgeDB's test coverage is now up to nearly 80%!) Please review!
BridgeDB's verification chain for descriptors currently (as of #9380 (moved)) goes like this:
Parse the @type bridge-networkstatus documents in the networkstatus-bridges file.
Create Bridgeclass instances for each this we parsed in step !#1. Call the Bridge.updateFromNetworkStatus()method with the corresponding networkstatus document for each Bridge. This includes storing the descriptor digest for each Bridge.
Parse the @type bridge-server-descriptors found in the bridge-descriptors file.
Store the extra-info-digest from each @type bridge-server-descriptor.
Parse and deduplicate the @type bridge-extrainfo descriptors in cached-extrainfo and cached-extrainfo.new.
Verify the router-signature on the @type bridge-extrainfo descriptor for each bridge, using the signing-key from the Bridge's @type bridge-server-descriptor.
from stem.descriptor import extrainfo_descriptorfrom stem.descriptor import server_descriptor
Unused.
def parseServerDescriptorsFile(filename, validate=True): """Parse a file which contains ``@type bridge-server-descriptor``s. .. note:: ``validate`` defaults to ``False`` because there appears to be a bug in Leekspin, the fake descriptor generator, where Stem thinks the fingerprint doesn't match the key… .. note:: We have to lie to Stem, pretending that these are ``@type server-descriptor``s, **not** ``@type bridge-server-descriptor``s. See ticket `#11257`_. .. _`#11257`: https://trac.torproject.org/projects/tor/ticket/11257 :param str filename: The file to parse descriptors from. :param bool validate: Whether or not to validate descriptor contents. (default: ``False``) :rtype: list :returns: A list of :api:`stem.descriptor.server_descriptor.RelayDescriptor`s. """ logging.info("Parsing server descriptors with Stem: %s" % filename) descriptorType = 'server-descriptor 1.0' document = parse_file(filename, descriptorType, validate=validate) routers = list(document) return routers
It provides type inferencing, defaults, etc without doing exec() as you presently are (... which is almost never a great idea). You can then use config_dict() to specify the config values each file cares about and their default values. For example...
def parseNetworkStatusFile(filename, validate=True, skipAnnotations=True, descriptorClass=RouterStatusEntryV3): """Parse a file which contains an ``@type bridge-networkstatus`` document. See `ticket #12254 <https://bugs.torproject.org/12254>`__ for why networkstatus-bridges documents don't look anything like the networkstatus v2 documents that they are purported to look like. They are missing all headers, and the entire footer including authority signatures.
Actually, you could probably simply parse it like a normal network status document but without validation.
def deduplicate(descriptors): """Deduplicate some descriptors, returning only the newest for each router. .. note:: If two descriptors for the same router are discovered, AND both descriptors have the **same** published timestamp, then the router's fingerprint WILL BE LOGGED ON PURPOSE, because we assume that router to be malicious (deliberately, or unintentionally). :param list descriptors: A list of :api:`stem.descriptor.server_descriptor.RelayDescriptor`s, :api:`stem.descriptor.extrainfo_descriptor.BridgeExtraInfoDescriptor`s, :api:`stem.descriptor.router_status_entry.RouterStatusEntryV2`s. """ duplicates = {} nonDuplicates = {} for descriptor in descriptors: fingerprint = descriptor.fingerprint logging.debug("Deduplicating %s descriptor for router %s" % (str(descriptor.__class__).rsplit('.', 1)[1], safelog.logSafely(fingerprint))) if fingerprint in nonDuplicates.keys(): # We already found a descriptor for this fingerprint: conflict = nonDuplicates[fingerprint] # If the descriptor we are currently parsing is newer than the # last one we found: if descriptor.published > conflict.published: # And if this is the first duplicate we've found for this # router, then create a list in the ``duplicates`` dictionary # for the router: if not fingerprint in duplicates.keys(): duplicates[fingerprint] = list() # Add this to the list of duplicates for this router: duplicates[fingerprint].append(conflict) # Finally, put the newest descriptor in the ``nonDuplicates`` # dictionary: nonDuplicates[fingerprint] = descriptor # Same thing, but this time the one we're parsing is older: elif descriptor.published < conflict.published: if not fingerprint in duplicates.keys(): duplicates[fingerprint] = list() duplicates[fingerprint].append(descriptor) # This *shouldn't* happen. It would mean that two descriptors for # the same router had the same timestamps, probably meaning there # is a severely-messed up OR implementation out there. Let's log # its fingerprint (no matter what!) so that we can look up its # ``platform`` line in its server-descriptor and tell whoever # wrote that code that they're probably (D)DOSing the Tor network. else: raise DescriptorWarning( ("Duplicate descriptor with identical timestamp (%s) for " "router with fingerprint '%s'!") % (descriptor.published, fingerprint)) # Hoorah! No duplicates! (yet...) else: nonDuplicates[fingerprint] = descriptor logging.info("Descriptor deduplication finished.") logging.info("Number of duplicates: %d" % len(duplicates)) for (fingerprint, dittos) in duplicates.items(): logging.info(" For %s: %d duplicates" % (safelog.logSafely(fingerprint), len(dittos))) logging.info("Number of non-duplicates: %d" % len(nonDuplicates)) return nonDuplicates
Off the top of my head think the following does the same as all the above...
# makes a map of {fingerprint => [all descriptors with that fingerprint]}fingerprint_to_descriptors = {}for desc in descriptors: fingerprint_to_descriptors.setdefault(desc.fingerprint, []).append(descriptor)# provide back the latestreturn [sorted(descriptors, key = lambda d: d.published, reversed = True)[0] for descriptors in fingerprint_to_descriptors.values()]
{{{
from stem.descriptor import extrainfo_descriptor
from stem.descriptor import server_descriptor
}}}
Unused.
Removed. Thanks!
{{{
def parseServerDescriptorsFile(filename, validate=True):
"""Parse a file which contains @type bridge-server-descriptors.
.. note:: ``validate`` defaults to ``False`` because there appears to be a bug in Leekspin, the fake descriptor generator, where Stem thinks the fingerprint doesn't match the key….. note:: We have to lie to Stem, pretending that these are ``@type server-descriptor``s, **not** ``@type bridge-server-descriptor``s. See ticket `#11257`_... _`#11257`: https://trac.torproject.org/projects/tor/ticket/11257:param str filename: The file to parse descriptors from.:param bool validate: Whether or not to validate descriptor contents. (default: ``False``):rtype: list:returns: A list of :api:`stem.descriptor.server_descriptor.RelayDescriptor`s."""logging.info("Parsing server descriptors with Stem: %s" % filename)descriptorType = 'server-descriptor 1.0'document = parse_file(filename, descriptorType, validate=validate)routers = list(document)return routers
Really doesn't seem like it's providing any value.
Yep… but adding a log statement in there makes it satisfy #9377 (moved). And documentation is important (rather than just importing a magickal function from Stem without explaining what is happening). And someday, with #12506 (moved), BridgeDB's database manager is going to do this parsing, so having everything all neatly contained within the bridgedb.parse.descriptors module will come it handy when it comes time to separate (rather than just calling list(parse_file([…])) in bridgedb.Main.load().
{{{
lib/bridgedb/configure.py:
def loadConfig(configFile=None, configCls=None):
"""Load configuration settings on top of the current settings.
...
}}}
It provides type inferencing, defaults, etc without doing exec() as you presently are (... which is almost never a great idea). You can then use config_dict() to specify the config values each file cares about and their default values. For example...
Then when the config for that singleton is read all the configurations get the updated values.
Awesome! Yes, BridgeDB had a particularly nasty manner of handling config files when I started working on it. Not to mention that it made the servers reset themselves to whatever settings the config file had when they were first started (#9277 (moved)). I didn't change the config parsing much, I just hacked in the parsing for updating configs while running, and (de)serialising them.
If you or someone else ever feels like hacking on making BridgeDB's configs saner… I'd be pretty happy about it. Otherwise I'll likely check out your Stem config utility at some point.
I recall that that didn't work for catching the unparseable descriptors and saving them to separate files and/or that it broke my integration tests? I'll have to look at that more.
{{{
def parseNetworkStatusFile(filename, validate=True, skipAnnotations=True,
descriptorClass=RouterStatusEntryV3):
"""Parse a file which contains an @type bridge-networkstatus document.
See `ticket #12254 <https://bugs.torproject.org/12254>`__ for whynetworkstatus-bridges documents don't look anything like the networkstatusv2 documents that they are purported to look like. They are missing allheaders, and the entire footer including authority signatures.
}}}
Actually, you could probably simply parse it like a normal network status document but without validation.
Validation seems like a good idea?
Someday I'll just fix the BridgeAuthority. Or kill it, since it basically does little more than collecting everything that comes its way and then shoving an everything.tar.gz at BridgeDB.
I've included fixes for some of your suggestions in the same fix/9380-stem_r10 branch.
Thanks, atagar!
Trac: Status: needs_review to closed Resolution: N/Ato fixed