BridgeDB should use Stem for parsing descriptors

added bridgedb-0.3.0 bridgedb-parsers component::circumvention/bridgedb isis2014Q3Q4 isisExB owner::isis priority::immediate resolution::fixed severity::normal status::closed stem type::enhancement labels

First version on bug9380_stricter_parser https://github.com/sysrqb/bridgedb.git

This ticket may get merged into #9462 (moved).

I started reviewing this code today, briefly, but I will need to look at it more.

Possibly naively, I think another option is that we scrap the current parsers and dep on stem. I have neither reviewed stem's parsers nor have I looked at the api...but I think it sounds like a reasonable idea. :)

atagar, thoughts? Basically, we need to parse bridge descriptors, extra-info descriptors, and networkstatuses and retrieve ~10% of the information they contain. I'm under the impression stem has support for all tor docs, but this may be wrong, though. Do you think this is a good application for stem and is it worth pulling in the entire lib just for the parsers? I really appreciate any input you have!

Trac:
Cc: N/A to atagar

atagar, thoughts? Basically, we need to parse bridge descriptors, extra-info descriptors, and networkstatuses and retrieve ~10% of the information they contain.

Hi sysrqb. Yup, this would be a perfect use case for stem. I assume by 'bridge descriptors' you mean 'bridge variant of server descriptors' in which case yup, we have support for all of those...

Bridge Server Descriptor
Bridge Extrainfo Descriptor
Network Status Documents (though these, of course, don't contain bridges)

If you have need of a descriptor parser then I definitely think it would be worth snagging stem. Stem does rather strict validation of the content (which you're looking for) and has quite a bit of tests around these modules. Also, I keep it up to date with the specs.

Cheers! -Damian

Oh, it just crossed my mind that the 'bridge' variants might not be what you're looking for. Those handle the sanitized bridge descriptor output on metrics. You might want the non-scrubbed RelayDescriptor and RelayExtraInfoDescriptor counterparts.

Arg. When replying to this two days ago, I did a stupid thing where I Ctrl-X'd the text I'd written to paste it into my editor, and then I highlighted something else and lost it.

tl;drewrite:

BridgeDB should use stem.
Most of BridgeDB's descriptor parsing code should be completely rewritten. (see the later point on unittests!)
Bridge descriptors are not well documented, and in some places not clearly spec'd. This should also be fixed.
There is not yet, as far as I know, an easy way to generate fake unsanitised bridge descriptors which would actually pass as real, current unsanitised bridge descriptors.
If atagar is amenable to such, we should create classes for bridge descriptors, (or possible modify the existing ones with some sort of is_sanitised_version=False attribute.
We need a real unittest framework, I started looking into a library which Lunar pointed me at, called sure, and if it plays well with twisted.trial. Especially for checking that descriptor parsing is going according to plan, sure seems really nice. (This should possibly be another ticket.)

Rather than merging, I am going to rename #9462 (moved) and set it as this ticket's parent. To make it more concrete, should this ticket then be about "BridgeDB should use stem for parsing descriptors according to torspec"?

Trac:
Status: new to needs_revision
Keywords: N/A deleted, tor-bridge, stem, refactoring, descriptor, bridgedb added
Cc: atagar to atagar, isis@torproject.org
Parent: N/A to #9462 (moved)

Trac:
Summary: Increase strictness of descriptor parsers according to spec to BridgeDB should use stem for parsing descriptors according to torspec

Trac:
Parent: #9462 (moved) to N/A

I have started working on this. What I've got so far is in my fix/9380-stem_r2 branch.

Trac:
Status: needs_revision to accepted
Keywords: descriptor, refactoring, tor-bridge, bridgedb deleted, bridgedb-0.2.x, bridgedb-parsers added
Owner: N/A to isis

Trac:
Status: accepted to needs_revision

Trac:
Cc: atagar, isis@torproject.org to atagar, isis, sysrqb

Additionally, for this ticket, there is a whole mess of code, mostly in bridgedb.Main and bridgedb.Bridges which deals with descriptor parsing which should be killed with fire. Or, nuked from orbit, to use weasel's terms.

Think I'll have a go at lighting the nukes on fire. Sorry if I'm stealing someone's fun. Progress at https://github.com/joelanders/bridgedb/commits/fix/9380-stem_r2.

Trac:
Username: joelanders

Replying to joelanders:

Think I'll have a go at lighting the nukes on fire. Sorry if I'm stealing someone's fun. Progress at https://github.com/joelanders/bridgedb/commits/fix/9380-stem_r2.

Awesome, thanks for working on this! :D

Hi isis, here's a quick code review for the file you pointed me toward (descriptors.py)...

if not descriptors.startswith("published"):
  precise = datetime.datetime.now().isoformat(sep=chr(0x20))
  timestamp = precise.rsplit('.', 1)[0]
  descriptors = "published {t}\n{d}".format(t=timestamp, d=descriptors)

Interesting, any reason to do 'sep=chr(0x20)' rather than 'sep=' !''?

Regardless, this would be simpler as...

if not descriptors.startswith("published"):
  published_line = "published %s\n" % datetime.datetime.now().strftime('%Y-%m-%d %H:%M:%S')
  descriptor = published_line + descriptor

routers = networkstatus.BridgeNetworkStatusDocument(descriptors,
  validate=validate)

This is something we've discussed several times but are you sure BridgeNetworkStatusDocument is what you want? Your earlier comment indicates that these files are simply a list of router status entries (no consensus header or footer). If that's the case then you could simplify this whole function to be somethng like...

def parseNetworkStatusFile(filename, validate=True):
  with open(filename) as descriptor_file:
    return list(stem.descriptor.router_status_entry._parse_file(
      descriptor_file,
      validate,
      entry_class = stem.descriptor.router_status_entry.RouterStatusEntryV2,
    ))

This will provide you a list of RouterStatusEntryV2 entries. I think you really want RouterStatusEntryV3, but your current code is providing you RouterStatusEntryV2 instances (maybe a bug).

routers = [router for router in document]
return routers

This could also be...

return list(document)

for descriptor in descriptors:
  router = descriptors.pop(descriptors.index(descriptor))

This has a couple issues...

You're modifying a list while you iterate over it. This is a bad idea in just about any language (iirc java would raise a ConcurrentModificationException).
Doing this is both pointless and very, very inefficient. You're doing a O(n) iteration over the descripters then doing a O(n) lookup to find the index then another O(n) modification. In other words this is O(n^2) twice.

This would be both simpler and more efficient as...

while descriptors:
  router = descriptors.pop()

That said this whole function can simplified as...

def deduplicate(descriptors):
  # Dictionary of {fingerprint: descriptor} for the newest descriptor with a
  # given fingerprint.

  fp_to_descriptor = {}
  duplicates = []

  for desc in descriptors:
    if desc.fingerprint in fp_to_descriptor:
      conflict = fp_to_descriptor[desc.fingerprint]

      if desc.published > conflict.published:
        fp_to_descriptor[desc.fingerprint] = desc
        duplicates.append(conflict)
      else:
        duplicates.append(desc)
    else:
      fp_to_descriptor[desc.fingerprint] = desc

  return fp_to_descriptor.values()

descriptors.extend([router for router in document])

Again, there's no reason to do list comprehension to just convert an iterable to a list. Calling...

[router for router in document]

... is the same as...

list(document)

Cheers! -Damian

Replying to atagar:

Hi isis, here's a quick code review for the file you pointed me toward (descriptors.py)...

if not descriptors.startswith("published"):
  precise = datetime.datetime.now().isoformat(sep=chr(0x20))
  timestamp = precise.rsplit('.', 1)[0]
  descriptors = "published {t}\n{d}".format(t=timestamp, d=descriptors)
}}}

Interesting, any reason to do 'sep=chr(0x20)' rather than 'sep=' !''?

tl;dr: The datetime module is braindamaged. Or Python2.x string types are braindamaged. Or both.

This was done to fix a unicode errors resulting on some of the test boxes, via the implicit concatenation/formatting of a string (datetime only returns ascii strings) with what might be a unicode string on some systems. Though which system was having the error, and if it was ponticum (the BridgeDB production server), I do not remember.

Regardless, this would be simpler as...

{{{ if not descriptors.startswith("published"): published_line = "published %s\n" % datetime.datetime.now().strftime('%Y-%m-%d %H:%M:%S') descriptor = published_line + descriptor }}}

Using strftime() certainly looks nicer! I'll have to try that and see if I hit the same unicode problem again. Or just fix the problem that the bridge networkstatus documents don't have a published line. :)

{{{ routers = networkstatus.BridgeNetworkStatusDocument(descriptors, validate=validate) }}}

This is something we've discussed several times but are you sure BridgeNetworkStatusDocument is what you want? Your earlier comment indicates that these files are simply a list of router status entries (no consensus header or footer). If that's the case then you could simplify this whole function to be something like...

{{{ def parseNetworkStatusFile(filename, validate=True): with open(filename) as descriptor_file: return list(stem.descriptor.router_status_entry._parse_file( descriptor_file, validate, entry_class = stem.descriptor.router_status_entry.RouterStatusEntryV2, )) }}}

This will provide you a list of RouterStatusEntryV2 entries. I think you really want RouterStatusEntryV3, but your current code is providing you RouterStatusEntryV2 instances (maybe a bug).

Thanks for the clarification! (Thanks also for causing me to find a bug in leekspin.)

Either way (with your code or mine), Stem is getting angry about the flag-thresholds line in the networkstatus file:

{{{ ValueError: Router status entries (v3) must have a 'r' line: flag-thresholds stable-uptime=613624 stable-mtbf=2488616 fast-speed=15000 guard-wfu=98.000% guard-tk=691200 guard-bw-inc-exits=55000 guard-bw-exc-exits=55000 enough-mtbf=1 ignoring-advertised-bws=0

 
It stops getting angry if the networkstatus document isn't validated, i.e. `parseNetworkStatusFile(filename, validate=False`. (We should probably just leave it at `False` for this one, since [there isn't a BridgeAuth signature](https://trac.torproject.org/projects/tor/ticket/12254) or anything else, other than sound formatting, to validate anyway.
 
> ----
> 
> ```
> routers = [router for router in document]
> return routers
> }}}
> 
> This could also be...
> 
> {{{
> return list(document)
> }}}
> 
 
Oh, fancy! Thanks, atagar! I've changed it to your version.
 
> ----
> 
> {{{
> for descriptor in descriptors:
>   router = descriptors.pop(descriptors.index(descriptor))
> }}}
> 
> This has a couple issues...
> 
>  * You're modifying a list while you iterate over it. This is a bad idea in just about any language (iirc java would raise a ConcurrentModificationException).
>  * Doing this is both pointless and very, very inefficient. You're doing a O(n) iteration over the descripters then doing a O(n) lookup to find the index then another O(n) modification. In other words this is O(n^2) twice.
> 
> This would be both simpler and more efficient as...
> 
> {{{
> while descriptors:
>   router = descriptors.pop()
> }}}
> 
> That said this whole function can simplified as...
> 
> {{{
> def deduplicate(descriptors):
>   # Dictionary of {fingerprint: descriptor} for the newest descriptor with a
>   # given fingerprint.
> 
>   fp_to_descriptor = {}
>   duplicates = []
> 
>   for desc in descriptors:
>     if desc.fingerprint in fp_to_descriptor:
>       conflict = fp_to_descriptor[desc.fingerprint]
> 
>       if desc.published > conflict.published:
>         fp_to_descriptor[desc.fingerprint] = desc
>         duplicates.append(conflict)
>       else:
>         duplicates.append(desc)
>     else:
>       fp_to_descriptor[desc.fingerprint] = desc
> 
>   return fp_to_descriptor.values()
> }}}
> 
 
This has been refactored as a mix between my original code and yours (because I wanted to keep the warnings for misbehaving OR implementations which might be submitting multiple bridge descriptors as we likely have seen in the past):
 
{{{
def deduplicate(descriptors):
    """Deduplicate some descriptors, returning only the newest for each router.

    .. note:: If two descriptors for the same router are discovered, AND both
        descriptors have the **same** published timestamp, then the router's
        fingerprint WILL BE LOGGED ON PURPOSE, because we assume that router
        to be malicious (deliberately, or unintentionally).

    :param list descriptors: A list of
        :api:`stem.descriptor.server_descriptor.RelayDescriptor`s,
        :api:`stem.descriptor.extrainfo_descriptor.BridgeExtraInfoDescriptor`s,
        :api:`stem.descriptor.router_status_entry.RouterStatusEntryV2`s.
    """
    duplicates = {}
    nonDuplicates = {}

    for descriptor in descriptors:
        fingerprint = descriptor.fingerprint

        logging.debug("Deduplicating %s descriptor for router %s"
                      % (str(descriptor.__class__).rsplit('.', 1)[1],
                         safelog.logSafely(fingerprint)))

        if fingerprint in nonDuplicates.keys():
            # We already found a descriptor for this fingerprint:
            conflict = nonDuplicates[fingerprint]

            # If the descriptor we are currently parsing is newer than the
            # last one we found:
            if descriptor.published > conflict.published:
                # And if this is the first duplicate we've found for this
                # router, then create a list in the ``duplicates`` dictionary
                # for the router:
                if not fingerprint in duplicates.keys():
                    duplicates[fingerprint] = list()
                # Add this to the list of duplicates for this router:
                duplicates[fingerprint].append(conflict)
                # Finally, put the newest descriptor in the ``nonDuplicates``
                # dictionary:
                nonDuplicates[fingerprint] = descriptor
            # Same thing, but this time the one we're parsing is older:
            elif descriptor.published < conflict.published:
                if not fingerprint in duplicates.keys():
                    duplicates[fingerprint] = list()
                duplicates[fingerprint].append(descriptor)
            # This *shouldn't* happen. It would mean that two descriptors for
            # the same router had the same timestamps, probably meaning there
            # is a severely-messed up OR implementation out there. Let's log
            # its fingerprint (no matter what!) so that we can look up its
            # ``platform`` line in its server-descriptor and tell whoever
            # wrote that code that they're probably (D)DOSing the Tor network.
            else:
                logging.warn(("Duplicate descriptor with identical timestamp "
                              "(%s) for router with fingerprint '%s'!")
                             % (descriptor.published, fingerprint))

        # Hoorah! No duplicates! (yet...)
        else:
            nonDuplicates[fingerprint] = descriptor

    logging.info("Descriptor deduplication finished.")
    logging.info("Number of duplicates: %d" % len(duplicates))
    for (fingerprint, dittos) in duplicates.items():
        logging.info("  For %s: %d duplicates"
                     % (safelog.logSafely(fingerprint), len(dittos)))
    logging.info("Number of non-duplicates: %d" % len(nonDuplicates))

    return nonDuplicates

{{{ descriptors.extend([router for router in document]) }}}

Again, there's no reason to do list comprehension to just convert an iterable to a list. Calling...

{{{ [router for router in document] }}}

... is the same as...

{{{ list(document) }}}

Also changed.

Cheers! -Damian

Thanks for the code review!h

Trac:
Status: needs_revision to needs_review

tl;dr: The datetime module is braindamaged. Or Python2.x string types are braindamaged. Or both.

Curious! I'm not sure why one particular system would exhibit a bytes vs unicode issue. That's certainly a gotcha if you're using python 3.x on some systems, but I've never seen it be related to the platform itself. I'm also unsure why chr(0x20) would've helped since that simply returns a string.

Regardless, if you get a clearer understanding of what value might be either bytes or unicode then we can normalize to avoid that issue.

Either way (with your code or mine), Stem is getting angry about the flag-thresholds line in the networkstatus file:

Oh, ok. So there are some network status document headers. Honestly if you don't care about these headers you could just skip down to the first router status entry then use router_status_entry._parse_file() as suggested.

if fingerprint in nonDuplicates.keys():

You don't need to specify keys() here. Iteration and contents checks are via a dictionary's keys so this is equivalent to...

if fingerprint in nonDuplicates:

Glad you found the review helpful! -Damian

I've got some more progress at https://github.com/joelanders/bridgedb/commits/fix/9380-stem_r3. Basically using isis's new functions from descriptors.py in Bridges.py and Main.py. Probably doesn't work quite right yet, and I haven't written any specs yet, but wanted to let someone chime in if I'm going in the wrong direction :)

Trac:
Username: joelanders

Replying to joelanders:

I've got some more progress at https://github.com/joelanders/bridgedb/commits/fix/9380-stem_r3. Basically using isis's new functions from descriptors.py in Bridges.py and Main.py. Probably doesn't work quite right yet, and I haven't written any specs yet, but wanted to let someone chime in if I'm going in the wrong direction :)

Thanks! Before getting into code review, I went ahead and tested your patches.

I tested your changes and they failed. The first reason was that #12953 (moved) was causing parser errors (not the fault of the parsers, it was Leekspin's fault), so I fixed that, pushed Leekspin-0.2.1, and bumped BridgeDB's Leekspin dependency to 0.2.1.

Because the Leekspin version would need to be bumped for all development branches, I did it in the develop branch. I now have a fix/9380-stem_r4 branch which is my same commits as before (in fix/9380-stem_r3) but with the Leekspin bump commit underneath them.

Next, to test them again, I cherry-picked your commits on top. There's still a few issues shown here in the CI results.

General review, not specifically on any commits:

We don't normally throw out code or unittests, we throw old code into lib/bridgedb/test/deprecated.py, and then use a monkey.MonkeyPatcher in lib/bridgedb/test/test_Tests.py to run unitests with the old code and the new code to ensure compatibility (except in the cases where we want to change behaviours).
In this case, apart from fixing the duplicated transports issue (#11216 (moved)), I would actually prefer that the parsers behave the same.

Case in point: while it was in the tor-spec.txt (until a few months ago) that the "a" line in a networkstatus-bridges file could be an IPv4 or IPv6 address, in reality, in implementation, this was never true. They only contain IPv6 addresses.

I guess what I mean to say is "just because the spec says something, doesn't at all mean that it's true, because nobody ever sees these descriptors in their unsanitised form (esp. the extrainfo ones) and therefore these things rarely get checked. We need to be certain that BridgeDB will still behave the same way after switching to Stem's parsers, otherwise things might get ugly.

Everything in BridgeDB whose filename starts with a capital letter is slated to be killed with fire. Try not to add new code to those files. Make new lowercased modules wherever you think they should live.

There's some general refactoring work on the bridgedb.Bridges module which this ticket could likely benefit from in my fix/12029-dist-api_r2 branch. I'm going to try to get those patches merge-able now, before the rest of the Distributor API changes (#12029 (moved)) so that we're not forcing the results from these parsers into the horrible mess of bad code in bridgedb.Bridges.

Cool, cool, thanks! I shoulda mentioned I hadn't run the existing tests on this new code. Just sketching the outline, mostly. Figuring out where to put things and stuff.

I've run into the specs vs. implementation tension a bit. Stem's RouterStatusEntryV2 doesn't have the or_addresses attribute (corresponding to the "a line" in the network status document. V3 does? Also, I think these or_addresses can have a "portlist" (as called in dir-spec.txt) as opposed to just an integer.

Trac:
Username: joelanders

Replying to joelanders:

Cool, cool, thanks! I shoulda mentioned I hadn't run the existing tests on this new code. Just sketching the outline, mostly. Figuring out where to put things and stuff.

I've run into the specs vs. implementation tension a bit. Stem's RouterStatusEntryV2 doesn't have the or_addresses attribute (corresponding to the "a line" in the network status document. V3 does? Also, I think these or_addresses can have a "portlist" (as called in dir-spec.txt) as opposed to just an integer.

I think both you and Damian are correct; this would be one of the instances where stem.descriptor.router_status_entry.RouterStatusEntryV3 would work better, as it has an or_addresses attribute which would contain the IPv6 addresses from a bridge networkstatus "a"-line. We'll need to switch to using router_status_entry._parse_file() as Damian suggests.

About the portlist for IPv6 addresses in one of these lines... I'm actually not sure what the spec says. In practice, I only recall seeing one IPv6 address and one port to go with it, but that doesn't mean it couldn't be otherwise. You might have to look at tor's source to determine what's allowed.

Also, I think these or_addresses can have a "portlist" (as called in dir-spec.txt) as opposed to just an integer.

Hi joelanders, I think it's a mistake for the spec to say "portlist" there (filed a ticket).

About the portlist for IPv6 addresses in one of these lines... I'm actually not sure what the spec says.

I'm not quite sure what this is in reference to, but the a-line can be IPv4 or IPv6 addresses. Stem's or_addresses attribute is a three value tuple telling you which it is.

Replying to atagar:

About the portlist for IPv6 addresses in one of these lines... I'm actually not sure what the spec says.

I'm not quite sure what this is in reference to, but the a-line can be IPv4 or IPv6 addresses. Stem's or_addresses attribute is a three value tuple telling you which it is.

tl;dr: The "a"/"or-address" lines, in implementation, only happen once each per router, and only ever contain IPv6 addresses, despite what dir-spec.txt says.

The spec says:

     "a" SP address ":" port NL

        [Any number]

        The "or-address" element as specified in section 2.1.1.

and:

   "or-address" SP ADDRESS ":" PORT NL

       [Any number]
      
       ADDRESS = IP6ADDR | IP4ADDR
       IPV6ADDR = an ipv6 address, surrounded by square brackets.
       IPV4ADDR = an ipv4 address, represented as a dotted quad.
       PORT = a number between 1 and 65535 inclusive.                
       An alternative for the address and ORPort of the "router" line, but with
       two added capabilities:  
      
         * or-address can be either an IPv4 or IPv6 address
         * or-address allows for multiple ORPorts and addresses

       A descriptor SHOULD NOT include an or-address line that does nothing but
       duplicate the address:port pair from its "router" line.

       The ordering of or-address lines and their PORT entries matter because
       Tor MAY accept a limited number of addresses or ports. As of Tor 0.2.3.x
       only the first address and the first port are used.

In terms of how many "a"/"or-address" lines there may be, the spec is only correct if you pay super close attention to the last sentence (this is actually the first time I've noticed it :) ).
In terms of whether IPv4 and/or IPv6 addresses are acceptable, the spec is currently wrong, according to the functions router_rebuild_descriptor() [https://gitweb.torproject.org/tor.git/blob/9f9b19ed7b06d8313a9bcbd6647fa097ec0a059d:/src/or/router.c#l1811 [source]] and router_dump_router_to_string() [https://gitweb.torproject.org/tor.git/blob/9f9b19ed7b06d8313a9bcbd6647fa097ec0a059d:/src/or/router.c#l2330 [source]] in src/or/router.c in tor's source code.

Unless we're aiming to build parsers for some other, spec-compliant¹, imaginary OR implementation? Which is totally worth doing, and if that's what Stem is doing, then that's great! But if that is the case, then BridgeDB would likely have to modify Stem's parsing, because it certainly wouldn't be very good if someone were to write an alternate OR implementation which started submitting the fully-spec-compliant "[Any number]" of addresses to BridgeDB and spamming the databases.

¹ I'm ignoring Orchid because last I checked (two months ago), it wasn't what I'd call "spec-compliant".

Trac:
Summary: BridgeDB should use stem for parsing descriptors according to torspec to BridgeDB should use Stem for parsing descriptors

Unless we're aiming to build parsers for some other, spec-compliant¹, imaginary OR implementation?

We are. Stem aims to conform with tor's spec, not its implementation. If tor doesn't match its spec then that's a tor bug and either tor or its spec needs to change.

Please shoot Nick a ticket if the spec's inaccurate about the 'a' lines.

Replying to atagar:

Please shoot Nick a ticket if the spec's inaccurate about the 'a' lines.

Okay, that ticket is now #13043 (moved).

My latest work is now in fix/9380-stem_r5.

Closed #10725 (moved) as a duplicate of this ticket.

Replying to isis:

My latest work is now in fix/9380-stem_r5.

Also, this isn't really ready for review yet. Changing to "need_revision".

Trac:
Status: needs_review to needs_revision

Elevating to "blocker" level because #11216 (moved), #12029 (moved), #12031 (moved), #12505 (moved), #12506 (moved), and #12872 (moved) (and therefore #12843 (moved)) are all blocking on this.

Trac:
Priority: normal to blocker

Nearly done. Latest version is in my fix/9380-stem_r8 branch.

I need a few more unittests and integration tests before I'm comfortable merging such a huge and drastic set of changes.

Trac:
Keywords: bridgedb-0.2.x deleted, bridgedb-0.3.x added

Trac:
Keywords: N/A deleted, isis2014Q3Q4 added

Trac:
Keywords: N/A deleted, isisExB added

The latest changes are in my fix/9380-stem_r10 branch, with 100% unittest coverage for all included changes. (Overall, BridgeDB's test coverage is now up to nearly 80%!) Please review!

Trac:
Status: needs_revision to needs_review

Trac:
Keywords: bridgedb-0.3.x deleted, bridgedb-0.3.0 added

FWIW, from #2895 (moved):

BridgeDB's verification chain for descriptors currently (as of #9380 (moved)) goes like this:

Parse the @type bridge-networkstatus documents in the networkstatus-bridges file.
Create Bridge class instances for each this we parsed in step !#1. Call the Bridge.updateFromNetworkStatus() method with the corresponding networkstatus document for each Bridge. This includes storing the descriptor digest for each Bridge.
Parse the @type bridge-server-descriptors found in the bridge-descriptors file.
Update each Bridge only if the descriptor digest matches the digested value of the @type bridge-server-descriptor that was just parsed.
Store the extra-info-digest from each @type bridge-server-descriptor.
Parse and deduplicate the @type bridge-extrainfo descriptors in cached-extrainfo and cached-extrainfo.new.
Verify the router-signature on the @type bridge-extrainfo descriptor for each bridge, using the signing-key from the Bridge's @type bridge-server-descriptor.
Update the Bridge's PluggableTransport class instances.

Hi Isis, here's some quick feedback...

from stem.descriptor import extrainfo_descriptor
from stem.descriptor import server_descriptor

Unused.

def parseServerDescriptorsFile(filename, validate=True):
    """Parse a file which contains ``@type bridge-server-descriptor``s.

    .. note:: ``validate`` defaults to ``False`` because there appears to be a
        bug in Leekspin, the fake descriptor generator, where Stem thinks the
        fingerprint doesn't match the key…

    .. note:: We have to lie to Stem, pretending that these are ``@type
        server-descriptor``s, **not** ``@type bridge-server-descriptor``s.
        See ticket `#11257`_.

    .. _`#11257`: https://trac.torproject.org/projects/tor/ticket/11257

    :param str filename: The file to parse descriptors from.
    :param bool validate: Whether or not to validate descriptor
        contents. (default: ``False``)
    :rtype: list
    :returns: A list of
        :api:`stem.descriptor.server_descriptor.RelayDescriptor`s.
    """
    logging.info("Parsing server descriptors with Stem: %s" % filename)
    descriptorType = 'server-descriptor 1.0'
    document = parse_file(filename, descriptorType, validate=validate)
    routers = list(document)
    return routers

This whole function is just 25 lines for doing...

routers = list(parse_file(filename, 'server-descriptor 1.0', validate = validate))

Really doesn't seem like it's providing any value.

lib/bridgedb/configure.py:

def loadConfig(configFile=None, configCls=None):
    """Load configuration settings on top of the current settings.
    ...

You might want to take a look at...

https://stem.torproject.org/api/util/conf.html

This is a module used by stem and arm so they can have nice config files...

https://gitweb.torproject.org/stem.git/tree/test/settings.cfg https://gitweb.torproject.org/arm.git/tree/armrc.sample

It provides type inferencing, defaults, etc without doing exec() as you presently are (... which is almost never a great idea). You can then use config_dict() to specify the config values each file cares about and their default values. For example...

https://gitweb.torproject.org/arm.git/tree/arm/header_panel.py#n27

Then when the config for that singleton is read all the configurations get the updated values.

document = parse_file(filename, descriptorType, validate=validate)

try:
  for router in document:
    descriptors.append(router)
except (ValueError, ProtocolError) as error:
  logging.error(
    ("Stem exception while parsing extrainfo descriptor from "
    "file '%s':\n%s") % (filename, str(error)))
    _copyUnparseableDescriptorFile(filename)

No reason for the for loop, same as...

descriptors = list(parse_file(filename, descriptorType, validate=validate))

def parseNetworkStatusFile(filename, validate=True, skipAnnotations=True,
                           descriptorClass=RouterStatusEntryV3):
    """Parse a file which contains an ``@type bridge-networkstatus`` document.

    See `ticket #12254 <https://bugs.torproject.org/12254>`__ for why
    networkstatus-bridges documents don't look anything like the networkstatus
    v2 documents that they are purported to look like. They are missing all
    headers, and the entire footer including authority signatures.

Actually, you could probably simply parse it like a normal network status document but without validation.

def deduplicate(descriptors):
    """Deduplicate some descriptors, returning only the newest for each router.

    .. note:: If two descriptors for the same router are discovered, AND both
        descriptors have the **same** published timestamp, then the router's
        fingerprint WILL BE LOGGED ON PURPOSE, because we assume that router
        to be malicious (deliberately, or unintentionally).

    :param list descriptors: A list of
        :api:`stem.descriptor.server_descriptor.RelayDescriptor`s,
        :api:`stem.descriptor.extrainfo_descriptor.BridgeExtraInfoDescriptor`s,
        :api:`stem.descriptor.router_status_entry.RouterStatusEntryV2`s.
    """
    duplicates = {}
    nonDuplicates = {}

    for descriptor in descriptors:
        fingerprint = descriptor.fingerprint

        logging.debug("Deduplicating %s descriptor for router %s"
                      % (str(descriptor.__class__).rsplit('.', 1)[1],
                         safelog.logSafely(fingerprint)))

        if fingerprint in nonDuplicates.keys():
            # We already found a descriptor for this fingerprint:
            conflict = nonDuplicates[fingerprint]

            # If the descriptor we are currently parsing is newer than the
            # last one we found:
            if descriptor.published > conflict.published:
                # And if this is the first duplicate we've found for this
                # router, then create a list in the ``duplicates`` dictionary
                # for the router:
                if not fingerprint in duplicates.keys():
                    duplicates[fingerprint] = list()
                # Add this to the list of duplicates for this router:
                duplicates[fingerprint].append(conflict)
                # Finally, put the newest descriptor in the ``nonDuplicates``
                # dictionary:
                nonDuplicates[fingerprint] = descriptor
            # Same thing, but this time the one we're parsing is older:
            elif descriptor.published < conflict.published:
                if not fingerprint in duplicates.keys():
                    duplicates[fingerprint] = list()
                duplicates[fingerprint].append(descriptor)
            # This *shouldn't* happen. It would mean that two descriptors for
            # the same router had the same timestamps, probably meaning there
            # is a severely-messed up OR implementation out there. Let's log
            # its fingerprint (no matter what!) so that we can look up its
            # ``platform`` line in its server-descriptor and tell whoever
            # wrote that code that they're probably (D)DOSing the Tor network.
            else:
                raise DescriptorWarning(
                    ("Duplicate descriptor with identical timestamp (%s) for "
                     "router with fingerprint '%s'!")
                    % (descriptor.published, fingerprint))

        # Hoorah! No duplicates! (yet...)
        else:
            nonDuplicates[fingerprint] = descriptor

    logging.info("Descriptor deduplication finished.")
    logging.info("Number of duplicates: %d" % len(duplicates))
    for (fingerprint, dittos) in duplicates.items():
        logging.info("  For %s: %d duplicates"
                     % (safelog.logSafely(fingerprint), len(dittos)))
    logging.info("Number of non-duplicates: %d" % len(nonDuplicates))

    return nonDuplicates

Off the top of my head think the following does the same as all the above...

# makes a map of {fingerprint => [all descriptors with that fingerprint]}

fingerprint_to_descriptors = {}

for desc in descriptors:
  fingerprint_to_descriptors.setdefault(desc.fingerprint, []).append(descriptor)

# provide back the latest

return [sorted(descriptors, key = lambda d: d.published, reversed = True)[0] for descriptors in fingerprint_to_descriptors.values()]

There is a hotfix/9380-duplicate-extrainfo-error branch which fixes a minor error from the above branch.

Replying to atagar:

Hi Isis, here's some quick feedback...

Thanks, atagar!

{{{ from stem.descriptor import extrainfo_descriptor from stem.descriptor import server_descriptor }}}

Unused.

Removed. Thanks!

{{{ def parseServerDescriptorsFile(filename, validate=True): """Parse a file which contains @type bridge-server-descriptors.

.. note:: ``validate`` defaults to ``False`` because there appears to be a
    bug in Leekspin, the fake descriptor generator, where Stem thinks the
    fingerprint doesn't match the key…

.. note:: We have to lie to Stem, pretending that these are ``@type
    server-descriptor``s, **not** ``@type bridge-server-descriptor``s.
    See ticket `#11257`_.

.. _`#11257`: https://trac.torproject.org/projects/tor/ticket/11257

:param str filename: The file to parse descriptors from.
:param bool validate: Whether or not to validate descriptor
    contents. (default: ``False``)
:rtype: list
:returns: A list of
    :api:`stem.descriptor.server_descriptor.RelayDescriptor`s.
"""
logging.info("Parsing server descriptors with Stem: %s" % filename)
descriptorType = 'server-descriptor 1.0'
document = parse_file(filename, descriptorType, validate=validate)
routers = list(document)
return routers

}}}

This whole function is just 25 lines for doing...

{{{ routers = list(parse_file(filename, 'server-descriptor 1.0', validate = validate)) }}}

Really doesn't seem like it's providing any value.

Yep… but adding a log statement in there makes it satisfy #9377 (moved). And documentation is important (rather than just importing a magickal function from Stem without explaining what is happening). And someday, with #12506 (moved), BridgeDB's database manager is going to do this parsing, so having everything all neatly contained within the bridgedb.parse.descriptors module will come it handy when it comes time to separate (rather than just calling list(parse_file([…])) in bridgedb.Main.load().

{{{ lib/bridgedb/configure.py:

def loadConfig(configFile=None, configCls=None): """Load configuration settings on top of the current settings. ... }}}

You might want to take a look at...

https://stem.torproject.org/api/util/conf.html

This is a module used by stem and arm so they can have nice config files...

https://gitweb.torproject.org/stem.git/tree/test/settings.cfg https://gitweb.torproject.org/arm.git/tree/armrc.sample

It provides type inferencing, defaults, etc without doing exec() as you presently are (... which is almost never a great idea). You can then use config_dict() to specify the config values each file cares about and their default values. For example...

https://gitweb.torproject.org/arm.git/tree/arm/header_panel.py#n27

Then when the config for that singleton is read all the configurations get the updated values.

Awesome! Yes, BridgeDB had a particularly nasty manner of handling config files when I started working on it. Not to mention that it made the servers reset themselves to whatever settings the config file had when they were first started (#9277 (moved)). I didn't change the config parsing much, I just hacked in the parsing for updating configs while running, and (de)serialising them.

If you or someone else ever feels like hacking on making BridgeDB's configs saner… I'd be pretty happy about it. Otherwise I'll likely check out your Stem config utility at some point.

{{{ document = parse_file(filename, descriptorType, validate=validate)

try: for router in document: descriptors.append(router) except (ValueError, ProtocolError) as error: logging.error( ("Stem exception while parsing extrainfo descriptor from " "file '%s':\n%s") % (filename, str(error))) _copyUnparseableDescriptorFile(filename) }}}

No reason for the for loop, same as...

{{{ descriptors = list(parse_file(filename, descriptorType, validate=validate)) }}}

I recall that that didn't work for catching the unparseable descriptors and saving them to separate files and/or that it broke my integration tests? I'll have to look at that more.

{{{ def parseNetworkStatusFile(filename, validate=True, skipAnnotations=True, descriptorClass=RouterStatusEntryV3): """Parse a file which contains an @type bridge-networkstatus document.
See `ticket #12254 <https://bugs.torproject.org/12254>`__ for why
networkstatus-bridges documents don't look anything like the networkstatus
v2 documents that they are purported to look like. They are missing all
headers, and the entire footer including authority signatures.
}}}

Actually, you could probably simply parse it like a normal network status document but without validation.

Validation seems like a good idea?

Someday I'll just fix the BridgeAuthority. Or kill it, since it basically does little more than collecting everything that comes its way and then shoving an everything.tar.gz at BridgeDB.

I've included fixes for some of your suggestions in the same fix/9380-stem_r10 branch.

Thanks, atagar!

Trac:
Status: needs_review to closed
Resolution: N/A to fixed

My pleasure. Great work, Isis!

This branch is now merged into BridgeDB-0.3.4.

Trac:
Severity: N/A to Normal
Sponsor: N/A to N/A

closed

mentioned in issue #9462 (moved)

mentioned in issue #9937 (moved)

mentioned in issue #10725 (moved)

mentioned in issue #11216 (moved)

mentioned in issue #11257 (closed)

mentioned in issue #12029 (moved)

mentioned in issue #12031 (moved)

mentioned in issue #12505 (moved)

mentioned in issue #12506 (moved)

mentioned in issue #12843 (moved)

mentioned in issue #12872 (moved)

mentioned in issue #13045 (moved)

mentioned in issue #13570 (closed)

mentioned in issue #13578 (moved)

mentioned in issue #14797 (moved)

mentioned in issue #15866 (moved)

mentioned in issue #18237 (moved)

moved to tpo/anti-censorship/bridgedb#9380 (closed)

mentioned in issue tpo/anti-censorship/bridgedb#9937 (closed)

mentioned in issue tpo/anti-censorship/bridgedb#13045 (closed)

mentioned in issue tpo/anti-censorship/bridgedb#15866 (closed)

mentioned in issue tpo/anti-censorship/bridgedb#18237 (closed)

BridgeDB should use Stem for parsing descriptors

Child items 0

Activity