Opened 6 years ago

Last modified 4 months ago

#12505 new defect

Refactor BridgeDB's hashrings

Reported by: isis Owned by:
Priority: High Milestone:
Component: Circumvention/BridgeDB Version:
Severity: Normal Keywords: bridgedb-dist, bridgedb-1.0.x, ex-sponsor-19
Cc: sysrqb Actual Points:
Parent ID: Points: 5
Reviewer: Sponsor: Sponsor30-can

Description (last modified by isis)

I have slowly been refactoring all of BridgeDB. Code which has been already refactored is named with "proper" (according to PEP8) lower-cased module names in lib/bridgedb in the BridgeDB repository.

Some of the largest, least-unitested, (and most difficult to refactor) sections of BridgeDB's code are the module and the module. This code primarily controls the hashrings and the distributors (which for some reason are subclasses of the very-confused hashrings structures).

The biggest problems are:

  1. The code for the various types of hashrings in bridgedb.Bridges is a complete mess. In some places, hashrings are referred to as BridgeHolders, in other places as Buckets (though not to be confused with the module, and in other places as "hashrings". Subclassing is haphazard and confusing to say the least. In addition, the hashrings are not algorithmically as efficient as they could be. Throughout the hashring code were old-style classes, unused methods, half-implemented methods, and unnecessary parameters. All of this code needs some serious help.

  1. The Distributors in bridgedb.Dist inherit, for some unknown reason, from the improperly implemented "base class" bridgedb.Bridges.BridgeHolder. One, this isn't how one implements a proper Python base class (by deriving from a class with __metaclass__ = abc.ABCMeta). Two, why a Distributor should be a subclass of a the "base" hashring class is unclear and unnecessary, and we should move away from that. A Distributor is something which distributes bridges to users, not some weird half-thought-out hashring subclass.

  1. The various Distributors in bridgedb.Dist should be different modules.

  1. Almost none of the code in and is unittested. These modules have the highest number of untested lines of code currently in BridgeDB.

After this is finished, I am mostly willing to tag BridgeDB-1.0.0. There may be a few other refactoring that should get finished before then, but this is the main piece remaining to be completed.

Child Tickets

Change History (12)

comment:1 Changed 6 years ago by isis

Status: newassigned

This is currently blocked on #9380. However, for #9380, I have started to redesign large section of bridgedb.Bridges, so part of this will already be done by the time it is no longer blocked.

comment:2 Changed 5 years ago by isis

Keywords: isis2014Q3Q4 added

comment:3 Changed 5 years ago by isis

Keywords: isisExB added

comment:4 Changed 5 years ago by isis

Some problems in lib/bridgedb/ (from a comment on #4771):

The function is bridgedb.Dist.IPBasedDistributor.getBridgesForIP(), and it currently looks like:

    def getBridgesForIP(self, ip, epoch, N=1, countryCode=None,
        """Return a list of bridges to give to a user.

        :param str ip: The user's IP address, as a dotted quad.
        :param str epoch: The time period when we got this request.  This can
                          be any string, so long as it changes with every
        :param int N: The number of bridges to try to give back. (default: 1)
        :param str countryCode: The two-letter geoip country code of the
            client's IP address. If given, the client will be placed in that
            "area". Clients within the same area receive the same bridges per
            period. If not given, the **ip** is truncated to it's CIDR /24
            representation and used as the "area". (default: None)
        :param list bridgeFilterRules: A list of callables used filter the
                                       bridges returned in the response to the
                                       client. See :mod:`~bridgedb.Filters`.
        :rtype: list
        :return: A list of :class:`~bridgedb.Bridges.Bridge`s to include in
                 the response. See
                 for an example of how this is used.
        """"Attempting to return %d bridges to client %s..."
                     % (N, ip))

        if not bridgeFilterRules:

        if not len(self.splitter):
            logging.warn("Bailing! Splitter has zero bridges!")
            return []

        logging.debug("Bridges in splitter:\t%d" % len(self.splitter))
        logging.debug("Client request epoch:\t%s" % epoch)
        logging.debug("Active bridge filters:\t%s"
                      % ' '.join([x.func_name for x in bridgeFilterRules]))

        area = self.areaMapper(ip)
        logging.debug("IP mapped to area:\t%s"
                      % logSafely("{0}.0/24".format(area)))

        key1 = ''
        pos = 0
        n = self.nClusters

        # only one of ip categories or area clustering is active
        # try to match the request to an ip category
        for category in self.categories:
            # IP Categories
            if category.contains(ip):
                g = filterAssignBridgesToRing(self.splitter.hmac,
                                              self.nClusters +
      "category<%s>%s", epoch, logSafely(area))
                pos = self.areaOrderHmac("category<%s>%s" % (epoch, area))
                key1 = getHMAC(self.splitter.key,
                               "Order-Bridges-In-Ring-%d" % n)
            n += 1

        # if no category matches, use area clustering
            # IP clustering
            h = int( self.areaClusterHmac(area)[:8], 16)
            # length of numClusters
            clusterNum = h % self.nClusters

            g = filterAssignBridgesToRing(self.splitter.hmac,
                                          self.nClusters +
            pos = self.areaOrderHmac("<%s>%s" % (epoch, area))
            key1 = getHMAC(self.splitter.key,
                           "Order-Bridges-In-Ring-%d" % clusterNum)

        # try to find a cached copy
        ruleset = frozenset(bridgeFilterRules)

        # See if we have a cached copy of the ring,
        # otherwise, add a new ring and populate it
        if ruleset in self.splitter.filterRings.keys():
            logging.debug("Cache hit %s" % ruleset)
            _, ring = self.splitter.filterRings[ruleset]

        # else create the ring and populate it
            logging.debug("Cache miss %s" % ruleset)
            ring = bridgedb.Bridges.BridgeRing(key1, self.answerParameters)

        # get an appropriate number of bridges
        numBridgesToReturn = getNumBridgesPerAnswer(ring,
        answer = ring.getBridges(pos, numBridgesToReturn)
        return answer

A couple things to note:

  • The countryCode parameter is entirely unused. What was it for?

The only place in BridgeDB's code where IPBasedDistributor.getBridgesForIP() is called is in bridgedb.HTTPServer.WebResourceBridges.getBridgeRequestAnswer(), where the countryCode is passed in, and it is the two-letter geoIP countryCode for the client's IP address.

Are we supposed to be grouping clients by which country they are coming from? Shouldn't we group them by whether or they are coming from Tor or another known open proxy, and if not, what country they are coming from?

Or was the countryCode parameter supposed to be the country which the bridge shouldn't be blocked in?

(By the way, the docstring was written by me. It was my best guess as to what countryCode was supposed to be for. There was previously no documentation on it.)

  • Should we still be grouping clients by /24s? What adversary is that effective against? I realise that it isn't very difficult to get a class C subnet, but it isn't very difficult to get addresses in different /24s. Should we make the groups bigger, i.e. group clients by which /16 they are coming from?
  • Why are we still using the /24 (the area) in the code for serving bridges to clients coming from Tor exits? This means that changing your exit node would get you different bridges. (But not the same bridges as people not using Tor.)

(Fixed as part of #4771)

  • It seems like a lot of these bugs come from commit f022b905ca01a193aabd4d78107f27fce85c40cd which implemented #4297 and is where this whole business of making an "IPv5 hashring" and putting Tor users in it came from… we should probably look at the other changes in that commit and review them.

comment:5 Changed 5 years ago by isis

Description: modified (diff)
Summary: Refactor and in BridgeDBRefactor BridgeDB's hashrings

comment:6 Changed 5 years ago by isis

As part of #5418, I ripped out the old COUNTRY_BLOCK_FILES parsers and some old classes for storing information on bridge blocks (from #5027).

Although that code had nothing to do the hashrings, it merits mention here because some of the code was in bridgedb.Bridges. It had literally never been used and needed to be entirely replaced anyway. The tickets for replacing the functionalities which that code was supposed to provide are #12547 and #12031.

That change was made in my fix/12505-5418-remove-old-block-code branch.

comment:7 Changed 4 years ago by isis

Keywords: TorCoreTeam201608 added

Adding to my august tickets.

comment:8 Changed 3 years ago by teor

Severity: Normal

Set all open tickets without a severity to "Normal"

comment:9 Changed 17 months ago by gaba

Cc: isis removed
Keywords: isis2014Q3Q4 isisExB TorCoreTeam201608 removed
Owner: isis deleted
Points: 5
Sponsor: Sponsor19

comment:10 Changed 12 months ago by gaba

Keywords: ex-sponsor-19 added

Adding the keyword to mark everything that didn't fit into the time for sponsor 19.

comment:11 Changed 12 months ago by phw

Sponsor: Sponsor19Sponsor30-can

Moving from Sponsor 19 to Sponsor 30.

comment:12 Changed 4 months ago by teor

Status: assignednew

Change tickets that are assigned to nobody to "new".

Note: See TracTickets for help on using tickets.