bridgedb should make clearer in its logs which addresses it knows are from bulk-exitlist

added bridgedb-0.3.2 bridgedb-dist component::circumvention/bridgedb isis2015Q1Q2 isisExB isisExC owner::isis priority::low resolution::fixed status::closed type::defect labels

There is an additional ring per IP category. Bridges are allocated with equal probability.

Replying to aagbsn:

There is an additional ring per IP category. Bridges are allocated with equal probability.

Ah-ha! So then this is the mysterious extra IP version ring! I've been referring to it as "the IPv5 hashring" this whole time.

That probably should have been documented when the change was added. :(

Trac:
Cc: N/A to isis
Owner: N/A to isis
Priority: normal to minor
Status: new to assigned

Trac:
Keywords: N/A deleted, isisExB, isisExC, isis2015Q1Q2 added

I assume that I shouldn't finish the assignment.log part of this ticket, since (as of #14082 (moved)) BridgeDB no longer creates assignment log files…

Sounds reasonable. That said, does whatever we replaced it with (the thing that lets atlas tell you how your bridge is being distributed) know about this ipv5 hashring too?

Replying to arma:

Sounds reasonable. That said, does whatever we replaced it with (the thing that lets atlas tell you how your bridge is being distributed) know about this ipv5 hashring too?

Onionoo no longer contains 'pool_assignment' fields in its bridge details.json documents, and #13921 (moved) is the ticket for changing both Atlas and Globe to no longer have "Pool assignment" fields in the UIs. In Globe's case, it now has a "Transports" field instead which lists the PT methodnames supported by that bridge (or rather, it will, as soon as someone redeploys Globe).

Oh. So we are no longer documenting anywhere which pool a bridge was in?

The goal there was to be able to answer research questions like "which distribution strategies cause bridges to get blocked quickly, or cause bridges to get lots of use" or similar questions.

So, if I asked you (1) which distribution strategies (https, https-to-Tor-IPs, email, etc) are being successful right now, how would you go about answering it? And if I asked you (2) what the answer was last November, is the data that you'd use to answer question 1 gone for those trying to answer question 2?

Replying to arma:

Oh. So we are no longer documenting anywhere which pool a bridge was in?

Correct. I have an email from Karsten on 8 December 2014 which said that Karsten had disabled the code for yatei to sanitise BridgeDB's assignments.log files, and that the code would be removed entirely from metrics-lib within a couple days if nothing broke.

The goal there was to be able to answer research questions like "which distribution strategies cause bridges to get blocked quickly, or cause bridges to get lots of use" or similar questions.

Right; I understand and agree. But the system we had for doing that was really slow for both the BridgeDB and Metrics servers. Not to mention really buggy.

I don't think that we shouldn't do it; I just think that if we are to do it again, then we probably want to reopen #2755 (moved). I would opt for storing the information in one of the databases for #12030 (moved) or #12031 (moved), and optionally providing some interface to the distributor assignment data, e.g. something like what you were asking for in #7877 (moved). And, actually, once #12506 (moved) is finished, we could do something like that without placing much extra load on BridgeDB (or, if we wanted to get really fancy, it could even be hosted on a separate BridgeMetrics machine).

So, if I asked you (1) which distribution strategies (https, https-to-Tor-IPs, email, etc) are being successful right now, how would you go about answering it?

Well… if "successful" means "the highest ratio of real clients who are successfully given bridges to bot traffic and everything else", then I'd run some grep queries on the bridgedb.log file to find out those numbers. But what did you mean by "successful"?

And if I asked you (2) what the answer was last November, is the data that you'd use to answer question 1 gone for those trying to answer question 2?

Well, for last November in particular there would not be an issue, because Metrics has data up until 8 December 2014… however, that obviously isn't going to continue answering the question moving forward. (Also January and February 2015 are forever missing.) There is also the issue that the current assignments.log implementation wouldn't exactly have answered these questions, since researchers would probably need a rounded number on how many seeming-unique clients BridgeDB has distributed a given bridge to within a given time period in order to correlate the effectiveness of any bridge distribution strategy to actual bridge client connections and/or bridge bandwidth usage.

Should I assume this ticket is about the "category set" and the "IPv5" ring and all the surrounding terminology and code being really confusing and in general need of a cleanup? Or something else? Or the first thing plus something else?

Trac:
Status: assigned to needs_information

I would define success of a distribution strategy as a function of how many people are using the bridges that are given out by that strategy.

That means if a strategy never gives bridges to anybody, it would score low. And if it gives out a lot of bridges but they never get used because they got blocked, it would also score low.

It we wanted to get fancier, we would then have a per-country success value. And then we could compare distribution strategies for a given country.

The intuition comes from Damon's Proximax paper from long ago.

Well, it sounds like the code and documentation definitely needs cleanup. And also I think we should work towards being able to compute these success metrics as described above, per distribution strategy. That probably involves getting a lot of moving parts to line up.

Replying to arma:

The intuition comes from Damon's Proximax paper from long ago.

See the "Putting it all together" section of https://blog.torproject.org/blog/research-problem-five-ways-test-bridge-reachability

Replying to arma:

I would define success of a distribution strategy as a function of how many people are using the bridges that are given out by that strategy.

That means if a strategy never gives bridges to anybody, it would score low. And if it gives out a lot of bridges but they never get used because they got blocked, it would also score low.

It we wanted to get fancier, we would then have a per-country success value. And then we could compare distribution strategies for a given country.

The intuition comes from Damon's Proximax paper from long ago.

Tracking distributor success is an excellent idea, I've added it to #9316 (moved).

I am considering changing the way we treat clients making requests from Tor exits. Actually, I'm questioning if this has ever worked.

The part of BridgeDB's codebase where this is supposed to happen is buried deep in lib/bridgedb/Dist.py, in a function that I've hardly touched, and which still contains code from when Nick wrote BridgeDB and the code was kept in SVN. The function is bridgedb.Dist.IPBasedDistributor.getBridgesForIP(), and it currently looks like:

    def getBridgesForIP(self, ip, epoch, N=1, countryCode=None,
                        bridgeFilterRules=None):
        """Return a list of bridges to give to a user.

        :param str ip: The user's IP address, as a dotted quad.
        :param str epoch: The time period when we got this request.  This can
                          be any string, so long as it changes with every
                          period.
        :param int N: The number of bridges to try to give back. (default: 1)
        :param str countryCode: The two-letter geoip country code of the
            client's IP address. If given, the client will be placed in that
            "area". Clients within the same area receive the same bridges per
            period. If not given, the **ip** is truncated to it's CIDR /24
            representation and used as the "area". (default: None)
        :param list bridgeFilterRules: A list of callables used filter the
                                       bridges returned in the response to the
                                       client. See :mod:`~bridgedb.Filters`.
        :rtype: list
        :return: A list of :class:`~bridgedb.Bridges.Bridge`s to include in
                 the response. See
                 :meth:`bridgedb.HTTPServer.WebResource.getBridgeRequestAnswer`
                 for an example of how this is used.
        """
        logging.info("Attempting to return %d bridges to client %s..."
                     % (N, ip))

        if not bridgeFilterRules:
            bridgeFilterRules=[]

        if not len(self.splitter):
            logging.warn("Bailing! Splitter has zero bridges!")
            return []

        logging.debug("Bridges in splitter:\t%d" % len(self.splitter))
        logging.debug("Client request epoch:\t%s" % epoch)
        logging.debug("Active bridge filters:\t%s"
                      % ' '.join([x.func_name for x in bridgeFilterRules]))

        area = self.areaMapper(ip)
        logging.debug("IP mapped to area:\t%s"
                      % logSafely("{0}.0/24".format(area)))

        key1 = ''
        pos = 0
        n = self.nClusters

        # only one of ip categories or area clustering is active
        # try to match the request to an ip category
        for category in self.categories:
            # IP Categories
            if category.contains(ip):
                g = filterAssignBridgesToRing(self.splitter.hmac,
                                              self.nClusters +
                                              len(self.categories),
                                              n)
                bridgeFilterRules.append(g)
                logging.info("category<%s>%s", epoch, logSafely(area))
                pos = self.areaOrderHmac("category<%s>%s" % (epoch, area))
                key1 = getHMAC(self.splitter.key,
                               "Order-Bridges-In-Ring-%d" % n)
                break
            n += 1

        # if no category matches, use area clustering
        else:
            # IP clustering
            h = int( self.areaClusterHmac(area)[:8], 16)
            # length of numClusters
            clusterNum = h % self.nClusters

            g = filterAssignBridgesToRing(self.splitter.hmac,
                                          self.nClusters +
                                          len(self.categories),
                                          clusterNum)
            bridgeFilterRules.append(g)
            pos = self.areaOrderHmac("<%s>%s" % (epoch, area))
            key1 = getHMAC(self.splitter.key,
                           "Order-Bridges-In-Ring-%d" % clusterNum)

        # try to find a cached copy
        ruleset = frozenset(bridgeFilterRules)

        # See if we have a cached copy of the ring,
        # otherwise, add a new ring and populate it
        if ruleset in self.splitter.filterRings.keys():
            logging.debug("Cache hit %s" % ruleset)
            _, ring = self.splitter.filterRings[ruleset]

        # else create the ring and populate it
        else:
            logging.debug("Cache miss %s" % ruleset)
            ring = bridgedb.Bridges.BridgeRing(key1, self.answerParameters)
            self.splitter.addRing(ring,
                                  ruleset,
                                  filterBridgesByRules(bridgeFilterRules),
                                  populate_from=self.splitter.bridges)

        # get an appropriate number of bridges
        numBridgesToReturn = getNumBridgesPerAnswer(ring,
                                                    max_bridges_per_answer=N)
        answer = ring.getBridges(pos, numBridgesToReturn)
        return answer

A couple things to note:

The countryCode parameter is entirely unused. What was it for?

The only place in BridgeDB's code where IPBasedDistributor.getBridgesForIP() is called is in bridgedb.HTTPServer.WebResourceBridges.getBridgeRequestAnswer(), where the countryCode is passed in, and it is the two-letter geoIP countryCode for the client's IP address.

Are we supposed to be grouping clients by which country they are coming from? Shouldn't we group them by whether or they are coming from Tor or another known open proxy, and if not, what country they are coming from?

Or was the countryCode parameter supposed to be the country which the bridge shouldn't be blocked in?

(By the way, the docstring was written by me. It was my best guess as to what countryCode was supposed to be for. There was previously no documentation on it.)
Should we still be grouping clients by /24s? What adversary is that effective against? I realise that it isn't very difficult to get a class C subnet, but it isn't very difficult to get addresses in different /24s. Should we make the groups bigger, i.e. group clients by which /16 they are coming from?
Why are we still using the /24 (the area) in the code for serving bridges to clients coming from Tor exits? This means that changing your exit node would get you different bridges. (But not the same bridges as people not using Tor.)
It seems like a lot of these bugs come from commit f022b905ca01a193aabd4d78107f27fce85c40cd which implemented #4297 (moved) and is where this whole business of making an "IPv5 hashring" and putting Tor users in it came from… we should probably look at the other changes in that commit and review them.

For now, I propose the following changes:

diff --git i/lib/bridgedb/Dist.py w/lib/bridgedb/Dist.py
index c2a8620..65a2f75 100644
--- i/lib/bridgedb/Dist.py
+++ w/lib/bridgedb/Dist.py
@@ -284,16 +283,21 @@ class IPBasedDistributor(Distributor):
         # try to match the request to an ip category
         for category in self.categories:
             # IP Categories
-            if category.contains(ip):
+            if ip in category:
+                # The tag is a tag applied to a proxy IP address when it is
+                # added to the bridgedb.proxy.ProxySet. For Tor Exit relays,
+                # the default is 'exit_relay'. For other proxies loaded from
+                # the PROXY_LIST_FILES config option, the default tag is the
+                # full filename that the IP address originally came from.
+                tag = category.getTag(ip)
+                logging.info("Client was from known proxy (tag: %s): %s"
+                             % (tag, ip))
                 g = filterAssignBridgesToRing(self.splitter.hmac,
                                               self.nClusters +
                                               len(self.categories),
                                               n)
                 bridgeFilterRules.append(g)
-                logging.info("category<%s>%s", epoch, logSafely(area))
-                pos = self.areaOrderHmac("category<%s>%s" % (epoch, area))
+                pos = self.areaOrderHmac("<%s>known-proxy" % epoch)
                 break
             n += 1

This fixes the issue with confusing logging, and also fixes the issue that changing your Tor exit gets you different bridges.

Replying to an email from Robert Ransom:

Replying to isis:

This fixes the issue with confusing logging, and also fixes the issue that changing your Tor exit gets you different bridges.

That was intentional, and (at least back in 2011) arma/Roger considered it a good feature.

At least it should spread out the load due to honest users who obtain bridges by HTTPS-over-Tor better than serving the same small set to all HTTPS-over-Tor bridge users.

Ah, that is a good point! But it also means that the whole subhashring for Tor users can be super easily scraped, meaning that if a user in China has already gotten their Tor working, and then they ask for bridges over Tor, they'll likely get bridges that are already blocked. :(

Trac:
Status: needs_information to needs_review

Replying to isis:

Replying to an email from Robert Ransom:

Replying to isis:

This fixes the issue with confusing logging, and also fixes the issue that changing your Tor exit gets you different bridges.

That was intentional, and (at least back in 2011) arma/Roger considered it a good feature.

At least it should spread out the load due to honest users who obtain bridges by HTTPS-over-Tor better than serving the same small set to all HTTPS-over-Tor bridge users.

Ah, that is a good point! But it also means that the whole subhashring for Tor users can be super easily scraped, meaning that if a user in China has already gotten their Tor working, and then they ask for bridges over Tor, they'll likely get bridges that are already blocked. :(

If we want to spread out the load more, we could do something like int(ip) % 4 and put that into the HMACed data, in order to split the Tor/proxy users into four groups, with separate bridges for each one. That would still make it impossible to get the whole subhashring in one go.

Replying to an email from Robert Ransom:

Replying to isis:

Should we still be grouping clients by /24s? What adversary is that effective against? I realise that it isn't very difficult to get a class C subnet, but it isn't very difficult to get addresses in different /24s. Should we make the groups bigger, i.e. group clients by which /16 they are coming from?

I thought it was /16, or at least intended to be /16, once, but I was probably confusing BridgeDB with Tor's implicit IP-based ‘families’ (i.e. no two relays in the same /16 will be chosen for the circuit).

Do you think it should be changed to /16? Truncating to /24 just seems like it would stop someone at Noisebridge from getting multiple sets of lines (Noisebridge has a /24). I don't really see what that accomplishes. I thought that the NSA has a bunch of /8s? And China has even crazier, they can just spoof the IP of anything in China.

I kind of think we should be grouping clients according to what country they are coming from… that is at least marginally difficult to change.

Replying to isis:

Replying to isis:

Replying to an email from Robert Ransom:

Replying to isis:

This fixes the issue with confusing logging, and also fixes the issue that changing your Tor exit gets you different bridges.

That was intentional, and (at least back in 2011) arma/Roger considered it a good feature.

At least it should spread out the load due to honest users who obtain bridges by HTTPS-over-Tor better than serving the same small set to all HTTPS-over-Tor bridge users.

Ah, that is a good point! But it also means that the whole subhashring for Tor users can be super easily scraped, meaning that if a user in China has already gotten their Tor working, and then they ask for bridges over Tor, they'll likely get bridges that are already blocked. :(

If we want to spread out the load more, we could do something like int(ip) % 4 and put that into the HMACed data, in order to split the Tor/proxy users into four groups, with separate bridges for each one. That would still make it impossible to get the whole subhashring in one go.

Okay, I went with doing the int(ip) % 4 thing. See commit 6cfee6452ac63fa019cdf08c1f633dcb9aba8c81 in my fix/4771-log-tor-exits branch.

bridgedb should make clearer in its logs which addresses it knows are from bulk-exitlist

Child items ...

Activity