Opened 8 years ago

Closed 5 years ago

#4771 closed defect (fixed)

bridgedb should make clearer in its logs which addresses it knows are from bulk-exitlist

Reported by: arma Owned by: isis
Priority: Low Milestone:
Component: Circumvention/BridgeDB Version:
Severity: Keywords: bridgedb-dist, bridgedb-0.3.2, isis2015Q1Q2, isisExB, isisExC
Cc: isis Actual Points:
Parent ID: Points:
Reviewer: Sponsor:

Description

When I start bridgedb with loglevel INFO, it says things like

Dec 24 19:58:54 [INFO]   by location set: 59 50 56 50
Dec 24 19:58:54 [INFO]   by category set: 40

and then when I hit if from an IP in the 'category set', it says

Dec 24 19:59:17 [INFO] area is 87.236.194
Dec 24 19:59:17 [INFO] ---------------------------------
Dec 24 19:59:17 [INFO] category<1970>87.236.194
Dec 24 19:59:17 [INFO] Replying to web request from 87.236.194.158.  Parameters were {}

So I assume that means the category set is the same as the set of bridges it gives out when an IP from the bulk-exitlist file asks.

But the assignments.log gives no hint about which bridges are in this category set. Is it a subset of each ring? Or some other set that never gets logged in the assignments list? Confusing.

Child Tickets

Change History (27)

comment:1 Changed 8 years ago by aagbsn

There is an additional ring per IP category. Bridges are allocated with equal probability.

comment:2 in reply to:  1 Changed 5 years ago by isis

Cc: isis added
Owner: set to isis
Priority: normalminor
Status: newassigned

Replying to aagbsn:

There is an additional ring per IP category. Bridges are allocated with equal probability.


Ah-ha! So then this is the mysterious extra IP version ring! I've been referring to it as "the IPv5 hashring" this whole time.

That probably should have been documented when the change was added. :(

comment:3 Changed 5 years ago by isis

Keywords: isis2015Q1Q2 isisExB isisExC added

comment:4 Changed 5 years ago by isis

I assume that I shouldn't finish the assignment.log part of this ticket, since (as of #14082) BridgeDB no longer creates assignment log files…

comment:5 Changed 5 years ago by arma

Sounds reasonable. That said, does whatever we replaced it with (the thing that lets atlas tell you how your bridge is being distributed) know about this ipv5 hashring too?

comment:6 in reply to:  5 Changed 5 years ago by isis

Replying to arma:

Sounds reasonable. That said, does whatever we replaced it with (the thing that lets atlas tell you how your bridge is being distributed) know about this ipv5 hashring too?

Onionoo no longer contains 'pool_assignment' fields in its bridge details.json documents, and #13921 is the ticket for changing both Atlas and Globe to no longer have "Pool assignment" fields in the UIs. In Globe's case, it now has a "Transports" field instead which lists the PT methodnames supported by that bridge (or rather, it will, as soon as someone redeploys Globe).

Last edited 5 years ago by isis (previous) (diff)

comment:7 Changed 5 years ago by arma

Oh. So we are no longer documenting anywhere which pool a bridge was in?

The goal there was to be able to answer research questions like "which distribution strategies cause bridges to get blocked quickly, or cause bridges to get lots of use" or similar questions.

So, if I asked you (1) which distribution strategies (https, https-to-Tor-IPs, email, etc) are being successful right now, how would you go about answering it? And if I asked you (2) what the answer was last November, is the data that you'd use to answer question 1 gone for those trying to answer question 2?

comment:8 in reply to:  7 Changed 5 years ago by isis

Replying to arma:

Oh. So we are no longer documenting anywhere which pool a bridge was in?

Correct. I have an email from Karsten on 8 December 2014 which said that Karsten had disabled the code for yatei to sanitise BridgeDB's assignments.log files, and that the code would be removed entirely from metrics-lib within a couple days if nothing broke.

The goal there was to be able to answer research questions like "which distribution strategies cause bridges to get blocked quickly, or cause bridges to get lots of use" or similar questions.

Right; I understand and agree. But the system we had for doing that was really slow for both the BridgeDB and Metrics servers. Not to mention really buggy.

I don't think that we shouldn't do it; I just think that if we are to do it again, then we probably want to reopen #2755. I would opt for storing the information in one of the databases for #12030 or #12031, and optionally providing some interface to the distributor assignment data, e.g. something like what you were asking for in #7877. And, actually, once #12506 is finished, we could do something like that without placing much extra load on BridgeDB (or, if we wanted to get really fancy, it could even be hosted on a separate BridgeMetrics machine).

So, if I asked you (1) which distribution strategies (https, https-to-Tor-IPs, email, etc) are being successful right now, how would you go about answering it?

Well… if "successful" means "the highest ratio of real clients who are successfully given bridges to bot traffic and everything else", then I'd run some grep queries on the bridgedb.log file to find out those numbers. But what did you mean by "successful"?

And if I asked you (2) what the answer was last November, is the data that you'd use to answer question 1 gone for those trying to answer question 2?

Well, for last November in particular there would not be an issue, because Metrics has data up until 8 December 2014… however, that obviously isn't going to continue answering the question moving forward. (Also January and February 2015 are forever missing.) There is also the issue that the current assignments.log implementation wouldn't exactly have answered these questions, since researchers would probably need a rounded number on how many seeming-unique clients BridgeDB has distributed a given bridge to within a given time period in order to correlate the effectiveness of any bridge distribution strategy to actual bridge client connections and/or bridge bandwidth usage.

comment:9 Changed 5 years ago by isis

Status: assignedneeds_information

Should I assume this ticket is about the "category set" and the "IPv5" ring and all the surrounding terminology and code being really confusing and in general need of a cleanup? Or something else? Or the first thing plus something else?

comment:10 Changed 5 years ago by arma

I would define success of a distribution strategy as a function of how many people are using the bridges that are given out by that strategy.

That means if a strategy never gives bridges to anybody, it would score low. And if it gives out a lot of bridges but they never get used because they got blocked, it would also score low.

It we wanted to get fancier, we would then have a per-country success value. And then we could compare distribution strategies for a given country.

The intuition comes from Damon's Proximax paper from long ago.

comment:11 Changed 5 years ago by arma

Well, it sounds like the code and documentation definitely needs cleanup. And also I think we should work towards being able to compute these success metrics as described above, per distribution strategy. That probably involves getting a lot of moving parts to line up.

comment:12 in reply to:  10 Changed 5 years ago by arma

Replying to arma:

The intuition comes from Damon's Proximax paper from long ago.

See the "Putting it all together" section of
https://blog.torproject.org/blog/research-problem-five-ways-test-bridge-reachability

comment:13 in reply to:  10 Changed 5 years ago by isis

Replying to arma:

I would define success of a distribution strategy as a function of how many people are using the bridges that are given out by that strategy.

That means if a strategy never gives bridges to anybody, it would score low. And if it gives out a lot of bridges but they never get used because they got blocked, it would also score low.

It we wanted to get fancier, we would then have a per-country success value. And then we could compare distribution strategies for a given country.

The intuition comes from Damon's Proximax paper from long ago.


Tracking distributor success is an excellent idea, I've added it to #9316.

comment:14 Changed 5 years ago by isis

I am considering changing the way we treat clients making requests from Tor exits. Actually, I'm questioning if this has ever worked.

The part of BridgeDB's codebase where this is supposed to happen is buried deep in lib/bridgedb/Dist.py, in a function that I've hardly touched, and which still contains code from when Nick wrote BridgeDB and the code was kept in SVN. The function is bridgedb.Dist.IPBasedDistributor.getBridgesForIP(), and it currently looks like:

    def getBridgesForIP(self, ip, epoch, N=1, countryCode=None,
                        bridgeFilterRules=None):
        """Return a list of bridges to give to a user.

        :param str ip: The user's IP address, as a dotted quad.
        :param str epoch: The time period when we got this request.  This can
                          be any string, so long as it changes with every
                          period.
        :param int N: The number of bridges to try to give back. (default: 1)
        :param str countryCode: The two-letter geoip country code of the
            client's IP address. If given, the client will be placed in that
            "area". Clients within the same area receive the same bridges per
            period. If not given, the **ip** is truncated to it's CIDR /24
            representation and used as the "area". (default: None)
        :param list bridgeFilterRules: A list of callables used filter the
                                       bridges returned in the response to the
                                       client. See :mod:`~bridgedb.Filters`.
        :rtype: list
        :return: A list of :class:`~bridgedb.Bridges.Bridge`s to include in
                 the response. See
                 :meth:`bridgedb.HTTPServer.WebResource.getBridgeRequestAnswer`
                 for an example of how this is used.
        """
        logging.info("Attempting to return %d bridges to client %s..."
                     % (N, ip))

        if not bridgeFilterRules:
            bridgeFilterRules=[]

        if not len(self.splitter):
            logging.warn("Bailing! Splitter has zero bridges!")
            return []

        logging.debug("Bridges in splitter:\t%d" % len(self.splitter))
        logging.debug("Client request epoch:\t%s" % epoch)
        logging.debug("Active bridge filters:\t%s"
                      % ' '.join([x.func_name for x in bridgeFilterRules]))

        area = self.areaMapper(ip)
        logging.debug("IP mapped to area:\t%s"
                      % logSafely("{0}.0/24".format(area)))

        key1 = ''
        pos = 0
        n = self.nClusters

        # only one of ip categories or area clustering is active
        # try to match the request to an ip category
        for category in self.categories:
            # IP Categories
            if category.contains(ip):
                g = filterAssignBridgesToRing(self.splitter.hmac,
                                              self.nClusters +
                                              len(self.categories),
                                              n)
                bridgeFilterRules.append(g)
                logging.info("category<%s>%s", epoch, logSafely(area))
                pos = self.areaOrderHmac("category<%s>%s" % (epoch, area))
                key1 = getHMAC(self.splitter.key,
                               "Order-Bridges-In-Ring-%d" % n)
                break
            n += 1

        # if no category matches, use area clustering
        else:
            # IP clustering
            h = int( self.areaClusterHmac(area)[:8], 16)
            # length of numClusters
            clusterNum = h % self.nClusters

            g = filterAssignBridgesToRing(self.splitter.hmac,
                                          self.nClusters +
                                          len(self.categories),
                                          clusterNum)
            bridgeFilterRules.append(g)
            pos = self.areaOrderHmac("<%s>%s" % (epoch, area))
            key1 = getHMAC(self.splitter.key,
                           "Order-Bridges-In-Ring-%d" % clusterNum)

        # try to find a cached copy
        ruleset = frozenset(bridgeFilterRules)

        # See if we have a cached copy of the ring,
        # otherwise, add a new ring and populate it
        if ruleset in self.splitter.filterRings.keys():
            logging.debug("Cache hit %s" % ruleset)
            _, ring = self.splitter.filterRings[ruleset]

        # else create the ring and populate it
        else:
            logging.debug("Cache miss %s" % ruleset)
            ring = bridgedb.Bridges.BridgeRing(key1, self.answerParameters)
            self.splitter.addRing(ring,
                                  ruleset,
                                  filterBridgesByRules(bridgeFilterRules),
                                  populate_from=self.splitter.bridges)

        # get an appropriate number of bridges
        numBridgesToReturn = getNumBridgesPerAnswer(ring,
                                                    max_bridges_per_answer=N)
        answer = ring.getBridges(pos, numBridgesToReturn)
        return answer

A couple things to note:

  • The countryCode parameter is entirely unused. What was it for?

The only place in BridgeDB's code where IPBasedDistributor.getBridgesForIP() is called is in bridgedb.HTTPServer.WebResourceBridges.getBridgeRequestAnswer(), where the countryCode is passed in, and it is the two-letter geoIP countryCode for the client's IP address.

Are we supposed to be grouping clients by which country they are coming from? Shouldn't we group them by whether or they are coming from Tor or another known open proxy, and if not, what country they are coming from?

Or was the countryCode parameter supposed to be the country which the bridge shouldn't be blocked in?

(By the way, the docstring was written by me. It was my best guess as to what countryCode was supposed to be for. There was previously no documentation on it.)

  • Should we still be grouping clients by /24s? What adversary is that effective against? I realise that it isn't very difficult to get a class C subnet, but it isn't very difficult to get addresses in different /24s. Should we make the groups bigger, i.e. group clients by which /16 they are coming from?
  • Why are we still using the /24 (the area) in the code for serving bridges to clients coming from Tor exits? This means that changing your exit node would get you different bridges. (But not the same bridges as people not using Tor.)
  • It seems like a lot of these bugs come from commit f022b905ca01a193aabd4d78107f27fce85c40cd which implemented #4297 and is where this whole business of making an "IPv5 hashring" and putting Tor users in it came from… we should probably look at the other changes in that commit and review them.

comment:15 Changed 5 years ago by isis

For now, I propose the following changes:

diff --git i/lib/bridgedb/Dist.py w/lib/bridgedb/Dist.py
index c2a8620..65a2f75 100644
--- i/lib/bridgedb/Dist.py
+++ w/lib/bridgedb/Dist.py
@@ -284,16 +283,21 @@ class IPBasedDistributor(Distributor):
         # try to match the request to an ip category
         for category in self.categories:
             # IP Categories
-            if category.contains(ip):
+            if ip in category:
+                # The tag is a tag applied to a proxy IP address when it is
+                # added to the bridgedb.proxy.ProxySet. For Tor Exit relays,
+                # the default is 'exit_relay'. For other proxies loaded from
+                # the PROXY_LIST_FILES config option, the default tag is the
+                # full filename that the IP address originally came from.
+                tag = category.getTag(ip)
+                logging.info("Client was from known proxy (tag: %s): %s"
+                             % (tag, ip))
                 g = filterAssignBridgesToRing(self.splitter.hmac,
                                               self.nClusters +
                                               len(self.categories),
                                               n)
                 bridgeFilterRules.append(g)
-                logging.info("category<%s>%s", epoch, logSafely(area))
-                pos = self.areaOrderHmac("category<%s>%s" % (epoch, area))
+                pos = self.areaOrderHmac("<%s>known-proxy" % epoch)
                 break
             n += 1

This fixes the issue with confusing logging, and also fixes the issue that changing your Tor exit gets you different bridges.

comment:16 in reply to:  15 ; Changed 5 years ago by isis

Status: needs_informationneeds_review

Replying to an email from Robert Ransom:

Replying to isis:

This fixes the issue with confusing logging, and also fixes the issue that changing your Tor exit gets you different bridges.

That was intentional, and (at least back in 2011) arma/Roger considered it a good feature.

At least it should spread out the load due to honest users who obtain bridges by HTTPS-over-Tor better than serving the same small set to all HTTPS-over-Tor bridge users.


Ah, that is a good point! But it also means that the whole subhashring for Tor users can be super easily scraped, meaning that if a user in China has already gotten their Tor working, and then they ask for bridges over Tor, they'll likely get bridges that are already blocked. :(

comment:17 in reply to:  16 ; Changed 5 years ago by isis

Replying to isis:

Replying to an email from Robert Ransom:

Replying to isis:

This fixes the issue with confusing logging, and also fixes the issue that changing your Tor exit gets you different bridges.

That was intentional, and (at least back in 2011) arma/Roger considered it a good feature.

At least it should spread out the load due to honest users who obtain bridges by HTTPS-over-Tor better than serving the same small set to all HTTPS-over-Tor bridge users.


Ah, that is a good point! But it also means that the whole subhashring for Tor users can be super easily scraped, meaning that if a user in China has already gotten their Tor working, and then they ask for bridges over Tor, they'll likely get bridges that are already blocked. :(


If we want to spread out the load more, we could do something like int(ip) % 4 and put that into the HMACed data, in order to split the Tor/proxy users into four groups, with separate bridges for each one. That would still make it impossible to get the whole subhashring in one go.

comment:18 Changed 5 years ago by isis

Replying to an email from Robert Ransom:

Replying to isis:

  • Should we still be grouping clients by /24s? What adversary is that effective against? I realise that it isn't very difficult to get a class C subnet, but it isn't very difficult to get addresses in different /24s. Should we make the groups bigger, i.e. group clients by which /16 they are coming from?


I thought it was /16, or at least intended to be /16, once, but I was probably confusing BridgeDB with Tor's implicit IP-based ‘families’ (i.e. no two relays in the same /16 will be chosen for the circuit).


Do you think it should be changed to /16? Truncating to /24 just seems like it would stop someone at Noisebridge from getting multiple sets of lines (Noisebridge has a /24). I don't really see what that accomplishes. I thought that the NSA has a bunch of /8s? And China has even crazier, they can just spoof the IP of *anything* in China.

I kind of think we should be grouping clients according to what country they are coming from… that is at least marginally difficult to change.

comment:19 in reply to:  17 Changed 5 years ago by isis

Replying to isis:

Replying to isis:

Replying to an email from Robert Ransom:

Replying to isis:

This fixes the issue with confusing logging, and also fixes the issue that changing your Tor exit gets you different bridges.

That was intentional, and (at least back in 2011) arma/Roger considered it a good feature.

At least it should spread out the load due to honest users who obtain bridges by HTTPS-over-Tor better than serving the same small set to all HTTPS-over-Tor bridge users.


Ah, that is a good point! But it also means that the whole subhashring for Tor users can be super easily scraped, meaning that if a user in China has already gotten their Tor working, and then they ask for bridges over Tor, they'll likely get bridges that are already blocked. :(


If we want to spread out the load more, we could do something like int(ip) % 4 and put that into the HMACed data, in order to split the Tor/proxy users into four groups, with separate bridges for each one. That would still make it impossible to get the whole subhashring in one go.


Okay, I went with doing the int(ip) % 4 thing. See commit 6cfee6452ac63fa019cdf08c1f633dcb9aba8c81 in my fix/4771-log-tor-exits branch.

comment:20 in reply to:  18 ; Changed 5 years ago by isis

Replying to an email from Robert Ransom:

Replying to isis:

Replying to an email from Robert Ransom:

Replying to isis:

  • Should we still be grouping clients by /24s? What adversary is that effective against? I realise that it isn't very difficult to get a class C subnet, but it isn't very difficult to get addresses in different /24s. Should we make the groups bigger, i.e. group clients by which /16 they are coming from?


I thought it was /16, or at least intended to be /16, once, but I was probably confusing BridgeDB with Tor's implicit IP-based ‘families’ (i.e. no two relays in the same /16 will be chosen for the circuit).


Do you think it should be changed to /16? Truncating to /24 just seems like it would stop someone at Noisebridge from getting multiple sets of lines (Noisebridge has a /24). I don't really see what that accomplishes. I thought that the NSA has a bunch of /8s? And China has even crazier, they can just spoof the IP of *anything* in China.

I kind of think we should be grouping clients according to what country they are coming from… that is at least marginally difficult to change.

Think about it this way: If BridgeDB splits the bridge supply by /24, then every bridge provided by HTTPS to a user in China can be obtained by Chinese censors. If BridgeDB splits the bridge supply by /16, then every bridge provided by HTTPS to a user in China can be obtained by Chinese censors. If BridgeDB splits the bridge supply by GeoIP country, then every bridge provided by HTTPS to a user in China can be obtained by Chinese censors, *and* every HTTPS-over-Tor bridge user in China will DDoS the snot out of the same few bridges with honest connection attempts and GFW RST packets.


Yeah, you're right, using the countryCode as the area would be a bad idea. It would give the government of country which doesn't already have the power to pretend to be any IP within their country that power.

I'm still somewhat inclined to change /24 to /16, if for nothing else than to mimic Tor's behaviour with respect to what constitutes addresses which could feasibly be under the control of the same adversary.

comment:21 Changed 5 years ago by isis

The problems that I noted above with lib/bridgedb/Dist.py have also been noted in #12505, which is the canonical ticket for dealing which refactoring all the hashring code, including the code in that file.

comment:22 Changed 5 years ago by isis

Arma also noted on IRC that we need to be certain that the available bridge rotation problems are fixed (#1839) before restricting the sets of bridges available to all Tor/proxy users at a given time.

comment:23 in reply to:  20 ; Changed 5 years ago by isis

Replying to isis:

I'm still somewhat inclined to change /24 to /16, if for nothing else than to mimic Tor's behaviour with respect to what constitutes addresses which could feasibly be under the control of the same adversary.


I'm changing the IPv4 /24 to /16 to mimic Tor's logic for EnforceDistinctSubnets. For IPv6 behaviour, see #15517.

comment:24 Changed 5 years ago by isis

Keywords: bridgedb-dist bridgedb-0.3.2 added

My fix/4771-log-tor-exits_r1 branch contains the final changes for this ticket. The only new changes are the switch from grouping subnets by /24 to grouping them by /16.

comment:25 in reply to:  23 ; Changed 5 years ago by arma

Replying to isis:

I'm changing the IPv4 /24 to /16 to mimic Tor's logic for EnforceDistinctSubnets.

I haven't looked at the other changes, but this one sounds plausible to me. I imagine we're going to have a much more non-uniform load on bridges given out with this distributor, since some /16's have lots of users in them and some /16's have zero users in them. Or actually, maybe this isn't true, since we're breaking the bridges into a reasonable small number of buckets and mapping each /16 onto a bucket, so there will still be many many /16's that map to each bucket, thus making the distribution more uniform? Or am I misunderstanding where the design has gone?

comment:26 in reply to:  25 Changed 5 years ago by isis

Replying to arma:

Replying to isis:

I'm changing the IPv4 /24 to /16 to mimic Tor's logic for EnforceDistinctSubnets.

I haven't looked at the other changes, but this one sounds plausible to me. I imagine we're going to have a much more non-uniform load on bridges given out with this distributor, since some /16's have lots of users in them and some /16's have zero users in them. Or actually, maybe this isn't true, since we're breaking the bridges into a reasonable small number of buckets and mapping each /16 onto a bucket, so there will still be many many /16's that map to each bucket, thus making the distribution more uniform? Or am I misunderstanding where the design has gone?


If the IP space of all /16s is 2¹⁶, then there are 65535 possible subnets that users can be in. I'm not sure what the normal distribution is for percentage of that 65535 being used by clients at any given point in time, but perhaps if we say that, in the worst-case, only ¼ of those 65535 are being used, so 16384 distinct subnets likely to be in use at any given point in time. This means that the HTTPS, non-Tor subhashring size would need to contain ⪞16384*3 ≅ 49000 bridges (ignoring overlap of Alice and Bob who end up in adjacent positions in the hashring, such that Alice gets bridges A, B, and C, and Bob gets B, C, and D). If we say that a "normal" number of bridges in the HTTPS hashrings is 3000 (that sounds about right), then (3000/3)/2¹⁶ gives the maximum percentage of those 65535 subnets which may be in use and still (probably, again, roughly ignoring overlap) allow for each set of bridges which would be handed out to only be mapped to 1 in-use /16 subnet at a time. (3000/3)/2¹⁶=0.0152587890625 or ~1.5% of the 65535 possible /16s.

If we think that our users are coming from significantly less than ~1.5% of the possible /16s, then we need to change this number. I'm fairly certain that, while there are rather large swaths of IP space not in use by general public end-users, I'm inclined to believe that the percentages aren't that low… but I that's just an intuition with little evidence to back it.

However, if my intuition and assumptions are correct, we should end up with many multiple /16s mapped to the same set of bridges within a time period, meaning a more uniform distribution. I suspect that previously, there may have been positions within the hashrings which might have only been obtainable via requesting bridges without using Tor from any of the many /8s and /16s assigned to the "DoD", for example. Not that I think that the NSA whistleblowers on those networks shouldn't have bridges, but rather that I would assume their bridges should have sufficient unrelated cover traffic to decrease any potential success rate of correlation attacks. Or does this mean that I should be wanting to map /16s and sets of bridges 1:1 to avoid their colleagues conducting the zig-zag attack you described in point #10 of your blog post? Would it be safer to have a /16 only have one set of bridges (or some sets uniquely assigned to it) for all the potential users in it?

comment:27 Changed 5 years ago by isis

Resolution: fixed
Status: needs_reviewclosed

My fix/4771-log-tor-exits_r1 branch was merged for the release of bridgedb-0.3.2.

Note: See TracTickets for help on using tickets.