When I start bridgedb with loglevel INFO, it says things like
Dec 24 19:58:54 [INFO] by location set: 59 50 56 50Dec 24 19:58:54 [INFO] by category set: 40
and then when I hit if from an IP in the 'category set', it says
Dec 24 19:59:17 [INFO] area is 87.236.194Dec 24 19:59:17 [INFO] ---------------------------------Dec 24 19:59:17 [INFO] category<1970>87.236.194Dec 24 19:59:17 [INFO] Replying to web request from 87.236.194.158. Parameters were {}
So I assume that means the category set is the same as the set of bridges it gives out when an IP from the bulk-exitlist file asks.
But the assignments.log gives no hint about which bridges are in this category set. Is it a subset of each ring? Or some other set that never gets logged in the assignments list? Confusing.
To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information
Child items 0
Show closed items
No child items are currently assigned. Use child items to break down this issue into smaller parts.
Linked items 0
Link issues together to show that they're related.
Learn more.
Sounds reasonable. That said, does whatever we replaced it with (the thing that lets atlas tell you how your bridge is being distributed) know about this ipv5 hashring too?
Sounds reasonable. That said, does whatever we replaced it with (the thing that lets atlas tell you how your bridge is being distributed) know about this ipv5 hashring too?
Onionoo no longer contains 'pool_assignment' fields in its bridge details.json documents, and #13921 (moved) is the ticket for changing both Atlas and Globe to no longer have "Pool assignment" fields in the UIs. In Globe's case, it now has a "Transports" field instead which lists the PT methodnames supported by that bridge (or rather, it will, as soon as someone redeploys Globe).
Oh. So we are no longer documenting anywhere which pool a bridge was in?
The goal there was to be able to answer research questions like "which distribution strategies cause bridges to get blocked quickly, or cause bridges to get lots of use" or similar questions.
So, if I asked you (1) which distribution strategies (https, https-to-Tor-IPs, email, etc) are being successful right now, how would you go about answering it? And if I asked you (2) what the answer was last November, is the data that you'd use to answer question 1 gone for those trying to answer question 2?
Oh. So we are no longer documenting anywhere which pool a bridge was in?
Correct. I have an email from Karsten on 8 December 2014 which said that Karsten had disabled the code for yatei to sanitise BridgeDB's assignments.log files, and that the code would be removed entirely from metrics-lib within a couple days if nothing broke.
The goal there was to be able to answer research questions like "which distribution strategies cause bridges to get blocked quickly, or cause bridges to get lots of use" or similar questions.
Right; I understand and agree. But the system we had for doing that was really slow for both the BridgeDB and Metrics servers. Not to mention really buggy.
I don't think that we shouldn't do it; I just think that if we are to do it again, then we probably want to reopen #2755 (moved). I would opt for storing the information in one of the databases for #12030 (moved) or #12031 (moved), and optionally providing some interface to the distributor assignment data, e.g. something like what you were asking for in #7877 (moved). And, actually, once #12506 (moved) is finished, we could do something like that without placing much extra load on BridgeDB (or, if we wanted to get really fancy, it could even be hosted on a separate BridgeMetrics machine).
So, if I asked you (1) which distribution strategies (https, https-to-Tor-IPs, email, etc) are being successful right now, how would you go about answering it?
Well… if "successful" means "the highest ratio of real clients who are successfully given bridges to bot traffic and everything else", then I'd run some grep queries on the bridgedb.log file to find out those numbers. But what did you mean by "successful"?
And if I asked you (2) what the answer was last November, is the data that you'd use to answer question 1 gone for those trying to answer question 2?
Well, for last November in particular there would not be an issue, because Metrics has data up until 8 December 2014… however, that obviously isn't going to continue answering the question moving forward. (Also January and February 2015 are forever missing.) There is also the issue that the current assignments.log implementation wouldn't exactly have answered these questions, since researchers would probably need a rounded number on how many seeming-unique clients BridgeDB has distributed a given bridge to within a given time period in order to correlate the effectiveness of any bridge distribution strategy to actual bridge client connections and/or bridge bandwidth usage.
Should I assume this ticket is about the "category set" and the "IPv5" ring and all the surrounding terminology and code being really confusing and in general need of a cleanup? Or something else? Or the first thing plus something else?
I would define success of a distribution strategy as a function of how many people are using the bridges that are given out by that strategy.
That means if a strategy never gives bridges to anybody, it would score low. And if it gives out a lot of bridges but they never get used because they got blocked, it would also score low.
It we wanted to get fancier, we would then have a per-country success value. And then we could compare distribution strategies for a given country.
The intuition comes from Damon's Proximax paper from long ago.
Well, it sounds like the code and documentation definitely needs cleanup. And also I think we should work towards being able to compute these success metrics as described above, per distribution strategy. That probably involves getting a lot of moving parts to line up.
I would define success of a distribution strategy as a function of how many people are using the bridges that are given out by that strategy.
That means if a strategy never gives bridges to anybody, it would score low. And if it gives out a lot of bridges but they never get used because they got blocked, it would also score low.
It we wanted to get fancier, we would then have a per-country success value. And then we could compare distribution strategies for a given country.
The intuition comes from Damon's Proximax paper from long ago.
Tracking distributor success is an excellent idea, I've added it to #9316 (moved).
I am considering changing the way we treat clients making requests from Tor exits. Actually, I'm questioning if this has ever worked.
The part of BridgeDB's codebase where this is supposed to happen is buried deep in lib/bridgedb/Dist.py, in a function that I've hardly touched, and which still contains code from when Nick wrote BridgeDB and the code was kept in SVN. The function is bridgedb.Dist.IPBasedDistributor.getBridgesForIP(), and it currently looks like:
def getBridgesForIP(self, ip, epoch, N=1, countryCode=None, bridgeFilterRules=None): """Return a list of bridges to give to a user. :param str ip: The user's IP address, as a dotted quad. :param str epoch: The time period when we got this request. This can be any string, so long as it changes with every period. :param int N: The number of bridges to try to give back. (default: 1) :param str countryCode: The two-letter geoip country code of the client's IP address. If given, the client will be placed in that "area". Clients within the same area receive the same bridges per period. If not given, the **ip** is truncated to it's CIDR /24 representation and used as the "area". (default: None) :param list bridgeFilterRules: A list of callables used filter the bridges returned in the response to the client. See :mod:`~bridgedb.Filters`. :rtype: list :return: A list of :class:`~bridgedb.Bridges.Bridge`s to include in the response. See :meth:`bridgedb.HTTPServer.WebResource.getBridgeRequestAnswer` for an example of how this is used. """ logging.info("Attempting to return %d bridges to client %s..." % (N, ip)) if not bridgeFilterRules: bridgeFilterRules=[] if not len(self.splitter): logging.warn("Bailing! Splitter has zero bridges!") return [] logging.debug("Bridges in splitter:\t%d" % len(self.splitter)) logging.debug("Client request epoch:\t%s" % epoch) logging.debug("Active bridge filters:\t%s" % ' '.join([x.func_name for x in bridgeFilterRules])) area = self.areaMapper(ip) logging.debug("IP mapped to area:\t%s" % logSafely("{0}.0/24".format(area))) key1 = '' pos = 0 n = self.nClusters # only one of ip categories or area clustering is active # try to match the request to an ip category for category in self.categories: # IP Categories if category.contains(ip): g = filterAssignBridgesToRing(self.splitter.hmac, self.nClusters + len(self.categories), n) bridgeFilterRules.append(g) logging.info("category<%s>%s", epoch, logSafely(area)) pos = self.areaOrderHmac("category<%s>%s" % (epoch, area)) key1 = getHMAC(self.splitter.key, "Order-Bridges-In-Ring-%d" % n) break n += 1 # if no category matches, use area clustering else: # IP clustering h = int( self.areaClusterHmac(area)[:8], 16) # length of numClusters clusterNum = h % self.nClusters g = filterAssignBridgesToRing(self.splitter.hmac, self.nClusters + len(self.categories), clusterNum) bridgeFilterRules.append(g) pos = self.areaOrderHmac("<%s>%s" % (epoch, area)) key1 = getHMAC(self.splitter.key, "Order-Bridges-In-Ring-%d" % clusterNum) # try to find a cached copy ruleset = frozenset(bridgeFilterRules) # See if we have a cached copy of the ring, # otherwise, add a new ring and populate it if ruleset in self.splitter.filterRings.keys(): logging.debug("Cache hit %s" % ruleset) _, ring = self.splitter.filterRings[ruleset] # else create the ring and populate it else: logging.debug("Cache miss %s" % ruleset) ring = bridgedb.Bridges.BridgeRing(key1, self.answerParameters) self.splitter.addRing(ring, ruleset, filterBridgesByRules(bridgeFilterRules), populate_from=self.splitter.bridges) # get an appropriate number of bridges numBridgesToReturn = getNumBridgesPerAnswer(ring, max_bridges_per_answer=N) answer = ring.getBridges(pos, numBridgesToReturn) return answer
A couple things to note:
The countryCode parameter is entirely unused. What was it for?
The only place in BridgeDB's code where IPBasedDistributor.getBridgesForIP() is called is in bridgedb.HTTPServer.WebResourceBridges.getBridgeRequestAnswer(), where the countryCode is passed in, and it is the two-letter geoIP countryCode for the client's IP address.
Are we supposed to be grouping clients by which country they are coming from? Shouldn't we group them by whether or they are coming from Tor or another known open proxy, and if not, what country they are coming from?
Or was the countryCode parameter supposed to be the country which the bridge shouldn't be blocked in?
(By the way, the docstring was written by me. It was my best guess as to what countryCode was supposed to be for. There was previously no documentation on it.)
Should we still be grouping clients by /24s? What adversary is that effective against? I realise that it isn't very difficult to get a class C subnet, but it isn't very difficult to get addresses in different /24s. Should we make the groups bigger, i.e. group clients by which /16 they are coming from?
Why are we still using the /24 (the area) in the code for serving bridges to clients coming from Tor exits? This means that changing your exit node would get you different bridges. (But not the same bridges as people not using Tor.)
It seems like a lot of these bugs come from commit f022b905ca01a193aabd4d78107f27fce85c40cd which implemented #4297 (moved) and is where this whole business of making an "IPv5 hashring" and putting Tor users in it came from… we should probably look at the other changes in that commit and review them.
diff --git i/lib/bridgedb/Dist.py w/lib/bridgedb/Dist.pyindex c2a8620..65a2f75 100644--- i/lib/bridgedb/Dist.py+++ w/lib/bridgedb/Dist.py@@ -284,16 +283,21 @@ class IPBasedDistributor(Distributor): # try to match the request to an ip category for category in self.categories: # IP Categories- if category.contains(ip):+ if ip in category:+ # The tag is a tag applied to a proxy IP address when it is+ # added to the bridgedb.proxy.ProxySet. For Tor Exit relays,+ # the default is 'exit_relay'. For other proxies loaded from+ # the PROXY_LIST_FILES config option, the default tag is the+ # full filename that the IP address originally came from.+ tag = category.getTag(ip)+ logging.info("Client was from known proxy (tag: %s): %s"+ % (tag, ip)) g = filterAssignBridgesToRing(self.splitter.hmac, self.nClusters + len(self.categories), n) bridgeFilterRules.append(g)- logging.info("category<%s>%s", epoch, logSafely(area))- pos = self.areaOrderHmac("category<%s>%s" % (epoch, area))+ pos = self.areaOrderHmac("<%s>known-proxy" % epoch) break n += 1
This fixes the issue with confusing logging, and also fixes the issue that changing your Tor exit gets you different bridges.
This fixes the issue with confusing logging, and also fixes the issue that changing your Tor exit gets you different bridges.
That was intentional, and (at least back in 2011) arma/Roger considered it a good feature.
At least it should spread out the load due to honest users who obtain bridges by HTTPS-over-Tor better than serving the same small set to all HTTPS-over-Tor bridge users.
Ah, that is a good point! But it also means that the whole subhashring for Tor users can be super easily scraped, meaning that if a user in China has already gotten their Tor working, and then they ask for bridges over Tor, they'll likely get bridges that are already blocked. :(
This fixes the issue with confusing logging, and also fixes the issue that changing your Tor exit gets you different bridges.
That was intentional, and (at least back in 2011) arma/Roger considered it a good feature.
At least it should spread out the load due to honest users who obtain bridges by HTTPS-over-Tor better than serving the same small set to all HTTPS-over-Tor bridge users.
Ah, that is a good point! But it also means that the whole subhashring for Tor users can be super easily scraped, meaning that if a user in China has already gotten their Tor working, and then they ask for bridges over Tor, they'll likely get bridges that are already blocked. :(
If we want to spread out the load more, we could do something like int(ip) % 4 and put that into the HMACed data, in order to split the Tor/proxy users into four groups, with separate bridges for each one. That would still make it impossible to get the whole subhashring in one go.
Should we still be grouping clients by /24s? What adversary is that effective against? I realise that it isn't very difficult to get a class C subnet, but it isn't very difficult to get addresses in different /24s. Should we make the groups bigger, i.e. group clients by which /16 they are coming from?
I thought it was /16, or at least intended to be /16, once, but I was probably confusing BridgeDB with Tor's implicit IP-based ‘families’ (i.e. no two relays in the same /16 will be chosen for the circuit).
Do you think it should be changed to /16? Truncating to /24 just seems like it would stop someone at Noisebridge from getting multiple sets of lines (Noisebridge has a /24). I don't really see what that accomplishes. I thought that the NSA has a bunch of /8s? And China has even crazier, they can just spoof the IP of anything in China.
I kind of think we should be grouping clients according to what country they are coming from… that is at least marginally difficult to change.
This fixes the issue with confusing logging, and also fixes the issue that changing your Tor exit gets you different bridges.
That was intentional, and (at least back in 2011) arma/Roger considered it a good feature.
At least it should spread out the load due to honest users who obtain bridges by HTTPS-over-Tor better than serving the same small set to all HTTPS-over-Tor bridge users.
Ah, that is a good point! But it also means that the whole subhashring for Tor users can be super easily scraped, meaning that if a user in China has already gotten their Tor working, and then they ask for bridges over Tor, they'll likely get bridges that are already blocked. :(
If we want to spread out the load more, we could do something like int(ip) % 4 and put that into the HMACed data, in order to split the Tor/proxy users into four groups, with separate bridges for each one. That would still make it impossible to get the whole subhashring in one go.
Okay, I went with doing the int(ip) % 4 thing. See commit 6cfee6452ac63fa019cdf08c1f633dcb9aba8c81 in my fix/4771-log-tor-exitsbranch.