Opened 9 years ago

Closed 7 years ago

#1837 closed defect (fixed)

bridgedb learns to load file of which bridges are blocked where

Reported by: arma Owned by:
Priority: Medium Milestone:
Component: Circumvention/BridgeDB Version:
Severity: Keywords:
Cc: aagbsn@… Actual Points:
Parent ID: #5484 Points:
Reviewer: Sponsor:

Description (last modified by karsten)

#1608 says we'll avoid giving out bridges that are blocked in china, when a user asks for bridges that will work in china.

So as part of that task we need to figure out the file format that we'll write the list of which bridges are unusable from china, and then make bridgedb load, parse, and use it.

Child Tickets

Attachments (1)

filterblocked.patch (18.1 KB) - added by aagbsn 8 years ago.
patch of diff between master and branch filterblocked

Download all attachments as: .zip

Change History (26)

comment:1 Changed 9 years ago by nickm

Status: newneeds_review

comment:2 in reply to:  1 ; Changed 9 years ago by kaner

Replying to nickm:

See https://github.com/aagbsn/bridgedb/commits/filterblocked

Two comments here:

  • Is there a feature planned where we can turn the blocking filter on and off per configuration? That would be very handy I think. Always filtering the results for, say, China currently only would lead to more bridges in China being blocked more effectively.

comment:3 in reply to:  2 Changed 9 years ago by aagbsn

Replying to kaner:

Replying to nickm:

See https://github.com/aagbsn/bridgedb/commits/filterblocked

Two comments here:

I am. Would this be better managed with a separate branch for each bug?

  • Is there a feature planned where we can turn the blocking filter on and off per configuration? That would be very handy I think.

This could be added. I think it should be enabled by default.

Always filtering the results for, say, China currently only would lead to more bridges in China being blocked more effectively.

This isn't true - blocked bridges are just filtered from the results. nickm suggests flagging blocked bridges as 'Might be blocked' in the response.

comment:4 Changed 9 years ago by aagbsn

Cc: aagbsn@… added

comment:5 Changed 8 years ago by karsten

Code looks good. A few comments:

  • Can you rebase this branch to origin/master, too?
  • What's the format of the blocked-bridges file? I think this should go into the bridge spec (which is not in origin/master, yet).
  • I found that you're removing blocked bridges from the result, similar to the /16 filtering. I'm not sure if this is the right thing to do. Maybe it is. I'm adding this comment mostly for myself to think about it when looking at your branch again.
  • You have the line self.db = db = bridgedb.Storage.getDB() in both BridgeBlock and CountryBlock. I think BridgeBlock doesn't need it. Also, what's the db in the middle doing?
  • Can you add two log statements for successfully loading and failing to load the GeoIP database?
  • You have at least two instances of SELECT * FROM in your code where you reference columns by their index. This is fragile, because someone might overlook these places when changing the schema. Better put the column names in here. Also, is the (fingerprint,) and (countryCode,) syntax correct?
  • How would I migrate the schema from version 1 to 2? Can you add a sentence or two to the README for that?

Thanks!

comment:6 in reply to:  5 Changed 8 years ago by aagbsn

Replying to karsten:

Code looks good. A few comments:

  • Can you rebase this branch to origin/master, too?

this seems to be replaying the entire branch over origin/master (again). Perhaps it makes more sense to apply the diff as a patch than to replay the branch on top of origin/master again?

  • What's the format of the blocked-bridges file? I think this should go into the bridge spec (which is not in origin/master, yet).

1 block entry per line: (do not include < >)
fingerprint <bridge fingerprint> country-code <country code>

I've also updated the README accordingly

  • I found that you're removing blocked bridges from the result, similar to the /16 filtering. I'm not sure if this is the right thing to do. Maybe it is. I'm adding this comment mostly for myself to think about it when looking at your branch again.

I believe this is the correct approach -- if bridgedb always returns 3 good bridges, an adversary could iteratively request and block bridges until there are none left.

  • You have the line self.db = db = bridgedb.Storage.getDB() in both BridgeBlock and CountryBlock. I think BridgeBlock doesn't need it. Also, what's the db in the middle doing?
  • Can you add two log statements for successfully loading and failing to load the GeoIP database?

done

  • You have at least two instances of SELECT * FROM in your code where you reference columns by their index. This is fragile, because someone might overlook these places when changing the schema. Better put the column names in here. Also, is the (fingerprint,) and (countryCode,) syntax correct?

yes, the extra ',' is correct.

I have replaced the SELECT *, and added another test for getBlockingCountries to Tests.py.

  • How would I migrate the schema from version 1 to 2? Can you add a sentence or two to the README for that?

there are a couple extra tables that need to be created, and the schema-version needs to be updated:

CREATE TABLE BlockedBridges ( id INTEGER PRIMARY KEY NOT NULL, hex_key, blocking_country);
CREATE INDEX BlockedBridgesBlockingCountry on BlockedBridges(hex_key);
REPLACE INTO Config VALUES ( 'schema-version', 2 );

README also updated.

Thanks!

Thanks for your feedback!

Changed 8 years ago by aagbsn

Attachment: filterblocked.patch added

patch of diff between master and branch filterblocked

comment:7 Changed 8 years ago by karsten

Thanks for making the changes! I ran your code on a local machine and it did what I expected. Yay!

Here are a few more comments, none of them preventing us from merging your patch:

  • There's a typo in the README: "<ountry code>" should be "<country code>"
  • Can you add a short explanation for installing the Maxmind GeoIP library and database to the README? On Debian, this is just "apt-get install python-geoip", but adding a link to http://www.maxmind.com/app/python might be good, too.
  • Should BridgeDB only attempt to load a GeoIP database if we're using a blocked-bridges file? Is there an easy way to change this?
  • I noticed that the way to request unblocked bridges via HTTPS is to request https://bridges.tpo/cc , right? Wouldn't it make sense to resolve the IP address to a country code, now that we have a GeoIP database? No need to implement this now, but I'd like to know if it makes sense to do this in the future.
  • The email distributor doesn't support removing blocked bridges from results, yet, right? Again, no need to implement this now, but how would we implement it? How would people provide their country code here?
  • I'm planning to add the following paragraph to the end of Section 4 of the BridgeDB spec. Does that paragraph make sense to you, and does it cover the most important parts of your patch?

"BridgeDB can be configured to read a file with the fingerprints and country codes where bridges are assumed to be blocked. Bridge users requesting bridges via the HTTPS distributor can specify their country code cc by requesting URL /cc. BridgeDB will then remove all bridges that are assumed to be blocked in that country from the result. The idea is that bridge users don't learn about bridges that very likely don't work for them."

So, I say let's merge your patch. Can you add a single commit on top of origin/master that contains your diff (plus the changes above)? Bonus points for a good commit message. :)

comment:8 Changed 8 years ago by karsten

Description: modified (diff)

Ah, there's one more thing that might be problematic or not:

def getCCFromRequest(request):
    # return the path
    if len(request.path) > 1:
        return request.path[1:3].lower()
    return None

In this code we're using the cc part of the URL /cc requested by a bridge user to determine which bridges should be removed from the result. Should we try harder to make sure that cc are exactly two letters?

comment:9 Changed 8 years ago by karsten

Description: modified (diff)

Oops. I didn't mean to change the description. The sentence "If not, can you add a comment saying why this is safe?" was meant as part of the comment.

comment:10 in reply to:  7 ; Changed 8 years ago by aagbsn

Replying to karsten:

Thanks for making the changes! I ran your code on a local machine and it did what I expected. Yay!

Here are a few more comments, none of them preventing us from merging your patch:

  • There's a typo in the README: "<ountry code>" should be "<country code>"
  • Can you add a short explanation for installing the Maxmind GeoIP library and database to the README? On Debian, this is just "apt-get install python-geoip", but adding a link to http://www.maxmind.com/app/python might be good, too.

all above fixed

  • Should BridgeDB only attempt to load a GeoIP database if we're using a blocked-bridges file? Is there an easy way to change this?

good idea. It is easy to change:

BridgeDB could check to see if the configuration object has a COUNTRY_BLOCK_FILE defined and if so try to import the GeoIP module there.

Probably the best place for this is in the call to addWebServer() -- a similar approach is used to only import twisted ssl support if HTTPS_PORT is defined.

  • I noticed that the way to request unblocked bridges via HTTPS is to request https://bridges.tpo/cc , right? Wouldn't it make sense to resolve the IP address to a country code, now that we have a GeoIP database? No need to implement this now, but I'd like to know if it makes sense to do this in the future.

This is the current behavior. If the client specifies a cc this will override GeoIP.

  • The email distributor doesn't support removing blocked bridges from results, yet, right? Again, no need to implement this now, but how would we implement it? How would people provide their country code here?

Blocked bridge filtering for email will work if a country code is passed to Dist.EmailBasedDistributor.getBridgesForEmail()

People could provide the country code in the subject or body of the email.
This might be a bit confusing because emails sent to bridgedb+cc@tpo allow the user to specify the language of the email and a second cc field may be confusing.

The language of an email doesn't indicate the clients network location, so filtering options should be explicit and not assumed.

  • I'm planning to add the following paragraph to the end of Section 4 of the BridgeDB spec. Does that paragraph make sense to you, and does it cover the most important parts of your patch?

"BridgeDB can be configured to read a file with the fingerprints and country codes where bridges are assumed to be blocked. Bridge users requesting bridges via the HTTPS distributor can specify their country code cc by requesting URL /cc. BridgeDB will then remove all bridges that are assumed to be blocked in that country from the result. The idea is that bridge users don't learn about bridges that very likely don't work for them."

+
BridgeDB will use GeoIP-based country detection to remove blocked bridges from the result if GeoIP support is available.

So, I say let's merge your patch. Can you add a single commit on top of origin/master that contains your diff (plus the changes above)? Bonus points for a good commit message. :)

comment:11 in reply to:  8 Changed 8 years ago by aagbsn

Replying to karsten:

Ah, there's one more thing that might be problematic or not:

def getCCFromRequest(request):
    # return the path
    if len(request.path) > 1:
        return request.path[1:3].lower()
    return None

In this code we're using the cc part of the URL /cc requested by a bridge user to determine which bridges should be removed from the result. Should we try harder to make sure that cc are exactly two letters?

Agreed that it doesn't make sense to service bogus requests.

def getCCFromRequest(request):
    path = request.path.strip('/')
    if len(path) ==  2:
       return path.lower()
   return None

comment:12 in reply to:  9 Changed 8 years ago by aagbsn

Replying to karsten:

Oops. I didn't mean to change the description. The sentence "If not, can you add a comment saying why this is safe?" was meant as part of the comment.

This is safe because the input (country code) is only used as a query against the blocked bridges list. I have updated the function to strip all characters except a-zA-Z

def getCCFromRequest(request):
    path = re.sub(r'[^a-zA-Z]', '', request.path)
    if len(path) ==  2:
        return path.lower()
    return None

comment:13 Changed 8 years ago by karsten

Looks good. Do you want to add the diff as a new single commit (so that you're the author, not I), or shall I do that?

comment:14 Changed 8 years ago by karsten

Here's a last request for change: Instead of filtering blocked bridges from the results, can we flag them as 'Might be blocked' in the response as suggested by Nick above? I think that's the much better behavior here. Otherwise, the user provides us with more information, but gets less information in return. (This also applies to #2678.)

aagbsn, can you prepare a branch with single commit on top of master that I can merge?

comment:15 in reply to:  14 ; Changed 8 years ago by aagbsn

Replying to karsten:

Here's a last request for change: Instead of filtering blocked bridges from the results, can we flag them as 'Might be blocked' in the response as suggested by Nick above? I think that's the much better behavior here. Otherwise, the user provides us with more information, but gets less information in return. (This also applies to #2678.)

aagbsn, can you prepare a branch with single commit on top of master that I can merge?

Done, but I don't think it makes sense for #2678
https://github.com/aagbsn/bridgedb/commits/master

comment:16 in reply to:  15 Changed 8 years ago by karsten

Replying to aagbsn:

Replying to karsten:

aagbsn, can you prepare a branch with single commit on top of master that I can merge?

Done, but I don't think it makes sense for #2678
https://github.com/aagbsn/bridgedb/commits/master

That branch does not contain the #1837 changes in a single commit on top of master. When I look at the history of your master branch, I see 42 commits that are not in origin/master, including recaptcha stuff that shouldn't be there, because it's not part of #1837. I can't merge that branch. Can you clean up the branch to only contain the #1837 parts, or should I do that?

comment:17 Changed 8 years ago by kaner

I'm currently not sure where exactly to look for the code changes to review, so I just tried running the latest master code as a test. A few (minor) observations (I might add more when I figured out what code I need to review):

  • If the "COUNTRY_BLOCK_FILE" doesn't exist, BridgeDB exits with a stack trace. Maybe we should add a more graceful way to exit and a hint for the user. Could BridgeDB maybe add an empty file in case it doesn't find one?
  • The README file doesn't contain hints on how to update the database yet. I see there's some instructions in filterblocked.patch, maybe we should add those.
  • Is there an easy way to make the ReCaptcha stuff barrier-free? (e.g. audio output of the captchas)

Thanks for your patience and code so far!

comment:18 Changed 8 years ago by kaner

Since you've rebased the commits on https://github.com/aagbsn/bridgedb/tree/fix_filterblocked i think not all of my comments above are still valid. (ReCaptcha code isn't in this branch anymore)

Overall looks good to me. Did you test the fix_filterblocked seperately?

comment:19 in reply to:  18 Changed 8 years ago by aagbsn

Replying to kaner:

Since you've rebased the commits on https://github.com/aagbsn/bridgedb/tree/fix_filterblocked i think not all of my comments above are still valid. (ReCaptcha code isn't in this branch anymore)

Overall looks good to me. Did you test the fix_filterblocked seperately?

I ran pylint and rechecked this branch.

comment:20 in reply to:  10 Changed 8 years ago by arma

Replying to aagbsn:

Replying to karsten:

  • The email distributor doesn't support removing blocked bridges from results, yet, right? Again, no need to implement this now, but how would we implement it? How would people provide their country code here?

Blocked bridge filtering for email will work if a country code is passed to Dist.EmailBasedDistributor.getBridgesForEmail()

People could provide the country code in the subject or body of the email.
This might be a bit confusing because emails sent to bridgedb+cc@tpo allow the user to specify the language of the email and a second cc field may be confusing.

The language of an email doesn't indicate the clients network location, so filtering options should be explicit and not assumed.

In practice right the cc in https://bridges.torproject.org/cc is a language code, not a country code. See https://bridges.torproject.org/zh_CN/ for a live example.

So is it possible there's a bug where you're expecting them to say https://bridges.torproject.org/cn ?

comment:21 Changed 7 years ago by karsten

Parent ID: #1608#5484

comment:22 Changed 7 years ago by aagbsn

BridgeDB should be rewritten to take arguments via POST and not try to overload these path schemes.
This is trivial to do. Are there any scripts (other than the ones china runs ;-) ) that will be broken if I change the user interface in this way?

comment:23 in reply to:  22 ; Changed 7 years ago by arma

Replying to aagbsn:

BridgeDB should be rewritten to take arguments via POST and not try to overload these path schemes.
This is trivial to do. Are there any scripts (other than the ones china runs ;-) ) that will be broken if I change the user interface in this way?

I believe nothing uses them currently (including people, for the most part).

comment:24 in reply to:  23 ; Changed 7 years ago by karsten

Replying to aagbsn:

BridgeDB should be rewritten to take arguments via POST and not try to overload these path schemes.

Why the switch to POST? What about GET URLs like https://bridges.torproject.org/index.html?lang=zh_CN&country=cn? That way people would still be able to fetch the list of bridges they want directly and bookmark it rather than having to use some HTML form.

comment:25 in reply to:  24 Changed 7 years ago by aagbsn

Resolution: fixed
Status: needs_reviewclosed

Replying to karsten:

Replying to aagbsn:

BridgeDB should be rewritten to take arguments via POST and not try to overload these path schemes.

Why the switch to POST? What about GET URLs like https://bridges.torproject.org/index.html?lang=zh_CN&country=cn? That way people would still be able to fetch the list of bridges they want directly and bookmark it rather than having to use some HTML form.

Good point. BridgeDB uses GET URLs now for querying bridges by transport, ipv6 support, and unblocked status. Closing this ticket because it was replaced by work done in #5027 and #4297.

Note: See TracTickets for help on using tickets.