Opened 7 days ago

Last modified 3 days ago

#29734 needs_review enhancement

Broker should receive country stats information from Proxy and Client

Reported by: cohosh Owned by: cohosh
Priority: Medium Milestone:
Component: Obfuscation/Snowflake Version:
Severity: Normal Keywords: snowflake, geoip, stats
Cc: ahf, cohosh, dcf, arlolra Actual Points:
Parent ID: #29207 Points: 1
Reviewer: ahf Sponsor: Sponsor19

Description

We can use existing geoip data to collect statistics about where clients are connecting from in order to detect possible blocking events. These should be gathered both from the initial domain-fronted client connection and from the proxies (to be passed to the broker) in order to detect the blocking of individual proxies or the blocking of the WebRTC connections.

Child Tickets

Change History (13)

comment:1 Changed 7 days ago by ahf

We have a couple of options here for the implementation:

  • The broker strictly doesn't depend on anything Tor, but we could re-use the tor-geoipdb databases that is bundled in Debian/Ubuntu to get updates. These databases have a slightly different format than the official MaxMind GeoIP databases.

Once the broker is able to update per-country stats for the domain-fronted client connection it should also be able to relay information about which database it is using to the Snowflake proxies, such that they can keep stats about incoming proxy connections from clients and where these are coming from. This would (maybe?) allow us to notice if WebRTC filtering is happening in a country in that the Broker will see multiple connections from the given country, but the proxies reports no incoming clients from the given country.

The proxies MUST NOT have to forward the client IP to the broker, which is why it is better for the proxies to fetch the GeoIP DB from the broker and cache it locally.

The format used by Tor itself is very simple (IP-encoded as an integer followed by the country) that you keep in an ordered vector where you do a binary search in whenever you need to look up a country from a given IP. The simplicity of this data-structure might make it more interesting than MaxMind's binary format since we need to do the same implementation in both Go and JavaScript.

The Tor implementation can be found in https://github.com/torproject/tor/blob/2f683465d4b666c5d8f84fb3b234ad539d8511cd/src/lib/geoip/geoip.c

The Tor GeoIP database format can be seen here: https://github.com/torproject/tor/tree/master/src/config (see geoip, geoip6 and the mmdb-convert.py conversion script)

comment:2 Changed 7 days ago by cohosh

Owner: set to cohosh
Status: newassigned

comment:3 in reply to:  1 ; Changed 6 days ago by cohosh

Replying to ahf:

Once the broker is able to update per-country stats for the domain-fronted client connection it should also be able to relay information about which database it is using to the Snowflake proxies, such that they can keep stats about incoming proxy connections from clients and where these are coming from. This would (maybe?) allow us to notice if WebRTC filtering is happening in a country in that the Broker will see multiple connections from the given country, but the proxies reports no incoming clients from the given country.

Are we already collecting per-country usage stats for snowflake bridges (as we do for other types of bridges)? If so, this might give us what we need automatically for noticing WebRTC filtering. Especially at the moment where there is one broker and one bridge, if clients are able to connect to snowflake proxies, there shouldn't be any censorship related reason that they cannot connect to bridges.

I think per-country usage stats at the broker side are still useful of course, it gives us extra information if clients are able to connect to the broker but not able to connect to the snowflake bridge eventually.

On a different note, it might also be useful to us to collect per-country stats on where the proxies are being run from.

comment:4 in reply to:  3 Changed 6 days ago by dcf

Replying to cohosh:

Are we already collecting per-country usage stats for snowflake bridges (as we do for other types of bridges)?

Yes, this was #18628. How it works is, the snowflake proxy forwards (proxy, proxy-go) the client's IP address to the bridge in a client_ip= URL query parameter. Then the server parses it and passes it to tor in the pt.DialOr call. It's similar to what we worked out for meek, which was #13171.

I don't think that Snowflake has enough users to show up on any of the by-country graphs at Tor Metrics, but you can see the stats in the uploaded descriptor files. Example: https://collector.torproject.org/archive/bridge-descriptors/extra-infos/bridge-extra-infos-2019-02.tar.xz

$ tar -O -xf bridge-extra-infos-2019-02.tar.xz | grep -A 24 '^extra-info flakey 5481936581E23D2D178105D44DB6915AB06BFB7F$' | grep -E '^dirreq-v3-reqs '
dirreq-v3-reqs ru=16,tr=16,ae=8,cn=8,gb=8,je=8,us=8
dirreq-v3-reqs tr=24,cn=16,ae=8,je=8,nl=8,ru=8,us=8
dirreq-v3-reqs tr=16,cn=8,gb=8,ru=8,us=8
...

If so, this might give us what we need automatically for noticing WebRTC filtering. Especially at the moment where there is one broker and one bridge, if clients are able to connect to snowflake proxies, there shouldn't be any censorship related reason that they cannot connect to bridges.

This logic makes sense to me.

comment:5 Changed 5 days ago by cohosh

Here's a first commit that does something similar to little-t-tor for mapping IP addresses to country codes. The functions parse and load a database file into memory and then binary search that on a provided address to efficiently find the country code.

https://github.com/cohosh/snowflake/commit/eedca1cbe49ff84468806fd630a9f104d9ca230a

For now I've just included the geoip geoipv6 database files in the repository... is there any easier way to get these?

comment:6 Changed 5 days ago by cohosh

Status: assignedneeds_review

Here's a merge candidate for geoip in the broker: https://github.com/cohosh/snowflake/compare/geoip

I added some very simple count-based usage statistics for clients (mostly just to show how it works). We can do something a lot nicer here. We can add the same statics for proxies as well.

comment:7 in reply to:  5 Changed 5 days ago by dcf

Replying to cohosh:

For now I've just included the geoip geoipv6 database files in the repository... is there any easier way to get these?

One way, if we're comfortable relying on Debian dependencies, is to ask the operator to install tor-geoipdb or geoip-database package.

In the tests, I would also test an address that maps to "" and perhaps special cases like 127.0.0.1, 0.0.0.0, 255.255.255.255.

comment:8 Changed 4 days ago by ahf

I think this looks good, with a few comments/questions:

  • I don't think we should include the two geoip databases in the repository by default?
  • We should make the path to the two GeoIP databases configurable (either via a command line parameter and/or a small config file?)
  • I don't know if this is a common thing in Go code to do, but in many functional languages where you have type aliases people tend to do type aliases for string types to make them "more specific". In this case the country-string type could be called Country so the metrics table would be a mapping of a Country to a monotonically increasing counter.
  • What should we do with these values when they are here? Should we have an API end-point that can dump them? Should we save them to a log file with some heartbeat interval? Chelsea Komlo showed me a neat library for collecting internal metrics in Go applications, but it might be too early to introduce additional dependencies just for this. It was this library: https://github.com/armon/go-metrics

comment:9 Changed 4 days ago by ahf

Oh, and more thing I forgot. Should we have a SIGHUP handler that reloads the tables?

comment:10 Changed 3 days ago by cohosh

Reviewer: ahf
Status: needs_reviewneeds_revision

comment:11 Changed 3 days ago by cohosh

One way, if we're comfortable relying on Debian dependencies, is to ask the operator to install ​tor-geoipdb or ​geoip-database package.

We should make the path to the two GeoIP databases configurable (either via a command line parameter and/or a small config file?)

I think this is the best of both worlds: https://github.com/cohosh/snowflake/commit/fbb87b508641bbbcfd3163d1f2a43b9aff4e0085

The broker now allows the operator to pass in a path to geop files (for IPv4 and IPv6) as command-line arguments. The default is the install location of the debian tor-geoip package. If an invalid filename is provided (or none are provided and the package is not installed), the table will fail to load but not cause any crashes. There's a test for that here: https://github.com/cohosh/snowflake/commit/09dd27f9408b1ff3ff916e374bcd5f659ad5b26b

In the tests, I would also test an address that maps to "" and perhaps special cases like 127.0.0.1, 0.0.0.0, 255.255.255.255.

Thanks! Got some bugs :) Here's the tests and fixes: https://github.com/cohosh/snowflake/commit/be4d245375722d958dd85f1a53849cdc37b3382b

comment:12 in reply to:  8 Changed 3 days ago by cohosh

Here's a new candidate: https://github.com/cohosh/snowflake/compare/geoip

In addition to the changes above, here are the other changes I made:

Replying to ahf:

  • I don't know if this is a common thing in Go code to do, but in many functional languages where you have type aliases people tend to do type aliases for string types to make them "more specific". In this case the country-string type could be called Country so the metrics table would be a mapping of a Country to a monotonically increasing counter.

I did that for CountryStats (which is the map from country codes to counts) is doing this for the country strings too noisy?

  • What should we do with these values when they are here? Should we have an API end-point that can dump them? Should we save them to a log file with some heartbeat interval? Chelsea Komlo showed me a neat library for collecting internal metrics in Go applications, but it might be too early to introduce additional dependencies just for this. It was this library: https://github.com/armon/go-metrics

I think it's still very early... my suggestion is to do something simple, close this ticket, and then think about what we want a bit more before adding new dependencies. Right now it just logs the country counts to a log file every hour

Oh, and more thing I forgot. Should we have a SIGHUP handler that reloads the tables?

Added.

comment:13 Changed 3 days ago by cohosh

Status: needs_revisionneeds_review
Note: See TracTickets for help on using tickets.