Broker should receive country stats information from Proxy and Client

Trac:
Parent Ticket: #29207 (moved)

added actualpoints::2 component::circumvention/snowflake geoip owner::cohosh parent::29207 points::1 priority::medium resolution::implemented reviewer::ahf severity::normal snowflake sponsor::28-must stats status::closed type::enhancement labels

We have a couple of options here for the implementation:

The broker strictly doesn't depend on anything Tor, but we could re-use the tor-geoipdb databases that is bundled in Debian/Ubuntu to get updates. These databases have a slightly different format than the official MaxMind GeoIP databases.
We can use MaxMind's own releases with the Go API found in https://github.com/oschwald/maxminddb-golang - this would require us to maintain the DB's ourselves.

Once the broker is able to update per-country stats for the domain-fronted client connection it should also be able to relay information about which database it is using to the Snowflake proxies, such that they can keep stats about incoming proxy connections from clients and where these are coming from. This would (maybe?) allow us to notice if WebRTC filtering is happening in a country in that the Broker will see multiple connections from the given country, but the proxies reports no incoming clients from the given country.

The proxies MUST NOT have to forward the client IP to the broker, which is why it is better for the proxies to fetch the GeoIP DB from the broker and cache it locally.

The format used by Tor itself is very simple (IP-encoded as an integer followed by the country) that you keep in an ordered vector where you do a binary search in whenever you need to look up a country from a given IP. The simplicity of this data-structure might make it more interesting than MaxMind's binary format since we need to do the same implementation in both Go and JavaScript.

The Tor implementation can be found in https://github.com/torproject/tor/blob/2f683465d4b666c5d8f84fb3b234ad539d8511cd/src/lib/geoip/geoip.c

The Tor GeoIP database format can be seen here: https://github.com/torproject/tor/tree/master/src/config (see geoip, geoip6 and the mmdb-convert.py conversion script)

Trac:
Owner: N/A to cohosh
Status: new to assigned

Replying to ahf:

Once the broker is able to update per-country stats for the domain-fronted client connection it should also be able to relay information about which database it is using to the Snowflake proxies, such that they can keep stats about incoming proxy connections from clients and where these are coming from. This would (maybe?) allow us to notice if WebRTC filtering is happening in a country in that the Broker will see multiple connections from the given country, but the proxies reports no incoming clients from the given country.

Are we already collecting per-country usage stats for snowflake bridges (as we do for other types of bridges)? If so, this might give us what we need automatically for noticing WebRTC filtering. Especially at the moment where there is one broker and one bridge, if clients are able to connect to snowflake proxies, there shouldn't be any censorship related reason that they cannot connect to bridges.

I think per-country usage stats at the broker side are still useful of course, it gives us extra information if clients are able to connect to the broker but not able to connect to the snowflake bridge eventually.

On a different note, it might also be useful to us to collect per-country stats on where the proxies are being run from.

Replying to cohosh:

Are we already collecting per-country usage stats for snowflake bridges (as we do for other types of bridges)?

Yes, this was #18628 (moved). How it works is, the snowflake proxy forwards (proxy, proxy-go) the client's IP address to the bridge in a client_ip= URL query parameter. Then the server parses it and passes it to tor in the pt.DialOr call. It's similar to what we worked out for meek, which was #13171 (moved).

I don't think that Snowflake has enough users to show up on any of the by-country graphs at Tor Metrics, but you can see the stats in the uploaded descriptor files. Example: https://collector.torproject.org/archive/bridge-descriptors/extra-infos/bridge-extra-infos-2019-02.tar.xz

$ tar -O -xf bridge-extra-infos-2019-02.tar.xz | grep -A 24 '^extra-info flakey 5481936581E23D2D178105D44DB6915AB06BFB7F$' | grep -E '^dirreq-v3-reqs '
dirreq-v3-reqs ru=16,tr=16,ae=8,cn=8,gb=8,je=8,us=8
dirreq-v3-reqs tr=24,cn=16,ae=8,je=8,nl=8,ru=8,us=8
dirreq-v3-reqs tr=16,cn=8,gb=8,ru=8,us=8
...

If so, this might give us what we need automatically for noticing WebRTC filtering. Especially at the moment where there is one broker and one bridge, if clients are able to connect to snowflake proxies, there shouldn't be any censorship related reason that they cannot connect to bridges.

This logic makes sense to me.

Here's a first commit that does something similar to little-t-tor for mapping IP addresses to country codes. The functions parse and load a database file into memory and then binary search that on a provided address to efficiently find the country code.

https://github.com/cohosh/snowflake/commit/eedca1cbe49ff84468806fd630a9f104d9ca230a

For now I've just included the geoip geoipv6 database files in the repository... is there any easier way to get these?

Here's a merge candidate for geoip in the broker: https://github.com/cohosh/snowflake/compare/geoip

I added some very simple count-based usage statistics for clients (mostly just to show how it works). We can do something a lot nicer here. We can add the same statics for proxies as well.

Trac:
Status: assigned to needs_review

Replying to cohosh:

For now I've just included the geoip geoipv6 database files in the repository... is there any easier way to get these?

One way, if we're comfortable relying on Debian dependencies, is to ask the operator to install tor-geoipdb or geoip-database package.

In the tests, I would also test an address that maps to "" and perhaps special cases like 127.0.0.1, 0.0.0.0, 255.255.255.255.

I think this looks good, with a few comments/questions:

I don't think we should include the two geoip databases in the repository by default?
We should make the path to the two GeoIP databases configurable (either via a command line parameter and/or a small config file?)
I don't know if this is a common thing in Go code to do, but in many functional languages where you have type aliases people tend to do type aliases for string types to make them "more specific". In this case the country-string type could be called Country so the metrics table would be a mapping of a Country to a monotonically increasing counter.
What should we do with these values when they are here? Should we have an API end-point that can dump them? Should we save them to a log file with some heartbeat interval? Chelsea Komlo showed me a neat library for collecting internal metrics in Go applications, but it might be too early to introduce additional dependencies just for this. It was this library: https://github.com/armon/go-metrics

Oh, and more thing I forgot. Should we have a SIGHUP handler that reloads the tables?

Trac:
Reviewer: N/A to ahf
Status: needs_review to needs_revision

One way, if we're comfortable relying on Debian dependencies, is to ask the operator to install tor-geoipdb or geoip-database package.

We should make the path to the two GeoIP databases configurable (either via a command line parameter and/or a small config file?)

I think this is the best of both worlds: https://github.com/cohosh/snowflake/commit/fbb87b508641bbbcfd3163d1f2a43b9aff4e0085

The broker now allows the operator to pass in a path to geop files (for IPv4 and IPv6) as command-line arguments. The default is the install location of the debian tor-geoip package. If an invalid filename is provided (or none are provided and the package is not installed), the table will fail to load but not cause any crashes. There's a test for that here: https://github.com/cohosh/snowflake/commit/09dd27f9408b1ff3ff916e374bcd5f659ad5b26b

In the tests, I would also test an address that maps to "" and perhaps special cases like 127.0.0.1, 0.0.0.0, 255.255.255.255.

Thanks! Got some bugs :) Here's the tests and fixes: https://github.com/cohosh/snowflake/commit/be4d245375722d958dd85f1a53849cdc37b3382b

Here's a new candidate: https://github.com/cohosh/snowflake/compare/geoip

In addition to the changes above, here are the other changes I made:

Replying to ahf:

I don't know if this is a common thing in Go code to do, but in many functional languages where you have type aliases people tend to do type aliases for string types to make them "more specific". In this case the country-string type could be called Country so the metrics table would be a mapping of a Country to a monotonically increasing counter. I did that for CountryStats (which is the map from country codes to counts) is doing this for the country strings too noisy?

What should we do with these values when they are here? Should we have an API end-point that can dump them? Should we save them to a log file with some heartbeat interval? Chelsea Komlo showed me a neat library for collecting internal metrics in Go applications, but it might be too early to introduce additional dependencies just for this. It was this library: https://github.com/armon/go-metrics I think it's still very early... my suggestion is to do something simple, close this ticket, and then think about what we want a bit more before adding new dependencies. Right now it just logs the country counts to a log file every hour

Oh, and more thing I forgot. Should we have a SIGHUP handler that reloads the tables? Added.

Trac:
Status: needs_revision to needs_review

I think this looks good. Maybe dcf has some comments?

Trac:
Status: needs_review to merge_ready

Oops, just fixed the test to look for the default geoip files: https://github.com/cohosh/snowflake/commit/fbcb980dd5e5b42dd80c2e0a8b3501cd17fa34a2

Updating actual points.

Trac:
Actualpoints: N/A to 2

Updated PR (fixed paths in the tests): https://github.com/cohosh/snowflake/compare/geoip

I have many comments, but overall my impression is good and I think you can move ahead with this.

My big-picture question is: what do we plan to do with this data? If it's to detect blocking events by comparing the broker's statistics against the bridge's, I think we should at least sketch out those analysis scripts, in order to see whether the client geoip data we will be collecting is suited to the requirements. My main point is that we shouldn't collect data just because it may be useful; instead we should design safe data collection around some question we want to answer. As it stands, the branch will collect more precise client data than Tor Metrics does (Tor Metrics doesn't publish raw numbers but applies some fuzzing and binning). Having /debug display precise counts is a danger in the following scenario: an observer wants to determine whether a particular client is accessing the Snowflake broker. Whenever the observer sees a suspected connection, it checks the /debug output to see whether the count has incremented.

Perhaps we could do a test deployment for a few days, to get an idea of what the data looks like. In fact, I think it's a good idea to try that, before merging. If there's a research question that we think this data could help us answer, we could ask the Safety Board to evaluate it.

This branch only tracks client accesses, not proxy accesses. What will happen with proxy accesses, are they planned to go in the same metrics.log file?
broker.go: The SIGHUP handler introduces a race condition because other goroutines may be using the tables while they are being replaced. Even though the table is replaced all at once with an assignment, I don't think that even that assignment is guaranteed to be atomic. The table really needs a lock—not only while being reloaded from a file but on every access (because SIGHUP means the data structure can change at any time). Alternately, consider removing the SIGHUP feature and only loading the database once at startup.
broker.go {{{ flag.StringVar(&geoipDatabase, "geoipdb", "/usr/share/tor/geoip", "path to correctly formatted geoip database mapping IPv4 address ranges to country codes") flag.StringVar(&geoip6Database, "geoip6db", "/usr/share/tor/geoip6", "path to correctly formatted geoip database mapping IPv6 address ranges to country codes") }}} Is there any way to disable the geoip feature, if I don't have the necessary files, other than by providing an invalid path? Maybe we don't care to provide a way to disable the feature and it's the operator's responsibility to provide some kind of geoip files; but in that case, it may be better to exit with an error if the files cannot be read, rather than logging the error and continuing on. It is perhaps surprising that you can explicitly ask for a certain file on the command line with -geoipdb path, and the broker will run even if the path cannot be read.
UpdateCountryStats: It looks to me like if a geoip lookup fails, it is silently ignored—this could be misleading. It would be good to distinguish three cases: geoip file present and lookup found a country code; geoip file present but lookup found nothing; geoip file absent so no lookup is possible.
Adding to that, tor records the hashes of its geoip files: geoip-db-digest and geoip6-db-digest. It would be good to record that information so we can retroactively determine if geoip files were out of date, or diagnose geoip failures.
I think that some of the logging is happening at too low a level. The most basic functions should return errors and not log. For example geoipStringToIP should just return the err from strconv.ParseUint. NewMetrics, too, should return any error resulting from filesystem operations.
GeoIPRangeClosure: I don't understand why this is called "closure"; it's just a function.
geoip.go geoip.go "4-byte unsigned integers": Please document whether it's big-endian or little-endian. "INTIPLOW,INTIPHIGH,CC": Similarly, document whether the ranges are inclusive of the high element or exclusive (I guess it would have to be inclusive, in order to represent 255.255.255.255?).
The parseEntry functions need error checking on geoipStringToIP and net.ParseIP, or else they will store a nil that will result in a panic later on. This error will be more useful if it includes a line number. Let me suggest a different factoring of the geoip parsing code. Have parseEntry be a plain function, not a method on GeoIPv4Table, and have it return (GeoIPEntry, error) rather than inserting into the table itself. GeoIPLoadFile can do the entry insertion, count lines and annotate the error result from parseEntry.
geoip.go: Check scanner.Err() after the loop.
GeoipError: Unless you need to do type matching on errors, you can just use errors.New or fmt.Errorf instead of making a new type.
GetCountryByAddr: I think a more natural API for this function would be to have it take a net.IP instead of a string (seeing as all it does with the string is immediately parse it, and UpdateCountryStats is already parsing it).
Can GeoIPRangeClosure be written as bytes.Compare(key.To16(), entry.ipHigh.To16()) <= 0, and GeoIPCheckRange as bytes.Compare(key.To16(), entry.ipLow.To16()) >= 0 && bytes.Compare(key.To16(), entry.ipHigh.To16()) <= 0?
metrics.go {{{ m.countryStats.counts[country] = m.countryStats.counts[country] + 1 }}} You can do this as m.countryStats.counts[country]++ or m.countryStats.counts[country] += 1.
metrics.go: I don't think the initial nil assigment does anything. {{{ m.tablev4 = nil // garbage collect previous table, if applicable m.tablev4 = tablev4 }}}
metrics.go {{{ for { <-heartbeat }}} Consider doing this as for range heartbeat {. That way, the loop will terminate if the heartbeat channel gets closed.
If NewMetrics is called more than once, there will be multiple open file handles to metrics.log (and only the first one will actually work, I think). Is it possible that NewMetrics could be called more than once?
snowflake-broker_test.go: I'm not crazy about having the tests depend on external files, especially ones that may change. Ideally the test code would include its own small geoip files, either as separate files or as literals in the source code. But at least, the test should t.Skip if a file is missing, rather than report spurious failure. I suggest adding a layer of abstraction: a GeoIPLoad(io.Reader) that parses any io.Reader (for examples a bytes.Reader), and write GeoIPLoadFile in terms of it.
It would be nice to also have tests that exercise the error paths in geoip parsing.
I would like to see unit tests for basic functions like GeoIPRangeClosure and GeoIPCheckRange. They don't have to be long, just test the edge conditions of an instance.

Broker should receive country stats information from Proxy and Client

Child items ...

Activity