Opened 2 months ago

Last modified 10 days ago

#29734 merge_ready enhancement

Broker should receive country stats information from Proxy and Client

Reported by: cohosh Owned by: cohosh
Priority: Medium Milestone:
Component: Circumvention/Snowflake Version:
Severity: Normal Keywords: snowflake, geoip, stats
Cc: ahf, cohosh, dcf, arlolra, irl, karsten Actual Points: 2
Parent ID: #29207 Points: 1
Reviewer: ahf Sponsor: Sponsor19

Description

We can use existing geoip data to collect statistics about where clients are connecting from in order to detect possible blocking events. These should be gathered both from the initial domain-fronted client connection and from the proxies (to be passed to the broker) in order to detect the blocking of individual proxies or the blocking of the WebRTC connections.

Child Tickets

Change History (34)

comment:1 Changed 2 months ago by ahf

We have a couple of options here for the implementation:

  • The broker strictly doesn't depend on anything Tor, but we could re-use the tor-geoipdb databases that is bundled in Debian/Ubuntu to get updates. These databases have a slightly different format than the official MaxMind GeoIP databases.

Once the broker is able to update per-country stats for the domain-fronted client connection it should also be able to relay information about which database it is using to the Snowflake proxies, such that they can keep stats about incoming proxy connections from clients and where these are coming from. This would (maybe?) allow us to notice if WebRTC filtering is happening in a country in that the Broker will see multiple connections from the given country, but the proxies reports no incoming clients from the given country.

The proxies MUST NOT have to forward the client IP to the broker, which is why it is better for the proxies to fetch the GeoIP DB from the broker and cache it locally.

The format used by Tor itself is very simple (IP-encoded as an integer followed by the country) that you keep in an ordered vector where you do a binary search in whenever you need to look up a country from a given IP. The simplicity of this data-structure might make it more interesting than MaxMind's binary format since we need to do the same implementation in both Go and JavaScript.

The Tor implementation can be found in https://github.com/torproject/tor/blob/2f683465d4b666c5d8f84fb3b234ad539d8511cd/src/lib/geoip/geoip.c

The Tor GeoIP database format can be seen here: https://github.com/torproject/tor/tree/master/src/config (see geoip, geoip6 and the mmdb-convert.py conversion script)

comment:2 Changed 2 months ago by cohosh

Owner: set to cohosh
Status: newassigned

comment:3 in reply to:  1 ; Changed 2 months ago by cohosh

Replying to ahf:

Once the broker is able to update per-country stats for the domain-fronted client connection it should also be able to relay information about which database it is using to the Snowflake proxies, such that they can keep stats about incoming proxy connections from clients and where these are coming from. This would (maybe?) allow us to notice if WebRTC filtering is happening in a country in that the Broker will see multiple connections from the given country, but the proxies reports no incoming clients from the given country.

Are we already collecting per-country usage stats for snowflake bridges (as we do for other types of bridges)? If so, this might give us what we need automatically for noticing WebRTC filtering. Especially at the moment where there is one broker and one bridge, if clients are able to connect to snowflake proxies, there shouldn't be any censorship related reason that they cannot connect to bridges.

I think per-country usage stats at the broker side are still useful of course, it gives us extra information if clients are able to connect to the broker but not able to connect to the snowflake bridge eventually.

On a different note, it might also be useful to us to collect per-country stats on where the proxies are being run from.

comment:4 in reply to:  3 Changed 2 months ago by dcf

Replying to cohosh:

Are we already collecting per-country usage stats for snowflake bridges (as we do for other types of bridges)?

Yes, this was #18628. How it works is, the snowflake proxy forwards (proxy, proxy-go) the client's IP address to the bridge in a client_ip= URL query parameter. Then the server parses it and passes it to tor in the pt.DialOr call. It's similar to what we worked out for meek, which was #13171.

I don't think that Snowflake has enough users to show up on any of the by-country graphs at Tor Metrics, but you can see the stats in the uploaded descriptor files. Example: https://collector.torproject.org/archive/bridge-descriptors/extra-infos/bridge-extra-infos-2019-02.tar.xz

$ tar -O -xf bridge-extra-infos-2019-02.tar.xz | grep -A 24 '^extra-info flakey 5481936581E23D2D178105D44DB6915AB06BFB7F$' | grep -E '^dirreq-v3-reqs '
dirreq-v3-reqs ru=16,tr=16,ae=8,cn=8,gb=8,je=8,us=8
dirreq-v3-reqs tr=24,cn=16,ae=8,je=8,nl=8,ru=8,us=8
dirreq-v3-reqs tr=16,cn=8,gb=8,ru=8,us=8
...

If so, this might give us what we need automatically for noticing WebRTC filtering. Especially at the moment where there is one broker and one bridge, if clients are able to connect to snowflake proxies, there shouldn't be any censorship related reason that they cannot connect to bridges.

This logic makes sense to me.

comment:5 Changed 2 months ago by cohosh

Here's a first commit that does something similar to little-t-tor for mapping IP addresses to country codes. The functions parse and load a database file into memory and then binary search that on a provided address to efficiently find the country code.

https://github.com/cohosh/snowflake/commit/eedca1cbe49ff84468806fd630a9f104d9ca230a

For now I've just included the geoip geoipv6 database files in the repository... is there any easier way to get these?

comment:6 Changed 2 months ago by cohosh

Status: assignedneeds_review

Here's a merge candidate for geoip in the broker: https://github.com/cohosh/snowflake/compare/geoip

I added some very simple count-based usage statistics for clients (mostly just to show how it works). We can do something a lot nicer here. We can add the same statics for proxies as well.

comment:7 in reply to:  5 Changed 2 months ago by dcf

Replying to cohosh:

For now I've just included the geoip geoipv6 database files in the repository... is there any easier way to get these?

One way, if we're comfortable relying on Debian dependencies, is to ask the operator to install tor-geoipdb or geoip-database package.

In the tests, I would also test an address that maps to "" and perhaps special cases like 127.0.0.1, 0.0.0.0, 255.255.255.255.

comment:8 Changed 2 months ago by ahf

I think this looks good, with a few comments/questions:

  • I don't think we should include the two geoip databases in the repository by default?
  • We should make the path to the two GeoIP databases configurable (either via a command line parameter and/or a small config file?)
  • I don't know if this is a common thing in Go code to do, but in many functional languages where you have type aliases people tend to do type aliases for string types to make them "more specific". In this case the country-string type could be called Country so the metrics table would be a mapping of a Country to a monotonically increasing counter.
  • What should we do with these values when they are here? Should we have an API end-point that can dump them? Should we save them to a log file with some heartbeat interval? Chelsea Komlo showed me a neat library for collecting internal metrics in Go applications, but it might be too early to introduce additional dependencies just for this. It was this library: https://github.com/armon/go-metrics

comment:9 Changed 2 months ago by ahf

Oh, and more thing I forgot. Should we have a SIGHUP handler that reloads the tables?

comment:10 Changed 2 months ago by cohosh

Reviewer: ahf
Status: needs_reviewneeds_revision

comment:11 Changed 2 months ago by cohosh

One way, if we're comfortable relying on Debian dependencies, is to ask the operator to install ​tor-geoipdb or ​geoip-database package.

We should make the path to the two GeoIP databases configurable (either via a command line parameter and/or a small config file?)

I think this is the best of both worlds: https://github.com/cohosh/snowflake/commit/fbb87b508641bbbcfd3163d1f2a43b9aff4e0085

The broker now allows the operator to pass in a path to geop files (for IPv4 and IPv6) as command-line arguments. The default is the install location of the debian tor-geoip package. If an invalid filename is provided (or none are provided and the package is not installed), the table will fail to load but not cause any crashes. There's a test for that here: https://github.com/cohosh/snowflake/commit/09dd27f9408b1ff3ff916e374bcd5f659ad5b26b

In the tests, I would also test an address that maps to "" and perhaps special cases like 127.0.0.1, 0.0.0.0, 255.255.255.255.

Thanks! Got some bugs :) Here's the tests and fixes: https://github.com/cohosh/snowflake/commit/be4d245375722d958dd85f1a53849cdc37b3382b

comment:12 in reply to:  8 Changed 2 months ago by cohosh

Here's a new candidate: https://github.com/cohosh/snowflake/compare/geoip

In addition to the changes above, here are the other changes I made:

Replying to ahf:

  • I don't know if this is a common thing in Go code to do, but in many functional languages where you have type aliases people tend to do type aliases for string types to make them "more specific". In this case the country-string type could be called Country so the metrics table would be a mapping of a Country to a monotonically increasing counter.

I did that for CountryStats (which is the map from country codes to counts) is doing this for the country strings too noisy?

  • What should we do with these values when they are here? Should we have an API end-point that can dump them? Should we save them to a log file with some heartbeat interval? Chelsea Komlo showed me a neat library for collecting internal metrics in Go applications, but it might be too early to introduce additional dependencies just for this. It was this library: https://github.com/armon/go-metrics

I think it's still very early... my suggestion is to do something simple, close this ticket, and then think about what we want a bit more before adding new dependencies. Right now it just logs the country counts to a log file every hour

Oh, and more thing I forgot. Should we have a SIGHUP handler that reloads the tables?

Added.

comment:13 Changed 2 months ago by cohosh

Status: needs_revisionneeds_review

comment:14 Changed 2 months ago by ahf

Status: needs_reviewmerge_ready

I think this looks good. Maybe dcf has some comments?

comment:15 Changed 2 months ago by cohosh

Oops, just fixed the test to look for the default geoip files: https://github.com/cohosh/snowflake/commit/fbcb980dd5e5b42dd80c2e0a8b3501cd17fa34a2

comment:16 Changed 2 months ago by cohosh

Actual Points: 2

Updating actual points.

comment:17 Changed 2 months ago by cohosh

Updated PR (fixed paths in the tests): https://github.com/cohosh/snowflake/compare/geoip

comment:18 Changed 2 months ago by dcf

I have many comments, but overall my impression is good and I think you can move ahead with this.

My big-picture question is: what do we plan to do with this data? If it's to detect blocking events by comparing the broker's statistics against the bridge's, I think we should at least sketch out those analysis scripts, in order to see whether the client geoip data we will be collecting is suited to the requirements. My main point is that we shouldn't collect data just because it may be useful; instead we should design safe data collection around some question we want to answer. As it stands, the branch will collect more precise client data than Tor Metrics does (Tor Metrics doesn't publish raw numbers but applies some fuzzing and binning). Having /debug display precise counts is a danger in the following scenario: an observer wants to determine whether a particular client is accessing the Snowflake broker. Whenever the observer sees a suspected connection, it checks the /debug output to see whether the count has incremented.

Perhaps we could do a test deployment for a few days, to get an idea of what the data looks like. In fact, I think it's a good idea to try that, before merging. If there's a research question that we think this data could help us answer, we could ask the Safety Board to evaluate it.


  • This branch only tracks client accesses, not proxy accesses. What will happen with proxy accesses, are they planned to go in the same metrics.log file?
  • broker.go: The SIGHUP handler introduces a race condition because other goroutines may be using the tables while they are being replaced. Even though the table is replaced all at once with an assignment, I don't think that even that assignment is guaranteed to be atomic. The table really needs a lock—not only while being reloaded from a file but on every access (because SIGHUP means the data structure can change at any time). Alternately, consider removing the SIGHUP feature and only loading the database once at startup.
  • broker.go
    flag.StringVar(&geoipDatabase, "geoipdb", "/usr/share/tor/geoip", "path to correctly formatted geoip database mapping IPv4 address ranges to country codes")
    flag.StringVar(&geoip6Database, "geoip6db", "/usr/share/tor/geoip6", "path to correctly formatted geoip database mapping IPv6 address ranges to country codes")
    
    Is there any way to disable the geoip feature, if I don't have the necessary files, other than by providing an invalid path? Maybe we don't care to provide a way to disable the feature and it's the operator's responsibility to provide some kind of geoip files; but in that case, it may be better to exit with an error if the files cannot be read, rather than logging the error and continuing on. It is perhaps surprising that you can explicitly ask for a certain file on the command line with -geoipdb path, and the broker will run even if the path cannot be read.
  • UpdateCountryStats: It looks to me like if a geoip lookup fails, it is silently ignored—this could be misleading. It would be good to distinguish three cases: geoip file present and lookup found a country code; geoip file present but lookup found nothing; geoip file absent so no lookup is possible.
  • Adding to that, tor records the hashes of its geoip files: geoip-db-digest and geoip6-db-digest. It would be good to record that information so we can retroactively determine if geoip files were out of date, or diagnose geoip failures.
  • I think that some of the logging is happening at too low a level. The most basic functions should return errors and not log. For example geoipStringToIP should just return the err from strconv.ParseUint. NewMetrics, too, should return any error resulting from filesystem operations.
  • GeoIPRangeClosure: I don't understand why this is called "closure"; it's just a function.
  • geoip.go geoip.go "4-byte unsigned integers": Please document whether it's big-endian or little-endian. "INTIPLOW,INTIPHIGH,CC": Similarly, document whether the ranges are inclusive of the high element or exclusive (I guess it would have to be inclusive, in order to represent 255.255.255.255?).
  • The parseEntry functions need error checking on geoipStringToIP and net.ParseIP, or else they will store a nil that will result in a panic later on. This error will be more useful if it includes a line number. Let me suggest a different factoring of the geoip parsing code. Have parseEntry be a plain function, not a method on GeoIPv4Table, and have it return (GeoIPEntry, error) rather than inserting into the table itself. GeoIPLoadFile can do the entry insertion, count lines and annotate the error result from parseEntry.
  • geoip.go: Check scanner.Err() after the loop.
  • GeoipError: Unless you need to do type matching on errors, you can just use errors.New or fmt.Errorf instead of making a new type.
  • GetCountryByAddr: I think a more natural API for this function would be to have it take a net.IP instead of a string (seeing as all it does with the string is immediately parse it, and UpdateCountryStats is already parsing it).
  • Can GeoIPRangeClosure be written as bytes.Compare(key.To16(), entry.ipHigh.To16()) <= 0, and GeoIPCheckRange as bytes.Compare(key.To16(), entry.ipLow.To16()) >= 0 && bytes.Compare(key.To16(), entry.ipHigh.To16()) <= 0?
  • metrics.go
    m.countryStats.counts[country] = m.countryStats.counts[country] + 1
    
    You can do this as m.countryStats.counts[country]++ or m.countryStats.counts[country] += 1.
  • metrics.go: I don't think the initial nil assigment does anything.
    m.tablev4 = nil // garbage collect previous table, if applicable
    m.tablev4 = tablev4
    
  • metrics.go
    for {
    	<-heartbeat
    
    Consider doing this as for range heartbeat {. That way, the loop will terminate if the heartbeat channel gets closed.
  • If NewMetrics is called more than once, there will be multiple open file handles to metrics.log (and only the first one will actually work, I think). Is it possible that NewMetrics could be called more than once?
  • snowflake-broker_test.go: I'm not crazy about having the tests depend on external files, especially ones that may change. Ideally the test code would include its own small geoip files, either as separate files or as literals in the source code. But at least, the test should t.Skip if a file is missing, rather than report spurious failure. I suggest adding a layer of abstraction: a GeoIPLoad(io.Reader) that parses any io.Reader (for examples a bytes.Reader), and write GeoIPLoadFile in terms of it.
  • It would be nice to also have tests that exercise the error paths in geoip parsing.
  • I would like to see unit tests for basic functions like GeoIPRangeClosure and GeoIPCheckRange. They don't have to be long, just test the edge conditions of an instance.

comment:19 in reply to:  18 ; Changed 8 weeks ago by cohosh

Replying to dcf:

I have many comments, but overall my impression is good and I think you can move ahead with this.

My big-picture question is: what do we plan to do with this data? If it's to detect blocking events by comparing the broker's statistics against the bridge's, I think we should at least sketch out those analysis scripts, in order to see whether the client geoip data we will be collecting is suited to the requirements. My main point is that we shouldn't collect data just because it may be useful; instead we should design safe data collection around some question we want to answer. As it stands, the branch will collect more precise client data than Tor Metrics does (Tor Metrics doesn't publish raw numbers but applies some fuzzing and binning). Having /debug display precise counts is a danger in the following scenario: an observer wants to determine whether a particular client is accessing the Snowflake broker. Whenever the observer sees a suspected connection, it checks the /debug output to see whether the count has incremented.

Perhaps we could do a test deployment for a few days, to get an idea of what the data looks like. In fact, I think it's a good idea to try that, before merging. If there's a research question that we think this data could help us answer, we could ask the Safety Board to evaluate it.

Thanks for this, I agree we should think about it some more.

Whatever we decide, we should eventually not be displaying this data in /debug in the end, but rather logging it and using that log file to display metrics somewhere else. I also think that we should not be revealing more information about clients than the bridge is.

I'm also willing to believe that collecting client country stats at the broker, even though it would tell us more information about censorship events, may not be *that* useful to us at the moment and is undesirable due to privacy concerns. We could always take a deeper dive into our investigations if we notice a drop in clients from a specific region at the bridge to figure out exactly what is going on.

On the other hand, perhaps we want to collect country stats of the snowflake proxies? This is discussed to some extent in #21315. Do we have privacy concerns about proxies that are similar to those concerning clients?

comment:20 in reply to:  19 Changed 8 weeks ago by dcf

Replying to cohosh:

On the other hand, perhaps we want to collect country stats of the snowflake proxies? This is discussed to some extent in #21315. Do we have privacy concerns about proxies that are similar to those concerning clients?

I think that doing it for proxies is less concerning that doing it for clients. I would be fine with merging code to collect stats on proxies right away, as I think the risk is low. We can ask the safety board if they can think of any dangers we missed, but I don't think we have to wait for that before starting.

comment:21 Changed 7 weeks ago by cohosh

Status: merge_readyneeds_revision

Putting this in needs_revision before we've actually modified it to collect proxy not client data.

comment:22 in reply to:  18 ; Changed 7 weeks ago by cohosh

Replying to dcf:
Thanks, I went with most of these suggestions. Here's a new PR: geoip There are a few items that require additional notes.

  • This branch only tracks client accesses, not proxy accesses. What will happen with proxy accesses, are they planned to go in the same metrics.log file?

We're now just tracking proxy accesses.

  • Is there any way to disable the geoip feature, if I don't have the necessary files, other than by providing an invalid path? Maybe we don't care to provide a way to disable the feature and it's the operator's responsibility to provide some kind of geoip files; but in that case, it may be better to exit with an error if the files cannot be read, rather than logging the error and continuing on. It is perhaps surprising that you can explicitly ask for a certain file on the command line with -geoipdb path, and the broker will run even if the path cannot be read.
  • It looks to me like if a geoip lookup fails, it is silently ignored—this could be misleading. It would be good to distinguish three cases: geoip file present and lookup found a country code; geoip file present but lookup found nothing; geoip file absent so no lookup is possible.

Added a -disablegeoip option that will not load geoip tables and will not return an error if UpdateCountryStats fails. Otherwise UpdateCountryStats will return an error if it fails and loading the table will panic if the file does not exist or if it is incorrectly formatted.

  • The parseEntry functions need error checking on geoipStringToIP and net.ParseIP, or else they will store a nil that will result in a panic later on. This error will be more useful if it includes a line number. Let me suggest a different factoring of the geoip parsing code. Have parseEntry be a plain function, not a method on GeoIPv4Table, and have it return (GeoIPEntry, error) rather than inserting into the table itself. GeoIPLoadFile can do the entry insertion, count lines and annotate the error result from parseEntry.

The difficulty in refactoring it this way is that ipv4 and ipv6 tables have differently formated entries that need to be parse differently.
I added error checking to parseEntry and have it return (GeoIPEntry, error) as suggested, but leave it as a method on GeoIPv4Table and GeoIPv6Table.

  • If NewMetrics is called more than once, there will be multiple open file handles to metrics.log (and only the first one will actually work, I think). Is it possible that NewMetrics could be called more than once?

Making a singleton *Metrics variable causes problems with how Convey does tests. It shouldn't be called more than once, but for now I'm using sync.Once on the logging at least so it's explicit

  • snowflake-broker_test.go: I'm not crazy about having the tests depend on external files, especially ones that may change. Ideally the test code would include its own small geoip files, either as separate files or as literals in the source code. But at least, the test should t.Skip if a file is missing, rather than report spurious failure. I suggest adding a layer of abstraction: a GeoIPLoad(io.Reader) that parses any io.Reader (for examples a bytes.Reader), and write GeoIPLoadFile in terms of it.
  • It would be nice to also have tests that exercise the error paths in geoip parsing.

I haven't made this abstraction at the moment and went for the small geoip files option.

  • I would like to see unit tests for basic functions like GeoIPRangeClosure and GeoIPCheckRange. They don't have to be long, just test the edge conditions of an instance.

I replaced these functions with much simpler one line statements, but I added edge cases to the existing geoip test to make sure we correctly categorize IPs at the edge of a range

comment:23 Changed 7 weeks ago by cohosh

Status: needs_revisionneeds_review

comment:24 in reply to:  22 Changed 5 weeks ago by dcf

Status: needs_reviewneeds_revision

I have just a few further changes to recommend.

Replying to cohosh:

Replying to dcf:

  • Is there any way to disable the geoip feature, if I don't have the necessary files, other than by providing an invalid path? Maybe we don't care to provide a way to disable the feature and it's the operator's responsibility to provide some kind of geoip files; but in that case, it may be better to exit with an error if the files cannot be read, rather than logging the error and continuing on. It is perhaps surprising that you can explicitly ask for a certain file on the command line with -geoipdb path, and the broker will run even if the path cannot be read.
  • It looks to me like if a geoip lookup fails, it is silently ignored—this could be misleading. It would be good to distinguish three cases: geoip file present and lookup found a country code; geoip file present but lookup found nothing; geoip file absent so no lookup is possible.

Added a -disablegeoip option that will not load geoip tables and will not return an error if UpdateCountryStats fails. Otherwise UpdateCountryStats will return an error if it fails and loading the table will panic if the file does not exist or if it is incorrectly formatted.

Okay. Let's replace the panic with a log.Fatal so the failure gets logged.

  • The parseEntry functions need error checking on geoipStringToIP and net.ParseIP, or else they will store a nil that will result in a panic later on. This error will be more useful if it includes a line number. Let me suggest a different factoring of the geoip parsing code. Have parseEntry be a plain function, not a method on GeoIPv4Table, and have it return (GeoIPEntry, error) rather than inserting into the table itself. GeoIPLoadFile can do the entry insertion, count lines and annotate the error result from parseEntry.

The difficulty in refactoring it this way is that ipv4 and ipv6 tables have differently formated entries that need to be parse differently.
I added error checking to parseEntry and have it return (GeoIPEntry, error) as suggested, but leave it as a method on GeoIPv4Table and GeoIPv6Table.

I didn't explain this well. I meant that there should be separate functions for the two formats e.g. parseEntryIPv4 and parseEntryIPv6. The existing parseEntry methods never refer to table, so they don't need to be methods. But leaving them as methods is fine too.

IMO annotating the error message with the problematic line should be done in GeoIPLoadFile, not in parseEntry. This will eliminate the duplication of the common "Provided geoip file is incorrectly formatted" string and ensure that all the error paths get annotated. What I mean is, in the scanner.Scan() loop, you can replace return err with return fmt.Errorf("Provided geoip file is incorrectly formatted: %s. Line is: %+q").

I was actually thinking of annotating with a line number, but including the literal contents of the line works as well. I suggest using the %+q format specififer instead of %s, because it will escape any weird bytes and ensure the output remains on one log line.

  • If NewMetrics is called more than once, there will be multiple open file handles to metrics.log (and only the first one will actually work, I think). Is it possible that NewMetrics could be called more than once?

Making a singleton *Metrics variable causes problems with how Convey does tests. It shouldn't be called more than once, but for now I'm using sync.Once on the logging at least so it's explicit

Okay.

  • snowflake-broker_test.go: I'm not crazy about having the tests depend on external files, especially ones that may change. Ideally the test code would include its own small geoip files, either as separate files or as literals in the source code. But at least, the test should t.Skip if a file is missing, rather than report spurious failure. I suggest adding a layer of abstraction: a GeoIPLoad(io.Reader) that parses any io.Reader (for examples a bytes.Reader), and write GeoIPLoadFile in terms of it.
  • It would be nice to also have tests that exercise the error paths in geoip parsing.

I haven't made this abstraction at the moment and went for the small geoip files option.

Okay.

  • I would like to see unit tests for basic functions like GeoIPRangeClosure and GeoIPCheckRange. They don't have to be long, just test the edge conditions of an instance.

I replaced these functions with much simpler one line statements, but I added edge cases to the existing geoip test to make sure we correctly categorize IPs at the edge of a range

Okay.

comment:25 Changed 5 weeks ago by dcf

Oh and it looks like country stats don't get incremented whenever GetCountryByAddr doesn't find a match. I'm afraid this will make analysis difficult if a large fraction of proxies aren't covered by the geoip database, or the database is improperly installed, or something. Could we count them with a country code of "??" or something?

https://github.com/cohosh/snowflake/blob/fb01c2d1c9d402cc4d96f01df2e6fb6cb37bc9a6/broker/geoip.go#L213
GetCountryByAddr returns (string, error), but failure to find an entry in a database is not really an "error". It makes it seem like there are three return stats possible (found, not found, error) when really there are only two (found, not found). How about changing the signature to

func GetCountryByAddr(table GeoIPTable, ip net.IP) (string, bool)

where the second return value is an ok flag, like when indexing a map or reading from a channel. The caller can then do the "??" substitution whenever !ok.

comment:26 Changed 5 weeks ago by cohosh

Cc: irl karsten added

comment:27 in reply to:  25 ; Changed 4 weeks ago by cohosh

Replying to dcf:

I have just a few further changes to recommend.

Updated branch: https://github.com/cohosh/snowflake/compare/geoip

Ack >.< thanks. Done!

Done

Documented as unsupported so we know what our line of thought was in case we want to support it later


Okay. Let's replace the panic with a log.Fatal so the failure gets logged.

Done

  • The parseEntry functions need error checking on geoipStringToIP and net.ParseIP, or else they will store a nil that will result in a panic later on. This error will be more useful if it includes a line number. Let me suggest a different factoring of the geoip parsing code. Have parseEntry be a plain function, not a method on GeoIPv4Table, and have it return (GeoIPEntry, error) rather than inserting into the table itself. GeoIPLoadFile can do the entry insertion, count lines and annotate the error result from parseEntry.

The difficulty in refactoring it this way is that ipv4 and ipv6 tables have differently formated entries that need to be parse differently.
I added error checking to parseEntry and have it return (GeoIPEntry, error) as suggested, but leave it as a method on GeoIPv4Table and GeoIPv6Table.

I didn't explain this well. I meant that there should be separate functions for the two formats e.g. parseEntryIPv4 and parseEntryIPv6. The existing parseEntry methods never refer to table, so they don't need to be methods. But leaving them as methods is fine too.

Okay, I'm going to leave them as methods to avoid having two different LoadFile functions for the different types of table.

IMO annotating the error message with the problematic line should be done in GeoIPLoadFile, not in parseEntry. This will eliminate the duplication of the common "Provided geoip file is incorrectly formatted" string and ensure that all the error paths get annotated. What I mean is, in the scanner.Scan() loop, you can replace return err with return fmt.Errorf("Provided geoip file is incorrectly formatted: %s. Line is: %+q").

Done, is there a better way of handling the error stubs in parseEntry other than returning fmt.Errorf("")?

Replying to dcf:

Oh and it looks like country stats don't get incremented whenever GetCountryByAddr doesn't find a match. I'm afraid this will make analysis difficult if a large fraction of proxies aren't covered by the geoip database, or the database is improperly installed, or something. Could we count them with a country code of "??" or something?

https://github.com/cohosh/snowflake/blob/fb01c2d1c9d402cc4d96f01df2e6fb6cb37bc9a6/broker/geoip.go#L213
GetCountryByAddr returns (string, error), but failure to find an entry in a database is not really an "error". It makes it seem like there are three return stats possible (found, not found, error) when really there are only two (found, not found). How about changing the signature to

func GetCountryByAddr(table GeoIPTable, ip net.IP) (string, bool)

where the second return value is an ok flag, like when indexing a map or reading from a channel. The caller can then do the "??" substitution whenever !ok.

Okay, so I've changed the GetCountryByAddr signature to what you've suggested and then just removed the error return value from UpdateCountryStats. I've also added setting the country to "??" in UpdateCountryStats.

comment:28 Changed 4 weeks ago by cohosh

Status: needs_revisionneeds_review

comment:29 in reply to:  27 Changed 4 weeks ago by dcf

Status: needs_reviewmerge_ready

Looks good! Nice work on this.

Replying to cohosh:

Replying to dcf:

IMO annotating the error message with the problematic line should be done in GeoIPLoadFile, not in parseEntry. This will eliminate the duplication of the common "Provided geoip file is incorrectly formatted" string and ensure that all the error paths get annotated. What I mean is, in the scanner.Scan() loop, you can replace return err with return fmt.Errorf("Provided geoip file is incorrectly formatted: %s. Line is: %+q").

Done, is there a better way of handling the error stubs in parseEntry other than returning fmt.Errorf("")?

There's nothing wrong with returning a specific "couldn't parse IP: " error message IMO, as long as the message gets printed at the top level in GeoIPLoadFile, not in parseEntry. Or, if you just want s generic error sentinel without a meaningful message, you can create an error object at the top-level scope and just return that:

var errParseError error = errors.New("parse error")

Replying to dcf:

Oh and it looks like country stats don't get incremented whenever GetCountryByAddr doesn't find a match. I'm afraid this will make analysis difficult if a large fraction of proxies aren't covered by the geoip database, or the database is improperly installed, or something. Could we count them with a country code of "??" or something?

Okay, so I've changed the GetCountryByAddr signature to what you've suggested and then just removed the error return value from UpdateCountryStats. I've also added setting the country to "??" in UpdateCountryStats.

The documentation for GetCountryByAddr should explain the second return value.

I think you can remove this condition now:

	if country != "" {

comment:30 Changed 2 weeks ago by dcf

While you are waiting for feedback on #21315, I think it would be okay to do a provisional deployment of (say) 7 days in order to get an idea of what a typical metrics.log file looks like. In our meeting today, cohosh suggested removing the proxy geoip information from /debug during the provisional deployment, and only storing it in metrics.log.

comment:31 in reply to:  30 Changed 2 weeks ago by cohosh

Replying to dcf:

While you are waiting for feedback on #21315, I think it would be okay to do a provisional deployment of (say) 7 days in order to get an idea of what a typical metrics.log file looks like. In our meeting today, cohosh suggested removing the proxy geoip information from /debug during the provisional deployment, and only storing it in metrics.log.

I did this and a few other things to make it work better with our specific deployment: https://github.com/cohosh/snowflake/compare/geoip_squashed

  • the broker wasn't failing correctly if the hardcoded "metrics.log" file couldn't be opened so I added a command-line option to specify this file, or default to Stdout. I also made sure to LogFatal if the specified file can't be opened. This required a bit of refactoring.
  • Fixed the log output to not print an extra newline character
  • Removed the if country != "" {} condition and added the documentation as requested above

comment:32 Changed 2 weeks ago by cohosh

More on deployment:

I tried the following steps on snowflake-broker:

  • add deb https://deb.torproject.org/torproject.org stretch main to /etc/apt/sources.list.d/tor.list
  • installed tor-geoipdb (which also installed tor and torsocks)
  • changed /etc/runit/snowflake-broker to:
    exec chpst -u snowflake-broker /usr/local/bin/broker --acme-hostnames snowflake-broker.bamsoftware.com --acme-email dcf@torproject.org --metrics-log /home/snowflake-broker/metrics.log 2>&1
    

But I'm still getting the error message
2019/05/10 18:07:13 open metrics.log: permission denied

I see that snowflake-broker is the owner of /home/snowflake-broker but it's assigned to group nogroup. Is that the cause of this problem?

comment:33 Changed 11 days ago by cohosh

It was an error in the code that's now fixed. The snowflake-broker appears to be opening the log file just fine.

comment:34 Changed 10 days ago by cohosh

Okay finally provisionally deployed the geoip updates and merged the changes to master.

Note: See TracTickets for help on using tickets.