Add auxiliary data on Tor relays and bridges to CollecTor
This ticket is the result of a local TODO list review and combines a few related ideas. Some of the ideas here are new, but some are really old and have been sitting on my list forever.
The general idea here is that CollecTor could provide auxiliary data on Tor relays and bridges. The main goal would be that other applications like Onionoo and Metrics but also Nyx can use this data to provide richer information on relays and bridges to their users. A secondary goal would be that CollecTor would serve as an archive for this data for future applications that don't exist yet.
Auxiliary data might include:
-
GeoIP country database: This is the same data as the Tor daemon uses internally to resolve relay IP addresses to country codes. We would be able to produce historical data by extracting
src/config/geoip
files from the Tor daemon Git repository. This data could be used by Metrics to bring back the relays by country graph. -
GeoIP city database: This data would be the same as Onionoo uses to resolve relay IP addresses to city names. The main advantage of having this file in CollecTor would be that Onionoo could automatically pull this data instead of relying on the operator to update GeoIP files.
-
GeoIP ASN database: This is similar to 2 but for ASN information.
-
Bridge GeoIP country database: Here's an idea to provide country information for bridges despite replacing IP addresses by hashes. CollecTor could keep a list of all bridge IP addresses in a given month and use the GeoIP country database from 1 to produce a custom database for resolving bridge IP addresses to country codes. Basically, that database would contain hashed fingerprints, 10.x.y.z IP addresses, and country codes. CollecTor would add a new line to this file whenever it observes a new bridge IP address, which would happen once per hour in particular at the beginning of a month. This file would change once per month when hashes for 10.x.y.z addresses change. However, this means that we'd have to reprocess the entire bridge tarball archive to generate older database files, because we have long deleted the inputs for generating those old 10.x.y.z IP addresses. Consumers of this data would be Onionoo but also Metrics for a new bridge country graph.
-
Relay reverse DNS entries: Right now, Onionoo runs its own rDNS resolver. But we could as well run that as part of CollecTor and provide the output data in a new data format to everyone who needs it. There would also be other consumers of this data, including the relay controller Nyx which would be display rDNS entries without risking to leak who is fetching that information.
This is a lot, but maybe there's even more. It's probably useful to discuss these different new data sets together. Once we decide we want to provide some or even all of them we should switch to child tickets. And just to set expectations right, it's probably going to take months to find enough time to implement these new data sets, if we think it's a good idea.