#21220 closed defect (not a bug)

remove data entries for non-existing or not-anymore-existing countries

Reported by: iwakeh Owned by: metrics-team
Priority: Medium Milestone:
Component: Metrics/Website Version:
Severity: Minor Keywords:
Cc: Actual Points:
Parent ID: Points:
Reviewer: Sponsor:

Description

The client dataset client.csv, userstats-combined.csv, and servers.csv contain some two-letter country codes for non-existing or not-anymore-existing countries, i.e. a1, a2, ap, cs, dd, du, eu, which won't be used for serving graphs on the Metrics web-site, but which are included in the data file.

Shouldn't these be removed from the data?

Child Tickets

Change History (6)

comment:1 Changed 11 months ago by karsten

Status: newneeds_information

Hmm, what's the benefit of removing them, other than saving a few lines?

A possible downside is that whenever a new country code gets added, we'll need to update the filter ASAP, or we'll later have to reprocess data. But if we don't remove any data entries in the aggregate step, we can easily add new country codes to the graphs and be done with it.

Oh, and other people might use our .csv files, so we shouldn't restrict ourselves to our own visualizations here.

I'd say let's leave them in, unless there are other reasons for taking them out that I didn't think of here. What do you think?

comment:2 Changed 11 months ago by iwakeh

This ticket is not about only providing data displayed on Tor Metrics, but only providing meaningful data.
It'll be perfectly fine to provide data for an existing country in the csv-file that cannot be shown on the web-site (yet).

I mentioned that the graphs for the mentioned country codes are not displayed on the site, because that is a good thing, i.e., no need to act.

The main reason for removing the data mentioned in the description is that these country codes don't make sense.

Maybe, also use this ticket to investigate why they are there?

comment:3 Changed 11 months ago by karsten

Those country codes are there, because relays reported those codes in their directory-request statistics. Here are the codes used by MaxMind, which probably covers most of the ones you listed. Others might come from custom GeoIP files provided by relay operators. There's no good way for us to remove some of these country codes and still be prepared for new codes to be added in the future. I'd rather want to leave it to applications to use whichever codes they want and ignore the rest.

comment:4 Changed 11 months ago by iwakeh

Yes, on second thought it makes sense to leave the codes reported as they are. The data explanation should point out that (as country resolution can be influenced by the relay operator) there can be weird and outdated codes. I'll add a separate ticket for adapting the texts later.

I could only find cs and dd, thanks for the GeoIp link. Now, knowing that a1 and a2 translate to "Anonymous Proxy" and "Satellite Provider", I'd say that at least the satellite related data should also be graphed.

Other than that, it seems this ticket is finished.

comment:5 in reply to:  4 ; Changed 11 months ago by karsten

Replying to iwakeh:

Yes, on second thought it makes sense to leave the codes reported as they are. The data explanation should point out that (as country resolution can be influenced by the relay operator) there can be weird and outdated codes. I'll add a separate ticket for adapting the texts later.

In theory, I'm all for improving documentation. But I wonder how many people would benefit from such an explanation and how many would be even more confused or would stop reading because of too much text. Please keep that in mind when writing more text. ;)

I could only find cs and dd, thanks for the GeoIp link. Now, knowing that a1 and a2 translate to "Anonymous Proxy" and "Satellite Provider", I'd say that at least the satellite related data should also be graphed.

The problem with "Anonymous Proxy" and "Satellite Provider" is that MaxMind put those in for otherwise valid country codes. That's why we switched to their GeoLite2 format which doesn't have this issue anymore. So, if there are still any users reported as "Anonymous Proxy" (what's that, after all?) or "Satellite Provider", those are likely heavily undercounted. If we had information on countries and on connection type, I'd say, let's add a graph for it. But with the given data, I'd say let's just consider "Anonymous Proxy" and "Satellite Provider" users as any other users whose IP address did not resolve to a country code.

Other than that, it seems this ticket is finished.

Okay. Feel free to close if you're happy with the answers. :)

comment:6 in reply to:  5 Changed 11 months ago by iwakeh

Resolution: not a bug
Status: needs_informationclosed

Replying to karsten:

Replying to iwakeh:

Yes, on second thought it makes sense to leave the codes reported as they are. The data explanation should point out that (as country resolution can be influenced by the relay operator) there can be weird and outdated codes. I'll add a separate ticket for adapting the texts later.

In theory, I'm all for improving documentation. But I wonder how many people would benefit from such an explanation and how many would be even more confused or would stop reading because of too much text. Please keep that in mind when writing more text. ;)

Yes, I'll keep it short; just in order to prevent future tickets like this one ;-)
(I could simply refer to this ticket ;-)

I could only find cs and dd, thanks for the GeoIp link. Now, knowing that a1 and a2 translate to "Anonymous Proxy" and "Satellite Provider", I'd say that at least the satellite related data should also be graphed.

The problem with "Anonymous Proxy" and "Satellite Provider" is that MaxMind put those in for otherwise valid country codes. That's why we switched to their GeoLite2 format which doesn't have this issue anymore. So, if there are still any users reported as "Anonymous Proxy" (what's that, after all?) or "Satellite Provider", those are likely heavily undercounted. If we had information on countries and on connection type, I'd say, let's add a graph for it. But with the given data, I'd say let's just consider "Anonymous Proxy" and "Satellite Provider" users as any other users whose IP address did not resolve to a country code.

Real world data is always unclean, ok, sigh.

Other than that, it seems this ticket is finished.

Okay. Feel free to close if you're happy with the answers. :)

Thanks for the discussion! Even if nothing is changed, now it's clear why things are as they are :-)

Closing.

Note: See TracTickets for help on using tickets.