Opened 7 years ago

Last modified 9 months ago

#8127 assigned enhancement

Bring back the relays-by-country graph

Reported by: karsten Owned by: metrics-team
Priority: Low Milestone:
Component: Metrics/Website Version:
Severity: Normal Keywords:
Cc: anadahz Actual Points:
Parent ID: Points:
Reviewer: Sponsor:

Description

Aggregating data for the relays-by-country graph has become prohibitively expensive. It keeps the server busy for 2 hours every day, affecting more important tasks like downloading descriptors. That's why I disabled this aggregation step on February 1 to the effect that relays-by-country graphs are still available but won't receive new data. The problem is the PostgreSQL-based IP-to-country lookup. I should look into making this lookup much, much faster. Creating this ticket so I don't forget.

Child Tickets

Change History (6)

comment:1 Changed 3 years ago by karsten

Severity: Normal
Summary: Fix and re-enable relays-by-country graph on metrics websiteBring back the relays-by-country graph
Type: defectenhancement

comment:2 Changed 2 years ago by karsten

Owner: changed from karsten to metrics-team
Status: newassigned

Handing over to metrics-team, because I'm not currently working on this.

comment:3 Changed 9 months ago by anadahz

Cc: anadahz added

What kind of resources are required?

Can a dedicated server allocated only to do this task help to bring back the relays-by-country graphs?

comment:4 in reply to:  3 ; Changed 9 months ago by karsten

Replying to anadahz:

What kind of resources are required?

Can a dedicated server allocated only to do this task help to bring back the relays-by-country graphs?

Unfortunately, it's not just a question of hardware. The code used for the blog post is good enough to run it once for a blog post, but it needs more work for being run periodically. Here are a few issues:

  • Every time this code runs, it processes *all* descriptors in the in/ directory. In a production environment we'd want it to skip descriptors it has processed before and use previously processed aggregations from them.
  • Updating geoip files is a manual steps. In fact, we're currently using the very same geoip file in a graph covering years of data. We'll need to find a way for automating updating geoip files. And we need to define which geoip file we're using for any given consensus. That last sentence alone is far from being trivial if we want to ensure that two people have a chance to independently produce the same graph.
  • Everything here works with files, but we'll want to use a database, or we'll be sad whenever the server reboots in the wrong moment. And we want the database schema to scale for the next five years.

comment:5 in reply to:  4 ; Changed 9 months ago by anadahz

Replying to karsten:

Replying to anadahz:

What kind of resources are required?

Can a dedicated server allocated only to do this task help to bring back the relays-by-country graphs?

Unfortunately, it's not just a question of hardware. The code used for the blog post is good enough to run it once for a blog post, but it needs more work for being run periodically. Here are a few issues:

  • Every time this code runs, it processes *all* descriptors in the in/ directory. In a production environment we'd want it to skip descriptors it has processed before and use previously processed aggregations from them.
  • Updating geoip files is a manual steps. In fact, we're currently using the very same geoip file in a graph covering years of data. We'll need to find a way for automating updating geoip files. And we need to define which geoip file we're using for any given consensus. That last sentence alone is far from being trivial if we want to ensure that two people have a chance to independently produce the same graph.

Aren't these the same GeoIP files as the ones used for Tor metrics currently?

  • Everything here works with files, but we'll want to use a database, or we'll be sad whenever the server reboots in the wrong moment. And we want the database schema to scale for the next five years.

Nonetheless do you think that these issues can be created as separate sub-tickets?

comment:6 in reply to:  5 Changed 9 months ago by karsten

Replying to anadahz:

Replying to karsten:

  • Updating geoip files is a manual steps. In fact, we're currently using the very same geoip file in a graph covering years of data. We'll need to find a way for automating updating geoip files. And we need to define which geoip file we're using for any given consensus. That last sentence alone is far from being trivial if we want to ensure that two people have a chance to independently produce the same graph.

Aren't these the same GeoIP files as the ones used for Tor metrics currently?

Well, Onionoo uses the latest of these GeoIP files in MaxMind's format. But nothing else in Tor Metrics uses these files. Nothing of this is hard, it's just a couple substeps that need to be done.

  • Everything here works with files, but we'll want to use a database, or we'll be sad whenever the server reboots in the wrong moment. And we want the database schema to scale for the next five years.

Nonetheless do you think that these issues can be created as separate sub-tickets?

Not really. These were just some examples, not a list of things that need to be done to resolve this ticket. I'd like to leave the implementation steps to whoever implements this.

Note: See TracTickets for help on using tickets.