Opened 9 months ago

Closed 8 months ago

#28116 closed enhancement (fixed)

Split up legacy module into more maintainable parts

Reported by: karsten Owned by: karsten
Priority: Medium Milestone:
Component: Metrics/Statistics Version:
Severity: Normal Keywords:
Cc: metrics-team Actual Points:
Parent ID: Points:
Reviewer: irl Sponsor:

Description

Our legacy module is a mess. That code dates back to a time when we tried to use a single database for all our statistics and for a service called relay search, which was not the same service as today's relay search. While I'm not ruling out that we can make a single-database approach work for everything we want to do with our data, it's not going to be this database.

It's time to move away from this legacy database and take a similar approach as we're taking for the other modules, where we only store the relevant parts that we need for our graphs.

As of now, the legacy module provides data for the following graphs:

  • In the Servers category:
    1. Relays and bridges
    2. Relays by relay flag
    3. Relays by tor version
    4. Relays by platform
  • In the Traffic category:
    1. Total relay bandwidth
    2. Advertised and consumed bandwidth by relay flag
    3. Consumed bandwidth by Exit/Guard flag combination
    4. Bandwidth spent on answering directory requests

Viewed from a different perspective, these 8 graphs show 3 different metrics:

  • Relay or bridge counts in graphs 1 to 4
  • Advertised bandwidths in graphs 5 and 6
  • Bandwidth histories in graphs 5 to 8

I could imagine that we make the following changes to split up the legacy module into more maintainable parts:

  1. Use existing data from the ipv6servers module for graph 1 and for the advertised bandwidth portions in graphs 5 and 6. This data already exists with only trivial differences affecting how we're treating missing data. We could just switch.
  2. Extend the ipv6servers module to also provide data for graphs 2 to 4. This extension would require us to reimport the entire archive, so it's more of a rewrite. But the ipv6servers module code is much cleaner and easier to extend than the legacy module code. And when we extend that module, we can relatively easily add bridge statistics and other relay metrics like consensus weight or path selection probabilities that we can use in new graphs later on. All in all not a trivial amount of work, but probably worth it.
  3. Keep the remaining parts of the legacy module for the bandwidth history parts in graphs 5 to 8. Bandwidth histories are going to be replaced by PrivCount data in the medium term anyway. We could keep the legacy module around for another year or two without planning to change much during that time. And when we shut it down, we can keep a copy of the aggregate data around, just like we're going to keep a static summary of the Tor Messenger statistics (#26047).

I'll start working on the second suggested change above. The two other changes depend on whether that second change can be made successfully.

Child Tickets

Attachments (1)

versions-2018-10-29.png (130.4 KB) - added by karsten 9 months ago.

Download all attachments as: .zip

Change History (12)

Changed 9 months ago by karsten

Attachment: versions-2018-10-29.png added

comment:1 Changed 9 months ago by karsten

Status: assignedneeds_review

Has it been 10 days since I created this ticket. And I didn't do much else than work on this ticket since then. Whee, not a trivial amount of work.

Alright. I finally have a branch that implements the second suggested change: extend the ipv6servers module to produce statistics on relays/bridges by relay flag, version, and platform.

Please review commit f3ceaa3 in my task-28116-2 branch. I admit, it's a lot of new code. But it's going to let us to remove a lot of old code, too!

So, when this code is reviewed, merged, and deployed, the result is that four existing graphs look pretty much as they looked before. Ohhhhh.

That's why I made another graph to showcase the cool new data that this extension/rewrite produces as a by-product:


Here's what's new in this graph:

  • By version it considers -dev, -alpha, -beta, -rc, and stable versions rather than x.y.z version numbers. Both perspectives can be useful, but this perspective was not possible with the legacy database.
  • Consensus weight fraction and the various guard/middle/exit probabilities were not available in the legacy database.
  • The legacy database did not have any numbers for bridges. The new ipv6servers database has them, including bridge numbers by relay flag and platform.

We can make more graphs later. But we first need to replace existing data, so let's prioritize that.

comment:2 Changed 9 months ago by irl

Status: needs_reviewmerge_ready

This looks good to me.

We might want to consider renaming the not-very-IPv6-specific-anymore module.

comment:3 in reply to:  2 ; Changed 9 months ago by karsten

Replying to irl:

This looks good to me.

Awesome! I'll start deploying changes over the next days. That will take a while, though.

We might want to consider renaming the not-very-IPv6-specific-anymore module.

Agreed. I'd like to rename it to just servers. But we still have a servers module which is just another name for the legacy module. Confusing, right?

My plan is to:

  1. merge/deploy these changes,
  2. remove then unused code from the legacy module,
  3. rename the legacy module to bandwidth, because it's only remaining purpose will be to produce our bandwidth history statistics, and
  4. rename the ipv6servers module to servers.

Anyway, doing step 1 first. Keeping this ticket open until that's the case, and then creating tickets for the next steps.

comment:4 Changed 9 months ago by irl

This plan sounds good.

comment:5 Changed 8 months ago by karsten

Merged to master with two fixes, 73f162d and 57c7fd6. Still working on deployment.

comment:6 Changed 8 months ago by karsten

Deployed. Next steps will be next.

comment:7 in reply to:  3 Changed 8 months ago by karsten

Status: merge_readyneeds_review

Replying to karsten:

My plan is to:

  1. merge/deploy these changes,

Already done.

For the remaining steps I started a new branch task-28116-3 with six commits so far:

  • 88535e1 fixes unit tests;
  1. remove then unused code from the legacy module,
  • 6765b8e stops generating the now unused servers.csv in the legacy module;
  • 09cfdfd stops including advertised bandwidth data in bandwidth.csv, which is also produced by the legacy module, and instead switches to data from the ipv6servers module;
  • ca5fa45 removes quite a bit of now unused code from the legacy module;
  1. rename the legacy module to bandwidth, because it's only remaining purpose will be to produce our bandwidth history statistics, and
  • f8fa108 modernizes a bit more of the legacy module and renames it to bwhist;
  1. rename the ipv6servers module to servers.
  • c95125a finally renames the ipv6servers module to just servers.

All commits are tested locally, though I suspect that deployment will be another challenge, just like deployment of the first step was not exactly trivial. Still very much worth making this change!

Please take a look. :)

comment:8 Changed 8 months ago by irl

Reviewer: irl

comment:9 Changed 8 months ago by irl

Status: needs_reviewmerge_ready

This all looks OK.

comment:10 Changed 8 months ago by karsten

Thanks for looking. Pushed to master, currently deploying.

comment:11 Changed 8 months ago by karsten

Resolution: fixed
Status: merge_readyclosed

Deployed! Closing. Thanks!

Note: See TracTickets for help on using tickets.