Opened 2 years ago

Closed 8 months ago

#21588 closed enhancement (fixed)

Rewrite the censorship detector used by the Tor Metrics website in Java

Reported by: karsten Owned by: karsten
Priority: Medium Milestone:
Component: Metrics/Statistics Version:
Severity: Normal Keywords: metrics-2018
Cc: metrics-team Actual Points:
Parent ID: Points:
Reviewer: Sponsor:

Description

The censorship detector written by George Danezis in 2011 is the only part of the Tor Metrics website that is written in Python. We should consider rewriting it in Java in order to integrate it more closely into the rest of the Tor Metrics website code. This is also related to #19754.

iwakeh, want to comment on whether this makes sense or not, before somebody else comes and picks this up?

(The following thoughts depend on whether we reach consensus in the metrics team that this is even a good idea.)

The first step of this rewrite should be to create a minimal setup of the Python file that doesn't require setting up an own instance of the Tor Metrics website. I'll attach a compressed version of the input file userstats-detector.csv to this ticket. Running the Python version should be as simple as downloading that attachment and the two Python files detector.py and country_info.py from metrics-web's clients module and running:

unxz userstats-detector.csv.xz
python detector.py

That command should run for a few minutes and produce a couple of files including userstats-ranges.csv, which is the only output file we care about:

date,country,minusers,maxusers
2011-09-08,a1,559.698186453,1399.64885163
2011-09-09,a1,469.497090181,1451.46081727
2011-09-11,a1,639.857484235,1457.19233381
2011-09-12,a1,597.260782974,1312.46735446
[...]

Step two could be to throw out any unused code that is not required to produce this output file. Ideally, this would happen in one or more separate commits.

Step three would be to look at required external dependencies to rewrite the remaining code in Java. I haven't looked at all at this yet, so maybe this is doable without adding external dependencies, which would be best. But if external dependencies are necessary, maybe there's something in Apache Commons that we can use here. In any case, adding external dependencies requires discussion on this ticket.

Step four would be to do the rewrite and to try out that it produces roughly the same results (we're cutting off decimal places, for example). There's a guide on coding style here.

Step five would be to review the new code and integrate it into metrics-web.

All in all, I could imagine that steps 1 to 4 might be an interesting task for a new volunteer. Optimistically adding the metrics-help keyword.

But let's first discuss whether this rewrite makes sense, or whether there's a better plan to do it!

Child Tickets

Attachments (1)

userstats-detector.csv.xz (766.7 KB) - added by karsten 2 years ago.

Download all attachments as: .zip

Change History (31)

comment:1 Changed 2 years ago by karsten

Component: - Select a componentMetrics/Metrics website
Keywords: metrics-help added
Owner: set to metrics-team

Changed 2 years ago by karsten

Attachment: userstats-detector.csv.xz added

comment:2 in reply to:  description Changed 2 years ago by iwakeh

Keywords: metrics-help removed

Replying to karsten:

The censorship detector written by George Danezis in 2011 is the only part of the Tor Metrics website that is written in Python. We should consider rewriting it in Java in order to integrate it more closely into the rest of the Tor Metrics website code. This is also related to #19754.

iwakeh, want to comment on whether this makes sense or not, before somebody else comes and picks this up?

Sure.

(The following thoughts depend on whether we reach consensus in the metrics team that this is even a good idea.)

It is definitely a good idea to remove another language and the necessity for calling scripts.

[snip]

Steps 1 and 2 are for getting to know the current code (and there is a lot not used in MetricsWeb); required reading is the Tech Report describing the detector.

Step three would be to look at required external dependencies to rewrite the remaining code in Java. I haven't looked at all at this yet, so maybe this is doable without adding external dependencies, which would be best. But if external dependencies are necessary, maybe there's something in Apache Commons that we can use here. In any case, adding external dependencies requires discussion on this ticket.

For fitting a normal distribution and calculating percentiles for normal and poisson distribution etc. a library will be necessary. I know that commons-math offers some functionality in that direction, but it might not provide fitting a normal distribution. So, quite some research necessary here. Requirements for a useful library:

  • pure java
  • no further dependencies
  • and in debian stable.

It might be even necessary to tweak the detector calculations.

Step four would be to do the rewrite and to try out that it produces roughly the same results (we're cutting off decimal places, for example). There's a guide on coding style here.

A rough description of the design should be discussed here first to ensure encapsulation, object oriented design, and the other goals we listed in the MetricsJavaStyle guide (not only CodingStyle), i.e., step 4a) rough design suggestion, step 4b) design review, step 4c) coding.

Step five would be to review the new code and integrate it into metrics-web.

This and step 4 depend quite a bit on redesigning the entire MetricsWeb.

All in all, I could imagine that steps 1 to 4 might be an interesting task for a new volunteer. Optimistically adding the metrics-help keyword.

But let's first discuss whether this rewrite makes sense, or whether there's a better plan to do it!

Because of 'let's first discuss' I would tend to remove "metrics-help". I think metrics-help should be tasks that are clearly defined and can be taken up by a developer without previous experience in Metrics stuff. So, we should wait and then add metrics-help to more clearly defined tasks derived from this.
(Hope you don't mind me removing the tag!)

comment:3 Changed 2 years ago by karsten

Sure, removing the keyword and discussing more internally works for me!

I guess one of my assumptions above was that step 4 wouldn't produce more than 100 or 200 lines of Java code, where we wouldn't have to worry much about design and could easily transform the output into whatever design we want as part of step 5.

Another assumption was that reading the report in steps 1 and 2 wouldn't be necessary, because there's plenty of sample data to test that the rewrite produces the same results as the original.

But I made those assumptions without having looked at the code for years (pretty much since 2011). And I believe that you're more into this topic right now than I am, because of your work on that other technical report. Would you want to give this a few more thoughts, in particular whether we can turn this into a mostly self-contained task for a new volunteer? I agree that we should be pretty sure what we want before asking somebody else to do it. :)

comment:4 in reply to:  3 Changed 2 years ago by iwakeh

Replying to karsten:

Sure, removing the keyword and discussing more internally works for me!

Phhew :-)

I guess one of my assumptions above was that step 4 wouldn't produce more than 100 or 200 lines of Java code, where we wouldn't have to worry much about design and could easily transform the output into whatever design we want as part of step 5.

Well, we don't want to re-implement the Java re-implementation ;-)
(And there might be more lines depending on the country code setup, which might also be used elsewhere ...)

Another assumption was that reading the report in steps 1 and 2 wouldn't be necessary, because there's plenty of sample data to test that the rewrite produces the same results as the original.

As this is going to be done by a Java person the python might be less readable to them than the report, which really describes the implementation.

But I made those assumptions without having looked at the code for years (pretty much since 2011). And I believe that you're more into this topic right now than I am, because of your work on that other technical report. Would you want to give this a few more thoughts, in particular whether we can turn this into a mostly self-contained task for a new volunteer?

Sure, I'll think about this. (Actually, that was already on my list of suggestions for MetricsWeb.)

I agree that we should be pretty sure what we want before asking somebody else to do it. :)

And, that way the task will be more rewarding for a volunteer.

comment:5 Changed 2 years ago by karsten

Component: Metrics/WebsiteMetrics/Statistics

Moving all tickets to Metrics/Statistics that are more related to the data-aggregating modules rather than the website parts of metric-web.

comment:6 Changed 2 years ago by karsten

Keywords: metrics-2018 added

comment:7 Changed 17 months ago by iwakeh

Owner: changed from metrics-team to karsten
Status: newassigned

Assigning to karsten as he is already working on this.

comment:8 Changed 17 months ago by iwakeh

Regarding distribution fitting, which is currently done by scipy and used as part of the detector:

Scipy didn't re-implement the functionality, but uses minpack (www.netlib.org/minpack). We briefly discussed using minpack (written in Fortran) via JNI. This will lead to having an additional dependency for debian packages 'libcminpack1' (currently installed as dependency of scipy) and also provide that in the java library classpath. In addition, an interface would need to be created to access the used functions (hybrd, hybrj etc.) natively. All in all more fragile 'plumbing' than calling scipy from java.

Next steps:

  • decide, if the native access should really be implemented using Java
  • create a new package org.torproject.metrics.math for the functionality regarding distribution fitting and other advanced optimization math for detection and other purposes.

comment:9 in reply to:  8 ; Changed 17 months ago by karsten

Hmm, okay, that sounds like we need to choose the least bad solution.

If we go for calling scipy from Java, let's at least reduce that to a single call and not create a Python process for each norm.fit call. We could write all input data to a file, call Python to fit normal distributions and write an output file, and read that in Java. Not pretty, but that at least reduces overhead a bit.

Are there really no native Java libraries that do the same thing?

comment:10 in reply to:  9 ; Changed 17 months ago by iwakeh

Replying to karsten:

Hmm, okay, that sounds like we need to choose the least bad solution.

If we go for calling scipy from Java, let's at least reduce that to a single call and not create a Python process for each norm.fit call. We could write all input data to a file, call Python to fit normal distributions and write an output file, and read that in Java. Not pretty, but that at least reduces overhead a bit.

Yep, let's be pragmatic. I would still create the new package for the calls in case a different solution comes up.

Are there really no native Java libraries that do the same thing?

Fortran is well suited for scientific computations, why re-implement? Apache commons does re-implement one algorithm by minpack in a simplified way and w/o any means to judge the quality. The algorithm they provide doesn't give the good results scipy's choice of minpack algorithm delivers. I didn't encounter any free Java implementation that would provide the functionality. Maybe, Apache commons math will one day provide all the optimization algorithms.

comment:11 Changed 17 months ago by iwakeh

Cc: metrics-team added

Adding metrics-team cc, b/c I only noticed the update by accident.

comment:12 in reply to:  10 ; Changed 17 months ago by karsten

Replying to iwakeh:

Replying to karsten:

Hmm, okay, that sounds like we need to choose the least bad solution.

If we go for calling scipy from Java, let's at least reduce that to a single call and not create a Python process for each norm.fit call. We could write all input data to a file, call Python to fit normal distributions and write an output file, and read that in Java. Not pretty, but that at least reduces overhead a bit.

Yep, let's be pragmatic. I would still create the new package for the calls in case a different solution comes up.

Okay. Can we design this package in a way that math calls can either be executed directly, or computed in a batch fashion? Because, for scipy, I think we should really go for batch execution, whereas future solutions might not require that.

Are there really no native Java libraries that do the same thing?

Fortran is well suited for scientific computations, why re-implement? Apache commons does re-implement one algorithm by minpack in a simplified way and w/o any means to judge the quality. The algorithm they provide doesn't give the good results scipy's choice of minpack algorithm delivers. I didn't encounter any free Java implementation that would provide the functionality. Maybe, Apache commons math will one day provide all the optimization algorithms.

Okay. Let's be pragmatic and assume it won't happen.

comment:13 in reply to:  12 Changed 17 months ago by iwakeh

Replying to karsten:

...
Okay. Can we design this package in a way that math calls can either be executed directly, or computed in a batch fashion? Because, for scipy, I think we should really go for batch execution, whereas future solutions might not require that.

Sure. Future solutions might not need the batch processing, but it might be convenient anyway for the calling part.

comment:14 Changed 17 months ago by iwakeh

Owner: changed from karsten to iwakeh
Status: assignedaccepted

comment:15 Changed 17 months ago by iwakeh

As discussed above the pragmatic solution for the moment will call scipy from java in a very encapsulated way (see comments above).

comment:16 Changed 16 months ago by iwakeh

Another solution will be to replace the fitting procedures with simpler means of estimating the relevant distribution parameters (as suggested and discussed via mail by Karsten).

We should check how this affects the censorship event estimates.

comment:17 Changed 16 months ago by iwakeh

Owner: changed from iwakeh to metrics-team
Status: acceptedassigned

Setting to metrics-team as this is currently a lower priority on my task list.

Version 0, edited 16 months ago by iwakeh (next)

comment:18 in reply to:  16 Changed 16 months ago by karsten

Owner: changed from metrics-team to karsten
Status: assignedaccepted

Replying to iwakeh:

Another solution will be to replace the fitting procedures with simpler means of estimating the relevant distribution parameters (as suggested and discussed via mail by Karsten).

We should check how this affects the censorship event estimates.

Alright, I tried this out, and it produced the exact same result. I'd say let's change this. Please review my commit 13cbcea in my task-21588 branch.

(Deploying this change will be trivial, as the censorship detector recomputes new lower/upper bounds for the entire data set every time it runs.)

comment:19 Changed 16 months ago by karsten

Status: acceptedneeds_review

comment:20 Changed 16 months ago by iwakeh

Reviewer: iwakeh

comment:21 Changed 16 months ago by karsten

Priority: MediumHigh

Setting priority to high, because it would allow me to move forward with sponsor 13 deliverable 2.

comment:22 Changed 16 months ago by iwakeh

Status: needs_reviewmerge_ready

This looks fine and reasonable! And, your simulation results tell the same.

comment:23 Changed 16 months ago by iwakeh

Please adjust the status for the Java implementation as needed.

comment:24 Changed 16 months ago by karsten

Status: merge_readyneeds_information

Great! I'm going to merge and deploy this change in the next 1--2 hours when the current background update is done. It will first run tonight.

Before we rewrite the censorship detector in Java, I'd like to make another change: let's replace the median and IQR calculation with something more standard from SciPy/Numpy. I'll do that, but I'd like to wait for the first patches to #26035 before making this change. Once the Python code implements exactly what we'd like to specify for Sponsor 13 deliverable 2, we can easily rewrite it in Java. Setting to needs_information for the blocking on #26035 part.

comment:25 Changed 15 months ago by iwakeh

Yes, using more standard python functionality makes sense.

comment:26 Changed 13 months ago by karsten

Priority: HighMedium

This ticket is about the same priority as most other Metrics/* tickets. Setting priority back to medium.

comment:27 Changed 9 months ago by karsten

Reviewer: iwakeh
Status: needs_informationneeds_review

I finally re-implemented the Python censorship detector in Java and integrated it into the clients module. Please review commits f16f360 and 70f5e0f in my task-21588 branch. Note that these two commits are based on #28799 and #28801, but that shouldn't really matter for the review. Also note that most of the comments aren't relevant for this review, because this Java rewrite is solely based on the latest Python version.

comment:28 Changed 8 months ago by notirl

Status: needs_reviewneeds_revision

The copyright should be updated to include the conversion to Java.

Otherwise it looks OK.

comment:29 Changed 8 months ago by karsten

Well, the copyright does include the Java conversion:

+/* Copyright 2011 George Danezis <gdane@microsoft.com>
+ *
+ * All rights reserved.
+ *
...
+ *
+ * (Clear BSD license:
+ * http://labs.metacarta.com/license-explanation.html#license)
+ *
+ * Copyright 2018 The Tor Project
+ * See LICENSE for licensing information */

Can you suggest a better way to do this?

comment:30 Changed 8 months ago by karsten

Resolution: fixed
Status: needs_revisionclosed

Okay, I merged and deployed this (with a minor tweak, because we haven't merged #28801 yet). We can always make the copyright notice clearer. Closing. Thanks!

Note: See TracTickets for help on using tickets.