Opened 14 months ago
Closed 8 months ago
#26035 closed enhancement (fixed)
Streamline sample quantile types used in the various modules
Reported by: | karsten | Owned by: | karsten |
---|---|---|---|
Priority: | Medium | Milestone: | |
Component: | Metrics/Statistics | Version: | |
Severity: | Normal | Keywords: | |
Cc: | metrics-team | Actual Points: | |
Parent ID: | Points: | ||
Reviewer: | irl | Sponsor: | Sponsor13 |
Description
While documenting how to reproduce our various statistics, I noticed that we're using different methods/formulas for computing sample quantiles, that is, the median, quartiles, percentiles, and so on. Ideally, we would settle on one method and use that everywhere. The benefit is easier documentation and reproducibility.
Here is a (probably still incomplete) list of graphs for which we calculate quantiles (with the tool written in parentheses):
- Relay users: Median and inter-quartile range of ratios in censorship detector (Python, possibly Java soon)
- Advertised bandwidth distribution: Percentiles, including the unusual 0-th percentile (Java) and median (R)
- Advertised bandwidth of n-th fastest relays: Median (R)
- Fraction of connections used uni-/bidirectionally: Quartiles (Java)
- Time to download files over Tor: Quartiles (PostgreSQL)
- Unique .onion addresses (version 2 only): Quartiles for weighted inter-quartile mean (Java)
- Onion-service traffic (versions 2 and 3): Quartiles for weighted inter-quartile mean (Java)
There exist surprisingly many ways for computing quantiles. I found the following links to be quite helpful:
- https://en.wikipedia.org/wiki/Quantile#Estimating_quantiles_from_a_sample
- https://www.rdocumentation.org/packages/stats/versions/3.5.0/topics/quantile
Looking at the lists, we should probably pick two types: one discontinuous (R-1
to R-3
) and one continuous type (R-4
to R-9
). And ideally, we'd pick types that are either the defaults in the tools we're using or that we can easily select to use in those tools.
Going through our tools:
- PostgreSQL has two functions,
PERCENTILE_CONT
andPERCENTILE_DISC
, of which we already use the first. I did some experiments with a quite large sample set and found thatPERCENTILE_CONT
produces the exact same output asR-7
andPERCENTILE_DISC
must be eitherR-1
orR-2
. A math person might be able to say whether it'sR-1
orR-2
by looking at the PostgreSQL source code. And maybe that person would be able to confirm theR-7
part, too. It seems like we don't have the choice of using other types than these in PosrtgreSQL, though, or at least not easily. - R has support for all nine types. After all, they're named after this language. It seems like
R-7
is the default type. - Java with Apache Commons Math has support for all nine types,
R-1
toR-9
. And in theory, the two types we need shouldn't be terribly hard to re-implement, in case we want to avoid putting in this not-exactly-tiny library as dependency. - Python with SciPy/Numpy probably has support for some types, but I guess we're not planning to keep our Python code anyway, so this doesn't really matter.
Whee, long ticket. Thoughts?
Child Tickets
Change History (27)
comment:1 Changed 14 months ago by
comment:2 Changed 14 months ago by
Thanks for this thoughtful response! I have just a few more thoughts on this:
I'm not entirely sure how that pseudocode handles percentile
values of 0.0 and 1.0. Is val()
0-indexed or 1-indexed? In either case, we'll end up outside of values with one of the two percentiles. (PostgreSQL's PERCENTILE_CONT
does support percentiles 0.0 and 1.0.)
I also looked more closely at types R-1
and R-2
and figured that PostgreSQL's PERCENTILE_DISC
is R-1
, because it does not produce any averages. So, PostgreSQL implements R-1
and R-7
.
(By the way, when I refer to them as R-x
, that's mainly to simplify our discussion here. I'm happy to specify them with their formulas and only mention that they are what R defines as type x in their quantile()
function.)
Regarding R's median()
, function, that produces the same result as R-2
and R-7
, right?
I wonder if, for the sake of simplicity, we should avoid using PERCENTILE_DISC
(which we're not using yet, AFAIK) and only use PERCENTILE_CONT
and R's median()
. That is, use R-7
everywhere.
I do agree that interpolation between two integers representing user numbers doesn't make as much sense. But we can always truncate or round results, if we believe that integers are less confusing.
(I could imagine that if we were to compute percentiles of truly discrete variables like the number of tires mounted on trucks, we wouldn't want to return 7, but only actual sample values. I don't think that we need to worry about that here.)
Regarding Apache Commons Math, we're not using that yet, and I don't feel strongly about adding it as dependency or implementing this quite simple function ourselves, say, in metrics-lib. Worth adding tests, I guess.
Regarding Python, I'm amending my statement above a little bit. It's true that we're going to replace our last remaining Python code. Still, if we want to make our numbers reproducible, we'll have to accept that many of our users will want to reproduce them using Python. We should at least take a brief look how this would work.
So, your possible steps make sense. Is this something you'd like to work on?
comment:3 Changed 14 months ago by
Owner: | changed from metrics-team to iwakeh |
---|---|
Status: | new → accepted |
Accepting.
Next steps:
- find decisions and answers for the few open questions.
- identify work packages and create sub-tickets.
- implement.
comment:4 Changed 14 months ago by
This turned out to be longer than intended:
a) Advertised bandwidth distribution:
- With the notation in comment:1 the Java code uses essentially
result = val(floor((N-1)*percentile))
about here. - R code takes the median of the Java calculated percentiles.
b) Advertised bandwidth of n-th fastest relays uses the resulting data from a).
c) Fraction of connections used uni-/bidirectionally: Uses Java and calculates result = val(floor(N*percentile))
for the three percentiles .25, .5, and .75.
d) Time to download files over Tor: uses percentile_cont
e) Unique .onion addresses (version 2 only): The code doesn't seem to calculate quartiles, rather checks that the interval is contained in the 25% to 75% interval of the fraction sum. Hmm, what am I missing here?
f) Onion-service traffic (versions 2 and 3): same as e).
In total, the Java calculations a) and c) use a discrete version of median calculation and differ in 'slight index shifting'.
The calculations in e) and f) are not really 'quartiles'.
The remaining calculations use R's median and postgresql percentile_cont, where R's standard median is calculated (in pseude code) as
first = floor(percentile * N); second = ceil(percentile * N); /* if first==second the value of first is used. */ if (first==second) { result = val(first); } else { /* if first and second differ, take the average. */ result = val(first) + (0.5 * (val(second) - val(first))); }
So R's standard median is the same as 50%-percentile of type R-2 and also coincides with 50%-percentile of type R-7.
There is a variety there. The discrete types are easier to compute (when trying to reproduce the results for example). Introducing the interpolation (or continuous) type in Java would mean to complicate the current code a little, but could be done w/o commons-math.
Of course, the two calculations in a) and c) should be the same, but that's only a minor change and not related to the choice of percentile calculation.
I: R-1
If we decide to use the discrete R-1 throughout. If so, we'd need to
- replace percentile_cont by percentile_dics, and
- replace R's median function throughout by the 50%-percentile type R-1 provided by utility function named
metricsmedian
.
II: R-7
If we decide to use of the proportionate interpolation method R-7 throughout, there are these work packages:
- implement a simple utility interpolation function for Java (similar to postgresql's approach)
- and make use of it in a) and c).
- replace R's median function throughout by the 50%-percentile type R-7 provided by utility function named
metricsmedian
.
In both cases the calculations e) and f) stay as they are, but need more documentation.
PS: (Trucks often have a spare tire mounted somewhere, and in some countries they use three wheeled trucks ;-)
comment:5 follow-ups: 6 12 Changed 14 months ago by
Thanks, very useful! Let me first try to answer the open questions:
- What's up with a) and c) using slightly different percentile implementations? The reason is that we're including the 0th (minimum) and 100th percentile (maximum) in a) which we're not in c). It's totally possible that what we're using right now for a) is a terrible hack. Maybe we should instead use the formula for c) in a) and handle percentile 0 or 100 as a special case. Whatever the other implementations do.
- What's up with e) and f) not being quartiles? What we're doing there is that we're computing the weighted quartiles. And again, it might be that it's a hack that we should rewrite. The goal should be to implement a weighted trimmed mean. The technical report probably has a better definition. What we cannot do, though, is use the exact same percentile definition as we're using for the other places.
- I think you left out the Python code that is our current censorship detector. Which is fine, as I see how we could change that code to match what we're doing elsewhere.
So, I guess the decision we need to make is whether we want to use R-1 or R-7 everywhere, right?
I'm slightly leaning towards R-7 here.
One reason is that, if we used R-1, we couldn't use R's default median()
anymore, because that interpolates. I found a non-interpolating median implementation in Python, called median_low (or median_high). And I think the Tor daemon uses a low median for some things related to directory authority voting. But I believe it's not the standard.
So, if we use R-7, we should have good tool support.
Except for Java where we'd have to implement something ourselves, which would also have to handle special cases 0 and 100.
By the way, do you feel strongly about avoiding Apache Commons Math? We'd only have to add it to metrics-web, and it would save us half a day of writing code and testing it. After all, we also rely on libraries for things like base64 encoding, which is not rocket science to implement ourselves. We wouldn't have to add it to the metrics-web .war file!
P.S.: Did I write something about trucks? I meant insect legs! Unless those have a spare leg mounted somewhere, too, in which case I'll think even harder about a good example. ;)
comment:6 follow-up: 7 Changed 14 months ago by
Replying to karsten:
By the way, do you feel strongly about avoiding Apache Commons Math? We'd only have to add it to metrics-web, and it would save us half a day of writing code and testing it. After all, we also rely on libraries for things like base64 encoding, which is not rocket science to implement ourselves. We wouldn't have to add it to the metrics-web .war file!
Related aspect: When we rewrite the censorship detector (#21588), we'll have to compute mean and standard deviation of approximately 50 sample values. We'll probably want to use a library for that, too.
comment:7 Changed 14 months ago by
Replying to karsten:
Replying to karsten:
By the way, do you feel strongly about avoiding Apache Commons Math? We'd only have to add it to metrics-web, and it would save us half a day of writing code and testing it. After all, we also rely on libraries for things like base64 encoding, which is not rocket science to implement ourselves. We wouldn't have to add it to the metrics-web .war file!
Related aspect: When we rewrite the censorship detector (#21588), we'll have to compute mean and standard deviation of approximately 50 sample values. We'll probably want to use a library for that, too.
Sure, commons-math might also be useful for future additions of new metrics and estimates.
So, let's settle on adding the dependency.
comment:8 follow-up: 9 Changed 14 months ago by
Sounds good!
(Did you also see my longer comment above the one that you replied to?)
comment:9 Changed 14 months ago by
Replying to karsten:
Sounds good!
(Did you also see my longer comment above the one that you replied to?)
Yes; still optimizing the reply length :-)
comment:10 Changed 14 months ago by
Heh, okay. I'm setting priority to high, just in case it helps with prioritizing tickets. Thanks!
comment:11 Changed 14 months ago by
Priority: | Medium → High |
---|
comment:12 Changed 14 months ago by
Replying to karsten:
Leaving out all commons-math related, b/c we agreed to use it.
Thanks, very useful! Let me first try to answer the open questions:
- What's up with a) and c) using slightly different percentile implementations? The reason is that we're including the 0th (minimum) and 100th percentile (maximum) in a) which we're not in c). It's totally possible that what we're using right now for a) is a terrible hack. Maybe we should instead use the formula for c) in a) and handle percentile 0 or 100 as a special case. Whatever the other implementations do.
Well, c) would fail on 1.0
, but that wouldn't occur b/c only quartiles are computed. This ought to be fixed and both implementations will be the same except for the edge cases.
- What's up with e) and f) not being quartiles? What we're doing there is that we're computing the weighted quartiles. And again, it might be that it's a hack that we should rewrite. The goal should be to implement a weighted trimmed mean. The technical report probably has a better definition. What we cannot do, though, is use the exact same percentile definition as we're using for the other places.
Well, I wouldn't call .25 times a value (fraction sum in the code) a quartile, and the code calculates the weighed mean of all intervals contained in [sumFraction * 0.25, sumFraction * 0.75]
. So, nothing to be done here.
...
I'm slightly leaning towards R-7 here.
I don't feel strongly about this.
...
Except for Java where we'd have to implement something ourselves, which would also have to handle special cases 0 and 100.
Yes the minimum and maximum need to be coded.
...
P.S.: Did I write something about trucks? I meant insect legs! Unless those have a spare leg mounted somewhere, too, in which case I'll think even harder about a good example. ;)
Well, for insects the leg number is fixed to six, unless they loose a leg and live on later. Might be best to stick to the values at hand ;-)
So, I implement the changes decided in this and the previous comments.
comment:13 Changed 14 months ago by
Sounds like we agree on the approach. (Except that we'll want to take another look at e) and f) when the others are done.) Looking forward to reviewing code! If, for any reason, you prefer to work on something else instead, I'm happy to help here. Thanks!
comment:14 follow-up: 15 Changed 14 months ago by
Owner: | changed from iwakeh to karsten |
---|---|
Status: | accepted → assigned |
Taking you up on your offer from comment:13, so I can concentrate on reviews and tickets of CollecTor.
comment:15 follow-up: 16 Changed 14 months ago by
Status: | assigned → needs_review |
---|
Replying to iwakeh:
Taking you up on your offer from comment:13, so I can concentrate on reviews and tickets of CollecTor.
Alright, happy to implement this change.
Please review my task-26035 branch with three commits:
- 4f92894 changes how we're computing median and inter-quartile range in the censorship detector code, which is still written in Python. I tested the change by running on our user number estimates. I found that it changes 159 of 2447 days in our data (6.5%) and leaves the remaining days entirely unchanged. This also makes sense: with a slightly different median and inter-quartile range we either include a value or exclude it as outlier. I'd say we cannot conclude that one of the implementations is correct and the other is not. The new implementation will simply be more consistent throughout our code base.
- 2685c78 makes the same change to our advertised bandwidth statistics. Obviously, this changes results a bit, because we're now interpolating between actually reported advertised bandwidths rather than returning a value that was actually reported by one of the relays. Still, for the sake of consistency throughout our code base, we should switch.
- f9c24ca makes the third change in this series, this time to the connbidirect module. The change is quite significant in years 2011 and 2012 where we had just a handful of relays reporting these statistics. Then it does make a difference whether we're interpolating or not. Same argument in favor of doing it now.
I'm currently re-processing the descriptor archive for updated advbwdist statistics (second commit above). Re-doing the clients and connbidirect statistics using the updated code is much simpler. I hope to be ready to deploy the change in the next few days. Ideally, we'd be done with the review process by then, too. Thanks in advance!
comment:16 Changed 14 months ago by
Status: | needs_review → needs_revision |
---|
Replying to karsten:
Replying to iwakeh:
Taking you up on your offer from comment:13, so I can concentrate on reviews and tickets of CollecTor.
Alright, happy to implement this change.
Thanks!
Please review my task-26035 branch with three commits:
- 4f92894 changes how we're computing median and inter-quartile range in the censorship detector code, which is still written in Python. I tested the change by running on our user number estimates. I found that it changes 159 of 2447 days in our data (6.5%) and leaves the remaining days entirely unchanged. This also makes sense: with a slightly different median and inter-quartile range we either include a value or exclude it as outlier. I'd say we cannot conclude that one of the implementations is correct and the other is not. The new implementation will simply be more consistent throughout our code base.
This looks fine and will make it easier to transfer this into Java later.
- 2685c78 makes the same change to our advertised bandwidth statistics. Obviously, this changes results a bit, because we're now interpolating between actually reported advertised bandwidths rather than returning a value that was actually reported by one of the relays. Still, for the sake of consistency throughout our code base, we should switch.
- f9c24ca makes the third change in this series, this time to the connbidirect module. The change is quite significant in years 2011 and 2012 where we had just a handful of relays reporting these statistics. Then it does make a difference whether we're interpolating or not. Same argument in favor of doing it now.
The advbwdist module has a new static method, which should be made more visible and thus facilitate re-use as well as testing.
Of course, for re-use it needs to be made more generic and maybe also placed in a different class (maybe **.stats.Utiliy
).
Remarks & suggestions in no particular order:
- the sorting step in advbwdist changes an input parameter, which is bad practice.
- commons-math Percentile class doesn't require the input data to be sorted. (The javadoc comment only talks about sorting in order to explain what will happen for edge cases.)
- Maybe rather use
doubleValue()
instead of casting a Number sub-type to a primitive. - Casting of percentile results could be performed by the caller, which could guarantee that there are only values of for example type short entered (see connbiderect). Or, provide special utility methods that re-use code internally.
- connbidirect uses similar code as advbwdist for almost the same computations. The input fraction list also get changed by the unnecessary sorting step (this might not matter in that case, but still)
- The Java re-implementation of the python detector will also benefit from a percentile function.
- The percentile input parameter
int[] percentiles
could be changed toint ... percentiles
.
Encapsulation and testability of this type of functionality that is needed throughout the code is essential and will also make documentation now and in future much easier.
The functionality should especially be tested b/c of the large impact such changes have, i.e., re-computation of everything. This should be revised.
comment:17 Changed 14 months ago by
Thanks for the review and remarks/suggestions! I'll revise the Java patches over the weekend or early next week.
comment:18 Changed 14 months ago by
Priority: | High → Medium |
---|
It seems unlikely that I'll get to this before the end of the month. De-prioritizing in order to focus on the tickets that we should really do in the remaining three days of May. This one can still happen in the first days of June.
comment:19 Changed 9 months ago by
Status: | needs_revision → accepted |
---|
Alright, I finally worked on a revision of my earlier patch series. I'd really like to get these fixes in, so that we're using the same definition of percentile everywhere in our code.
So, I went through all of iwakeh's remarks and suggestions and implemented most of them.
The suggestion that I did not implement is generalize and share the percentile computation code, for several reasons:
- While I think that sharing code between modules will be useful, I'd rather want to approach that top-down rather than bottom-up from this rather small method.
- These two methods are similar, but different. It's certainly possible to generalize them, but this has so far delayed this patch for too long.
- We're talking about 15 to 25 lines of code here, and the hard parts happen in the Apache Commons Math library. Things might be different if we implemented our own percentile logic, but that's not the case here.
Instead I went for separating the percentile computation code in both modules into their own methods and added unit tests.
irl, please review my task-26035-2 branch, which contains the rebased patches discussed above plus three new commits with unit test fixes and implementations of iwakeh's remarks and suggestions.
comment:20 Changed 9 months ago by
Status: | accepted → needs_review |
---|
comment:21 Changed 9 months ago by
As this is a larger review, I'd like to look at this after the current sponsor19 tasks, unless we think this is an immediate priority. If anyone else would like to do an initial review for now, then please do.
comment:23 Changed 9 months ago by
Reviewer: | → irl |
---|
comment:24 follow-up: 25 Changed 9 months ago by
Status: | needs_review → merge_ready |
---|
I think this is ready to go, but, we should create a new ticket for generalising the computePercentiles function with generics.
comment:25 follow-up: 26 Changed 9 months ago by
Replying to irl:
I think this is ready to go, [...]
Thanks for the review! Pushed to master, deploying now, step by step. Leaving this ticket open until everything's deployed.
[...] but, we should create a new ticket for generalising the computePercentiles function with generics.
I'll create a slightly different ticket before closing this one: we should share more code between modules, and a generic computePercentiles()
would be one piece of code to consider here. I wouldn't want to merge a patch that starts with this part of the code. I'd want us to approach this from a top-down perspective where we generalize similar functionality and use it in all modules to reduce the overall amount of code and likelihood of bugs.
comment:26 follow-up: 27 Changed 9 months ago by
Replying to karsten:
Replying to irl:
I think this is ready to go, [...]
Thanks for the review! Pushed to master, deploying now, step by step. Leaving this ticket open until everything's deployed.
Everything's deployed now.
[...] but, we should create a new ticket for generalising the computePercentiles function with generics.
I'll create a slightly different ticket before closing this one: we should share more code between modules, and a generic
computePercentiles()
would be one piece of code to consider here. I wouldn't want to merge a patch that starts with this part of the code. I'd want us to approach this from a top-down perspective where we generalize similar functionality and use it in all modules to reduce the overall amount of code and likelihood of bugs.
Still not closing until I get around to doing this.
comment:27 Changed 8 months ago by
Resolution: | → fixed |
---|---|
Status: | merge_ready → closed |
Replying to karsten:
Replying to karsten:
Replying to irl:
[...] but, we should create a new ticket for generalising the computePercentiles function with generics.
I'll create a slightly different ticket before closing this one: we should share more code between modules, and a generic
computePercentiles()
would be one piece of code to consider here. I wouldn't want to merge a patch that starts with this part of the code. I'd want us to approach this from a top-down perspective where we generalize similar functionality and use it in all modules to reduce the overall amount of code and likelihood of bugs.
Still not closing until I get around to doing this.
Alright, #28342 now exists with many more ideas on this topic. Closing. Thanks!
Long description, true.
Let's look at the postgresql implementation (using source of 9.6.8, but assuming it is not subject to frequent change, ./src/backend/utils/adt/orderedsetaggs.c).
The computation of percentile_* as C-ish pseudo-code (which could easily be used for implementation in Java or python):
For values, where fractions make sense, e.g. seconds for onionperf results, the interpolation or continuous method could be used and for all else, e.g. user numbers etc., the simpler calculation (refered to as discontinuous) could be used.
The classification R-x is not really a DIN and I wouldn't rely on an implementation without checking the source.
Thus, using our own implementation in Java might be less trouble and smaller dependency counts/space used. Python is not used anymore soon, and R is only used for the median and either could document the median calculation in R or implement the 0.5-percentile.
Possible steps: