Don't trust "bridge-ips" blindly for user number estimates
|Reported by:||karsten||Owned by:|
I think I found a bug in the user number estimates that led to the confusion on #13171.
When I developed the algorithm for estimating user numbers, bridges only reported how many directory requests they responded to ("dirreq-v3-resp"), but not how these directory requests were distributed to countries ("dirreq-v3-reqs"). What they did report was how many different IP addresses by country connected to the bridge ("bridge-ips"). The goal back then was to provide better user numbers per country, so I put in the assumption that the geographic distributions of directory responses and connecting IP addresses would be roughly the same. And I think that assumption is still valid for most cases.
However, the meek version before the #13171 fix broke this assumption. Here's an example from a meek bridge that didn't have this fix yet (descriptor digest 462a2bcc..):
extra-info UtahMeekBridge 88F745840F47CE0C6A4FE61D827950B06F9E4534 published 2015-12-09 22:53:48 dirreq-v3-resp ok=17656,not-enough-sigs=0,unavailable=0,not-found=0,not-modified=6160,busy=0 bridge-ips de=16,cn=8,us=8
It's rather unlikely that 17656 responses were sent back to 32 IP addresses or less. Still, following the assumption above, we're saying that half of those 17656 responses were sent back to Germany and one quarter each to China and the U.S.A., and that seems dangerously wrong.
I'm going to attach a scatter plot in a minute, dirreq-resp-by-bridge-ips-2016-01-27.png, that puts the numbers of "dirreq-v3-resp ok=..." and "bridge-ips" in relation for statistics reported between December 1, 2015 and last week. The two meek bridges 88F7.. and AA03.. stand out quite a bit there as clusters close to the y axis.
I have a few possible fixes in mind. The first part would be to ignore all statistics where 1 unique IP address was reported to make, say, 10 directory requests or more. That would remove all dots to the left of the dashed line in the graph.
The second part of the fix would be to switch from combining "dirreq-v3-resp" and "bridge-ips" numbers and instead use reported distributions of directory requests to countries ("dirreq-v3-reqs") that were not available 3.5 years ago. But starting roughly 2 years ago, these statistics are being published by more and more bridges.
Here's a descriptor (fe171d40..) that was published last week by the same bridge as above, now named MeekGoogle, which was after the meek-specific #13171 fix:
extra-info MeekGoogle 88F745840F47CE0C6A4FE61D827950B06F9E4534 published 2016-01-22 13:11:10 dirreq-v3-reqs us=7200,ru=1576,de=1520,[..],cn=88,[..] dirreq-v3-resp ok=22016,not-enough-sigs=0,unavailable=0,not-found=0,not-modified=6016,busy=0 bridge-ips us=3016,ru=632,gb=536,de=528,[..],cn=40,[..] bridge-ip-versions v4=8752,v6=64 bridge-ip-transports <OR>=8,meek=8808
I'm attaching a second scatter plot, dirreq-resp-by-dirreq-reqs-2016-01-27.png, that compares the numbers of "dirreq-v3-resp ok=..." to "dirreq-v3-reqs". The correlation is close to linear, which makes sense, because the number of directory requests should roughly match the number of directory responses. I think we can make the user number estimates a bit more accurate by making this switch. We would still fall back to "bridge-ips" if "dirreq-v3-reqs" is empty, but that would mostly affect older statistics.
Part three of the plan would be to remove the "bridge-ips" line entirely from little-t-tor, because we wouldn't use it anymore. It's worth noting that we'd lose the ability to filter out meek bridges that don't have the #13171 fix and that don't report usable "dirreq-v3-reqs" statistics. Or rather, we wouldn't spot future meek-like bridges affected by a similar bug.
Here's why. The first bridge descriptor above also contained a "dirreq-v3-reqs" line that I left out before:
extra-info UtahMeekBridge 88F745840F47CE0C6A4FE61D827950B06F9E4534 published 2015-12-09 22:53:48 dirreq-v3-resp ok=17656,not-enough-sigs=0,unavailable=0,not-found=0,not-modified=6160,busy=0 dirreq-v3-reqs us=17648,cn=8 bridge-ips de=16,cn=8,us=8
We wouldn't be able to filter out this bridge without the "bridge-ips" line. We would have to assume that the vast majority of requests to this bridge came from the U.S.A., and a tiny minority from China.
I think this is acceptable, because the purpose of statistics shouldn't be to validate the correctness of other statistics.
To summarize my plan, here's what I'd like to do:
- If a bridge reports both a "dirreq-v3-resp" and a "bridge-ips" line, check if the first number is smaller than 10 times the second number; if not, ignore these directory-request statistics reported by this bridge.
- If a bridge only reports a "bridge-ips" line and no "dirreq-v3-reqs" line, assume that the country distributions are the same, which is what we're doing right now.
- If a bridge reports a "dirreq-v3-reqs" line, use that for user number estimates and ignore the "bridge-ips" line in case it's present.
Hope this report was not too confusing. Feedback much appreciated.