Opened 16 months ago

Last modified 15 months ago

#18167 new defect

Don't trust "bridge-ips" blindly for user number estimates

Reported by: karsten Owned by:
Priority: Medium Milestone:
Component: Metrics/Metrics website Version:
Severity: Major Keywords: meek
Cc: dcf Actual Points:
Parent ID: Points:
Reviewer: Sponsor:

Description

I think I found a bug in the user number estimates that led to the confusion on #13171.

When I developed the algorithm for estimating user numbers, bridges only reported how many directory requests they responded to ("dirreq-v3-resp"), but not how these directory requests were distributed to countries ("dirreq-v3-reqs"). What they did report was how many different IP addresses by country connected to the bridge ("bridge-ips"). The goal back then was to provide better user numbers per country, so I put in the assumption that the geographic distributions of directory responses and connecting IP addresses would be roughly the same. And I think that assumption is still valid for most cases.

However, the meek version before the #13171 fix broke this assumption. Here's an example from a meek bridge that didn't have this fix yet (descriptor digest 462a2bcc..):

extra-info UtahMeekBridge 88F745840F47CE0C6A4FE61D827950B06F9E4534
published 2015-12-09 22:53:48
dirreq-v3-resp ok=17656,not-enough-sigs=0,unavailable=0,not-found=0,not-modified=6160,busy=0
bridge-ips de=16,cn=8,us=8

It's rather unlikely that 17656 responses were sent back to 32 IP addresses or less. Still, following the assumption above, we're saying that half of those 17656 responses were sent back to Germany and one quarter each to China and the U.S.A., and that seems dangerously wrong.

I'm going to attach a scatter plot in a minute, dirreq-resp-by-bridge-ips-2016-01-27.png, that puts the numbers of "dirreq-v3-resp ok=..." and "bridge-ips" in relation for statistics reported between December 1, 2015 and last week. The two meek bridges 88F7.. and AA03.. stand out quite a bit there as clusters close to the y axis.

I have a few possible fixes in mind. The first part would be to ignore all statistics where 1 unique IP address was reported to make, say, 10 directory requests or more. That would remove all dots to the left of the dashed line in the graph.

The second part of the fix would be to switch from combining "dirreq-v3-resp" and "bridge-ips" numbers and instead use reported distributions of directory requests to countries ("dirreq-v3-reqs") that were not available 3.5 years ago. But starting roughly 2 years ago, these statistics are being published by more and more bridges.

Here's a descriptor (fe171d40..) that was published last week by the same bridge as above, now named MeekGoogle, which was after the meek-specific #13171 fix:

extra-info MeekGoogle 88F745840F47CE0C6A4FE61D827950B06F9E4534
published 2016-01-22 13:11:10
dirreq-v3-reqs us=7200,ru=1576,de=1520,[..],cn=88,[..]
dirreq-v3-resp ok=22016,not-enough-sigs=0,unavailable=0,not-found=0,not-modified=6016,busy=0
bridge-ips us=3016,ru=632,gb=536,de=528,[..],cn=40,[..]
bridge-ip-versions v4=8752,v6=64
bridge-ip-transports <OR>=8,meek=8808

I'm attaching a second scatter plot, dirreq-resp-by-dirreq-reqs-2016-01-27.png, that compares the numbers of "dirreq-v3-resp ok=..." to "dirreq-v3-reqs". The correlation is close to linear, which makes sense, because the number of directory requests should roughly match the number of directory responses. I think we can make the user number estimates a bit more accurate by making this switch. We would still fall back to "bridge-ips" if "dirreq-v3-reqs" is empty, but that would mostly affect older statistics.

Part three of the plan would be to remove the "bridge-ips" line entirely from little-t-tor, because we wouldn't use it anymore. It's worth noting that we'd lose the ability to filter out meek bridges that don't have the #13171 fix and that don't report usable "dirreq-v3-reqs" statistics. Or rather, we wouldn't spot future meek-like bridges affected by a similar bug.

Here's why. The first bridge descriptor above also contained a "dirreq-v3-reqs" line that I left out before:

extra-info UtahMeekBridge 88F745840F47CE0C6A4FE61D827950B06F9E4534
published 2015-12-09 22:53:48
dirreq-v3-resp ok=17656,not-enough-sigs=0,unavailable=0,not-found=0,not-modified=6160,busy=0
dirreq-v3-reqs us=17648,cn=8
bridge-ips de=16,cn=8,us=8

We wouldn't be able to filter out this bridge without the "bridge-ips" line. We would have to assume that the vast majority of requests to this bridge came from the U.S.A., and a tiny minority from China.

I think this is acceptable, because the purpose of statistics shouldn't be to validate the correctness of other statistics.

To summarize my plan, here's what I'd like to do:

  1. If a bridge reports both a "dirreq-v3-resp" and a "bridge-ips" line, check if the first number is smaller than 10 times the second number; if not, ignore these directory-request statistics reported by this bridge.
  1. If a bridge only reports a "bridge-ips" line and no "dirreq-v3-reqs" line, assume that the country distributions are the same, which is what we're doing right now.
  1. If a bridge reports a "dirreq-v3-reqs" line, use that for user number estimates and ignore the "bridge-ips" line in case it's present.

Hope this report was not too confusing. Feedback much appreciated.

Child Tickets

Attachments (4)

dirreq-resp-by-bridge-ips-2016-01-27.png (53.9 KB) - added by karsten 16 months ago.
dirreq-resp-by-dirreq-reqs-2016-01-27.png (39.0 KB) - added by karsten 16 months ago.
meek-clients-2016-02-01.png (54.4 KB) - added by karsten 16 months ago.
meek-bridges-2016-02-02.png (41.2 KB) - added by karsten 16 months ago.

Download all attachments as: .zip

Change History (10)

Changed 16 months ago by karsten

Changed 16 months ago by karsten

comment:1 Changed 16 months ago by dcf

I'm going to attach a scatter plot in a minute, dirreq-resp-by-bridge-ips-2016-01-27.png, that puts the numbers of "dirreq-v3-resp ok=..." and "bridge-ips" in relation for statistics reported between December 1, 2015 and last week. The two meek bridges 88F7.. and AA03.. stand out quite a bit there as clusters close to the y axis.


I'm attaching a second scatter plot, dirreq-resp-by-dirreq-reqs-2016-01-27.png, that compares the numbers of "dirreq-v3-resp ok=..." to "dirreq-v3-reqs". The correlation is close to linear, which makes sense, because the number of directory requests should roughly match the number of directory responses.


comment:2 Changed 16 months ago by dcf

To summarize my plan, here's what I'd like to do:

  1. If a bridge reports both a "dirreq-v3-resp" and a "bridge-ips" line, check if the first number is smaller than 10 times the second number; if not, ignore these directory-request statistics reported by this bridge.
  2. If a bridge only reports a "bridge-ips" line and no "dirreq-v3-reqs" line, assume that the country distributions are the same, which is what we're doing right now.
  3. If a bridge reports a "dirreq-v3-reqs" line, use that for user number estimates and ignore the "bridge-ips" line in case it's present.

Thanks for looking into this. That explains my confusion. I assumed that it was already using "dirreq-v3-reqs", not "bridge-ips". I.e. I thought (3) was already in effect.

I don't think we should do (1). The old "dirreq-v3-resp" and "dirreq-v3-reqs" numbers are correct, it's just that they are being wrongly apportioned to countries. But they still count the total accurately, I believe. We knew what the consequences would be back then, that meek users would be wrongly counted as being mostly from the U.S. We expected, after merging #13171, that the count of U.S. users would go down and other countries would go up. Ignoring those counts would be ignoring around 20% of bridge users through 2015.

Agreed on (2) and (3).

Changed 16 months ago by karsten

comment:3 follow-up: Changed 16 months ago by karsten

Right, I agree that we shouldn't do (1) but only (2) and (3). The following graph shows how this change would affect bridge users from China and the U.S. and bridge users using meek. Does this graph look plausible to you?


comment:4 in reply to: ↑ 3 ; follow-up: Changed 16 months ago by dcf

Replying to karsten:

Right, I agree that we shouldn't do (1) but only (2) and (3). The following graph shows how this change would affect bridge users from China and the U.S. and bridge users using meek. Does this graph look plausible to you?

Yes, it looks plausible to me.

Noticed how it smoothed out some of the wild fluctuations in cn in Dec 2015. I wonder if this is because when counting "bridge-ips", the cn count would semi-randomly switch between 0/n and 1/n of meek users.

Changed 16 months ago by karsten

comment:5 in reply to: ↑ 4 Changed 16 months ago by karsten

Replying to dcf:

Replying to karsten:

Right, I agree that we shouldn't do (1) but only (2) and (3). The following graph shows how this change would affect bridge users from China and the U.S. and bridge users using meek. Does this graph look plausible to you?

Yes, it looks plausible to me.

Great, thanks for looking!

I'm trying to make another change or two to user number estimates (#8786, #18203) before re-running the estimation algorithm on years of data. That means it could take a few more weeks until the numbers on Metrics are updated.

Noticed how it smoothed out some of the wild fluctuations in cn in Dec 2015. I wonder if this is because when counting "bridge-ips", the cn count would semi-randomly switch between 0/n and 1/n of meek users.

I think so, yes. See the following graph with fractions of "bridge-ips" being resolved to cn:


I think that explains the fluctuations.

comment:6 Changed 15 months ago by dcf

  • Keywords meek added
Note: See TracTickets for help on using tickets.