Opened 8 years ago

Closed 7 years ago

#5807 closed enhancement (implemented)

Propose better bridge usage statistics

Reported by: karsten Owned by: karsten
Priority: Medium Milestone:
Component: Metrics/Analysis Version:
Severity: Keywords: SponsorF20120701 tor-bridge
Cc: arma Actual Points:
Parent ID: #3261 Points: 14
Reviewer: Sponsor:

Description

In #3261 I found that the way how bridges report their statistics to the bridge authority is not the reason why our bridge usage numbers are so unreliable. It's the way how we derive user numbers from unique IP addresses which is totally broken. We should try an approach that's similar to how we count directory requests on directory mirrors.

There are a few substeps necessary to come up with better bridge usage statistics:

  • Write a proposal for new bridge statistics based on counting directory requests per day and country. (4 points)
  • Implement the proposal, get it in mergeable state, and test it on three own bridges that don't publish the new statistics to the bridge authority yet. (4 points)
  • Evaluate whether the new statistics can improve our user number estimates, and if so, enable reporting to the bridge authority and prepare deployment on all new bridges. (6 points)

Child Tickets

Attachments (3)

bridge-dirreq-stats.png (102.0 KB) - added by karsten 8 years ago.
Estimated bridge users based on directory request statistics
bridge-dirreq-stats-2012-10-02.png (104.8 KB) - added by karsten 7 years ago.
counting-bridge-users.pdf (189.4 KB) - added by karsten 7 years ago.
Tech report: Counting daily bridge users (DRAFT)

Download all attachments as: .zip

Change History (15)

comment:1 Changed 8 years ago by karsten

Cc: nickm added
Component: AnalysisTor Bridge
Priority: majornormal
Status: newneeds_review
Type: taskenhancement

I wrote some code to record network status on bridges similar to how we record them on directory mirrors. The patch is really tiny, because we can re-use the existing dirreq stats and bridge stats code.

Here's the suggested new keyword line for dir-spec.txt that would come after "bridge-ips":

"bridge-v3-reqs" CC=N,CC=N,... NL
    [At most once.]

    List of mappings from two-letter country codes to the number of
    requests for v3 network statuses from that country as seen by the
    bridge, rounded up to the nearest multiple of 8. Only those requests
    are counted that the directory can answer with a 200 OK status code.

Does this change need a proposal? (That's what the needs_review flag is for, not for reviewing the actual patch.) The patch should go into 0.2.4.x. I'm going to clean up the patch and test it for a while on my own bridges before asking to merge it.

comment:2 Changed 8 years ago by nickm

Needs a proposal; it's a spec change AND it involves publishing more statistics. We should NOT be rubber-stamping that kind of thing. The proposal can be pretty short, though.

comment:3 Changed 8 years ago by karsten

Status: needs_reviewaccepted

Okay, will write a proposal and post it to tor-dev. Thanks!

comment:4 Changed 8 years ago by karsten

Yesterday I realized that we might consider #5824 as a feature. If bridges report relay-only statistics, that includes directory request statistics, too. That's exactly what we're looking for here except that we already have the data and don't have to start collecting it.

I tweaked the bridge descriptor sanitizer to leave "dirreq-*" lines in and let it process the April 2012 tarball once again. The result is that 44% of bridge descriptors already contain directory request statistics. For comparison, our current bridge usage statistics are based on data contained in 57% of bridge descriptors.

Here's the first catch: the dirreq stats in bridge descriptors are not broken down by country. To be precise, the "dirreq-v3-reqs" lines are all empty, because that's how we tried to not record directory request statistics on bridges. But there are lines like "ok=40,not-enough-sigs=0,unavailable=0,not-found=0,not-modified=0,busy=0" that tell us that we sent responses to 40 directory requests. We can try to fix the missing by-country statistics by calculating totals from directory requests and breaking down by country based on unique IP numbers. I'm optimistic that the result wouldn't be too much off. They should be much more reliable than statistics based on unique IP numbers only.

The second catch is that we need to discuss whether we can use the data that we scrubbed from bridge descriptors so far. I'm going to work on a quick analysis of the April 2012 data today to evaluate whether they're useful or not. If they are, we should do the tor-dev dance like we do for #5684 and discuss stopping to sanitize dirreq stats in bridge descriptors.

If we actually can use dirreq stats from bridges, that's going to save us at least 1 year of waiting until a large enough number of 0.2.4.x bridges report the statistics proposed earlier in this ticket.

Changed 8 years ago by karsten

Attachment: bridge-dirreq-stats.png added

Estimated bridge users based on directory request statistics

comment:5 Changed 8 years ago by karsten

Here are the results from my quick analysis. Good news is that the new statistics look great for estimating daily bridge users. Bad news is that we may only have 1/10 as many bridge users as we thought.

Estimated bridge users based on directory request statistics

There are five lines in the graph:

1) Reported directory request responses is the total number of directory requests (v3 network statuses) that bridges report to have responded to.

2) Fraction of bridges reporting directory request statistics is the number of bridges including "dirreq-stats-end" lines in their descriptors by the average number of running bridges on the same day.

3) Estimated directory request responses is the number from 1) divided by the number in 2). This is the number we'd expect if all bridges reported directory request statistics.

4) Estimated daily bridge users from all countries is the number from 3) divided by 10. The 10 comes from the assumption that the average bridge client downloads 10 v3 network statuses per day, which is the same assumption as for directly connecting clients.

5) Estimated daily bridge users from Syria is the number from 4) multiplied with the fraction of users from .sy that we learn from unique IP addresses seen at bridges. I chose .sy, because that's apparently the country with most bridge users these days. The approach would work for all other countries, too.

So, I think it makes sense to look more into dirreq statistics reported by bridges before adding the "bridge-v3-reqs" line as suggested above.

The next step will be to move the discussion of keeping dirreq statistics in sanitized bridge descriptors to the tor-dev mailing list. I guess I should follow up my proposal 201 mail for that.

comment:6 Changed 7 years ago by karsten

Keywords: SponsorF20120701 added
Milestone: Sponsor F: July 1, 2012

Switching from using milestones to keywords for sponsor deliverables. See #6365 for details.

comment:7 Changed 7 years ago by nickm

Milestone: Tor: unspecified

comment:8 Changed 7 years ago by nickm

Keywords: tor-bridge added

comment:9 Changed 7 years ago by nickm

Component: Tor BridgeTor

Changed 7 years ago by karsten

comment:10 Changed 7 years ago by karsten

Cc: arma added; nickm removed
Component: TorAnalysis
Milestone: Tor: unspecified

Here are the results from a more detailed analysis. This analysis covers all data we have, which means September 2011 until now. Bridges did not report directory requests to the bridge authority before that.


See the comment above for explanations of lines 1 to 5. The new line 2a shows the number of consensuses published on a given day. I discovered that the spikes in directory requests (line 1) coincide with missing consensuses on the same day. This makes sense, because clients would request consensuses more often if they canot get a recent enough consensus. We'll have to discard those days from the estimated bridge user numbers. This also applies to direct user estimates, by the way.

So, line 4 is the new estimate for daily bridge users from all countries and line 5 is the estimate for a single country, here Syria. As I said, the new numbers are much smaller than the old numbers, but I have much more confidence in this estimation method. The old approach is wrong with respect to absolute numbers and was only deployed back in 2009 to observe relative changes over time.

The next step will be to a) write down the new algorithm in a tech report and b) implement it on the metrics website (probably along with the old graphs for the moment, not replacing them yet).

Changed 7 years ago by karsten

Attachment: counting-bridge-users.pdf added

Tech report: Counting daily bridge users (DRAFT)

comment:11 Changed 7 years ago by karsten

I just attached a draft version of the tech report on "Counting daily bridge users" that I'm planning to publish as output of this ticket. Comments are highly appreciated, especially if I can still incorporate feedback in the final version that I'm planning to publish on October 31.

This completes step a) from my previous comment, but step b) requires much more new code than expected and is out of scope for the November deadline.

comment:12 Changed 7 years ago by karsten

Resolution: implemented
Status: acceptedclosed

Published the final version of the "Counting daily bridge users" report. That concludes this ticket. Closing.

Note: See TracTickets for help on using tickets.