In #3261 (moved) I found that the way how bridges report their statistics to the bridge authority is not the reason why our bridge usage numbers are so unreliable. It's the way how we derive user numbers from unique IP addresses which is totally broken. We should try an approach that's similar to how we count directory requests on directory mirrors.
There are a few substeps necessary to come up with better bridge usage statistics:
Write a proposal for new bridge statistics based on counting directory requests per day and country. (4 points)
Implement the proposal, get it in mergeable state, and test it on three own bridges that don't publish the new statistics to the bridge authority yet. (4 points)
Evaluate whether the new statistics can improve our user number estimates, and if so, enable reporting to the bridge authority and prepare deployment on all new bridges. (6 points)
To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information
Child items 0
Show closed items
No child items are currently assigned. Use child items to break down this issue into smaller parts.
Linked items 0
Link issues together to show that they're related.
Learn more.
I wrote some code to record network status on bridges similar to how we record them on directory mirrors. The patch is really tiny, because we can re-use the existing dirreq stats and bridge stats code.
Here's the suggested new keyword line for dir-spec.txt that would come after "bridge-ips":
"bridge-v3-reqs" CC=N,CC=N,... NL [At most once.] List of mappings from two-letter country codes to the number of requests for v3 network statuses from that country as seen by the bridge, rounded up to the nearest multiple of 8. Only those requests are counted that the directory can answer with a 200 OK status code.
Does this change need a proposal? (That's what the needs_review flag is for, not for reviewing the actual patch.) The patch should go into 0.2.4.x. I'm going to clean up the patch and test it for a while on my own bridges before asking to merge it.
Trac: Priority: major to normal Status: new to needs_review Type: task to enhancement Cc: N/Ato nickm Component: Analysis to Tor Bridge
Needs a proposal; it's a spec change AND it involves publishing more statistics. We should NOT be rubber-stamping that kind of thing. The proposal can be pretty short, though.
Yesterday I realized that we might consider #5824 (moved) as a feature. If bridges report relay-only statistics, that includes directory request statistics, too. That's exactly what we're looking for here except that we already have the data and don't have to start collecting it.
I tweaked the bridge descriptor sanitizer to leave "dirreq-*" lines in and let it process the April 2012 tarball once again. The result is that 44% of bridge descriptors already contain directory request statistics. For comparison, our current bridge usage statistics are based on data contained in 57% of bridge descriptors.
Here's the first catch: the dirreq stats in bridge descriptors are not broken down by country. To be precise, the "dirreq-v3-reqs" lines are all empty, because that's how we tried to not record directory request statistics on bridges. But there are lines like "ok=40,not-enough-sigs=0,unavailable=0,not-found=0,not-modified=0,busy=0" that tell us that we sent responses to 40 directory requests. We can try to fix the missing by-country statistics by calculating totals from directory requests and breaking down by country based on unique IP numbers. I'm optimistic that the result wouldn't be too much off. They should be much more reliable than statistics based on unique IP numbers only.
The second catch is that we need to discuss whether we can use the data that we scrubbed from bridge descriptors so far. I'm going to work on a quick analysis of the April 2012 data today to evaluate whether they're useful or not. If they are, we should do the tor-dev dance like we do for #5684 (moved) and discuss stopping to sanitize dirreq stats in bridge descriptors.
If we actually can use dirreq stats from bridges, that's going to save us at least 1 year of waiting until a large enough number of 0.2.4.x bridges report the statistics proposed earlier in this ticket.
Here are the results from my quick analysis. Good news is that the new statistics look great for estimating daily bridge users. Bad news is that we may only have 1/10 as many bridge users as we thought.
There are five lines in the graph:
Reported directory request responses is the total number of directory requests (v3 network statuses) that bridges report to have responded to.
Fraction of bridges reporting directory request statistics is the number of bridges including "dirreq-stats-end" lines in their descriptors by the average number of running bridges on the same day.
Estimated directory request responses is the number from 1) divided by the number in 2). This is the number we'd expect if all bridges reported directory request statistics.
Estimated daily bridge users from all countries is the number from 3) divided by 10. The 10 comes from the assumption that the average bridge client downloads 10 v3 network statuses per day, which is the same assumption as for directly connecting clients.
Estimated daily bridge users from Syria is the number from 4) multiplied with the fraction of users from .sy that we learn from unique IP addresses seen at bridges. I chose .sy, because that's apparently the country with most bridge users these days. The approach would work for all other countries, too.
So, I think it makes sense to look more into dirreq statistics reported by bridges before adding the "bridge-v3-reqs" line as suggested above.
The next step will be to move the discussion of keeping dirreq statistics in sanitized bridge descriptors to the tor-dev mailing list. I guess I should follow up my proposal 201 mail for that.
Here are the results from a more detailed analysis. This analysis covers all data we have, which means September 2011 until now. Bridges did not report directory requests to the bridge authority before that.
See the comment above for explanations of lines 1 to 5. The new line 2a shows the number of consensuses published on a given day. I discovered that the spikes in directory requests (line 1) coincide with missing consensuses on the same day. This makes sense, because clients would request consensuses more often if they canot get a recent enough consensus. We'll have to discard those days from the estimated bridge user numbers. This also applies to direct user estimates, by the way.
So, line 4 is the new estimate for daily bridge users from all countries and line 5 is the estimate for a single country, here Syria. As I said, the new numbers are much smaller than the old numbers, but I have much more confidence in this estimation method. The old approach is wrong with respect to absolute numbers and was only deployed back in 2009 to observe relative changes over time.
The next step will be to a) write down the new algorithm in a tech report and b) implement it on the metrics website (probably along with the old graphs for the moment, not replacing them yet).
Trac: Component: Tor to Analysis Cc: nickm to arma Milestone: Tor: unspecified toN/A
I just attached a draft version of the tech report on "Counting daily bridge users" that I'm planning to publish as output of this ticket. Comments are highly appreciated, especially if I can still incorporate feedback in the final version that I'm planning to publish on October 31.
This completes step a) from my previous comment, but step b) requires much more new code than expected and is out of scope for the November deadline.