Write BridgeDB metrics parser and analyse existing data

Trac:
Parent Ticket: #31274 (moved)

added component::metrics/statistics owner::phw parent::31274 points::2 priority::medium resolution::fixed s30-o21a1 severity::normal sponsor::30-must status::closed type::task labels

Trac:

Requests per bridge types.

Trac:

Requests per bridge types (without Tor over HTTPS).

Trac:

Requests per distribution mechanisms.

Trac:

Requests per distribution mechanisms (without Tor over HTTPS).

Trac:

(Un)successful requests over email.

Trac:

(Un)successful requests over moat.

Trac:

(Un)successful requests without Tor over HTTPS.

Trac:

(Un)successful requests with Tor over HTTPS.

Below are a bunch of diagrams that I created from our BridgeDB usage metrics. The scripts that I used to create these diagrams are available here: https://dip.torproject.org/phw/bridgedb-metrics-visualisation

Note that none of the diagrams below contains vanilla bridges because of a bug (#32203 (moved)). (In fact, I discovered this bug by glancing at these visualisations and wondering why there are no vanilla bridges.)

== Requests per bridge types

The first diagram suggests that ScrambleSuit and obfs3 are relatively popular but once we remove HTTPS requests coming from Tor (which we believe to be almost entirely bots), basically only obfs4 remains. Note that this visualisation is an aggregate over all distribution mechanisms. An average of 7,500 obfs4 requests per day is a surprisingly large number to me. There may very well be more bots among these requests.

== Requests per distribution mechanism

Again, we can see a significant difference between all requests and requests without bots. With bots removed, moat is our most active distribution mechanism. Note that in moat it's difficult to tell apart a user from a bot request, so this visualisation is assuming that all requests are from users, which is probably false.

== (Un)successful requests per distribution mechanism

[[Image((un)successful-email.png, 600px)]]

Most users (assuming that these requests are in fact from users) succeed at getting bridges over email. The drop in mid-October was because of a bug in BridgeDB (#32105 (moved)).

[[Image((un)successful-moat.png, 600px)]]

Moat is looking worse. A sizable fraction of requests are unsuccessful, presumably because of the difficulty of our CAPTCHA.

[[Image((un)successful-nontor-https.png, 600px)]]

The situation is similar for HTTPS requests excluding requests over Tor (which we believe are bots).

[[Image((un)successful-tor-https.png, 600px)]]

HTTPS requests over Tor only are surprisingly successful. These bots are likely using a classifier that solves our CAPTCHAs.

Trac:
Cc: phw to phw, karsten, metrics-team

phw, you asked about adding any of these graphs to the metrics website. It does involve some coding and some documenting, but I think that having these graphs would be very useful, so let's talk.

I guess my first question would be how many graphs you'd like to see on the metrics website. Ideally, we would be able to compress these graphs above into one or two. Unlike these one-off graphs the ones on the metrics website would be somewhat configurable with the parameters we're using for other graphs (or new ones, if really needed). So, can you think about your one or two most important graphs for the metrics website?

Replying to karsten:

phw, you asked about adding any of these graphs to the metrics website. It does involve some coding and some documenting, but I think that having these graphs would be very useful, so let's talk.

How can we help with the coding or documenting?

I guess my first question would be how many graphs you'd like to see on the metrics website. Ideally, we would be able to compress these graphs above into one or two. Unlike these one-off graphs the ones on the metrics website would be somewhat configurable with the parameters we're using for other graphs (or new ones, if really needed). So, can you think about your one or two most important graphs for the metrics website?

I would choose "requests per bridge type" and "requests per distribution mechanism" (and possibly exclude "Tor requests over HTTPS" or, even better, make it configurable):

I started working on adding these two graphs to Tor Metrics today, and I'm planning to have something to review by end of the day.

Two questions:

Do you mind if we change the ticket component to Metrics/Statistics? In that case the ticket would be resolved as soon as two new graphs appear on Tor Metrics. But if you want to keep this ticket for writing your own parser and analyzing results, maybe we should open a new ticket. Up to you!
You graphed the binned numbers, that is, numbers rounded up to multiples of 10. In other graphs we're subtracting bin_size/2 as an approximation to the original number. That's a choice, though, and I'd understand if you'd rather want to graph exactly what's in the statistics. I guess my question is: did you intentionally not subtract anything?

Will be back here (or on another ticket) in a few hours with code and sample graphs.

Okay, I finished a first patch that processes BridgeDB metrics once per day to produce a .csv file and that adds two graphs to Tor Metrics. Can you please take a look at that patch, not regarding the Java/R code, but regarding user-facing documentation of the two new graphs? In particular, please take a look at the TODOs in that patch. (irl, I'll ask you to review a revised branch for the code portions once the documentation parts are all set.)

By the way, while reading your code, I found that you're only looking at BridgeDB metrics files in CollecTor's recent/ directory. There's currently a (minor) bug in CollecTor where we never remove files from that directory. I'm going to fix that at some point, and then your script will only provide the latest three files. A possible fix would be to also process files in CollecTor's archive/ directory. Not sure how much of an issue that is when these graphs exist on Tor Metrics, but I thought I should let you know.

Trac:
Status: assigned to needs_review

Replying to karsten:

Do you mind if we change the ticket component to Metrics/Statistics? In that case the ticket would be resolved as soon as two new graphs appear on Tor Metrics. But if you want to keep this ticket for writing your own parser and analyzing results, maybe we should open a new ticket. Up to you!

We already have a parser and took a look at some data, so we're done from the anti-censorship team's side. I changed the component to Metrics/Statistics.

You graphed the binned numbers, that is, numbers rounded up to multiples of 10. In other graphs we're subtracting bin_size/2 as an approximation to the original number. That's a choice, though, and I'd understand if you'd rather want to graph exactly what's in the statistics. I guess my question is: did you intentionally not subtract anything?

Oh, good point. Let's subtract bin_size/2 here as well, to make the diagrams consistent with Metrics's other diagrams, and more representative of the original number.

Trac:
Component: Circumvention/BridgeDB to Metrics/Statistics

Thanks for your work on this!

Replying to karsten:

Okay, I finished a first patch that processes BridgeDB metrics once per day to produce a .csv file and that adds two graphs to Tor Metrics. Can you please take a look at that patch, not regarding the Java/R code, but regarding user-facing documentation of the two new graphs? In particular, please take a look at the TODOs in that patch. (irl, I'll ask you to review a revised branch for the code portions once the documentation parts are all set.)

Commit 93f2500c:

For bridgedb-transport, I would change the title to:

"BridgeDB requests for each bridge type"

...and the description to:

"<p>This graph shows the number BridgeDB requests for each bridge type. BridgeDB requests over Tor and unsuccessful requests (e.g., invalid emails or incorrect CAPTCHAs) are not included in these numbers.</p>"

For bridgedb-distribution, I would change the title to:

"BridgeDB requests for each distribution method"

...and the description to:

"<p>This graph shows the number of BridgeDB requests for each distribution method. HTTPS requests over Tor and unsuccessful requests (e.g., invalid emails or incorrect CAPTCHAs) are not included in these numbers.</p>"

Here are my changes to commit 93f2500c:

<h3 id="bridgedb-stats" class="hover">BridgeDB requests
<a href="#bridgedb-stats" class="anchor">#</a>
</h3>

<p>BridgeDB metrics contain aggregated information about requests to the
BridgeDB service.  BridgeDB keeps track of each request per distribution method
(HTTPS, moat, email), per bridge type (e.g., vanilla or obfs4) per country code
or email provider (e.g., "ru" or "gmail") per request success ("success" or
"fail"). Every 24 hours, BridgeDB writes these metrics to disk and then begins
a new measurement interval.</p>

<p>The following description applies to the following graph and tables:</p>

<ul>
<li>BridgeDB requests by bridge type<a href="/bridgedb-transport.html" class="btn btn-primary btn-xs"><i class="fa fa-chevron-right" aria-hidden="true"></i> graph</a></li>
<li>BridgeDB requests by distribution <a href="/bridgedb-distribution.html" class="btn btn-primary btn-xs"><i class="fa fa-chevron-right" aria-hidden="true"></i> graph</a></li>
</ul>

<h4>Step 1: Parse BridgeDB metrics to obtain reported request numbers</h4>

<p>Obtain BridgeDB metrics from <a href="/collector.html#type-bridgedb-metrics">CollecTor</a>.
Refer to the <a href="https://gitweb.torproject.org/bridgedb.git/tree/doc/bridgedb-metrics-spec.txt">BridgeDB metrics specification</a> for details on the descriptor format.</p>

<h4>Step 2: Skip requests coming in over Tor exits</h4>

<p>Skip any request counts with <code>zz</code> as their
<code>CC/EMAIL</code> metrics key part.  We use the <code>zz</code> pseudo
country code for requests originating from Tor exit relays.  We're discarding
these requests because <a href="https://bugs.torproject.org/32117">bots use the
Tor network to crawl BridgeDB</a> and including bot requests would provide a
false sense of how users interact with BridgeDB.  Note that BridgeDB maintains
a separate distribution pool for requests coming from Tor exit relays.</p>

<h4>Step 3: Aggregate requests by date, distribution method, and bridge type</h4>

<p>BridgeDB metrics contain request numbers broken down by distribution method,
bridge type, and a few more dimensions.  For our purposes we only care about
total request numbers by date and either distribution method or bridge type.
We're using request sums by these three dimensions as aggregates.  As date
we're using the date of the BridgeDB metrics interval end.  If we encounter
more than one BridgeDB metrics interval end on the same UTC date (which
shouldn't be possible with an interval length of 24 hours), we arbitrarily keep
whichever we process first.</p>

</div>

<div class="container">

I wasn't sure what TODO If we're supposed to "unbin" numbers, this is probably where we should say that. meant, so I deleted the line. Is this about the bin_size/2 modification you mentioned above?

In commit 93f2500c, I would replace "transport" with "bridge type" (because we include vanilla, which is technically the absence of a transport protocol) and "distribution" with "distribution method". I would also change:

<li><b>transport:</b> Name of the pluggable transport protocol, which includes <code>"obfs2"</code>, <code>"obfs3"</code>, <code>"obfs4"</code>, <code>"scramblesuit"</code>, and <code>"fte"</code>, and which will change in the future.</li>

to

<li><b>transport:</b> Name of the bridge type, which includes <code>"vanilla"</code>, <code>"obfs2"</code>, <code>"obfs3"</code>, <code>"obfs4"</code>, <code>"scramblesuit"</code>, and <code>"fte"</code>, and which will change in the future.</li>

May may want to change the column's name to something like "bridge_type" but I think it's also ok to keep it.

By the way, while reading your code, I found that you're only looking at BridgeDB metrics files in CollecTor's recent/ directory. There's currently a (minor) bug in CollecTor where we never remove files from that directory. I'm going to fix that at some point, and then your script will only provide the latest three files. A possible fix would be to also process files in CollecTor's archive/ directory. Not sure how much of an issue that is when these graphs exist on Tor Metrics, but I thought I should let you know.

Thanks, I appreciate the warning!

Trac:
Status: needs_review to needs_revision

Thanks for your comments! A couple questions before I continue putting your suggestions in:

The change from "transport" to "bridge type" is a tricky one. We're using "transport" where you'd use "bridge type" for other graphs, like this one. I guess part of the rationale was that even with the default OR protocol, messages are being transported between client and bridge, just not using one of the pluggable transports but using the built-in one. But we're using this term for almost a decade now, so it's well possible that it's merely the result of me convincing myself that it must be a good idea. That being said, part of me would prefer keeping the "transport" term here. However, if you think this is too confusing, let's consider making this change. What do you prefer?
"Distribution" vs. "distribution method" is another case like this. We're not using this term on Tor Metrics yet. But we should be sure to pick the term that is most common in the Tor universe. I vaguely recall the term "distributor" from when I wrote the BridgeDB spec, but that term might be too technical for this purpose. Anyway, what's the term you would want us to use here?
Regarding your suggestions to graph titles, keep in mind that short titles work better on the website (and on graph images) than long ones. At the same time they need to be meaningful in the context of all other graphs/tables on the Metrics website.
You wrote in several places that "unsuccessful requests (e.g., invalid emails or incorrect CAPTCHAs) are not included in these numbers." But is this true in your code and in mine? What would I have to do to filter out these requests?
The "unbin" part is about the -bin_size/2, yes. I already made that change in the code locally, and we should document it.

Would you want to provide your changes as a Git-formatted patch (or Git branch) based on mine? That would make it a bit easier for me to include your changes (and harder to miss out good suggestions).

I'll work more on this tomorrow. Thanks!

Replying to karsten:

The change from "transport" to "bridge type" is a tricky one. We're using "transport" where you'd use "bridge type" for other graphs, like this one. I guess part of the rationale was that even with the default OR protocol, messages are being transported between client and bridge, just not using one of the pluggable transports but using the built-in one. But we're using this term for almost a decade now, so it's well possible that it's merely the result of me convincing myself that it must be a good idea. That being said, part of me would prefer keeping the "transport" term here. However, if you think this is too confusing, let's consider making this change. What do you prefer?

Using "transport" is fine – especially because we should be consistent with the graphs that are already present.

"Distribution" vs. "distribution method" is another case like this. We're not using this term on Tor Metrics yet. But we should be sure to pick the term that is most common in the Tor universe. I vaguely recall the term "distributor" from when I wrote the BridgeDB spec, but that term might be too technical for this purpose. Anyway, what's the term you would want us to use here?

I like "distributor" more than "distribution" because it's more specific and suggests that there's a component that actively distributes bridges.

Regarding your suggestions to graph titles, keep in mind that short titles work better on the website (and on graph images) than long ones. At the same time they need to be meaningful in the context of all other graphs/tables on the Metrics website.

Gotcha! Just to be clear: do you think that "BridgeDB requests for each bridge type" and/or "BridgeDB requests for each distributor" is too long or otherwise problematic?

You wrote in several places that "unsuccessful requests (e.g., invalid emails or incorrect CAPTCHAs) are not included in these numbers." But is this true in your code and in mine? What would I have to do to filter out these requests?

Oops, you're right. The two diagrams we want to include contain both successful and unsuccessful requests. I think that's ok as long as we mention it in the description, which I intend to do.

The "unbin" part is about the -bin_size/2, yes. I already made that change in the code locally, and we should document it.

Ok, no problem.

Would you want to provide your changes as a Git-formatted patch (or Git branch) based on mine? That would make it a bit easier for me to include your changes (and harder to miss out good suggestions).

Sure, no problem!

Trac:
Sponsor: N/A to Sponsor30-must

I pushed a squash commit where I made a few changes and tried to include most of your suggestions. Please take a look if you can.

A few comments on that:

I kept the term "transport" and used "distributor" instead of "distribution (method)".
The current graph names are "BridgeDB requests by requested transport" and "BridgeDB requests by distributor". If we can keep the "by" instead of "for each", I'd prefer that. I put in the "requested" to avoid confusions with a hypothetical statistic on requests coming in via different transport. If you think that it would be clear enough without the "requested" part, we can take that out, too.
There are two TODOs left in the sources. If you have suggestions for useful text there, I'd gladly put that in. Otherwise I can make something up.

Regarding code review, I don't think we'll get one before the holidays. But this code is separate from other code, so I'd say it's fine to deploy it anyway. Worst thing that can happen is that the numbers aren't correct, but I think that you would be the best person to notice that. The code review by irl would then happen in January, and if I screwed up something, I'd fix that then.

Trac:
Status: needs_revision to needs_review

Trac:
0001-Elaborate-on-total-request-numbers-and-bin-size.patch

metrics-web patch

Replying to karsten:

I pushed a squash commit where I made a few changes and tried to include most of your suggestions. Please take a look if you can.

Thanks, looks good to me.

The current graph names are "BridgeDB requests by requested transport" and "BridgeDB requests by distributor". If we can keep the "by" instead of "for each", I'd prefer that. I put in the "requested" to avoid confusions with a hypothetical statistic on requests coming in via different transport. If you think that it would be clear enough without the "requested" part, we can take that out, too.

I'm fine with "by" instead of "for each". And "requested transport" is a good idea, now that I think about it.

There are two TODOs left in the sources. If you have suggestions for useful text there, I'd gladly put that in. Otherwise I can make something up.

Here's a patch to address these two TODOs.

Regarding code review, I don't think we'll get one before the holidays. But this code is separate from other code, so I'd say it's fine to deploy it anyway. Worst thing that can happen is that the numbers aren't correct, but I think that you would be the best person to notice that. The code review by irl would then happen in January, and if I screwed up something, I'd fix that then.

Sounds good and thanks again for your help with this!

Trac:
Status: needs_review to needs_revision

Thanks! I applied your patch, squashed commits, pushed them to master, and deployed the CSV-file-generating code to the server. It'll run over night (to update all CSV files, not just this one), and tomorrow morning I'll check how that went and also deploy the website update. If that all works well, the only remaining part is the code review that will happen in January.

All changes are now deployed:

https://metrics.torproject.org/bridgedb-transport.html

https://metrics.torproject.org/bridgedb-distributor.html

Please let me know if anything looks wrong!

Leaving this ticket open for the code review of commit 1e056fc in January.

Trac:
Status: needs_revision to needs_review

lgtm

Trac:
Status: needs_review to merge_ready

Glad to hear! The patch is already merged and deployed, so there's nothing else left to do here. Closing. Thanks!

Trac:
Status: merge_ready to closed
Resolution: N/A to fixed

closed

changed time estimate to 16h

mentioned in issue #32781 (moved)

mentioned in issue #31274 (moved)

Write BridgeDB metrics parser and analyse existing data

Child items 0

Activity