Opened 11 months ago

Closed 8 months ago

#32135 closed task (fixed)

Write BridgeDB metrics parser and analyse existing data

Reported by: phw Owned by: phw
Priority: Medium Milestone:
Component: Metrics/Statistics Version:
Severity: Normal Keywords: s30-o21a1
Cc: phw, karsten, metrics-team Actual Points:
Parent ID: #31274 Points: 2
Reviewer: Sponsor: Sponsor30-must

Description

We've been collecting BridgeDB usage metrics (#9316) for a while now. It's time to write a parser for our metrics format, run it over our data, and extract insight from it. Questions that come to mind are:

  • How stable are our request patterns?
  • What's the ranking of PT popularity?
  • How many (un)successful requests are there broken down by distributor?

Child Tickets

Attachments (9)

bridge-types.png (96.6 KB) - added by phw 11 months ago.
Requests per bridge types.
bridge-types-nontor.png (95.3 KB) - added by phw 11 months ago.
Requests per bridge types (without Tor over HTTPS).
distribution-mechanisms.png (60.2 KB) - added by phw 11 months ago.
Requests per distribution mechanisms.
distribution-mechanisms-notor.png (67.6 KB) - added by phw 11 months ago.
Requests per distribution mechanisms (without Tor over HTTPS).
(un)successful-email.png (66.9 KB) - added by phw 11 months ago.
(Un)successful requests over email.
(un)successful-moat.png (85.9 KB) - added by phw 11 months ago.
(Un)successful requests over moat.
(un)successful-nontor-https.png (82.5 KB) - added by phw 11 months ago.
(Un)successful requests without Tor over HTTPS.
(un)successful-tor-https.png (61.2 KB) - added by phw 11 months ago.
(Un)successful requests with Tor over HTTPS.
0001-Elaborate-on-total-request-numbers-and-bin-size.patch (1.8 KB) - added by phw 9 months ago.
metrics-web patch

Download all attachments as: .zip

Change History (26)

Changed 11 months ago by phw

Attachment: bridge-types.png added

Requests per bridge types.

Changed 11 months ago by phw

Attachment: bridge-types-nontor.png added

Requests per bridge types (without Tor over HTTPS).

Changed 11 months ago by phw

Attachment: distribution-mechanisms.png added

Requests per distribution mechanisms.

Changed 11 months ago by phw

Requests per distribution mechanisms (without Tor over HTTPS).

Changed 11 months ago by phw

Attachment: (un)successful-email.png added

(Un)successful requests over email.

Changed 11 months ago by phw

Attachment: (un)successful-moat.png added

(Un)successful requests over moat.

Changed 11 months ago by phw

(Un)successful requests without Tor over HTTPS.

Changed 11 months ago by phw

(Un)successful requests with Tor over HTTPS.

comment:1 Changed 11 months ago by phw

Below are a bunch of diagrams that I created from our BridgeDB usage metrics. The scripts that I used to create these diagrams are available here: https://dip.torproject.org/phw/bridgedb-metrics-visualisation

Note that none of the diagrams below contains vanilla bridges because of a bug (#32203). (In fact, I discovered this bug by glancing at these visualisations and wondering why there are no vanilla bridges.)

Requests per bridge types

Requests per bridge types.

Requests per bridge types (without Tor over HTTPS).

The first diagram suggests that ScrambleSuit and obfs3 are relatively popular but once we remove HTTPS requests coming from Tor (which we believe to be almost entirely bots), basically only obfs4 remains. Note that this visualisation is an aggregate over all distribution mechanisms. An average of ~7,500 obfs4 requests per day is a surprisingly large number to me. There may very well be more bots among these requests.

Requests per distribution mechanism

Requests per distribution mechanisms.

Requests per distribution mechanisms (without Tor over HTTPS).

Again, we can see a significant difference between all requests and requests without bots. With bots removed, moat is our most active distribution mechanism. Note that in moat it's difficult to tell apart a user from a bot request, so this visualisation is assuming that all requests are from users, which is probably false.

(Un)successful requests per distribution mechanism

(Un)successful requests over email.

Most users (assuming that these requests are in fact from users) succeed at getting bridges over email. The drop in mid-October was because of a bug in BridgeDB (#32105).

(Un)successful requests over moat.

Moat is looking worse. A sizable fraction of requests are unsuccessful, presumably because of the difficulty of our CAPTCHA.

(Un)successful requests without Tor over HTTPS.

The situation is similar for HTTPS requests excluding requests over Tor (which we believe are bots).

(Un)successful requests with Tor over HTTPS.

HTTPS requests over Tor only are surprisingly successful. These bots are likely using a classifier that solves our CAPTCHAs.

comment:2 Changed 10 months ago by karsten

Cc: karsten metrics-team added

comment:3 Changed 10 months ago by karsten

phw, you asked about adding any of these graphs to the metrics website. It does involve some coding and some documenting, but I think that having these graphs would be very useful, so let's talk.

I guess my first question would be how many graphs you'd like to see on the metrics website. Ideally, we would be able to compress these graphs above into one or two. Unlike these one-off graphs the ones on the metrics website would be somewhat configurable with the parameters we're using for other graphs (or new ones, if really needed). So, can you think about your one or two most important graphs for the metrics website?

comment:4 in reply to:  3 Changed 10 months ago by phw

Replying to karsten:

phw, you asked about adding any of these graphs to the metrics website. It does involve some coding and some documenting, but I think that having these graphs would be very useful, so let's talk.


How can we help with the coding or documenting?

I guess my first question would be how many graphs you'd like to see on the metrics website. Ideally, we would be able to compress these graphs above into one or two. Unlike these one-off graphs the ones on the metrics website would be somewhat configurable with the parameters we're using for other graphs (or new ones, if really needed). So, can you think about your one or two most important graphs for the metrics website?


I would choose "requests per bridge type" and "requests per distribution mechanism" (and possibly exclude "Tor requests over HTTPS" or, even better, make it configurable):

Requests per bridge types (without Tor over HTTPS).
Requests per distribution mechanisms (without Tor over HTTPS).

comment:5 Changed 9 months ago by karsten

I started working on adding these two graphs to Tor Metrics today, and I'm planning to have something to review by end of the day.

Two questions:

  • Do you mind if we change the ticket component to Metrics/Statistics? In that case the ticket would be resolved as soon as two new graphs appear on Tor Metrics. But if you want to keep this ticket for writing your own parser and analyzing results, maybe we should open a new ticket. Up to you!
  • You graphed the binned numbers, that is, numbers rounded up to multiples of 10. In other graphs we're subtracting bin_size/2 as an approximation to the original number. That's a choice, though, and I'd understand if you'd rather want to graph exactly what's in the statistics. I guess my question is: did you intentionally not subtract anything?

Will be back here (or on another ticket) in a few hours with code and sample graphs.

comment:6 Changed 9 months ago by karsten

Status: assignedneeds_review

Okay, I finished a first patch that processes BridgeDB metrics once per day to produce a .csv file and that adds two graphs to Tor Metrics. Can you please take a look at that patch, not regarding the Java/R code, but regarding user-facing documentation of the two new graphs? In particular, please take a look at the TODOs in that patch. (irl, I'll ask you to review a revised branch for the code portions once the documentation parts are all set.)

By the way, while reading your code, I found that you're only looking at BridgeDB metrics files in CollecTor's recent/ directory. There's currently a (minor) bug in CollecTor where we never remove files from that directory. I'm going to fix that at some point, and then your script will only provide the latest three files. A possible fix would be to also process files in CollecTor's archive/ directory. Not sure how much of an issue that is when these graphs exist on Tor Metrics, but I thought I should let you know.

comment:7 in reply to:  5 Changed 9 months ago by phw

Component: Circumvention/BridgeDBMetrics/Statistics

Replying to karsten:

  • Do you mind if we change the ticket component to Metrics/Statistics? In that case the ticket would be resolved as soon as two new graphs appear on Tor Metrics. But if you want to keep this ticket for writing your own parser and analyzing results, maybe we should open a new ticket. Up to you!


We already have a parser and took a look at some data, so we're done from the anti-censorship team's side. I changed the component to Metrics/Statistics.

  • You graphed the binned numbers, that is, numbers rounded up to multiples of 10. In other graphs we're subtracting bin_size/2 as an approximation to the original number. That's a choice, though, and I'd understand if you'd rather want to graph exactly what's in the statistics. I guess my question is: did you intentionally not subtract anything?


Oh, good point. Let's subtract bin_size/2 here as well, to make the diagrams consistent with Metrics's other diagrams, and more representative of the original number.

comment:8 in reply to:  6 Changed 9 months ago by phw

Status: needs_reviewneeds_revision

Thanks for your work on this!

Replying to karsten:

Okay, I finished a first patch that processes BridgeDB metrics once per day to produce a .csv file and that adds two graphs to Tor Metrics. Can you please take a look at that patch, not regarding the Java/R code, but regarding user-facing documentation of the two new graphs? In particular, please take a look at the TODOs in that patch. (irl, I'll ask you to review a revised branch for the code portions once the documentation parts are all set.)


Commit 93f2500c:

For bridgedb-transport, I would change the title to:

"BridgeDB requests for each bridge type"

...and the description to:

"<p>This graph shows the number BridgeDB requests for each bridge type. BridgeDB requests over Tor and unsuccessful requests (e.g., invalid emails or incorrect CAPTCHAs) are not included in these numbers.</p>"

For bridgedb-distribution, I would change the title to:

"BridgeDB requests for each distribution method"

...and the description to:

"<p>This graph shows the number of BridgeDB requests for each distribution method. HTTPS requests over Tor and unsuccessful requests (e.g., invalid emails or incorrect CAPTCHAs) are not included in these numbers.</p>"

Here are my changes to commit 93f2500c:

<h3 id="bridgedb-stats" class="hover">BridgeDB requests
<a href="#bridgedb-stats" class="anchor">#</a>
</h3>

<p>BridgeDB metrics contain aggregated information about requests to the
BridgeDB service.  BridgeDB keeps track of each request per distribution method
(HTTPS, moat, email), per bridge type (e.g., vanilla or obfs4) per country code
or email provider (e.g., "ru" or "gmail") per request success ("success" or
"fail"). Every 24 hours, BridgeDB writes these metrics to disk and then begins
a new measurement interval.</p>

<p>The following description applies to the following graph and tables:</p>

<ul>
<li>BridgeDB requests by bridge type<a href="/bridgedb-transport.html" class="btn btn-primary btn-xs"><i class="fa fa-chevron-right" aria-hidden="true"></i> graph</a></li>
<li>BridgeDB requests by distribution <a href="/bridgedb-distribution.html" class="btn btn-primary btn-xs"><i class="fa fa-chevron-right" aria-hidden="true"></i> graph</a></li>
</ul>

<h4>Step 1: Parse BridgeDB metrics to obtain reported request numbers</h4>

<p>Obtain BridgeDB metrics from <a href="/collector.html#type-bridgedb-metrics">CollecTor</a>.
Refer to the <a href="https://gitweb.torproject.org/bridgedb.git/tree/doc/bridgedb-metrics-spec.txt">BridgeDB metrics specification</a> for details on the descriptor format.</p>

<h4>Step 2: Skip requests coming in over Tor exits</h4>

<p>Skip any request counts with <code>zz</code> as their
<code>CC/EMAIL</code> metrics key part.  We use the <code>zz</code> pseudo
country code for requests originating from Tor exit relays.  We're discarding
these requests because <a href="https://bugs.torproject.org/32117">bots use the
Tor network to crawl BridgeDB</a> and including bot requests would provide a
false sense of how users interact with BridgeDB.  Note that BridgeDB maintains
a separate distribution pool for requests coming from Tor exit relays.</p>

<h4>Step 3: Aggregate requests by date, distribution method, and bridge type</h4>

<p>BridgeDB metrics contain request numbers broken down by distribution method,
bridge type, and a few more dimensions.  For our purposes we only care about
total request numbers by date and either distribution method or bridge type.
We're using request sums by these three dimensions as aggregates.  As date
we're using the date of the BridgeDB metrics interval end.  If we encounter
more than one BridgeDB metrics interval end on the same UTC date (which
shouldn't be possible with an interval length of 24 hours), we arbitrarily keep
whichever we process first.</p>

</div>

<div class="container">

I wasn't sure what TODO If we're supposed to "unbin" numbers, this is probably where we should say that. meant, so I deleted the line. Is this about the bin_size/2 modification you mentioned above?

In commit 93f2500c, I would replace "transport" with "bridge type" (because we include vanilla, which is technically the absence of a transport protocol) and "distribution" with "distribution method". I would also change:

<li><b>transport:</b> Name of the pluggable transport protocol, which includes <code>"obfs2"</code>, <code>"obfs3"</code>, <code>"obfs4"</code>, <code>"scramblesuit"</code>, and <code>"fte"</code>, and which will change in the future.</li>

to

<li><b>transport:</b> Name of the bridge type, which includes <code>"vanilla"</code>, <code>"obfs2"</code>, <code>"obfs3"</code>, <code>"obfs4"</code>, <code>"scramblesuit"</code>, and <code>"fte"</code>, and which will change in the future.</li>

May may want to change the column's name to something like "bridge_type" but I think it's also ok to keep it.

By the way, while reading your code, I found that you're only looking at BridgeDB metrics files in CollecTor's recent/ directory. There's currently a (minor) bug in CollecTor where we never remove files from that directory. I'm going to fix that at some point, and then your script will only provide the latest three files. A possible fix would be to also process files in CollecTor's archive/ directory. Not sure how much of an issue that is when these graphs exist on Tor Metrics, but I thought I should let you know.


Thanks, I appreciate the warning!

Last edited 9 months ago by phw (previous) (diff)

comment:9 Changed 9 months ago by karsten

Thanks for your comments! A couple questions before I continue putting your suggestions in:

  • The change from "transport" to "bridge type" is a tricky one. We're using "transport" where you'd use "bridge type" for other graphs, like this one. I guess part of the rationale was that even with the default OR protocol, messages are being transported between client and bridge, just not using one of the pluggable transports but using the built-in one. But we're using this term for almost a decade now, so it's well possible that it's merely the result of me convincing myself that it must be a good idea. That being said, part of me would prefer keeping the "transport" term here. However, if you think this is too confusing, let's consider making this change. What do you prefer?
  • "Distribution" vs. "distribution method" is another case like this. We're not using this term on Tor Metrics yet. But we should be sure to pick the term that is most common in the Tor universe. I vaguely recall the term "distributor" from when I wrote the BridgeDB spec, but that term might be too technical for this purpose. Anyway, what's the term you would want us to use here?
  • Regarding your suggestions to graph titles, keep in mind that short titles work better on the website (and on graph images) than long ones. At the same time they need to be meaningful in the context of all other graphs/tables on the Metrics website.
  • You wrote in several places that "unsuccessful requests (e.g., invalid emails or incorrect CAPTCHAs) are not included in these numbers." But is this true in your code and in mine? What would I have to do to filter out these requests?
  • The "unbin" part is about the -bin_size/2, yes. I already made that change in the code locally, and we should document it.

Would you want to provide your changes as a Git-formatted patch (or Git branch) based on mine? That would make it a bit easier for me to include your changes (and harder to miss out good suggestions).

I'll work more on this tomorrow. Thanks!

comment:10 in reply to:  9 Changed 9 months ago by phw

Replying to karsten:

  • The change from "transport" to "bridge type" is a tricky one. We're using "transport" where you'd use "bridge type" for other graphs, like this one. I guess part of the rationale was that even with the default OR protocol, messages are being transported between client and bridge, just not using one of the pluggable transports but using the built-in one. But we're using this term for almost a decade now, so it's well possible that it's merely the result of me convincing myself that it must be a good idea. That being said, part of me would prefer keeping the "transport" term here. However, if you think this is too confusing, let's consider making this change. What do you prefer?


Using "transport" is fine – especially because we should be consistent with the graphs that are already present.

  • "Distribution" vs. "distribution method" is another case like this. We're not using this term on Tor Metrics yet. But we should be sure to pick the term that is most common in the Tor universe. I vaguely recall the term "distributor" from when I wrote the BridgeDB spec, but that term might be too technical for this purpose. Anyway, what's the term you would want us to use here?


I like "distributor" more than "distribution" because it's more specific and suggests that there's a component that actively distributes bridges.

  • Regarding your suggestions to graph titles, keep in mind that short titles work better on the website (and on graph images) than long ones. At the same time they need to be meaningful in the context of all other graphs/tables on the Metrics website.


Gotcha! Just to be clear: do you think that "BridgeDB requests for each bridge type" and/or "BridgeDB requests for each distributor" is too long or otherwise problematic?

  • You wrote in several places that "unsuccessful requests (e.g., invalid emails or incorrect CAPTCHAs) are not included in these numbers." But is this true in your code and in mine? What would I have to do to filter out these requests?


Oops, you're right. The two diagrams we want to include contain both successful and unsuccessful requests. I think that's ok as long as we mention it in the description, which I intend to do.

  • The "unbin" part is about the -bin_size/2, yes. I already made that change in the code locally, and we should document it.


Ok, no problem.

Would you want to provide your changes as a Git-formatted patch (or Git branch) based on mine? That would make it a bit easier for me to include your changes (and harder to miss out good suggestions).


Sure, no problem!

comment:11 Changed 9 months ago by phw

Sponsor: Sponsor30-must

comment:12 Changed 9 months ago by karsten

Status: needs_revisionneeds_review

I pushed a squash commit where I made a few changes and tried to include most of your suggestions. Please take a look if you can.

A few comments on that:

  • I kept the term "transport" and used "distributor" instead of "distribution (method)".
  • The current graph names are "BridgeDB requests by requested transport" and "BridgeDB requests by distributor". If we can keep the "by" instead of "for each", I'd prefer that. I put in the "requested" to avoid confusions with a hypothetical statistic on requests coming in via different transport. If you think that it would be clear enough without the "requested" part, we can take that out, too.
  • There are two TODOs left in the sources. If you have suggestions for useful text there, I'd gladly put that in. Otherwise I can make something up.

Regarding code review, I don't think we'll get one before the holidays. But this code is separate from other code, so I'd say it's fine to deploy it anyway. Worst thing that can happen is that the numbers aren't correct, but I think that you would be the best person to notice that. The code review by irl would then happen in January, and if I screwed up something, I'd fix that then.

Changed 9 months ago by phw

metrics-web patch

comment:13 in reply to:  12 Changed 9 months ago by phw

Status: needs_reviewneeds_revision

Replying to karsten:

I pushed a squash commit where I made a few changes and tried to include most of your suggestions. Please take a look if you can.


Thanks, looks good to me.

  • The current graph names are "BridgeDB requests by requested transport" and "BridgeDB requests by distributor". If we can keep the "by" instead of "for each", I'd prefer that. I put in the "requested" to avoid confusions with a hypothetical statistic on requests coming in via different transport. If you think that it would be clear enough without the "requested" part, we can take that out, too.


I'm fine with "by" instead of "for each". And "requested transport" is a good idea, now that I think about it.

  • There are two TODOs left in the sources. If you have suggestions for useful text there, I'd gladly put that in. Otherwise I can make something up.


Here's a patch to address these two TODOs.

Regarding code review, I don't think we'll get one before the holidays. But this code is separate from other code, so I'd say it's fine to deploy it anyway. Worst thing that can happen is that the numbers aren't correct, but I think that you would be the best person to notice that. The code review by irl would then happen in January, and if I screwed up something, I'd fix that then.


Sounds good and thanks again for your help with this!

comment:14 Changed 9 months ago by karsten

Thanks! I applied your patch, squashed commits, pushed them to master, and deployed the CSV-file-generating code to the server. It'll run over night (to update all CSV files, not just this one), and tomorrow morning I'll check how that went and also deploy the website update. If that all works well, the only remaining part is the code review that will happen in January.

comment:15 Changed 9 months ago by karsten

Status: needs_revisionneeds_review

All changes are now deployed:

https://metrics.torproject.org/bridgedb-transport.html

https://metrics.torproject.org/bridgedb-distributor.html

Please let me know if anything looks wrong!

Leaving this ticket open for the code review of commit 1e056fc in January.

comment:16 Changed 8 months ago by irl

Status: needs_reviewmerge_ready

lgtm

comment:17 Changed 8 months ago by karsten

Resolution: fixed
Status: merge_readyclosed

Glad to hear! The patch is already merged and deployed, so there's nothing else left to do here. Closing. Thanks!

Note: See TracTickets for help on using tickets.