Opened 6 years ago

Last modified 3 years ago

#10218 assigned enhancement

Provide "users-per-transport-per-country" statistics for obfsbridges

Reported by: asn Owned by: joelanders
Priority: Medium Milestone: Tor: unspecified
Component: Core Tor/Tor Version:
Severity: Normal Keywords: tor-bridge, tor-pt, bridgedb-dist
Cc: wfn, yawning, isis, mrphs, tim@…, dcf Actual Points:
Parent ID: Points: 3
Reviewer: Sponsor: SponsorS-can

Description

In our bridge-stats file we currently have bridge-ips which tracks the number of connections per-country, and bridge-ip-transports which counts the number of connections per-transport.

Still, these two data points don't allow you to infer the users-per-transport-per-country; which would give us useful information in case of blocked transports in specific jurisdictions, etc.

gamambel today suggested that we add such functionality, which seems like a marvelous idea.

wfn and grindhold seem interested in coding this. As discussed in IRC, interesting functions here are:geoip_get_transport_history() geoip_format_bridge_stats() validate_bridge_stats().

We should also define a nice format for this new line in bridge-stats. A not nice format is:
bridge-ip-transports-per-country cn::obfs2:42,obfs3:46 ir::obfs2:10,obfs3:666
we might be able to find a better one (or even one that is used currently somewhere else in tor).

Child Tickets

Attachments (9)

ptcc.R (1.9 KB) - added by dcf 4 years ago.
ptcc-obfs3.png (13.6 KB) - added by dcf 4 years ago.
ptcc-meek.png (13.4 KB) - added by dcf 4 years ago.
ptcc-fte.png (13.6 KB) - added by dcf 4 years ago.
ptcc-time.R (2.3 KB) - added by dcf 4 years ago.
ptcc-time-obfs3.png (42.6 KB) - added by dcf 4 years ago.
ptcc-time-meek.png (33.0 KB) - added by dcf 4 years ago.
ptcc-time-fte.png (45.8 KB) - added by dcf 4 years ago.
ptcc-time-obfs4.png (52.8 KB) - added by dcf 4 years ago.

Download all attachments as: .zip

Change History (43)

comment:1 Changed 6 years ago by asn

Keywords: tor-bridge tor-pt added; `tor-bridge` `tor-pt` removed

comment:2 Changed 6 years ago by karsten

Status: newneeds_information

This may seem like a marvelous idea from a data perspective. But I'm less thrilled about the potential privacy implications. Knowing the number of users in a smaller country using a not so common transport is a lot of information, compared to knowing the total number of users in that country and the total number of users of that transport. I'm leaning towards not knowing these details even if that prevents us from knowing about blocked transports in specific jurisdictions. Changing to needs_information until we're sure whether we want to build this or not.

Here's an alternative approach: we could compare by-country distributions of bridges offering different sets of transports; if the bridges offering only obfs2 hardly see any Chinese users, but the obfs2+obfs3 bridges do, we'll learn something. The good news is that we don't need any new data for this, which would take another Tor release and 6--12 months to show up in descriptors. The bad news is that this may not be as straightforward to build. Related to #8462. If wfn and/or grindhold want to help with this, great!

comment:3 Changed 6 years ago by mo

I understand the concerns. But, I don't really see the difference between configuring our bridges to only support single pluggable transports to get such data. The plan was that *all* out bridges support all current pluggable transports. We're talking 2000 stable bridges here.

comment:4 Changed 6 years ago by karsten

The idea was not to explicitly configure bridges to support only a subset of transports. This will happen naturally as transports come and go. Compare this to the different Tor versions run by relays. Ideally, we'd want all relays to run the most recent stable version, but that's not what happens in reality. The idea was to make use of delayed transport upgrade behavior to learn something interesting here.

comment:5 Changed 6 years ago by nickm

Milestone: Tor: 0.2.???

comment:6 Changed 6 years ago by wfn

Cc: wfn added

comment:7 Changed 5 years ago by yawning

Cc: yawning added

comment:8 Changed 5 years ago by asn

We should start thinking about this again. Maybe try to apply the obfuscation of the HS statistics in these statistics (#13987).

Without these statistics it's very hard to evaluate how well our PTs are doing in countrys like China. It's also hard to evaluate bridge distribution mechanisms in a country.

Last edited 5 years ago by asn (previous) (diff)

comment:9 Changed 5 years ago by isis

Cc: isis added
Keywords: bridgedb-dist added

I would also find great use in these statistics as far as BridgeDB is concerned, and I think that this is a wonderful idea.

comment:10 Changed 5 years ago by ailanthus

I agree that these statistics would be very very useful--a chart that I could show human rights groups, funders, and activists would be great.

comment:11 Changed 5 years ago by mrphs

Cc: mrphs added

comment:12 Changed 5 years ago by joelanders

Owner: set to joelanders
Status: needs_informationassigned

Talked to asn on IRC about this; would like to take a shot at it in the next week.
I've had a look at the interesting functions, and am trying to digest some of the Differential Privacy stuff.
Think I can manage, once I figure out the smartlists and strmaps.

It's probably not worth the trouble to try to combine geoip_get_transport_history(), geoip_get_client_history, and this new function?
We'll be doing a HT_FOREACH(ent, clientmap, &client_history) {...} in a few places.

And y'all'll need to figure out the binning / noise parameters for me :)

comment:13 Changed 4 years ago by mrphs

Hello friends, any update on this?

This data (ideally after converting to a nice graph on metrics) will help us understand how many users we have in countries where using PTs are the only way to access Tor network. And also gives us a better understanding of how they're censoring the internet or if specific PTs get blocked/probed.

comment:14 Changed 4 years ago by joelanders

(I got a job right after my last message, have since quit it. I'll take a look at this again, try to page it back in.)

comment:15 Changed 4 years ago by isis

Keywords: 028-triage added
Milestone: Tor: 0.2.???Tor: 0.2.8.x-final
Points: medium
Type: taskenhancement

I'm happy to help or to take this ticket.

comment:16 in reply to:  15 Changed 4 years ago by mrphs

Severity: Normal

Replying to isis:

I'm happy to help or to take this ticket.

Yay! \o/ - It's absolutely not my place to say anything about this. I'm just one of the people who are interested in the result. But I don't think much (or any) work has been done on this front so far, other than this ticket. And since the sooner we have it assigned, the sooner it gets in tor and the sooner we'll have the result, may I ask if anyone has any objection if isis takes this ticket? (hopefully she's still interested :)

comment:17 Changed 4 years ago by tsammut

Cc: tim@… added

comment:18 Changed 4 years ago by joelanders

Ach ja, you better take it; I'm being all flighty.

comment:19 Changed 4 years ago by asn

Another way to cheaply (and partially) bypass the privacy issue here would be to say "Only report statistics from countries with more than N=500 users of $PT" where PT can be obfs3, obfs4, meek, etc. We can also tweak the threshold N to be 1000 or more if needed.

This way we ensure that only very popular pluggable transports will be displayed in the stats. And because the number of users is (arguably) that large, this should not reveal information about individual users. We can also do this on top of the regular noise that we would add.

While this is definitely not ideal, this statistic can be useful for us because it will tell us which PTs are popular in which countries. If a PT is popular and it stops being in the list of statistics (because the number of users dropped) we still learn that a censorship event is happening.

As we understand better this statistic, we can then add more noise or decrease the threshold N accordingly.

comment:20 Changed 4 years ago by karsten

Here's something else we could do to get a first estimate of users per transport and country that doesn't require adding new statistics.

It turns out that most large bridges (4 out of 5 on February 1, 2016) only see noteworthy usage via a single transport or have requests via one transport dominating the others in numbers (74% on the 5th large bridge on February 1, 2016). Example:

A72D5DB45D9DE4B244D3F6C4AD22A66F40BF5B87,bridge,responses,,<OR>,,2016-02-01 00:00:00,2016-02-02 00:00:00,4.3
A72D5DB45D9DE4B244D3F6C4AD22A66F40BF5B87,bridge,responses,,obfs3,,2016-02-01 00:00:00,2016-02-02 00:00:00,26892.1
73D8FF840444F84EC50DD755FBAD44CF1F0DE28B,bridge,responses,,<OR>,,2016-02-01 00:00:00,2016-02-02 00:00:00,4.3
73D8FF840444F84EC50DD755FBAD44CF1F0DE28B,bridge,responses,,obfs3,,2016-02-01 00:00:00,2016-02-02 00:00:00,26787.2
88F745840F47CE0C6A4FE61D827950B06F9E4534,bridge,responses,,meek,,2016-02-01 00:00:00,2016-02-02 00:00:00,22049.6
3E0908F131AC417C48DDD835D78FB6887F4CD126,bridge,responses,,<OR>,,2016-02-01 00:00:00,2016-02-01 17:52:31,8.3
3E0908F131AC417C48DDD835D78FB6887F4CD126,bridge,responses,,obfs3,,2016-02-01 00:00:00,2016-02-01 17:52:31,15245.3
3E0908F131AC417C48DDD835D78FB6887F4CD126,bridge,responses,,obfs4,,2016-02-01 00:00:00,2016-02-01 17:52:31,4764.3
3E0908F131AC417C48DDD835D78FB6887F4CD126,bridge,responses,,scramblesuit,,2016-02-01 00:00:00,2016-02-01 17:52:31,476.2
AA033EEB61601B2B7312D89B62AAA23DC3ED8A34,bridge,responses,,<OR>,,2016-02-01 00:00:00,2016-02-02 00:00:00,10.6
AA033EEB61601B2B7312D89B62AAA23DC3ED8A34,bridge,responses,,meek,,2016-02-01 00:00:00,2016-02-02 00:00:00,19024.7

The four bridges with a single transport are easy. The distribution of requests by country exactly matches the distribution by country and transport. Done.

The fifth bridge with multiple transports is trickier. We could assume that the distribution by country is the same for all transports, that is, if CC (in [0..1]) requests came from a given country and PT (also in [0..1]) requests came in via a given transport, x * y requests can be attributed to that country and transport. But that assumption may be wrong. What we could also do as first approximation is find a lower and upper bound of users by country and transport. The lower bound would probably be defined as something like max(0, PT + CC - 1) (not just 0 to account for cases where CC > 1 - PT) and the upper bound as min(PT, CC), even though I could be convinced that other formulas are even more correct.

My guess is that this approximation would provide us with some insights about actual usage and about requirements for better statistics. The best part is that all required data is already available, we just need to look at it.

The bad news is that I don't have the time before the dev meeting to run this analysis, and I can't even say whether the weeks after the dev meeting will be any better. That's why I uploaded the data for somebody else to do the analysis and publish results here. Any takers?

comment:21 Changed 4 years ago by dcf

Cc: dcf added

Changed 4 years ago by dcf

Attachment: ptcc.R added

Changed 4 years ago by dcf

Attachment: ptcc-obfs3.png added

Changed 4 years ago by dcf

Attachment: ptcc-meek.png added

Changed 4 years ago by dcf

Attachment: ptcc-fte.png added

comment:22 in reply to:  20 Changed 4 years ago by dcf

Replying to karsten:

The bad news is that I don't have the time before the dev meeting to run this analysis, and I can't even say whether the weeks after the dev meeting will be any better. That's why I uploaded the data for somebody else to do the analysis and publish results here. Any takers?

Here is my try: attachment:ptcc.R.

The blue lines show the lower and upper bounds according to the formulas in comment:20. The heavy gray lines go from 0 to the total number of responses for the transport across all bridges.

obfs3

The obfs3 graph shows that the ranges can actually be fairly tight.


meek

In the meek graph, the ranges are super small (almost too small to see); in other words we know the number very precisely. That's because all the existing meek bridges only run meek and nothing else (just a negligible number of <OR> connections).


fte

Most of the lower bounds for FTE bottom out at 0 because PT + CC < 1.


There are a couple of anomalies if you look at the full graphs, caused by rows with duplicate fingerprints. There are just a few of them. I didn't bother to clean them up. For example,

5AFE2BF54983490EDA216813BE12DF1CE98E763A,bridge,responses,,,,2016-02-01 00:00:00,2016-02-01 03:38:47,1.8
5AFE2BF54983490EDA216813BE12DF1CE98E763A,bridge,responses,,,,2016-02-01 11:41:28,2016-02-02 00:00:00,14.4
5AFE2BF54983490EDA216813BE12DF1CE98E763A,bridge,responses,us,,,2016-02-01 00:00:00,2016-02-01 03:38:47,0.9
5AFE2BF54983490EDA216813BE12DF1CE98E763A,bridge,responses,us,,,2016-02-01 11:41:28,2016-02-02 00:00:00,4.8

comment:23 Changed 4 years ago by karsten

That's fantastic, dcf! Thanks for doing this analysis!

Is it easy for you to re-run the analysis with more data? I just uploaded some more data from February 1 to around February 12 that I already had around here. And I could extract more data from the database if you think it would be useful. Just let me know how many weeks or months you can process, and I'll get the data for you. Who knows, maybe this approach works much better or much worse for February 1, 2016 than for other days.

Regarding those rows with duplicate fingerprints, note that they contain different timestamps. I think the best thing would be to sum up all values coming from the same bridge. Let me know if that doesn't make sense, and I'll think harder what to do there.

Changed 4 years ago by dcf

Attachment: ptcc-time.R added

Changed 4 years ago by dcf

Attachment: ptcc-time-obfs3.png added

Changed 4 years ago by dcf

Attachment: ptcc-time-meek.png added

Changed 4 years ago by dcf

Attachment: ptcc-time-fte.png added

comment:24 in reply to:  23 Changed 4 years ago by dcf

Replying to karsten:

Is it easy for you to re-run the analysis with more data? I just uploaded some more data from February 1 to around February 12 that I already had around here. And I could extract more data from the database if you think it would be useful. Just let me know how many weeks or months you can process, and I'll get the data for you. Who knows, maybe this approach works much better or much worse for February 1, 2016 than for other days.

I edited the script to handle multiple days: attachment:ptcc-time.R. The heavy bars show the lower and upper bounds as before. The connecting lines go through the middle of the bars.

I limited these graphs to 6 countries. It took about 1 minute to run on merged-2016-02.csv.xz. If I did not limit the countries, it would take up about 75% of my RAM and take a long time (I didn't wait for it to finish). I suspect it's because of the giant expand.grid the code does with all combinations of date, country, and transport. Handling more bulk data will probably require a cleverer approach.

obfs3


meek


fte


Changed 4 years ago by dcf

Attachment: ptcc-time-obfs4.png added

comment:25 Changed 4 years ago by nickm

Milestone: Tor: 0.2.8.x-finalTor: 0.2.9.x-final

These seem like features, or like other stuff unlikely to be possible this month. Bumping them to 0.2.9

comment:26 Changed 4 years ago by nickm

Sponsor: SponsorS-can

Tagging these bridge- and PT- items as S-can.

comment:27 Changed 4 years ago by nickm

Keywords: tor-bridge tor-pt bridgedb-dist 028-triagetor-bridge, tor-pt, bridgedb-dist, 028-triage

comment:28 Changed 4 years ago by isabela

Points: medium3

comment:29 Changed 3 years ago by karsten

I just created #19544 for adding graphs on bridge users by country and transport to Tor Metrics. The discussion of adding new stats to the tor daemon should probably stay here, but the discussion of graphing existing data should go to that new ticket.

comment:30 Changed 3 years ago by nickm

Keywords: nickm-deferred-20161005 added
Milestone: Tor: 0.2.9.x-finalTor: 0.3.0.x-final

Deferring big/risky-feature things (even the ones I really love!) to 0.3.0. Please argue if I'm wrong.

comment:31 Changed 3 years ago by dgoulet

Keywords: triage-out-030-201612 added
Milestone: Tor: 0.3.0.x-finalTor: unspecified

Triaged out on December 2016 from 030 to Unspecified.

comment:32 Changed 3 years ago by nickm

Keywords: nickm-deferred-20161005 removed

comment:33 Changed 3 years ago by nickm

Keywords: triage-out-030-201612 removed

comment:34 Changed 3 years ago by nickm

Keywords: 028-triage removed
Note: See TracTickets for help on using tickets.