Opened 2 years ago

Closed 2 years ago

#26868 closed defect (not a bug)

How does metrics get bridge statistics at a granularity of 1 user?

Reported by: teor Owned by: metrics-team
Priority: Medium Milestone:
Component: Metrics/Statistics Version:
Severity: Normal Keywords:
Cc: Actual Points:
Parent ID: Points:
Reviewer: Sponsor:

Description

Bridge pluggable transport statistics are rounded to the nearest 8 users.

So how do the graphs end up with a granularity of 1 user?
https://metrics.torproject.org/userstats-bridge-transport.html?start=2018-04-14&end=2018-07-20&transport=obfs2&transport=websocket&transport=fte&transport=scramblesuit&transport=snowflake

Child Tickets

Change History (9)

comment:1 Changed 2 years ago by teor

For example, see:

bridge-ip-transports <OR>=8,obfs3=8,obfs4=8,scramblesuit=8
...
bridge-ip-transports <OR>=16,obfs3=9040,obfs4=74736

in https://collector.torproject.org/recent/bridge-descriptors/extra-infos/2018-07-16-01-09-00-extra-infos

comment:2 Changed 2 years ago by arma

I can't tell from the graph you gave us that there are any lines with granularity of 1. Here's one where it's easier to see:
https://metrics.torproject.org/userstats-bridge-transport.html?start=2018-04-14&end=2018-07-20&transport=websocket&transport=snowflake

I look forward to hearing from Karsten or Iain on this one, but here is a reason why it could make sense:

If you have a multiple of 8 that is much higher than 8, then it makes a lot of sense to estimate that you're in the middle of your top multiple (i.e. if it says 80 figure 76, if it says 88 figure 84, etc). But if it says 8, it's probably not the case that you had 4 users. I could see an argument for betting that most of the 8's represent one user, and aggregating them that way.

comment:3 in reply to:  2 Changed 2 years ago by teor

Replying to arma:

I can't tell from the graph you gave us that there are any lines with granularity of 1. Here's one where it's easier to see:
https://metrics.torproject.org/userstats-bridge-transport.html?start=2018-04-14&end=2018-07-20&transport=websocket&transport=snowflake

I look forward to hearing from Karsten or Iain on this one, but here is a reason why it could make sense:

If you have a multiple of 8 that is much higher than 8, then it makes a lot of sense to estimate that you're in the middle of your top multiple (i.e. if it says 80 figure 76, if it says 88 figure 84, etc). But if it says 8, it's probably not the case that you had 4 users. I could see an argument for betting that most of the 8's represent one user, and aggregating them that way.

Ah, but the snowflake graph shows 0, 1, 2, 3, 4, 5 and 6 users, not just 0, 4, and 8.

comment:4 Changed 2 years ago by karsten

We're soon going to publish a specification for reproducing the data behind all our graphs (#26857), which should answer this question. You're pretty much our target audience for that document, so I'd appreciate if you could wait a few more days for that document to be available and then let us know if this question is sufficiently answered and what other questions remain.

Though if you really need a faster answer, I'll write something here. Let me know!

comment:5 Changed 2 years ago by teor

A few days is fine. (I doubt it's a security issue, because the underlying data is definitely rounded to 8, even for snowflake.)

@type bridge-extra-info 1.3
extra-info flakey 5481936581E23D2D178105D44DB6915AB06BFB7F
master-key-ed25519 cefgVl0wPLbQew8dIXs79HPheEaiETs24HPp0FXNWAI
published 2018-07-16 23:55:23
write-history 2018-07-16 17:36:08 (86400 s) 1419033600,589701120,467485696
read-history 2018-07-16 17:36:08 (86400 s) 1443484672,605205504,493749248
dirreq-write-history 2018-07-16 17:36:08 (86400 s) 10368000,11821056,4459520
dirreq-read-history 2018-07-16 17:36:08 (86400 s) 86016,832512,69632
geoip-db-digest FF83AD73DE7672C77EDF8888F4B241642C7C90F7
geoip6-db-digest B1CDBFEB7C88F82EF3B5289CAFEED1321FA4693F
dirreq-stats-end 2018-07-16 16:26:00 (86400 s)
dirreq-v3-ips ma=8
dirreq-v3-reqs ma=8
dirreq-v3-resp ok=8,not-enough-sigs=0,unavailable=0,not-found=0,not-modified=8,busy=0
dirreq-v3-direct-dl complete=0,timeout=0,running=0
dirreq-v3-tunneled-dl complete=8,timeout=0,running=0
transport snowflake
bridge-stats-end 2018-07-16 16:26:22 (86400 s)
bridge-ips ma=8
bridge-ip-versions v4=8,v6=0
bridge-ip-transports snowflake=8
router-digest-sha256 g0Kd3B3uT2ImV8DUALqvEGu9oJciwTfVlh+d6CfWpk8
router-digest 4D3E56AE7C3388EAC2245FB56E3DB27B7BF25120

comment:6 Changed 2 years ago by karsten

Please take a look at the "Bridge users" section of this shiny new Tor Metrics page: https://metrics.torproject.org/reproducible-metrics.html#bridge-users

Any feedback on that page would be highly appreciated, though probably in a new ticket or new thread on the metrics-team@ mailing list.

If the question asked in this ticket remains unanswered, please let us know here, and we'll see if we can explain it better either on that page or here.

comment:7 Changed 2 years ago by teor

So, I believe he answer to my question is:

"We approximate directory request numbers by multiplying the fraction of unique IP addresses from a given country, transport, or IP version with the total number of successful requests."

But I think there are two missing steps:

  • Metrics appears to round/truncate/ceiling client numbers to the nearest integer
  • You say that you "Skip dates where frac is smaller than 10% and hence too low for a robust estimate"
    • are the snowflake bridges less than 10% of total bridge usage? That could be why their numbers vary so much.
    • how do you calculate 10% of bridge usage? (Bridges don't have bandwidth, so do you use unique IP addresses?)

comment:8 in reply to:  7 ; Changed 2 years ago by karsten

Replying to teor:

So, I believe he answer to my question is:

"We approximate directory request numbers by multiplying the fraction of unique IP addresses from a given country, transport, or IP version with the total number of successful requests."

That would produce smaller numbers than 8, too.

Another answer is this part: "Split observations to the covered UTC dates by assuming a linear distribution of observations."

We'd have to look at the raw data to say which one is the better answer. But I assume your question is mostly answered by knowing that it's not a too small number in the original data.

But I think there are two missing steps:

  • Metrics appears to round/truncate/ceiling client numbers to the nearest integer

Right, we're using integer truncation here. We should probably document that under Step 4 of the Relay users section.

  • You say that you "Skip dates where frac is smaller than 10% and hence too low for a robust estimate"
    • are the snowflake bridges less than 10% of total bridge usage? That could be why their numbers vary so much.
    • how do you calculate 10% of bridge usage? (Bridges don't have bandwidth, so do you use unique IP addresses?)

Wait, no, frac is the "estimated fraction of reported directory-request statistics". It is unrelated to snowflake in particular and refers to all bridge usage. The formula for computing frac is specified in Step 3 of the Relay users section.

Please let me know if this makes more sense now, and if not, how we can improve it. Thanks!

comment:9 in reply to:  8 Changed 2 years ago by teor

Resolution: not a bug
Status: newclosed

Replying to karsten:

Replying to teor:

...

  • You say that you "Skip dates where frac is smaller than 10% and hence too low for a robust estimate"
    • are the snowflake bridges less than 10% of total bridge usage? That could be why their numbers vary so much.
    • how do you calculate 10% of bridge usage? (Bridges don't have bandwidth, so do you use unique IP addresses?)

Wait, no, frac is the "estimated fraction of reported directory-request statistics". It is unrelated to snowflake in particular and refers to all bridge usage. The formula for computing frac is specified in Step 3 of the Relay users section.

Please let me know if this makes more sense now, and if not, how we can improve it. Thanks!

Ok, that makes sense. Thank you for explaining!

Note: See TracTickets for help on using tickets.