Opened 3 years ago

Closed 3 years ago

Last modified 3 years ago

#19544 closed enhancement (fixed)

Add graph on bridge users by country and transport

Reported by: karsten Owned by: karsten
Priority: Medium Milestone: Metrics 1.0.0
Component: Metrics/Website Version:
Severity: Normal Keywords:
Cc: dcf, iwakeh, mrphs, dgoulet Actual Points:
Parent ID: Points:
Reviewer: Sponsor:

Description

The following idea came up in the discussion to provide "users-per-transport-per-country" statistics for obfsbridges. This ticket is about graphing existing data, whereas the discussion of reporting new data will continue on #10218. Quoting a bit from that ticket to have enough context here:

It turns out that most large bridges (4 out of 5 on February 1, 2016) only see noteworthy usage via a single transport or have requests via one transport dominating the others in numbers (74% on the 5th large bridge on February 1, 2016).

We could assume that the distribution by country is the same for all transports, that is, if CC (in [0..1]) requests came from a given country and PT (also in [0..1]) requests came in via a given transport, x * y requests can be attributed to that country and transport. But that assumption may be wrong.

What we could also do as first approximation is find a lower and upper bound of users by country and transport. The lower bound would probably be defined as something like max(0, PT + CC - 1) (not just 0 to account for cases where CC > 1 - PT) and the upper bound as min(PT, CC), even though I could be convinced that other formulas are even more correct.

dcf kindly graphed responses by country and transport on #10218 here and here, indicating that this approach may produce actually useful results.

The next step was to perform these calculations in the database and transform number of responses to estimated user numbers. I finally found time to work on that step. Here's a graph on Tor Metrics which is yet "hidden" under "Advanced" until I'm more confident that it's doing the right thing.

https://metrics.torproject.org/userstats-bridge-combined.html

Example (image link to that graph, may look different over time):

https://metrics.torproject.org/userstats-bridge-combined.png

Next steps:

  • Become more confident in the particular math and code behind this graph. Once that's done, move the graph to "Basic" so that people will find it. I'm attaching a branch in a minute.
  • Make the user interface better. For example, we could also graph top countries by transport, not just top transports by all countries or top transports in a given country. Maybe we can graph other things using this data as well.
  • Make the raw data available. There's a .csv file behind this graph, but I didn't put that on Tor Metrics yet, because we might have to change the data format and lack a versioning system to do that. I'm putting up a snapshot of that file (36.4M) for review.

Child Tickets

Change History (10)

comment:1 Changed 3 years ago by karsten

Status: newneeds_review

Please review my branch task-19544 (which is optimistically based on https://gitweb.torproject.org/karsten/metrics-web.git/log/?h=task-19540 and will be rebased once that branch gets merged).

comment:2 Changed 3 years ago by mrphs

Cc: mrphs added

comment:3 Changed 3 years ago by iwakeh

Looks fine.

Using the reported numbers for deriving (a sample of) a joint distribution of a request being from a certain country and using a certain transport, the bounds follow from basic (discrete) probability theory:

  • The upper bound b/c of the monotony of a probability measure
  • The lower bound can also be easily derived using probability axioms and the given condition.
  • The 'middle' is based on the assumption that country of origin is independent of the request's type of transport, which doesn't have to be the case (as mentioned in the description). This number depicts just one class of special distributions, where the exact value could be computed. Thus, it's good to not display it (at least I didn't see it in the graphs). Is there a reason why it is computed?

Without additional information the bounds cannot be improved.

comment:4 Changed 3 years ago by karsten

Owner: set to karsten
Status: needs_reviewassigned

Thanks for this review! It's good to hear that the approach is correct, at least for upper and lower bounds.

The middle estimate could go away if you think it's more likely to confuse people than provide any new insights. It was helpful to confirm that the sum of users by country and transport is still the same as the sum of users by either of the two. But that's about it. That value is not graphed, and if you fear that people would graph it and conclude it must be accurate, then I'd rather not want to provide it.

Putting this ticket back to assigned. Thanks again!

comment:5 Changed 3 years ago by karsten

Cc: dgoulet added
Status: assignedneeds_review

Please review my branch task-19544-2 (had to rebase to master) which takes out the "middle" part, simplifies the graph a bit, and adds a data page. This branch is currently deployed for testing with the new graph page (select a country to start the magic) and new data page.

comment:6 Changed 3 years ago by iwakeh

Feedback for web-pages:
While looking at the data page, which really nicely zooms in, I noticed a high peak in 2015 for
mid June US and mid June All, the same days for randomly chosen countries es, fr, it, ge, de, eg do not show the peak.
Is there an explanation for the peak in us numbers?

Regarding the data page:
The variables 'country', 'total', 'transport' used in the formulas should be renamed, i.e., have other names than the columns, and defined. The cvs columns 'country' and 'transport' contain strings and 'total' is not defined.

(R and SQL review follows)

comment:7 in reply to:  6 Changed 3 years ago by karsten

Replying to iwakeh:

Feedback for web-pages:
While looking at the data page, which really nicely zooms in, I noticed a high peak in 2015 for
mid June US and mid June All, the same days for randomly chosen countries es, fr, it, ge, de, eg do not show the peak.
Is there an explanation for the peak in us numbers?

There's no good explanation yet, but I started looking into that in #16555.

Regarding the data page:
The variables 'country', 'total', 'transport' used in the formulas should be renamed, i.e., have other names than the columns, and defined. The cvs columns 'country' and 'transport' contain strings and 'total' is not defined.

Right, how about these definitions?

low: Lower bound of users by country and transport, calculated as max(0, C + T - S): if the number of users from a given country C and using a given transport T exceeds the total number of users from all countries and transports S, there must be users from that country and transport. And if that is not the case, 0 is the lower limit.

high: Upper bound of users by country and transport, calculated as min(C, T): there cannot be more users by country and transport than there are users by either of the two numbers.

(R and SQL review follows)

Cool!

comment:8 Changed 3 years ago by iwakeh

It is a lot of code ...
Hope the following makes some sense.

SQL

Still looks fine.

Just two typos, and I'd really like to have explicit column aliases in the select-stmts and group-by-clauses. The latter improves readability and safeguards in case of later column reordering. The former makes the SQL independent of postgresql's default column naming scheme. The following diff is just to give an example as I didn't run it on a db:

diff --git a/modules/clients/init-userstats.sql b/modules/clients/init-userstats.sql
index 9d39c5e..a369575 100644
--- a/modules/clients/init-userstats.sql
+++ b/modules/clients/init-userstats.sql
@@ -373,7 +373,7 @@ BEGIN
           WHERE id = last_id;
 
       -- If the new entry ends right when an existing entry starts and
-      -- there's no gap between when the previousl processed entry ends
+      -- there's no gap between when the previously processed entry ends
       -- and when the new entry starts, merge the new entry with the other
       -- two entries.  This happens by deleting the previous entry and
       -- expanding the subsequent entry to cover all three entries.
@@ -538,7 +538,7 @@ BEGIN
     WHERE aggregated.date = aggregated_bytes_responses.date
     AND aggregated.node = aggregated_bytes_responses.node;
 
-  -- Update results based on notes reporting responses but no bytes.
+  -- Update results based on nodes reporting responses but no bytes.
   UPDATE aggregated
     SET nrh = aggregated_responses_bytes.nrh
     FROM (
@@ -582,7 +582,7 @@ BEGIN
   -- meant to re-compute this query multiple times.
   CREATE TEMPORARY TABLE update2 AS
     SELECT fingerprint, country, transport,
-           DATE(stats_start), SUM(val) AS val
+           DATE(stats_start) as stats_date, SUM(val) AS val
     FROM merged
     WHERE node = 'bridge'
     AND metric = 'responses'
@@ -590,7 +590,7 @@ BEGIN
     -- Note: Comment out the following condition to initialize table!
     AND DATE(stats_start) IN (
         SELECT DISTINCT DATE(stats_start) FROM imported)
-    GROUP BY 1, 2, 3, 4;
+    GROUP BY fingerprint, country, transport, stats_date;
 
   -- Delete all entries from the combined table that we're about to
   -- re-compute.
@@ -603,7 +603,8 @@ BEGIN
   -- transport.  These response numbers will later be transformed into
   -- user number estimates in the combined view.
   INSERT INTO combined_country_transport
-    SELECT country.date, country.country, transport.transport,
+    SELECT country.date as date, country.country as country,
+           transport.transport as transport,
            SUM(GREATEST(0, transport.val + country.val - total.val))
              AS low,
            SUM(LEAST(transport.val, country.val)) AS high
@@ -623,7 +624,7 @@ BEGIN
     AND country.fingerprint = transport.fingerprint
     AND country.fingerprint = total.fingerprint
     AND transport.fingerprint = total.fingerprint
-    GROUP BY 1, 2, 3;
+    GROUP BY date, country, transport;
 
   -- We're done combining new data.
   RAISE NOTICE '% Finishing combine step.', timeofday();

The following comments for stmts not touched by this change seem confusing, unless I remove the 'of seconds'.
Does that make sense?

diff --git a/modules/clients/init-userstats.sql b/modules/clients/init-userstats.sql
index 9d39c5e..66edbcc 100644
--- a/modules/clients/init-userstats.sql
+++ b/modules/clients/init-userstats.sql
@@ -119,7 +119,7 @@ CREATE TABLE aggregated (
   -- tech report.
   rrx DOUBLE PRECISION NOT NULL DEFAULT 0,
 
-  -- Total number of seconds of nodes reporting responses, possibly broken
+  -- Total number of nodes reporting responses, possibly broken
   -- down by country, transport, or version if either of them is not ''.
   -- This would be referred to as n(R) in the tech report, though it's not
   -- used there.
@@ -128,7 +128,7 @@ CREATE TABLE aggregated (
   -- Total number of reported bytes.  See h(H) in the tech report.
   hh DOUBLE PRECISION NOT NULL DEFAULT 0,
 
-  -- Total number of seconds of nodes in the status.  See n(N) in the tech
+  -- Total number of nodes in the status.  See n(N) in the tech
   -- report.
   nn DOUBLE PRECISION NOT NULL DEFAULT 0,
 
@@ -136,11 +136,11 @@ CREATE TABLE aggregated (
   -- bytes.  See h(R intersect H) in the tech report.
   hrh DOUBLE PRECISION NOT NULL DEFAULT 0,
 
-  -- Number of seconds of nodes reporting bytes.  See n(H) in the tech
+  -- Number of nodes reporting bytes.  See n(H) in the tech
   -- report.
   nh DOUBLE PRECISION NOT NULL DEFAULT 0,
 
-  -- Number of seconds of nodes reporting responses but no bytes.  See
+  -- Number of nodes reporting responses but no bytes.  See
   -- n(R \ H) in the tech report.
   nrh DOUBLE PRECISION NOT NULL DEFAULT 0
 );

R

Why are the top three transports decided by the /difference/ of upper and lower bound?
Or, did I miss-read the script?
Here the excerpt from the R script:

    a <- aggregate(list(mid = u$high - u$low),
                   by = list(transport = u$transport), FUN = sum)
    a <- a[order(a$mid, decreasing = TRUE)[1:top], ]

Shouldn't it be highest upper bound and ties decided by highest lower bound?

Data explanation

(reply to comment 7) Maybe something like the following (which can be improved):

low: Lower bound of users by country and transport, calculated as sum over all bridges having reports for the given country and transport, i.e., the sum of M(b), where for each bridge b define M(b) := max(0, C(b) + T(b) - S(b)) using the following definitions:

  • C(b) is the number of users from a given country reported by b
  • T(b) is the number of users using a given transport reported by b
  • S(b) is the total numbers of responses reported by b

Reasoning: If the sum C(b) + T(b) exceeds the total number of users from all countries and transports S(b), there must be users from that country and transport. And if that is not the case, 0 is the lower limit.

high: Upper bound of users by country and transport, calculated as sum over all bridges having reports for the given country and transport, i.e., the sum of m(b), where for each bridge b define m(b):=min(C(b), T(b)) where we use the definitions from low (above).
Reasoning: there cannot be more users by country and transport than there are users by either of the two numbers.

comment:9 Changed 3 years ago by karsten

Resolution: fixed
Status: needs_reviewclosed

Great review! You spotted a major bug in ordering transports by user numbers that would favor transports with a large range between low and high bound over transports with a larger mean of low and high bound. To give a concrete example, meek has been omitted in graphs, even though we had very precise estimates for meek users. Fixed now!

Regarding the "seconds of nodes" phrase, these are really the number seconds that nodes were running on a given day. The reason is that a node running for only an hour or two should be counted less in estimates than a node running the whole day. I'm inclined to leave this text unchanged for now, because we are going to rewrite that code anyway to rely less on single reports and to include confidence intervals.

Pushed --squash commits to task-19544-2, so that you can see what I changed. Then squashed and reordered commits and pushed to master.

I think that concludes this ticket. We can always add more graphs in separate tickets as needed, but I think we're good for now. Closing. Many thanks!

comment:10 Changed 3 years ago by iwakeh

Milestone: Metrics 1.0.0
Note: See TracTickets for help on using tickets.