Add graph on bridge users by country and transport

changed milestone to %Metrics 1.0.0

added component::metrics/website milestone::Metrics 1.0.0 owner::karsten priority::medium resolution::fixed severity::normal status::closed type::enhancement labels

Please review my branch task-19544 (which is optimistically based on [https://gitweb.torproject.org/karsten/metrics-web.git/log/?h=task-19540] and will be rebased once that branch gets merged).

Trac:
Status: new to needs_review

Trac:
Cc: dcf, iwakeh to dcf, iwakeh, mrphs

Looks fine.

Using the reported numbers for deriving (a sample of) a joint distribution of a request being from a certain country and using a certain transport, the bounds follow from basic (discrete) probability theory:

The upper bound b/c of the monotony of a probability measure
The lower bound can also be easily derived using probability axioms and the given condition.
The 'middle' is based on the assumption that country of origin is independent of the request's type of transport, which doesn't have to be the case (as mentioned in the description). This number depicts just one class of special distributions, where the exact value could be computed. Thus, it's good to not display it (at least I didn't see it in the graphs). Is there a reason why it is computed?

Without additional information the bounds cannot be improved.

Thanks for this review! It's good to hear that the approach is correct, at least for upper and lower bounds.

The middle estimate could go away if you think it's more likely to confuse people than provide any new insights. It was helpful to confirm that the sum of users by country and transport is still the same as the sum of users by either of the two. But that's about it. That value is not graphed, and if you fear that people would graph it and conclude it must be accurate, then I'd rather not want to provide it.

Putting this ticket back to assigned. Thanks again!

Trac:
Owner: N/A to karsten
Status: needs_review to assigned

Please review my branch task-19544-2 (had to rebase to master) which takes out the "middle" part, simplifies the graph a bit, and adds a data page. This branch is currently deployed for testing with the new graph page (select a country to start the magic) and new data page.

Trac:
Status: assigned to needs_review
Cc: dcf, iwakeh, mrphs to dcf, iwakeh, mrphs, dgoulet

Feedback for web-pages: While looking at the data page, which really nicely zooms in, I noticed a high peak in 2015 for mid June US and mid June All, the same days for randomly chosen countries es, fr, it, ge, de, eg do not show the peak. Is there an explanation for the peak in us numbers?

Regarding the data page: The variables 'country', 'total', 'transport' used in the formulas should be renamed, i.e., have other names than the columns, and defined. The cvs columns 'country' and 'transport' contain strings and 'total' is not defined.

(R and SQL review follows)

Replying to iwakeh:

Feedback for web-pages: While looking at the data page, which really nicely zooms in, I noticed a high peak in 2015 for mid June US and mid June All, the same days for randomly chosen countries es, fr, it, ge, de, eg do not show the peak. Is there an explanation for the peak in us numbers?

There's no good explanation yet, but I started looking into that in #16555 (moved).

Regarding the data page: The variables 'country', 'total', 'transport' used in the formulas should be renamed, i.e., have other names than the columns, and defined. The cvs columns 'country' and 'transport' contain strings and 'total' is not defined.

Right, how about these definitions?

low: Lower bound of users by country and transport, calculated as max(0, C + T - S): if the number of users from a given country C and using a given transport T exceeds the total number of users from all countries and transports S, there must be users from that country and transport. And if that is not the case, 0 is the lower limit.

high: Upper bound of users by country and transport, calculated as min(C, T): there cannot be more users by country and transport than there are users by either of the two numbers.

(R and SQL review follows)

Cool!

It is a lot of code ... Hope the following makes some sense.

=== SQL Still looks fine.

Just two typos, and I'd really like to have explicit column aliases in the select-stmts and group-by-clauses. The latter improves readability and safeguards in case of later column reordering. The former makes the SQL independent of postgresql's default column naming scheme. The following diff is just to give an example as I didn't run it on a db:

diff --git a/modules/clients/init-userstats.sql b/modules/clients/init-userstats.sql
index 9d39c5e..a369575 100644
--- a/modules/clients/init-userstats.sql
+++ b/modules/clients/init-userstats.sql
@@ -373,7 +373,7 @@ BEGIN
           WHERE id = last_id;
 
       -- If the new entry ends right when an existing entry starts and
-      -- there's no gap between when the previousl processed entry ends
+      -- there's no gap between when the previously processed entry ends
       -- and when the new entry starts, merge the new entry with the other
       -- two entries.  This happens by deleting the previous entry and
       -- expanding the subsequent entry to cover all three entries.
@@ -538,7 +538,7 @@ BEGIN
     WHERE aggregated.date = aggregated_bytes_responses.date
     AND aggregated.node = aggregated_bytes_responses.node;
 
-  -- Update results based on notes reporting responses but no bytes.
+  -- Update results based on nodes reporting responses but no bytes.
   UPDATE aggregated
     SET nrh = aggregated_responses_bytes.nrh
     FROM (
@@ -582,7 +582,7 @@ BEGIN
   -- meant to re-compute this query multiple times.
   CREATE TEMPORARY TABLE update2 AS
     SELECT fingerprint, country, transport,
-           DATE(stats_start), SUM(val) AS val
+           DATE(stats_start) as stats_date, SUM(val) AS val
     FROM merged
     WHERE node = 'bridge'
     AND metric = 'responses'
@@ -590,7 +590,7 @@ BEGIN
     -- Note: Comment out the following condition to initialize table!
     AND DATE(stats_start) IN (
         SELECT DISTINCT DATE(stats_start) FROM imported)
-    GROUP BY 1, 2, 3, 4;
+    GROUP BY fingerprint, country, transport, stats_date;
 
   -- Delete all entries from the combined table that we're about to
   -- re-compute.
@@ -603,7 +603,8 @@ BEGIN
   -- transport.  These response numbers will later be transformed into
   -- user number estimates in the combined view.
   INSERT INTO combined_country_transport
-    SELECT country.date, country.country, transport.transport,
+    SELECT country.date as date, country.country as country,
+           transport.transport as transport,
            SUM(GREATEST(0, transport.val + country.val - total.val))
              AS low,
            SUM(LEAST(transport.val, country.val)) AS high
@@ -623,7 +624,7 @@ BEGIN
     AND country.fingerprint = transport.fingerprint
     AND country.fingerprint = total.fingerprint
     AND transport.fingerprint = total.fingerprint
-    GROUP BY 1, 2, 3;
+    GROUP BY date, country, transport;
 
   -- We're done combining new data.
   RAISE NOTICE '% Finishing combine step.', timeofday();

The following comments for stmts not touched by this change seem confusing, unless I remove the 'of seconds'. Does that make sense?

diff --git a/modules/clients/init-userstats.sql b/modules/clients/init-userstats.sql
index 9d39c5e..66edbcc 100644
--- a/modules/clients/init-userstats.sql
+++ b/modules/clients/init-userstats.sql
@@ -119,7 +119,7 @@ CREATE TABLE aggregated (
   -- tech report.
   rrx DOUBLE PRECISION NOT NULL DEFAULT 0,
 
-  -- Total number of seconds of nodes reporting responses, possibly broken
+  -- Total number of nodes reporting responses, possibly broken
   -- down by country, transport, or version if either of them is not ''.
   -- This would be referred to as n(R) in the tech report, though it's not
   -- used there.
@@ -128,7 +128,7 @@ CREATE TABLE aggregated (
   -- Total number of reported bytes.  See h(H) in the tech report.
   hh DOUBLE PRECISION NOT NULL DEFAULT 0,
 
-  -- Total number of seconds of nodes in the status.  See n(N) in the tech
+  -- Total number of nodes in the status.  See n(N) in the tech
   -- report.
   nn DOUBLE PRECISION NOT NULL DEFAULT 0,
 
@@ -136,11 +136,11 @@ CREATE TABLE aggregated (
   -- bytes.  See h(R intersect H) in the tech report.
   hrh DOUBLE PRECISION NOT NULL DEFAULT 0,
 
-  -- Number of seconds of nodes reporting bytes.  See n(H) in the tech
+  -- Number of nodes reporting bytes.  See n(H) in the tech
   -- report.
   nh DOUBLE PRECISION NOT NULL DEFAULT 0,
 
-  -- Number of seconds of nodes reporting responses but no bytes.  See
+  -- Number of nodes reporting responses but no bytes.  See
   -- n(R \ H) in the tech report.
   nrh DOUBLE PRECISION NOT NULL DEFAULT 0
 );

=== R Why are the top three transports decided by the /difference/ of upper and lower bound? Or, did I miss-read the script? Here the excerpt from the R script:

    a <- aggregate(list(mid = u$high - u$low),
                   by = list(transport = u$transport), FUN = sum)
    a <- a[order(a$mid, decreasing = TRUE)[1:top], ]

Shouldn't it be highest upper bound and ties decided by highest lower bound?

=== Data explanation (reply to comment 7) Maybe something like the following (which can be improved):

low: Lower bound of users by country and transport, calculated as sum over all bridges having reports for the given country and transport, i.e., the sum of M(b), where for each bridge b define M(b) := max(0, C(b) + T(b) - S(b)) using the following definitions:

C(b) is the number of users from a given country reported by b
T(b) is the number of users using a given transport reported by b
S(b) is the total numbers of responses reported by b Reasoning: If the sum C(b) + T(b) exceeds the total number of users from all countries and transports S(b), there must be users from that country and transport. And if that is not the case, 0 is the lower limit.

high: Upper bound of users by country and transport, calculated as sum over all bridges having reports for the given country and transport, i.e., the sum of m(b), where for each bridge b define m(b):=min(C(b), T(b)) where we use the definitions from low (above). Reasoning: there cannot be more users by country and transport than there are users by either of the two numbers.

Great review! You spotted a major bug in ordering transports by user numbers that would favor transports with a large range between low and high bound over transports with a larger mean of low and high bound. To give a concrete example, meek has been omitted in graphs, even though we had very precise estimates for meek users. Fixed now!

Regarding the "seconds of nodes" phrase, these are really the number seconds that nodes were running on a given day. The reason is that a node running for only an hour or two should be counted less in estimates than a node running the whole day. I'm inclined to leave this text unchanged for now, because we are going to rewrite that code anyway to rely less on single reports and to include confidence intervals.

Pushed --squash commits to task-19544-2, so that you can see what I changed. Then squashed and reordered commits and pushed to master.

I think that concludes this ticket. We can always add more graphs in separate tickets as needed, but I think we're good for now. Closing. Many thanks!

Trac:
Resolution: N/A to fixed
Status: needs_review to closed

Trac:
Milestone: N/A to Metrics 1.0.0

closed

Add graph on bridge users by country and transport

Child items ...

Activity