Why do our user graphs have these huge spikes?

added component::metrics/analysis priority::medium resolution::implemented status::closed type::task labels

< arma> karsten: can you remind me in a few sentences how we conclude the number of users on our user graphs? i remember we had lots of approaches to choose from, but i don't remember which we use now.

< karsten> arma: Here's how we estimate daily users: We add up directory requests for network status consensuses to all directory mirrors that report these numbers to us (which is an increasing number of relays with 0.2.3.x). < karsten> We also estimate how many bytes these directory mirrors, as well as all directory mirrors in the network, have spent on answering directory requests. < karsten> We can then extrapolate the total number of directory requests in the network. We divide this number by 10, assuming that every client makes 10 directory requests per day. That number is our daily user estimate.

Karsten: how does the 'estimate how many bytes these directory mirrors have spent on answering directory requests' figure in? I can see how to do step 1 (add up directory requests) and step 3 (extrapolate to total directory requests) and you're done. I guess I could also see how to do step 2 and step 3 and you're done. How do you combine steps 1 and 2?

We use the directory bytes to decide what fraction of directory requests have been reported to us. If 100 out of 1000 directory mirrors report directory request statistics to us, we don't know if these directory mirrors saw 10% or 5% or 20% of all directory requests in the network. But we can use the directory bytes of a) the directory mirrors reporting directory request statistics and b) all directory mirrors in the network to estimate what fraction of directory requests we have seen.

Note that we're talking about estimated directory bytes here, because not all relays report that number to us yet. We estimate the number of written directory bytes by subtracting the total read bytes from the total written bytes. On page 7 of the relevant tech report we write that we assume that the difference between total written and total read bytes on directory mirrors is to a large extent the result of answering small directory requests with large directory objects. We observed that relays that don’t mirror the directory write more bytes than they read, too, but the difference between written and read bytes is much smaller than on directory mirrors. We weight the bytes written by directory mirrors with the quotient of read and written bytes on relays that don’t mirror the directory in order to account for non-directory related factors. We then subtract the num- ber of bytes read by directory mirrors and obtain an estimate of directory bytes written by directory mirrors.

Ok. So the questions are, for those three spikes in April:

Did the difference between reported read bytes and write bytes go up a lot?
Did the reported number of dir requests go up a lot?
Did the set of reporting directories (either of dir requests or of dir bytes) change dramatically?

It would be good to know which parameter was the one that changed a lot, to better understand how to detect spikes in our output that may not reflect spikes in actual usage.

https://metrics.torproject.org/network.html?graph=dirbytes&start=2010-12-01&end=2011-06-23&dpi=300#dirbytes makes it look like the total number of bytes spent on handling dir info didn't change much.

How can we double the number of users without increasing the amount of total dir info we give out?

Trac:
daily-users.pdf

Graph showing the possible reason for spikes in user numbers

I had a look at the data, and it seems that our estimate of the fraction of directory bytes that the directory mirrors which reported directory requests is slightly broken. The reported request numbers are quite stable, but every time when there's a peak in the user number, our estimate of observed directory bytes drops below 10 %. I just attached a graph which shows this problem, the graph code is in the metrics-tasks repository.

How do we fix this? We should make more relays upgrade to 0.2.2.x when they report directory requests by default. Or we should find a better estimate for directory bytes than our current one.

Replying to karsten:

I had a look at the data, and it seems that our estimate of the fraction of directory bytes that the directory mirrors which reported directory requests is slightly broken. The reported request numbers are quite stable, but every time when there's a peak in the user number, our estimate of observed directory bytes drops below 10 %. I just attached a graph which shows this problem, the graph code is in the metrics-tasks repository.

Nice!

How do we fix this? We should make more relays upgrade to 0.2.2.x when they report directory requests by default. Or we should find a better estimate for directory bytes than our current one.

This one I'm confused by.

for(<>) {
  if (/^r .* (\d+)$/) {
    $dirport = $1;
  } elsif (/^v Tor (.*?)[- ]/) {
    $versionstring = $1;
  } elsif ($dirport != 0 and /^w Bandwidth=(.*)/) {
    $bandwidth = $1;
    if ($versionstring le "0.2.2.0") {
      print "$versionstring (old): $bandwidth\n";
      $old += $bandwidth;
    } else {
      print "$versionstring: $bandwidth\n";
      $new += $bandwidth;
    }
  }
}

print "Total old is $old, total new is $new\n";

Piping my cached-consensus file into that script yields

Total old is 1419334, total new is 2603078

So about 65% of the directory mirrors by weight are running 0.2.2 already.

I guess you mean 0.2.3? In which case only 358701/(358701+3663711)= 9% are running it.

Replying to arma:

I guess you mean 0.2.3? In which case only 358701/(358701+3663711)= 9% are running it.

There are a handful of fast directory mirrors running 0.2.2.x right now without publishing dir stats. I could mail them and ask them to turn on the dir stats. That's probably better than asking them to switch to 0.2.3.x.

The instructions there are just to add these two lines your torrc, right?

DirReqStatistics 1
ExtraInfoStatistics 1

And it should work fine for recent 0.2.2 versions?

In further poking, the fractions that relays are advertising are assuming that all clients use the weights in the consensus, which only started happening in mid 0.2.2. So all the 0.2.1 clients are fetching directory stuff according to different weights -- before the WEIGHT_FOR_DIR rule existed.

For example, here are the most popular directory mirrors for my Tor client that's running master:

11965430.000000 nononononon
12127490.000000 Merav
12439200.000000 blutmagie4
14261280.000000 blutmagie3
15700000.000000 rueckgrat
16100000.000000 CCN1
16608960.000000 rockhall
17300000.000000 Tor2YitCC
17718560.000000 greyunknown
18900000.000000 Lifuka
19400000.000000 c00psTOR2
20600000.000000 Tor3YitCC
21445940.000000 trusted
23800000.000000 c00psTOR
24900000.000000 pink
27600000.000000 Pandora15
29900000.000000 Pandora24
30200000.000000 Pandora
36000000.000000 Pandora20
36600000.000000 Pandora23
36700000.000000 Pandora22
37600000.000000 Pandora17
51000000.000000 oilsrv1

But for example, trusted is a weight 44819472 for my 0.2.1 client. I haven't done enough scripting to actually compare, but I think some of these numbers will be quite different. And even for the case where the weights are the same (when the relay doesn't have the Guard or the Exit flag), the rest of the weights will be different.

For posterity, here were my hacks.

For master:

diff --git a/src/or/routerlist.c b/src/or/routerlist.c
index f711282..b304255 100644
--- a/src/or/routerlist.c
+++ b/src/or/routerlist.c
@@ -1812,6 +1812,10 @@ smartlist_choose_node_by_bandwidth_weights(smartlist_t *s
     weighted_bw += weight*this_bw;
     if (is_me)
       sl_last_weighted_bw_of_me = weight*this_bw;
+    if (rule == WEIGHT_FOR_DIR)
+      log_debug(LD_CIRC, "Foo: %lf %s",
+                weight*this_bw,
+                node_get_nickname(node));
   } SMARTLIST_FOREACH_END(node);

   /* XXXX023 this is a kludge to expose these values. */

and for maint-0.2.1:

diff --git a/src/or/routerlist.c b/src/or/routerlist.c
index be7ba0e..32588e9 100644
--- a/src/or/routerlist.c
+++ b/src/or/routerlist.c
@@ -1593,6 +1593,7 @@ smartlist_choose_by_bandwidth(smartlist_t *sl, bandwidth_w
       is_guard = status->is_possible_guard;
       if (status->has_bandwidth) {
         this_bw = kb_to_bytes(status->bandwidth);
+        log_debug(LD_CIRC, "Foo1: %d %d %s", this_bw, i, status->nickname);
       } else { /* guess */
         /* XXX022 once consensuses always list bandwidths, we can take
          * this guessing business out. -RD */
@@ -1724,6 +1725,8 @@ smartlist_choose_by_bandwidth(smartlist_t *sl, bandwidth_w
       total_bw += bw;
       if (i == (unsigned) me_idx)
         sl_last_weighted_bw_of_me = bw;
+      if (statuses)
+        log_debug(LD_CIRC, "Foo2: %d %d %d", bandwidths[i], bw, i);
     }
   }

If we had a full-time metrics researcher, one workaround would be to calculate what dirreq-share the relay would be using the old formula, and plot that curve too.

Even doing a couple of data points by hand (some where your red dots are, and some not) would give us a better intuition here.

It seems very weird to me that the fraction of the network we're seeing would drop by half, yet the number of requests we're seeing would remain steady. So I'm trying to come up with reasons why the fraction of the network didn't actually drop.

Replying to arma:

In further poking, the fractions that relays are advertising are assuming that all clients use the weights in the consensus, which only started happening in mid 0.2.2. So all the 0.2.1 clients are fetching directory stuff according to different weights -- before the WEIGHT_FOR_DIR rule existed.

Another idea for why we see spikes would be that when the weight changes that isn't immediately reflected in load, because (0.2.2.x) users still have the old weight until they get a new consensus. So if there is a sharp drop in weight, the user number might be heavily inflated.

I think there's some confusion in the comments about dirreq-share lines and how clients on different Tor major versions use different weights for picking a directory mirror. This is all irrelevant here! Our current user number estimate is based on estimated written directory bytes, not on probabilities for clients to pick directory mirrors. These estimated directory byte histories are, in theory, far more reliable and useful for adding up the observations from multiple directories than the dirreq-share values that we used before.

The suspected reason why our estimated user numbers have some false values in the past few months is that our directory bytes estimate is slightly broken. We're estimating directory bytes based on the difference between written and read total bytes. This approach worked okay when developing the user number estimate, but apparently it's not fail-safe. We're using this estimate as opposed to extrapolating the recently introduced directory byte metric, because we have bandwidth data for the past few years. Even if we switched to the reported directory bytes now, we'd only have user numbers for the past few months.

I'm going to re-run the comparison of extrapolated directory bytes and our estimate based on the difference between written and read total bytes.

Also, I was wrong that relays need to upgrade to 0.2.2.x to report directory stats by default. It's 0.2.3.x that they need to upgrade to. Relays on 0.2.2.x would have to add DirReqStatistics 1 and ExtraInfoStatistics 1 to their torrc to report directory stats.

Trac:
dir-bytes-estimate.pdf

Estimated vs. extrapolated directory bytes

It looks like the directory bytes estimate isn't that bad. The problem is just that we don't have enough directory mirrors reporting directory-request statistics to us. See the attached graph dir-bytes-estimate.pdf for the details.

The blue line shows the extrapolated directory byte metric for all directory mirrors. This line should be pretty reliable except for the first few days in August 2010 when only a few directory mirrors reported directory bytes. But we cannot use this line, because it only reaches back to August 2010 and we want user statistics since August 2009.

The red line is our directory byte estimate based on the difference between written and read total bytes. I would have expected a bigger distance from the blue line or a much higher volatility. But great, in theory, the estimate still works fine.

Now, the green line is the same estimate, but only for directory mirrors reporting directory request statistics. We use the quotient of the red and green line to compute our "fraction" in the other graph. I marked the same four points with purple dots when our user number skyrocketed.

My conclusion would be that there need to be more and faster directory mirrors reporting directory request statistics than on those four days. If we reach around 1/3 or 1/2, bandwidth-wise, of directory mirrors reporting directory request statistics, we should do okay.

Replying to karsten:

My conclusion would be that there need to be more and faster directory mirrors reporting directory request statistics than on those four days. If we reach around 1/3 or 1/2, bandwidth-wise, of directory mirrors reporting directory request statistics, we should do okay.

Is there an easy way for us to track this fraction over time? I want to a) know where the fraction is now, and b) figure out if we should try to make it default in 0.2.2, or just harass the fast relays into setting it manually.

Trac:
Cc: N/A to karsten

I just added a new CSV file for this: https://metrics.torproject.org/csv/dirreq-stats.csv

I opened #3951 (moved) as a followup, to reduce the future chances of wrongness in our graphs.

Closing this one, on the theory that the original question is now 'solved'. Thanks!

Trac:
Status: new to closed
Resolution: N/A to implemented

closed

mentioned in issue #3736 (moved)

mentioned in issue #3822 (moved)

mentioned in issue #3951 (moved)

mentioned in issue tpo/core/tor#3951 (closed)

Why do our user graphs have these huge spikes?

Child items ...

Activity