There were three usage spikes in April and one at the end of May. They happen in many countries at once. So my assumption is that they're errors in our user counting algorithm.
The ones in April happened around the same time as #2704 (moved) blew up, so we thought they were related. The one at the end of May shouldn't be related.
So what is changing in the measurements we gather that makes it spike?
To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information
Child items
0
Show closed items
No child items are currently assigned. Use child items to break down this issue into smaller parts.
Linked items
0
Link issues together to show that they're related.
Learn more.
< arma> karsten: can you remind me in a few sentences how we conclude the number of users on our user graphs? i remember we had lots of approaches to choose from, but i don't remember which we use now.
< karsten> arma: Here's how we estimate daily users: We add up directory requests for network status consensuses to all directory mirrors that report these numbers to us (which is an increasing number of relays with 0.2.3.x).
< karsten> We also estimate how many bytes these directory mirrors, as well as all directory mirrors in the network, have spent on answering directory requests.
< karsten> We can then extrapolate the total number of directory requests in the network. We divide this number by 10, assuming that every client makes 10 directory requests per day. That number is our daily user estimate.
Karsten: how does the 'estimate how many bytes these directory mirrors have spent on answering directory requests' figure in? I can see how to do step 1 (add up directory requests) and step 3 (extrapolate to total directory requests) and you're done. I guess I could also see how to do step 2 and step 3 and you're done. How do you combine steps 1 and 2?
We use the directory bytes to decide what fraction of directory requests have been reported to us. If 100 out of 1000 directory mirrors report directory request statistics to us, we don't know if these directory mirrors saw 10% or 5% or 20% of all directory requests in the network. But we can use the directory bytes of a) the directory mirrors reporting directory request statistics and b) all directory mirrors in the network to estimate what fraction of directory requests we have seen.
Note that we're talking about estimated directory bytes here, because not all relays report that number to us yet. We estimate the number of written directory bytes by subtracting the total read bytes from the total written bytes. On page 7 of the relevant tech report we write that we assume that the difference between total written and total read bytes on directory mirrors is to a large extent the result of answering small directory requests with large directory objects. We observed that relays that don’t mirror the directory write more bytes than they read, too, but the difference between written and read bytes is much smaller than on directory mirrors. We weight the bytes written by directory mirrors with the quotient of read and written bytes on relays that don’t mirror the directory in order to account for non-directory related factors. We then subtract the num- ber of bytes read by directory mirrors and obtain an estimate of directory bytes written by directory mirrors.
Ok. So the questions are, for those three spikes in April:
Did the difference between reported read bytes and write bytes go up a lot?
Did the reported number of dir requests go up a lot?
Did the set of reporting directories (either of dir requests or of dir bytes) change dramatically?
It would be good to know which parameter was the one that changed a lot, to better understand how to detect spikes in our output that may not reflect spikes in actual usage.
I had a look at the data, and it seems that our estimate of the fraction of directory bytes that the directory mirrors which reported directory requests is slightly broken. The reported request numbers are quite stable, but every time when there's a peak in the user number, our estimate of observed directory bytes drops below 10 %. I just attached a graph which shows this problem, the graph code is in the metrics-tasks repository.
How do we fix this? We should make more relays upgrade to 0.2.2.x when they report directory requests by default. Or we should find a better estimate for directory bytes than our current one.
I had a look at the data, and it seems that our estimate of the fraction of directory bytes that the directory mirrors which reported directory requests is slightly broken. The reported request numbers are quite stable, but every time when there's a peak in the user number, our estimate of observed directory bytes drops below 10 %. I just attached a graph which shows this problem, the graph code is in the metrics-tasks repository.
Nice!
How do we fix this? We should make more relays upgrade to 0.2.2.x when they report directory requests by default. Or we should find a better estimate for directory bytes than our current one.
This one I'm confused by.
for(<>) { if (/^r .* (\d+)$/) { $dirport = $1; } elsif (/^v Tor (.*?)[- ]/) { $versionstring = $1; } elsif ($dirport != 0 and /^w Bandwidth=(.*)/) { $bandwidth = $1; if ($versionstring le "0.2.2.0") { print "$versionstring (old): $bandwidth\n"; $old += $bandwidth; } else { print "$versionstring: $bandwidth\n"; $new += $bandwidth; } }}print "Total old is $old, total new is $new\n";
Piping my cached-consensus file into that script yields
Total old is 1419334, total new is 2603078
So about 65% of the directory mirrors by weight are running 0.2.2 already.
I guess you mean 0.2.3? In which case only 358701/(358701+3663711)= 9% are running it.
I guess you mean 0.2.3? In which case only 358701/(358701+3663711)= 9% are running it.
There are a handful of fast directory mirrors running 0.2.2.x right now without publishing dir stats. I could mail them and ask them to turn on the dir stats. That's probably better than asking them to switch to 0.2.3.x.
The instructions there are just to add these two lines your torrc, right?
DirReqStatistics 1ExtraInfoStatistics 1
And it should work fine for recent 0.2.2 versions?
In further poking, the fractions that relays are advertising are assuming that all clients use the weights in the consensus, which only started happening in mid 0.2.2. So all the 0.2.1 clients are fetching directory stuff according to different weights -- before the WEIGHT_FOR_DIR rule existed.
For example, here are the most popular directory mirrors for my Tor client that's running master:
But for example, trusted is a weight 44819472 for my 0.2.1 client. I haven't done enough scripting to actually compare, but I think some of these numbers will be quite different. And even for the case where the weights are the same (when the relay doesn't have the Guard or the Exit flag), the rest of the weights will be different.
If we had a full-time metrics researcher, one workaround would be to calculate what dirreq-share the relay would be using the old formula, and plot that curve too.
Even doing a couple of data points by hand (some where your red dots are, and some not) would give us a better intuition here.
It seems very weird to me that the fraction of the network we're seeing would drop by half, yet the number of requests we're seeing would remain steady. So I'm trying to come up with reasons why the fraction of the network didn't actually drop.
In further poking, the fractions that relays are advertising are assuming that all clients use the weights in the consensus, which only started happening in mid 0.2.2. So all the 0.2.1 clients are fetching directory stuff according to different weights -- before the WEIGHT_FOR_DIR rule existed.
Another idea for why we see spikes would be that when the weight changes that isn't immediately reflected in load, because (0.2.2.x) users still have the old weight until they get a new consensus. So if there is a sharp drop in weight, the user number might be heavily inflated.
I think there's some confusion in the comments about dirreq-share lines and how clients on different Tor major versions use different weights for picking a directory mirror. This is all irrelevant here! Our current user number estimate is based on estimated written directory bytes, not on probabilities for clients to pick directory mirrors. These estimated directory byte histories are, in theory, far more reliable and useful for adding up the observations from multiple directories than the dirreq-share values that we used before.
The suspected reason why our estimated user numbers have some false values in the past few months is that our directory bytes estimate is slightly broken. We're estimating directory bytes based on the difference between written and read total bytes. This approach worked okay when developing the user number estimate, but apparently it's not fail-safe. We're using this estimate as opposed to extrapolating the recently introduced directory byte metric, because we have bandwidth data for the past few years. Even if we switched to the reported directory bytes now, we'd only have user numbers for the past few months.
I'm going to re-run the comparison of extrapolated directory bytes and our estimate based on the difference between written and read total bytes.
Also, I was wrong that relays need to upgrade to 0.2.2.x to report directory stats by default. It's 0.2.3.x that they need to upgrade to. Relays on 0.2.2.x would have to add DirReqStatistics 1 and ExtraInfoStatistics 1 to their torrc to report directory stats.
It looks like the directory bytes estimate isn't that bad. The problem is just that we don't have enough directory mirrors reporting directory-request statistics to us. See the attached graph dir-bytes-estimate.pdf for the details.
The blue line shows the extrapolated directory byte metric for all directory mirrors. This line should be pretty reliable except for the first few days in August 2010 when only a few directory mirrors reported directory bytes. But we cannot use this line, because it only reaches back to August 2010 and we want user statistics since August 2009.
The red line is our directory byte estimate based on the difference between written and read total bytes. I would have expected a bigger distance from the blue line or a much higher volatility. But great, in theory, the estimate still works fine.
Now, the green line is the same estimate, but only for directory mirrors reporting directory request statistics. We use the quotient of the red and green line to compute our "fraction" in the other graph. I marked the same four points with purple dots when our user number skyrocketed.
My conclusion would be that there need to be more and faster directory mirrors reporting directory request statistics than on those four days. If we reach around 1/3 or 1/2, bandwidth-wise, of directory mirrors reporting directory request statistics, we should do okay.
My conclusion would be that there need to be more and faster directory mirrors reporting directory request statistics than on those four days. If we reach around 1/3 or 1/2, bandwidth-wise, of directory mirrors reporting directory request statistics, we should do okay.
Is there an easy way for us to track this fraction over time? I want to a) know where the fraction is now, and b) figure out if we should try to make it default in 0.2.2, or just harass the fast relays into setting it manually.