Include client numbers even if we think we got reports from more than 100% of all relays

added component::metrics/statistics owner::karsten priority::high resolution::fixed severity::normal sponsor::V-can status::closed type::defect labels

I wonder if it's more than a rounding error.

How is frac calculated?

I wonder if you're choosing a point in time to calculate the consensus weight fraction.

What happens if relays that submit statistics are offline at that point in time? But it submits statistics later, when it comes back online?

If that relay has a weight N, and 100% of relays submit, then your frac calculation will be:

   100     the relays that submitted statistics
---------
(100 - N)  the relays that were online at the point in time you chose

When it should be:

100 the relays that submitted statistics
---
100 all the relays that could have submitted statistics

I can't find the codebase or the metrics statistics explanation to check. (I tried searching for the code online. And I tried the metrics wiki pages.)

Edit: formatting

You'll find a description/specification how frac is calculate here: https://metrics.torproject.org/reproducible-metrics.html#relay-users

Maybe rounding error was not the right term. In fact, I believe it might be a situation like the one you're describing. I can extract the variable values going into the frac formula; maybe one of them is responsible for getting us above the 100%.

However, we should carefully consider whether we want to change that formula or rather not touch it until we have PrivCount as replacement. If we think the frac value isn't going to grow much beyond 100%, we could just accept that inaccuracy and live with it. If we think it's going to grow towards, say, 150%, I agree that we'll have to do something.

Replying to karsten:

You'll find a description/specification how frac is calculate here: https://metrics.torproject.org/reproducible-metrics.html#relay-users

Maybe rounding error was not the right term. In fact, I believe it might be a situation like the one you're describing. I can extract the variable values going into the frac formula; maybe one of them is responsible for getting us above the 100%.

I wonder if changing the bandwidth interval to 24 hours revealed this issue?

For servers which report 24 hour intervals, I think that:

h(R^H) is usually equal to h(H)
n(H) is usually 24
n(R\H) is usually 0
n(N) can be slightly less than 24, if a relay was unreachable or misconfigured, but didn't go down
Therefore, frac can be slightly more than 1.

However, we should carefully consider whether we want to change that formula or rather not touch it until we have PrivCount as replacement. If we think the frac value isn't going to grow much beyond 100%, we could just accept that inaccuracy and live with it. If we think it's going to grow towards, say, 150%, I agree that we'll have to do something.

I think a similar analysis applies to PrivCount: if a relay is up for the whole day, then it will report statistics using PrivCount. But if that relay is dropped from some consensuses due to reachability, then our idea of the average number of running relays will be too low.

We won't see this bug until almost all relays are running PrivCount. But let's avoid re-implementing this bug in PrivCount if we can.

What can PrivCount do to avoid introducing a similar bug?

Trac:
Sponsor: N/A to SponsorV-can

Replying to karsten:

I suggest that we drop the upper limit and change the line above to:

{{{ WHERE a.frac >= 0.1 }}}

This sounds like a reasonable thing to do in this case.

Trac:
$frac-raw-2018-11-28$

I think I now know what's going on: some relays report written directory byte statistics for times when they were hardly listed in consensuses.

Here's a graph with all variables going into the frac formula, plus intermediate products, and finally the frac value:

$frac-raw-2018-11-28.png, 500px$

Note the red arrow. At this point n(H) grows larger than n(N). That's an issue. By definition, a relay cannot report written directory bytes statistics for a longer time than it's online.

I also looked at random relay 002B024E24A30F113982FCB17DFE05B6F38C0C79 that had a larger n(H) value than n(N) value on 2018-10-28:

This relay was listed in 3 out of 24 consensuses on 2018-10-28 (19:00, 20:00, and 21:00). As a result, we count this relay with n(N) = 10800 (we're using seconds internally, not hours).
The same relay published an extra-info descriptor on 2018-10-31 at 09:28:04 with the following line: dirreq-write-history 2018-10-30 08:04:04 (86400 s) 0,0. We count this as n(H) = 57356 on 2018-10-28.

A possible mitigation (other than the one I suggested above) could be to replace n(H) with n(N^H) in the frac formula. This would mean that we'd cap the amount of time for which a relay reported written directory bytes to the amount of time it was listed in the consensus.

I'm currently dumping and downloading the database to try this out at home. However, I'm afraid that deploying this fix is going to be much more expensive than making the simple fix suggested above. I'll report here what I find out.

Trac:
Status: new to accepted
Owner: metrics-team to karsten

Replying to karsten:

I think I now know what's going on: some relays report written directory byte statistics for times when they were hardly listed in consensuses.

Here's a graph ...

Note the red arrow. At this point n(H) grows larger than n(N). That's an issue. By definition, a relay cannot report written directory bytes statistics for a longer time than it's online.

But relays that aren't listed in the consensus can still be acting as relays.

Here are a few scenarios where that happens:

the relay's IPv4 address is unreachable from a majority of directory authorities, but some clients (with old consensuses) can still reach it:
the relay's IPv4 address has changed, and the authorities haven't checked the new address, but the relay is still reachable on the old address cached at some clients
the same scenarios with IPv6, but there are only 6/9 authorities that check and vote on IPv6
the relay is configured as a bridge by some clients, but it publishes descriptors as a relay

If a relay drops in and out of the consensus every few hours, there will always be some clients with a consensus containing that relay.

I also looked at random relay 002B024E24A30F113982FCB17DFE05B6F38C0C79 that had a larger n(H) value than n(N) value on 2018-10-28:

This relay was listed in 3 out of 24 consensuses on 2018-10-28 (19:00, 20:00, and 21:00). As a result, we count this relay with n(N) = 10800 (we're using seconds internally, not hours).

The same relay published an extra-info descriptor on 2018-10-31 at 09:28:04 with the following line: dirreq-write-history 2018-10-30 08:04:04 (86400 s) 0,0. We count this as n(H) = 57356 on 2018-10-28.

A possible mitigation (other than the one I suggested above) could be to replace n(H) with n(N^H) in the frac formula. This would mean that we'd cap the amount of time for which a relay reported written directory bytes to the amount of time it was listed in the consensus.

This seems like a reasonable approach: if the relay is listed in the consensus for n(N^H) seconds, then we should weight its bandwidth using that number of seconds.

I'm currently dumping and downloading the database to try this out at home. However, I'm afraid that deploying this fix is going to be much more expensive than making the simple fix suggested above. I'll report here what I find out.

I'm not sure if it will make much of a difference long-term: relays that drop out of the consensus should have low bandwidth weights, and therefore low bandwidths. (Except when the network is unstable, or there are less than 3 bandwidth authorities.)

Replying to teor:

Replying to karsten:

Note the red arrow. At this point n(H) grows larger than n(N). That's an issue. By definition, a relay cannot report written directory bytes statistics for a longer time than it's online.

But relays that aren't listed in the consensus can still be acting as relays.

You're right, there are cases where this is possible. These are just cases we did not consider in the original design of the frac formula. But yes, this is possible.

A possible mitigation (other than the one I suggested above) could be to replace n(H) with n(N^H) in the frac formula. This would mean that we'd cap the amount of time for which a relay reported written directory bytes to the amount of time it was listed in the consensus.

This seems like a reasonable approach: if the relay is listed in the consensus for n(N^H) seconds, then we should weight its bandwidth using that number of seconds.

Oh, you're raising another important point here: speaking in formula terms, if we replace n(H) with n(N^H) we'll also have to replace h(H) with h(N^H).

Similarly, we'll have to replace h(R^H) with h(R^H^N) and n(R\H) with n(R^N\H).

Hmmmm. I'm less optimistic now that changing the frac formula is a good idea. It seems like a too big change to make, and we're not even sure that the result will be more accurate.

I'm currently dumping and downloading the database to try this out at home. However, I'm afraid that deploying this fix is going to be much more expensive than making the simple fix suggested above. I'll report here what I find out.

I'm not sure if it will make much of a difference long-term: relays that drop out of the consensus should have low bandwidth weights, and therefore low bandwidths. (Except when the network is unstable, or there are less than 3 bandwidth authorities.)

Agreed.

Let's make the change I suggested above, in a slightly modified way:

-WHERE a.frac BETWEEN 0.1 AND 1.0
+WHERE a.frac BETWEEN 0.1 AND 1.1

The reason for accepting frac values between 1.0 and 1.1 is that, as discussed here, there can be relays reporting statistics that temporarily didn't make it into the consensus.

The reason for not giving up on the upper bound is that, as the graph above shows, there are still single days over the years when frac suddenly went up to 1.2, 1.5, or even 2.0. We should continue excluding these data points. Therefore we should use 1.1 as new upper bound.

How does this sound?