Take descriptor upload overlap into account when estimating version 3 onion address counts

added component::metrics/statistics owner::metrics-team parent::23126 priority::medium severity::normal status::new type::enhancement labels

I'll have to dive deeper into this topic, but here are some quick thoughts:

I don't think we're including anything from v3 in these statistics, but we'd have to ask asn and dgoulet to be certain.
I believe we're taking descriptor overlap periods into account for v2. See Section 5, "Extrapolating network totals" of the linked report: "As an approximation, we assume that a hidden service publishes its descriptor to twelve directories over a 24-hour period: the service stores two replicas per descriptor using different descriptor identifiers, both descriptor replicas get stored to three different hidden-service directories each, and the service changes descriptor identifiers once every 24 hours which leads to two different descriptor identifiers per replica." And later in that section we say how this is just an approximation.

Do you think there's a defect in the v2 code?

And, independent of that question, is there anything in particular that should we keep in mind when extending this code to v3?

Thanks!

Replying to karsten:

I'll have to dive deeper into this topic, but here are some quick thoughts:

I don't think we're including anything from v3 in these statistics, but we'd have to ask asn and dgoulet to be certain.

No, we're not. And perhaps we will end up collecting them using PrivCount in Tor.

I believe we're taking descriptor overlap periods into account for v2. See Section 5, "Extrapolating network totals" of the linked report: "As an approximation, we assume that a hidden service publishes its descriptor to twelve directories over a 24-hour period: the service stores two replicas per descriptor using different descriptor identifiers, both descriptor replicas get stored to three different hidden-service directories each, and the service changes descriptor identifiers once every 24 hours which leads to two different descriptor identifiers per replica." And later in that section we say how this is just an approximation.

Do you think there's a defect in the v2 code?

Yes. In each 24-hour period, there is a 1-hour overlap where descriptors are posted to the current and next HSDirs. So services with addresses that correspond to the first or last hour (initial bytes 00-0B and F4-FF) can be seen at 6 or 18 directories, not 12. But this probably balances out over time.

This is how I fixed it in experimental PrivCount (there might be bugs): https://github.com/privcount/privcount/pull/423/commits/4f1fb9191c9f3c5dc0ccbfe43c2b021a213a0c78

I also wonder if you need to account for the 1-2 hour delay between a consensus being produced, and clients downloading and using it. But the variance is probably small.

And, independent of that question, is there anything in particular that should we keep in mind when extending this code to v3?

There is an overlap for 12 hours per day, from when the client receives the 0000 consensus, for 36 hours (that is, approximately 0100-0200 for 36 hours)
The hash ring changes every 24 hours based on the SRV
You need the ed25519 relay ids from descriptors to calculate the hash ring (they're not in the consensus)

There are a few more minor things that affect v2 and v3. I added a list to experimental PrivCount's position weights script: https://github.com/privcount/privcount/pull/423/commits/e4d5786469b12781a10b1c875d9228d65a17b2d9#diff-a5cebcf3ce45960e58426e68588e82e1R41

Let's think about all the v3 stats in the same place: this ticket is for metrics, the parent #23126 (moved) is for Core Tor.

Trac:
Parent: N/A to #23126 (moved)

Moving all tickets to Metrics/Statistics that are more related to the data-aggregating modules rather than the website parts of metric-web.

Trac:
Component: Metrics/Website to Metrics/Statistics

Trac:
Cc: N/A to asn

Trac:
Keywords: N/A deleted, metrics-2018 added

Finally, I got it. (I didn't think the whole 2 years about this, but when I started looking at this ticket again this morning it took me a while to understand the bug...)

The situation is slightly different from your description, because statistics are not collected from 00:00 UTC but from whenever a relay starts collecting them. Your general statement that we're accounting for descriptor upload overlap wrong is correct, though.

My current thought is to document this inaccuracy rather than changing the code. It's a known inaccuracy of roughly 1/24 = 4.2% of absolute numbers. But it doesn't affect relative changes over time. I don't think that changing the code and reprocessing the statistics is worth the effort, also regarding explaining why the numbers have changed now.

Here's how we could document this on the Reproducible Metrics page:

As an approximation, we assume that an onion service publishes its descriptor to twelve directories over a 24-hour period: the service stores two replicas per descriptor using different descriptor identifiers, both descriptor replicas get stored to three different onion-service directories each, and the service changes descriptor identifiers once every 24 hours which leads to two different descriptor identifiers per replica.

To be clear, this approximation is not entirely accurate. For example, the descriptors of roughly 1/24 of services are seen by 3 rather than 2 sets of onion-service directories, when a service changes descriptor identifiers once at the beginning of a relay's statistics interval and once again towards the end. In some cases, the two replicas or the descriptors with changed descriptor identifiers could have been stored to the same directory. As another example, onion-service directories might have joined or left the network and other directories might have become responsible for storing a descriptor which also include that .onion address in their statistics. However, for the subsequent analysis, we assume that neither of these cases affects results substantially.

What do you think about this change?

I also agree that we should keep this in mind when we work on v3 stats. We should keep this ticket open, turn it into an enhancement, and update the summary a bit to make it clear that the remaining work is just for v3.

Trac:
Status: new to needs_review
Keywords: metrics-2018 deleted, N/A added

I think it's a good idea to document small inaccuracies.

I also believe that churn (on both the service and HSDir sides) is likely to outweigh the impact of this inaccuracy.

Alright, I added that sentence to the Reproducible Metrics page. And I changed the ticket summary and turned the ticket into an enhancement for the remaining version 3 work. I guess the next step here is to wait for #23126 (moved) being implemented.

Trac:
Summary: Onion address counts ignore descriptor upload overlap to Take descriptor upload overlap into account when estimating version 3 onion address counts
Type: defect to enhancement
Status: needs_review to new

Take descriptor upload overlap into account when estimating version 3 onion address counts

Child items ...

Activity