We ignore descriptor upload overlap periods (they're not even mentioned in the paper).
During an overlap period, descriptors are published to twice as many
HSDirs (v2 & v3). If we ignore this, we will double-count:
v2: uploads and unique onion addresses and descriptor ids for 1 hour per day,
v3: uploads and unique descriptor ids for 12 hours per day.
I'm not sure if assuming that descriptors are seen by 2 sets of HSDirs per day covers this, because in v2 they are actually seen by 3 sets of HSDirs with probability 1/24, when the service address starts with 00-0B (assuming stats are collected from 00:00 UTC and relay clocks are accurate).
And in v3, for half the day (00:00 + typical client consensus download delay of 1-2 hours) they are seen by 2 HSDirs, and half the day they are seen by 1 HSDir. Not that we measure v3 yet.
To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information
Child items ...
Show closed items
Linked items 0
Link issues together to show that they're related.
Learn more.
I'll have to dive deeper into this topic, but here are some quick thoughts:
I don't think we're including anything from v3 in these statistics, but we'd have to ask asn and dgoulet to be certain.
I believe we're taking descriptor overlap periods into account for v2. See Section 5, "Extrapolating network totals" of the linked report: "As an approximation, we assume that a hidden service publishes its descriptor to twelve directories over a 24-hour period: the service stores two replicas per descriptor using different descriptor identifiers, both descriptor replicas get stored to three different hidden-service directories each, and the service changes descriptor identifiers once every 24 hours which leads to two different descriptor identifiers per replica." And later in that section we say how this is just an approximation.
Do you think there's a defect in the v2 code?
And, independent of that question, is there anything in particular that should we keep in mind when extending this code to v3?
I'll have to dive deeper into this topic, but here are some quick thoughts:
I don't think we're including anything from v3 in these statistics, but we'd have to ask asn and dgoulet to be certain.
No, we're not. And perhaps we will end up collecting them using PrivCount in Tor.
I believe we're taking descriptor overlap periods into account for v2. See Section 5, "Extrapolating network totals" of the linked report: "As an approximation, we assume that a hidden service publishes its descriptor to twelve directories over a 24-hour period: the service stores two replicas per descriptor using different descriptor identifiers, both descriptor replicas get stored to three different hidden-service directories each, and the service changes descriptor identifiers once every 24 hours which leads to two different descriptor identifiers per replica." And later in that section we say how this is just an approximation.
Do you think there's a defect in the v2 code?
Yes. In each 24-hour period, there is a 1-hour overlap where descriptors are posted to the current and next HSDirs. So services with addresses that correspond to the first or last hour (initial bytes 00-0B and F4-FF) can be seen at 6 or 18 directories, not 12. But this probably balances out over time.
I also wonder if you need to account for the 1-2 hour delay between a consensus being produced, and clients downloading and using it. But the variance is probably small.
And, independent of that question, is there anything in particular that should we keep in mind when extending this code to v3?
There is an overlap for 12 hours per day, from when the client receives the 0000 consensus, for 36 hours (that is, approximately 0100-0200 for 36 hours)
The hash ring changes every 24 hours based on the SRV
You need the ed25519 relay ids from descriptors to calculate the hash ring (they're not in the consensus)
Finally, I got it. (I didn't think the whole 2 years about this, but when I started looking at this ticket again this morning it took me a while to understand the bug...)
The situation is slightly different from your description, because statistics are not collected from 00:00 UTC but from whenever a relay starts collecting them. Your general statement that we're accounting for descriptor upload overlap wrong is correct, though.
My current thought is to document this inaccuracy rather than changing the code. It's a known inaccuracy of roughly 1/24 = 4.2% of absolute numbers. But it doesn't affect relative changes over time. I don't think that changing the code and reprocessing the statistics is worth the effort, also regarding explaining why the numbers have changed now.
As an approximation, we assume that an onion service publishes its descriptor to twelve directories over a 24-hour period: the service stores two replicas per descriptor using different descriptor identifiers, both descriptor replicas get stored to three different onion-service directories each, and the service changes descriptor identifiers once every 24 hours which leads to two different descriptor identifiers per replica.
To be clear, this approximation is not entirely accurate. For example, the descriptors of roughly 1/24 of services are seen by 3 rather than 2 sets of onion-service directories, when a service changes descriptor identifiers once at the beginning of a relay's statistics interval and once again towards the end. In some cases, the two replicas or the descriptors with changed descriptor identifiers could have been stored to the same directory. As another example, onion-service directories might have joined or left the network and other directories might have become responsible for storing a descriptor which also include that .onion address in their statistics. However, for the subsequent analysis, we assume that neither of these cases affects results substantially.
What do you think about this change?
I also agree that we should keep this in mind when we work on v3 stats. We should keep this ticket open, turn it into an enhancement, and update the summary a bit to make it clear that the remaining work is just for v3.
Trac: Status: new to needs_review Keywords: metrics-2018 deleted, N/Aadded
Alright, I added that sentence to the Reproducible Metrics page. And I changed the ticket summary and turned the ticket into an enhancement for the remaining version 3 work. I guess the next step here is to wait for #23126 (moved) being implemented.
Trac: Summary: Onion address counts ignore descriptor upload overlap to Take descriptor upload overlap into account when estimating version 3 onion address counts Type: defect to enhancement Status: needs_review to new