Opened 3 years ago

Last modified 5 months ago

#23367 new enhancement

Take descriptor upload overlap into account when estimating version 3 onion address counts

Reported by: teor Owned by: metrics-team
Priority: Medium Milestone:
Component: Metrics/Statistics Version:
Severity: Normal Keywords:
Cc: asn Actual Points:
Parent ID: #23126 Points:
Reviewer: Sponsor:

Description

Based on this tor metrics paper:
https://research.torproject.org/techreports/extrapolating-hidserv-stats-2015-01-31.pdf

We ignore descriptor upload overlap periods (they're not even mentioned in the paper).

During an overlap period, descriptors are published to twice as many
HSDirs (v2 & v3). If we ignore this, we will double-count:

  • v2: uploads and unique onion addresses and descriptor ids for 1 hour per day,
  • v3: uploads and unique descriptor ids for 12 hours per day.

I'm not sure if assuming that descriptors are seen by 2 sets of HSDirs per day covers this, because in v2 they are actually seen by 3 sets of HSDirs with probability 1/24, when the service address starts with 00-0B (assuming stats are collected from 00:00 UTC and relay clocks are accurate).

And in v3, for half the day (00:00 + typical client consensus download delay of 1-2 hours) they are seen by 2 HSDirs, and half the day they are seen by 1 HSDir. Not that we measure v3 yet.

Child Tickets

Change History (9)

comment:1 Changed 3 years ago by karsten

I'll have to dive deeper into this topic, but here are some quick thoughts:

  • I don't think we're including anything from v3 in these statistics, but we'd have to ask asn and dgoulet to be certain.
  • I believe we're taking descriptor overlap periods into account for v2. See Section 5, "Extrapolating network totals" of the linked report: "As an approximation, we assume that a hidden service publishes its descriptor to twelve directories over a 24-hour period: the service stores two replicas per descriptor using different descriptor identifiers, both descriptor replicas get stored to three different hidden-service directories each, and the service changes descriptor identifiers once every 24 hours which leads to two different descriptor identifiers per replica." And later in that section we say how this is just an approximation.

Do you think there's a defect in the v2 code?

And, independent of that question, is there anything in particular that should we keep in mind when extending this code to v3?

Thanks!

comment:2 in reply to:  1 Changed 3 years ago by teor

Replying to karsten:

I'll have to dive deeper into this topic, but here are some quick thoughts:

  • I don't think we're including anything from v3 in these statistics, but we'd have to ask asn and dgoulet to be certain.

No, we're not. And perhaps we will end up collecting them using PrivCount in Tor.

  • I believe we're taking descriptor overlap periods into account for v2. See Section 5, "Extrapolating network totals" of the linked report: "As an approximation, we assume that a hidden service publishes its descriptor to twelve directories over a 24-hour period: the service stores two replicas per descriptor using different descriptor identifiers, both descriptor replicas get stored to three different hidden-service directories each, and the service changes descriptor identifiers once every 24 hours which leads to two different descriptor identifiers per replica." And later in that section we say how this is just an approximation.

Do you think there's a defect in the v2 code?

Yes. In each 24-hour period, there is a 1-hour overlap where descriptors are posted to the current and next HSDirs. So services with addresses that correspond to the first or last hour (initial bytes 00-0B and F4-FF) can be seen at 6 or 18 directories, not 12. But this probably balances out over time.

This is how I fixed it in experimental PrivCount (there might be bugs):
https://github.com/privcount/privcount/pull/423/commits/4f1fb9191c9f3c5dc0ccbfe43c2b021a213a0c78

I also wonder if you need to account for the 1-2 hour delay between a consensus being produced, and clients downloading and using it. But the variance is probably small.

And, independent of that question, is there anything in particular that should we keep in mind when extending this code to v3?

  • There is an overlap for 12 hours per day, from when the client receives the 0000 consensus, for 36 hours (that is, approximately 0100-0200 for 36 hours)
  • The hash ring changes every 24 hours based on the SRV
  • You need the ed25519 relay ids from descriptors to calculate the hash ring (they're not in the consensus)

There are a few more minor things that affect v2 and v3. I added a list to experimental PrivCount's position weights script:
https://github.com/privcount/privcount/pull/423/commits/e4d5786469b12781a10b1c875d9228d65a17b2d9#diff-a5cebcf3ce45960e58426e68588e82e1R41

comment:3 Changed 3 years ago by teor

Parent ID: #23126

Let's think about all the v3 stats in the same place: this ticket is for metrics, the parent #23126 is for Core Tor.

comment:4 Changed 3 years ago by karsten

Component: Metrics/WebsiteMetrics/Statistics

Moving all tickets to Metrics/Statistics that are more related to the data-aggregating modules rather than the website parts of metric-web.

comment:5 Changed 3 years ago by asn

Cc: asn added

comment:6 Changed 3 years ago by karsten

Keywords: metrics-2018 added

comment:7 Changed 5 months ago by karsten

Keywords: metrics-2018 removed
Status: newneeds_review

Finally, I got it. (I didn't think the whole 2 years about this, but when I started looking at this ticket again this morning it took me a while to understand the bug...)

The situation is slightly different from your description, because statistics are not collected from 00:00 UTC but from whenever a relay starts collecting them. Your general statement that we're accounting for descriptor upload overlap wrong is correct, though.

My current thought is to document this inaccuracy rather than changing the code. It's a known inaccuracy of roughly 1/24 = 4.2% of absolute numbers. But it doesn't affect relative changes over time. I don't think that changing the code and reprocessing the statistics is worth the effort, also regarding explaining why the numbers have changed now.

Here's how we could document this on the Reproducible Metrics page:

As an approximation, we assume that an onion service publishes its descriptor to twelve directories over a 24-hour period: the service stores two replicas per descriptor using different descriptor identifiers, both descriptor replicas get stored to three different onion-service directories each, and the service changes descriptor identifiers once every 24 hours which leads to two different descriptor identifiers per replica.

To be clear, this approximation is not entirely accurate. For example, the descriptors of roughly 1/24 of services are seen by 3 rather than 2 sets of onion-service directories, when a service changes descriptor identifiers once at the beginning of a relay's statistics interval and once again towards the end. In some cases, the two replicas or the descriptors with changed descriptor identifiers could have been stored to the same directory. As another example, onion-service directories might have joined or left the network and other directories might have become responsible for storing a descriptor which also include that .onion address in their statistics. However, for the subsequent analysis, we assume that neither of these cases affects results substantially.

What do you think about this change?

I also agree that we should keep this in mind when we work on v3 stats. We should keep this ticket open, turn it into an enhancement, and update the summary a bit to make it clear that the remaining work is just for v3.

comment:8 Changed 5 months ago by teor

I think it's a good idea to document small inaccuracies.

I also believe that churn (on both the service and HSDir sides) is likely to outweigh the impact of this inaccuracy.

comment:9 Changed 5 months ago by karsten

Status: needs_reviewnew
Summary: Onion address counts ignore descriptor upload overlapTake descriptor upload overlap into account when estimating version 3 onion address counts
Type: defectenhancement

Alright, I added that sentence to the Reproducible Metrics page. And I changed the ticket summary and turned the ticket into an enhancement for the remaining version 3 work. I guess the next step here is to wait for #23126 being implemented.

Note: See TracTickets for help on using tickets.