Opened 2 years ago

Last modified 2 years ago

#23367 new defect

Onion address counts ignore descriptor upload overlap

Reported by: teor Owned by: metrics-team
Priority: Medium Milestone:
Component: Metrics/Statistics Version:
Severity: Normal Keywords: metrics-2018
Cc: asn Actual Points:
Parent ID: #23126 Points:
Reviewer: Sponsor:

Description

Based on this tor metrics paper:
https://research.torproject.org/techreports/extrapolating-hidserv-stats-2015-01-31.pdf

We ignore descriptor upload overlap periods (they're not even mentioned in the paper).

During an overlap period, descriptors are published to twice as many
HSDirs (v2 & v3). If we ignore this, we will double-count:

  • v2: uploads and unique onion addresses and descriptor ids for 1 hour per day,
  • v3: uploads and unique descriptor ids for 12 hours per day.

I'm not sure if assuming that descriptors are seen by 2 sets of HSDirs per day covers this, because in v2 they are actually seen by 3 sets of HSDirs with probability 1/24, when the service address starts with 00-0B (assuming stats are collected from 00:00 UTC and relay clocks are accurate).

And in v3, for half the day (00:00 + typical client consensus download delay of 1-2 hours) they are seen by 2 HSDirs, and half the day they are seen by 1 HSDir. Not that we measure v3 yet.

Child Tickets

Change History (6)

comment:1 Changed 2 years ago by karsten

I'll have to dive deeper into this topic, but here are some quick thoughts:

  • I don't think we're including anything from v3 in these statistics, but we'd have to ask asn and dgoulet to be certain.
  • I believe we're taking descriptor overlap periods into account for v2. See Section 5, "Extrapolating network totals" of the linked report: "As an approximation, we assume that a hidden service publishes its descriptor to twelve directories over a 24-hour period: the service stores two replicas per descriptor using different descriptor identifiers, both descriptor replicas get stored to three different hidden-service directories each, and the service changes descriptor identifiers once every 24 hours which leads to two different descriptor identifiers per replica." And later in that section we say how this is just an approximation.

Do you think there's a defect in the v2 code?

And, independent of that question, is there anything in particular that should we keep in mind when extending this code to v3?

Thanks!

comment:2 in reply to:  1 Changed 2 years ago by teor

Replying to karsten:

I'll have to dive deeper into this topic, but here are some quick thoughts:

  • I don't think we're including anything from v3 in these statistics, but we'd have to ask asn and dgoulet to be certain.

No, we're not. And perhaps we will end up collecting them using PrivCount in Tor.

  • I believe we're taking descriptor overlap periods into account for v2. See Section 5, "Extrapolating network totals" of the linked report: "As an approximation, we assume that a hidden service publishes its descriptor to twelve directories over a 24-hour period: the service stores two replicas per descriptor using different descriptor identifiers, both descriptor replicas get stored to three different hidden-service directories each, and the service changes descriptor identifiers once every 24 hours which leads to two different descriptor identifiers per replica." And later in that section we say how this is just an approximation.

Do you think there's a defect in the v2 code?

Yes. In each 24-hour period, there is a 1-hour overlap where descriptors are posted to the current and next HSDirs. So services with addresses that correspond to the first or last hour (initial bytes 00-0B and F4-FF) can be seen at 6 or 18 directories, not 12. But this probably balances out over time.

This is how I fixed it in experimental PrivCount (there might be bugs):
https://github.com/privcount/privcount/pull/423/commits/4f1fb9191c9f3c5dc0ccbfe43c2b021a213a0c78

I also wonder if you need to account for the 1-2 hour delay between a consensus being produced, and clients downloading and using it. But the variance is probably small.

And, independent of that question, is there anything in particular that should we keep in mind when extending this code to v3?

  • There is an overlap for 12 hours per day, from when the client receives the 0000 consensus, for 36 hours (that is, approximately 0100-0200 for 36 hours)
  • The hash ring changes every 24 hours based on the SRV
  • You need the ed25519 relay ids from descriptors to calculate the hash ring (they're not in the consensus)

There are a few more minor things that affect v2 and v3. I added a list to experimental PrivCount's position weights script:
https://github.com/privcount/privcount/pull/423/commits/e4d5786469b12781a10b1c875d9228d65a17b2d9#diff-a5cebcf3ce45960e58426e68588e82e1R41

comment:3 Changed 2 years ago by teor

Parent ID: #23126

Let's think about all the v3 stats in the same place: this ticket is for metrics, the parent #23126 is for Core Tor.

comment:4 Changed 2 years ago by karsten

Component: Metrics/WebsiteMetrics/Statistics

Moving all tickets to Metrics/Statistics that are more related to the data-aggregating modules rather than the website parts of metric-web.

comment:5 Changed 2 years ago by asn

Cc: asn added

comment:6 Changed 2 years ago by karsten

Keywords: metrics-2018 added
Note: See TracTickets for help on using tickets.