Write a hidden service hsdir health measurer

added 027-triaged-1-in SponsorR component::metrics/analysis network-health points::medium/large priority::high resolution::fixed severity::normal status::closed tor-hs type::project version::tor 0.2.7 labels

Replying to arma:

That answer will give us a handle on whether things are going as we expect, and also give us a sense of whether the number "six" is overkill or underkill or what.

Also, running this program and/or analysis over time will tell us when the situation changes. (Right now we have no real way of realizing that, for example, network churn has increased to the point that hidden service descriptor fetches are failing more often.)

Trac:
Type: defect to project

Another useful outcome would be to compare the current design to the following alternate design: hidden services publish to n hsdirs, and clients fetch from among k<n of them. In that design, hidden services overpublish to a few extra relays "on either side", to reduce the chance that relay churn will result in a relay that clients fetch from but who is unable to handle the request.

Relevant to this discussion:

https://lists.torproject.org/pipermail/tor-dev/2015-February/008230.html

Small update, this depends on #14847 (moved) new control port command to have the ability to fetch descriptor from HSDirs (custom or not).

These may be worth looking at for 0.2.7.

Trac:
Milestone: Tor: 0.2.??? to Tor: 0.2.7.x-final

Trac:
Status: new to assigned

Marking more tickets as triaged-in for 0.2.7

Trac:
Keywords: N/A deleted, 027-triaged-1-in added

Trac:
Version: N/A to Tor: 0.2.7
Points: N/A to medium/large
Priority: normal to major

I think the main question that remains to be answered here is #13208 (moved):

What's the average number of hsdir fetches before we get the hsdesc?

That is, how many HSDirs will a client try on average before they manage to fetch the descriptor? If the answer is close to 1.0, it means that node churn does not affect the hash ring much (and in theory we could even decrease the number of HSDirs per HS).

Since this is an experiment, it might make sense to have our HS health tool download consensuses like a normal client, instead of always using the latest one.

If we wanted to learn the actual value here, we could do #15962 (moved) and actually measure the churn effect using historical metrics data.

Replying to asn:

Since this is an experiment, it might make sense to have our HS health tool download consensuses like a normal client, instead of always using the latest one.

If you're interested in normal-client performance, maybe Torperf is the better tool. I could provide you with logs from three Torperf instances that have been running for a couple of weeks now. Let me know.

Trac:
Milestone: Tor: 0.2.7.x-final to Tor: 0.2.8.x-final

Trac:
Sponsor: N/A to N/A
Milestone: Tor: 0.2.8.x-final to N/A

Ok I will close this ticket but before here are some conclusions about this and possibly future work. I'm attaching to this ticket the raw result taken from May 29th, 2015 to June 14th, 2016. You can find the CSV file specification in https://gitlab.com/hs-health/hs-health/blob/master/analyze-csv.py#L93

This experiment showed us few things. With a client always using the latest consensus, here are the results of the 6 stable .onion we've monitored (output from analyze-csv.py).

Log health.csv period is from 29 May 2015 16:36:03 to 15 Jun 2016 00:08:15 (9175 hours)
--> 2.721% failed fetch (3958/145435).
    On average once we fail to fetch once on a specific HSDir, the descriptor was missing for 01:14:31 (4471 seconds).

[+] wlupld3ptjvsgwqw.onion
    3.35% of failed fetch (913/27270) for an average time of 01:29:09 minutes (5349 seconds)
    After first fail on an HSDir, we have 7.55 failed attempt(s) before success
    Churn happened 1.319% of the time (121 times)

[+] 3g2upl4pq6kufc4m.onion
    1.80% of failed fetch (524/29099) for an average time of 00:50:19 minutes (3019 seconds)
    After first fail on an HSDir, we have 3.94 failed attempt(s) before success
    Churn happened 1.450% of the time (133 times)

[+] agorahooawayyfoe.onion
    5.07% of failed fetch (596/11744) for an average time of 01:21:02 minutes (4862 seconds)
    After first fail on an HSDir, we have 6.77 failed attempt(s) before success
    Churn happened 0.959% of the time (88 times)

[+] 4cjw6cwpeaeppfqz.onion
    3.11% of failed fetch (886/28495) for an average time of 01:28:32 minutes (5312 seconds)
    After first fail on an HSDir, we have 7.38 failed attempt(s) before success
    Churn happened 1.308% of the time (120 times)

[+] zti6p7h6spbtx5xr.onion
    3.05% of failed fetch (497/16289) for an average time of 01:18:47 minutes (4727 seconds)
    After first fail on an HSDir, we have 6.54 failed attempt(s) before success
    Churn happened 0.828% of the time (76 times)

[+] facebookcorewwwi.onion
    1.90% of failed fetch (542/28580) for an average time of 01:04:11 minutes (3851 seconds)
    After first fail on an HSDir, we have 4.93 failed attempt(s) before success
    Churn happened 1.199% of the time (110 times)

As we can see, it's pretty stable. The churn rate is very low and always only affect one single HSDir out of the set of 6 (see .csv result, it's not printed in the output). On average, a client with latest consensus will fail to fetch the descriptor on one HSDir out of the six ~2.71% of the time.

The number of fetches varies because unfortunately the tool is not entirely "stable" that is sometimes it crashed and for some period of time we go without fetching some .onion while others are still running (python threading is ... something....).

Conclusion

This experiment is not ideal as it only consider the latest consensus on the client side which is not really the reality of things. An improved version of this tool would basically run 12 clients with each with a different hour consensus spanning over 12 hours. Then using that, trying to fetch the descriptor and note down churn and failures.
One key aspect of this tool is that once a fail fetch happened, it went into "recover mode" that is retrying every 15 minutes until the descriptor can be fetched again thus giving us the interesting statistics of how many failed attempt before success and how much time do I need to spend waiting for until success. This gets a bit more complicated with clients with different consensuses because they need to update their consensus at some point over time and deciding which consensus to update to (latest or 2 hours in past or ...) might affect the results but also creates LOTS of cases to test.
A simpler but I think better version of this tool would be to instead of taking the latest consensus all the time, it should simply use the tor client normal behavior and monitoring the .onion with it. However, the HS client side behavior has changed over some tor stable version and might change again so this should be made for each maintained tor version which would also indicate to us any regression or performance improvement between them.
Load on the network considerations. It is all fun and well but if we decide to improve this tool (or rewrite a new one), we should consider how much load it puts on the network. HS fetch aren't that heavy but if you multiply this by 12 times 6 HSDir and then you run this every X minutes, lets not forget what it can do to the Guard in front.

So all things considered, there is much more room for improvement with this tool and the results could be useful to have on our metrics website but we need to make it a bit more wise and at the very least change it with 3.

Trac:
Status: assigned to closed
Reviewer: N/A to N/A
Resolution: N/A to fixed
Severity: N/A to Normal

Modern version of this tool #28841 (moved).

Trac:
Keywords: N/A deleted, network-health added

closed

mentioned in issue #14843 (moved)

mentioned in issue #14847 (moved)

mentioned in issue #15962 (moved)

mentioned in issue #28841 (moved)

mentioned in issue tpo/core/tor#14843 (closed)

mentioned in issue tpo/core/tor#28841 (closed)

Write a hidden service hsdir health measurer

Child items ...

Activity

Conclusion