A given hidden service maps to six hsdirs (hidden service directories) at any given time, in terms of where it will publish its hsdesc (hidden service descriptor).
But due to churn in relays, and different perspectives on the network (since clients and the hidden service likely don't have the very latest consensus), there's a chance that the client will choose an hsdir that doesn't have the hsdesc.
It would be neat to look at the last few consensuses, and compute the union of all the hsdirs that might be in use, and then see which ones actually have the hsdesc.
That answer will give us a handle on whether things are going as we expect, and also give us a sense of whether the number "six" is overkill or underkill or what.
An alternate, perhaps parallel, approach would be to look at the historical metrics data involving relay churn (easy to see in the consensus) and relay restarts (where uptime resets) and calculate what the answer ought to be. And if our prediction is much different from the reality, why?
To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information
Child items ...
Show closed items
Linked items 0
Link issues together to show that they're related.
Learn more.
That answer will give us a handle on whether things are going as we expect, and also give us a sense of whether the number "six" is overkill or underkill or what.
Also, running this program and/or analysis over time will tell us when the situation changes. (Right now we have no real way of realizing that, for example, network churn has increased to the point that hidden service descriptor fetches are failing more often.)
Another useful outcome would be to compare the current design to the following alternate design: hidden services publish to n hsdirs, and clients fetch from among k<n of them. In that design, hidden services overpublish to a few extra relays "on either side", to reduce the chance that relay churn will result in a relay that clients fetch from but who is unable to handle the request.
I think the main question that remains to be answered here is #13208 (moved):
What's the average number of hsdir fetches before we get the hsdesc?
That is, how many HSDirs will a client try on average before they
manage to fetch the descriptor? If the answer is close to 1.0, it
means that node churn does not affect the hash ring much (and in
theory we could even decrease the number of HSDirs per HS).
Since this is an experiment, it might make sense to have our HS
health tool download consensuses like a normal client, instead of
always using the latest one.
If we wanted to learn the actual value here, we could do #15962 (moved) and
actually measure the churn effect using historical metrics data.
Since this is an experiment, it might make sense to have our HS
health tool download consensuses like a normal client, instead of
always using the latest one.
If you're interested in normal-client performance, maybe Torperf is the better tool. I could provide you with logs from three Torperf instances that have been running for a couple of weeks now. Let me know.
Ok I will close this ticket but before here are some conclusions about this and possibly future work. I'm attaching to this ticket the raw result taken from May 29th, 2015 to June 14th, 2016. You can find the CSV file specification in https://gitlab.com/hs-health/hs-health/blob/master/analyze-csv.py#L93
This experiment showed us few things. With a client always using the latest consensus, here are the results of the 6 stable .onion we've monitored (output from analyze-csv.py).
Log health.csv period is from 29 May 2015 16:36:03 to 15 Jun 2016 00:08:15 (9175 hours)--> 2.721% failed fetch (3958/145435). On average once we fail to fetch once on a specific HSDir, the descriptor was missing for 01:14:31 (4471 seconds).[+] wlupld3ptjvsgwqw.onion 3.35% of failed fetch (913/27270) for an average time of 01:29:09 minutes (5349 seconds) After first fail on an HSDir, we have 7.55 failed attempt(s) before success Churn happened 1.319% of the time (121 times)[+] 3g2upl4pq6kufc4m.onion 1.80% of failed fetch (524/29099) for an average time of 00:50:19 minutes (3019 seconds) After first fail on an HSDir, we have 3.94 failed attempt(s) before success Churn happened 1.450% of the time (133 times)[+] agorahooawayyfoe.onion 5.07% of failed fetch (596/11744) for an average time of 01:21:02 minutes (4862 seconds) After first fail on an HSDir, we have 6.77 failed attempt(s) before success Churn happened 0.959% of the time (88 times)[+] 4cjw6cwpeaeppfqz.onion 3.11% of failed fetch (886/28495) for an average time of 01:28:32 minutes (5312 seconds) After first fail on an HSDir, we have 7.38 failed attempt(s) before success Churn happened 1.308% of the time (120 times)[+] zti6p7h6spbtx5xr.onion 3.05% of failed fetch (497/16289) for an average time of 01:18:47 minutes (4727 seconds) After first fail on an HSDir, we have 6.54 failed attempt(s) before success Churn happened 0.828% of the time (76 times)[+] facebookcorewwwi.onion 1.90% of failed fetch (542/28580) for an average time of 01:04:11 minutes (3851 seconds) After first fail on an HSDir, we have 4.93 failed attempt(s) before success Churn happened 1.199% of the time (110 times)
As we can see, it's pretty stable. The churn rate is very low and always only affect one single HSDir out of the set of 6 (see .csv result, it's not printed in the output). On average, a client with latest consensus will fail to fetch the descriptor on one HSDir out of the six ~2.71% of the time.
The number of fetches varies because unfortunately the tool is not entirely "stable" that is sometimes it crashed and for some period of time we go without fetching some .onion while others are still running (python threading is ... something....).
Conclusion
This experiment is not ideal as it only consider the latest consensus on the client side which is not really the reality of things. An improved version of this tool would basically run 12 clients with each with a different hour consensus spanning over 12 hours. Then using that, trying to fetch the descriptor and note down churn and failures.
One key aspect of this tool is that once a fail fetch happened, it went into "recover mode" that is retrying every 15 minutes until the descriptor can be fetched again thus giving us the interesting statistics of how many failed attempt before success and how much time do I need to spend waiting for until success. This gets a bit more complicated with clients with different consensuses because they need to update their consensus at some point over time and deciding which consensus to update to (latest or 2 hours in past or ...) might affect the results but also creates LOTS of cases to test.
A simpler but I think better version of this tool would be to instead of taking the latest consensus all the time, it should simply use the tor client normal behavior and monitoring the .onion with it. However, the HS client side behavior has changed over some tor stable version and might change again so this should be made for each maintained tor version which would also indicate to us any regression or performance improvement between them.
Load on the network considerations. It is all fun and well but if we decide to improve this tool (or rewrite a new one), we should consider how much load it puts on the network. HS fetch aren't that heavy but if you multiply this by 12 times 6 HSDir and then you run this every X minutes, lets not forget what it can do to the Guard in front.
So all things considered, there is much more room for improvement with this tool and the results could be useful to have on our metrics website but we need to make it a bit more wise and at the very least change it with 3.
Trac: Status: assigned to closed Reviewer: N/AtoN/A Resolution: N/Ato fixed Severity: N/Ato Normal