We've been getting lots of reports about bad reachability of onion services (e.g. #28730 (moved)) and in particular the v3 ones.
We need a tool that we can use to evaluate and monitor the health of onion services. We should use it to verify how reachable and stable onions are, and also as a benchmark for how their stability changes over time.
A relevant ticket here is #13209 (moved) which we can leverage in the future.
One way to write such a tool is to provide it with an onion service, and the tool fetches its desc from every HSDir, then introduces itself to all the intro points, and make sure that rendezvous can occur. Then it monitors this over time to find issues with reachability.
To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information
Child items
0
Show closed items
No child items are currently assigned. Use child items to break down this issue into smaller parts.
Linked items
0
Link issues together to show that they're related.
Learn more.
If there are experiments that I could run (from controller code) to help with understanding this, let me know (I can use the txtorcon documentation hidden-services -- there's a v2 and a v3 one -- as test-subjects)
Yes, I want this tool: we need it as a building block to debugging and nagios-style monitoring.
I'd suggest one slight change to the original description by asn: step one is to make the tool that does the checking, and then step two, separately, would be to run the tool on some sort of schedule. By separating these, anybody who says "why isn't my onion service working" can grab the tool and find out. Heck, somebody might even set up a web page that runs it for people.
There are two parts of Onion Service reachability: a) Ability of Tor network to communicate it's HSDesc reliably and do the introduction/rendezvous procedure when user tries to reach the Onion Service and b) Ability of server software at Onion Service side to properly listen for incoming connection and respond to requests. We care very much about the former, but do we care about the latter? I think not really, as that is responsibility of whoever is running Onion Service and they can use tools like Nagios to monitor things on their side. Also it is trivial to just torify curl .... Any comments on this?
Part a) from above can be further split into: a1) Ability of HSDirs to reliably inform user about the (latest) HSDesc and a2) Ability of Tor network to establish the final circuit between user and Onion Service. I suppose we want to measure both of these, and log some metrics about them? That would be timing information, as well success/failure for each try. We also want to detect cases of Tor network failing to perform any of the connection establishment steps.
Do we want this to be based on stem? Can we currently do introduction/rendezvous stuff with Tor Control Port and get progress information that is fine-grained enough for this tool? Are there things we need to implement for Tor Control interface (beyond making HSFETCH support v3 descriptors) to make it ready?
What would be the UI/API of such tool? Do we want JSON output for easier integration with other stuff? Do we want some API over HTTP?
Trac: Status: new to needs_information Type: defect to project
Agreed, this health assessment tool should be entirely about the "within the Tor protocol" side of things. People can use nagios or whatever to make sure that their service is running well -- but only if the onion protocols are reliable and consistent.
Yes. It's not just timing, and not just success/failure, but another piece is trying to identify what went wrong if one of the steps went wrong.
Basing it on stem is fine with me. I think the answer might be "no" for whether all of this stuff is exposed properly in the control protocol though. I think we've had tickets about extending the control protocol in that direction open for a very long time. Or maybe nobody even made the tickets.
I imagine the first way of using the tool would be that we, the developers of the tool, run it consistently against some known-stable onion service. The goal would be to look for patterns in the failures. So the better the tool can be at identifying where the failure is and why the failure is, the more useful it will be. And then the second use of the tool would be when people say their onion service isn't working right -- we can tell them to run the tool and see what it says. Then I could imagine a third way, which is somebody sets up a web interface to run the tool on behalf of anybody who interacts with the website. Then it would become the sort of thing all sorts of people could easily run. But, one step at a time -- let's start with that 'first way'. :)
Trac: Keywords: tor-hs reachability research deleted, tor-hs reachability research network-health added Summary: Write tool for onion service health assesment to Write tool for onion service health assessment
I run two v3 onion services that fulfill the criteria and could give us relevant statistics for our questions. commenting here so I get the updates on this ticket, I would be thrilled to use a tool like this to monitor their health and report back here.
I have no other way to know about their health, all I can do is check the Tor daemon for relevant messages in the log file or try connecting to them myself. But this "solution" is obviously orders of magnitude less reliable and exact than a tool that will check the health against all HSDirs and try introduction to all IPs.
This has nothing to do with s27 O1A1.1. It is instrumental but the activity should not depend on this.
I did built that tool based on tor HS tracing but it is not upstream nor ready for upstream. Heck, it might live its life outside of mainline tor, who knows.
At this point, the general idea of stable tracepoints in tor is being discussed so merging this tool upstream depends on the decisions coming out of the discussions with the network team.
For now, this lives outside of tor and hopefully one day, the tracing part will be put upstream. In the meantime, the rest is out of tree.