Write tool for onion service health assessment

changed milestone to %Tor: unspecified

added component::core tor/tor milestone::Tor: unspecified network-health network-team-roadmap-september owner::dgoulet parent::30200 points::7 priority::medium reachability research resolution::fixed severity::normal sponsor::27-must status::closed tor-hs type::project labels

FWIW, I've received a bug-report related to this feature too: https://github.com/meejah/txtorcon/issues/327

If there are experiments that I could run (from controller code) to help with understanding this, let me know (I can use the txtorcon documentation hidden-services -- there's a v2 and a v3 one -- as test-subjects)

Yes, I want this tool: we need it as a building block to debugging and nagios-style monitoring.

I'd suggest one slight change to the original description by asn: step one is to make the tool that does the checking, and then step two, separately, would be to run the tool on some sort of schedule. By separating these, anybody who says "why isn't my onion service working" can grab the tool and find out. Heck, somebody might even set up a web page that runs it for people.

I assume using HSFETCH would be at the core of such a tool .. but also that doesn't work with v3 onions. (That would be: https://trac.torproject.org/projects/tor/ticket/25417)

Trac:
Cc: N/A to rl1987

Some questions/thoughts about this:

There are two parts of Onion Service reachability: a) Ability of Tor network to communicate it's HSDesc reliably and do the introduction/rendezvous procedure when user tries to reach the Onion Service and b) Ability of server software at Onion Service side to properly listen for incoming connection and respond to requests. We care very much about the former, but do we care about the latter? I think not really, as that is responsibility of whoever is running Onion Service and they can use tools like Nagios to monitor things on their side. Also it is trivial to just torify curl .... Any comments on this?
Part a) from above can be further split into: a1) Ability of HSDirs to reliably inform user about the (latest) HSDesc and a2) Ability of Tor network to establish the final circuit between user and Onion Service. I suppose we want to measure both of these, and log some metrics about them? That would be timing information, as well success/failure for each try. We also want to detect cases of Tor network failing to perform any of the connection establishment steps.
Do we want this to be based on stem? Can we currently do introduction/rendezvous stuff with Tor Control Port and get progress information that is fine-grained enough for this tool? Are there things we need to implement for Tor Control interface (beyond making HSFETCH support v3 descriptors) to make it ready?
What would be the UI/API of such tool? Do we want JSON output for easier integration with other stuff? Do we want some API over HTTP?

Trac:
Status: new to needs_information
Type: defect to project

Agreed, this health assessment tool should be entirely about the "within the Tor protocol" side of things. People can use nagios or whatever to make sure that their service is running well -- but only if the onion protocols are reliable and consistent.
Yes. It's not just timing, and not just success/failure, but another piece is trying to identify what went wrong if one of the steps went wrong.
Basing it on stem is fine with me. I think the answer might be "no" for whether all of this stuff is exposed properly in the control protocol though. I think we've had tickets about extending the control protocol in that direction open for a very long time. Or maybe nobody even made the tickets.
I imagine the first way of using the tool would be that we, the developers of the tool, run it consistently against some known-stable onion service. The goal would be to look for patterns in the failures. So the better the tool can be at identifying where the failure is and why the failure is, the more useful it will be. And then the second use of the tool would be when people say their onion service isn't working right -- we can tell them to run the tool and see what it says. Then I could imagine a third way, which is somebody sets up a web interface to run the tool on behalf of anybody who interacts with the website. Then it would become the sort of thing all sorts of people could easily run. But, one step at a time -- let's start with that 'first way'. :)

Trac:
Keywords: tor-hs reachability research deleted, tor-hs reachability research network-health added
Summary: Write tool for onion service health assesment to Write tool for onion service health assessment

I run two v3 onion services that fulfill the criteria and could give us relevant statistics for our questions. commenting here so I get the updates on this ticket, I would be thrilled to use a tool like this to monitor their health and report back here.

I have no other way to know about their health, all I can do is check the Tor daemon for relevant messages in the log file or try connecting to them myself. But this "solution" is obviously orders of magnitude less reliable and exact than a tool that will check the health against all HSDirs and try introduction to all IPs.

Trac:
Cc: rl1987 to rl1987, s7r

Trac:
Cc: rl1987, s7r to rl1987, s7r, metrics-team

Trac:
Sponsor: N/A to Sponsor27-can

Trac:
Points: N/A to 23
Sponsor: Sponsor27-can to Sponsor27-must

Trac:
Parent: N/A to #29995 (moved)

Add keyword for tickets in the network team roadmap.

Trac:
Keywords: tor-hs reachability research network-health deleted, research, tor-hs, reachability, network-team-roadmap-2019-Q1Q2, network-health added

Trac:
Keywords: network-team-roadmap-2019-Q1Q2 deleted, network-team-roadmap-september added

Points changed at the Stockholm meeting.

Trac:
Points: 23 to 7
Status: needs_information to assigned
Owner: N/A to dgoulet

This has nothing to do with s27 O1A1.1. It is instrumental but the activity should not depend on this.

I did built that tool based on tor HS tracing but it is not upstream nor ready for upstream. Heck, it might live its life outside of mainline tor, who knows.

Trac:
Parent: #29995 (moved) to #30200 (moved)

At this point, the general idea of stable tracepoints in tor is being discussed so merging this tool upstream depends on the decisions coming out of the discussions with the network team.

For now, this lives outside of tor and hopefully one day, the tracing part will be put upstream. In the meantime, the rest is out of tree.

For reference, tor tracing is here: https://gitweb.torproject.org/user/dgoulet/tor.git/?h=lttng-hs

The scripts to analyze the traces and output useful data is here until we find a better place:

https://people.torproject.org/~dgoulet/volatile/analyze-client.py https://people.torproject.org/~dgoulet/volatile/analyze-service.py

Closing this as "Done" since the work has been done but upstream merge requires more discussions. But for the sponsored work, it is considered done.

Trac:
Status: assigned to closed
Resolution: N/A to fixed

closed

changed time estimate to 56h

mentioned in issue #28967 (moved)

mentioned in issue #28968 (moved)

mentioned in issue #28969 (moved)

mentioned in issue #30200 (moved)

moved to tpo/core/tor#28841 (closed)

Write tool for onion service health assessment

Child items 0

Activity