Opened 9 months ago

Last modified 8 weeks ago

#28841 assigned project

Write tool for onion service health assessment

Reported by: asn Owned by: dgoulet
Priority: Medium Milestone: Tor: unspecified
Component: Core Tor/Tor Version:
Severity: Normal Keywords: tor-hs, reachability, research, network-health, network-team-roadmap-september
Cc: rl1987, s7r, metrics-team Actual Points:
Parent ID: #29995 Points: 7
Reviewer: Sponsor: Sponsor27-must

Description

We've been getting lots of reports about bad reachability of onion services (e.g. #28730) and in particular the v3 ones.

We need a tool that we can use to evaluate and monitor the health of onion services. We should use it to verify how reachable and stable onions are, and also as a benchmark for how their stability changes over time.

A relevant ticket here is #13209 which we can leverage in the future.

One way to write such a tool is to provide it with an onion service, and the tool fetches its desc from every HSDir, then introduces itself to all the intro points, and make sure that rendezvous can occur. Then it monitors this over time to find issues with reachability.

Child Tickets

TicketStatusOwnerSummaryComponent
#18620needs_revisionstr4dHSFORGET command to clear cached client state for a HSCore Tor/Tor
#25417closedneelHSFETCH support for v3 Hidden ServicesCore Tor/Tor
#28967newTor control command to connect to Onion ServiceCore Tor/Tor
#28968newOnion Service v2 connection status update eventCore Tor/Tor
#28969newOnion Service v3 connection status update eventCore Tor/Tor

Change History (14)

comment:1 Changed 9 months ago by meejah

FWIW, I've received a bug-report related to this feature too: https://github.com/meejah/txtorcon/issues/327

If there are experiments that I could run (from controller code) to help with understanding this, let me know (I can use the txtorcon documentation hidden-services -- there's a v2 and a v3 one -- as test-subjects)

comment:2 Changed 9 months ago by arma

Yes, I want this tool: we need it as a building block to debugging and nagios-style monitoring.

I'd suggest one slight change to the original description by asn: step one is to make the tool that does the checking, and then step two, separately, would be to run the tool on some sort of schedule. By separating these, anybody who says "why isn't my onion service working" can grab the tool and find out. Heck, somebody might even set up a web page that runs it for people.

comment:3 Changed 9 months ago by meejah

I assume using HSFETCH would be at the core of such a tool .. but also that doesn't work with v3 onions. (That would be: https://trac.torproject.org/projects/tor/ticket/25417)

comment:4 Changed 9 months ago by rl1987

Cc: rl1987 added

comment:5 Changed 9 months ago by rl1987

Status: newneeds_information
Type: defectproject

Some questions/thoughts about this:

  1. There are two parts of Onion Service reachability: a) Ability of Tor network to communicate it's HSDesc reliably and do the introduction/rendezvous procedure when user tries to reach the Onion Service and b) Ability of server software at Onion Service side to properly listen for incoming connection and respond to requests. We care very much about the former, but do we care about the latter? I think not really, as that is responsibility of whoever is running Onion Service and they can use tools like Nagios to monitor things on their side. Also it is trivial to just torify curl .... Any comments on this?
  2. Part a) from above can be further split into: a1) Ability of HSDirs to reliably inform user about the (latest) HSDesc and a2) Ability of Tor network to establish the final circuit between user and Onion Service. I suppose we want to measure both of these, and log some metrics about them? That would be timing information, as well success/failure for each try. We also want to detect cases of Tor network failing to perform any of the connection establishment steps.
  3. Do we want this to be based on stem? Can we currently do introduction/rendezvous stuff with Tor Control Port and get progress information that is fine-grained enough for this tool? Are there things we need to implement for Tor Control interface (beyond making HSFETCH support v3 descriptors) to make it ready?
  4. What would be the UI/API of such tool? Do we want JSON output for easier integration with other stuff? Do we want some API over HTTP?

comment:6 Changed 9 months ago by arma

Keywords: network-health added
Summary: Write tool for onion service health assesmentWrite tool for onion service health assessment
  1. Agreed, this health assessment tool should be entirely about the "within the Tor protocol" side of things. People can use nagios or whatever to make sure that their service is running well -- but only if the onion protocols are reliable and consistent.
  1. Yes. It's not just timing, and not just success/failure, but another piece is trying to identify what went wrong if one of the steps went wrong.
  1. Basing it on stem is fine with me. I think the answer might be "no" for whether all of this stuff is exposed properly in the control protocol though. I think we've had tickets about extending the control protocol in that direction open for a very long time. Or maybe nobody even made the tickets.
  1. I imagine the first way of using the tool would be that we, the developers of the tool, run it consistently against some known-stable onion service. The goal would be to look for patterns in the failures. So the better the tool can be at identifying where the failure is and why the failure is, the more useful it will be. And then the second use of the tool would be when people say their onion service isn't working right -- we can tell them to run the tool and see what it says. Then I could imagine a third way, which is somebody sets up a web interface to run the tool on behalf of anybody who interacts with the website. Then it would become the sort of thing all sorts of people could easily run. But, one step at a time -- let's start with that 'first way'. :)

comment:7 Changed 9 months ago by s7r

Cc: s7r added

I run two v3 onion services that fulfill the criteria and could give us relevant statistics for our questions. commenting here so I get the updates on this ticket, I would be thrilled to use a tool like this to monitor their health and report back here.

I have no other way to know about their health, all I can do is check the Tor daemon for relevant messages in the log file or try connecting to them myself. But this "solution" is obviously orders of magnitude less reliable and exact than a tool that will check the health against all HSDirs and try introduction to all IPs.

comment:8 Changed 7 months ago by irl

Cc: metrics-team added

comment:9 Changed 5 months ago by asn

Sponsor: Sponsor27-can

comment:10 Changed 5 months ago by asn

Points: 23
Sponsor: Sponsor27-canSponsor27-must

comment:11 Changed 5 months ago by asn

Parent ID: #29995

comment:12 Changed 5 months ago by gaba

Keywords: network-team-roadmap-2019-Q1Q2 added

Add keyword for tickets in the network team roadmap.

comment:13 Changed 2 months ago by gaba

Keywords: network-team-roadmap-september added; network-team-roadmap-2019-Q1Q2 removed

comment:14 Changed 8 weeks ago by dgoulet

Owner: set to dgoulet
Points: 237
Status: needs_informationassigned

Points changed at the Stockholm meeting.

Note: See TracTickets for help on using tickets.