Opened 7 months ago

Closed 5 months ago

#21056 closed defect (fixed)

Could not pick one of the responsible hidden service directories, because we requested them all recently without success.

Reported by: joeyh Owned by:
Priority: High Milestone: Tor: 0.3.1.x-final
Component: Core Tor/Tor Version: Tor: 0.2.9.8
Severity: Normal Keywords: tor-hs
Cc: s7r Actual Points:
Parent ID: #17242 Points:
Reviewer: Sponsor:

Description

I am seeing connections to hidden services reliably fail, with this in the log:

Dec 21 13:50:16.000 [info] connection_ap_handshake_rewrite_and_attach(): Got a hidden service request for ID '[scrubbed]'
Dec 21 13:50:16.000 [info] connection_ap_handshake_rewrite_and_attach(): Unknown descriptor [scrubbed]. Fetching.
Dec 21 13:50:16.000 [debug] rend_client_refetch_v2_renddesc(): Fetching v2 rendezvous descriptor for service [scrubbed]
Dec 21 13:50:16.000 [info] pick_hsdir(): Could not pick one of the responsible hidden service directories, because we requested them all recently without success.
Dec 21 13:50:16.000 [info] pick_hsdir(): Could not pick one of the responsible hidden service directories, because we requested them all recently without success.
Dec 21 13:50:16.000 [info] fetch_v2_desc_by_addr(): Could not pick one of the responsible hidden service directories to fetch descriptors, because we already tried them all unsuccessfully.

In my test case, I have 2 hidden services running on a machine. Both have been started up for the first time in the past 5-10 minutes. On the same machine, I open concurrently two socks connections, one to each hidden service. Reiably, one of the socks connections succeeds, and the other fails. After it starts failing, retries using that onion address continue to fail for several minutes.

Kind of looks like one socks request is breaking the other one. This seems similar to #16501 and #15937 but I am pretty sure I am only making 2 socks connections, to two different onion addresses.

Attached log shows this occurring, and then after a while the bad state cleared and a retry to connect to the hidden service that it had not been able to resolve succeeded.

Child Tickets

Attachments (1)

debug.log.gz (125.3 KB) - added by joeyh 7 months ago.

Download all attachments as: .zip

Change History (13)

Changed 7 months ago by joeyh

comment:1 Changed 7 months ago by dgoulet

  • Component changed from - Select a component to Core Tor/Tor
  • Keywords tor-hs added
  • Milestone set to Tor: 0.3.0.x-final

comment:2 Changed 7 months ago by s7r

I confirm this happens heavily on git master with otherwise available up and running onion services. It only happens when you open more ( > 1 ) concurrent socks requests (either to different hidden services or to the same hidden service with isolation).

I won't attach another debug log because this is very easy to reproduce: just open at least 2 or 3 socks requests simultaneously to the same up and running hidden service but use a different socks username / password for each request so they are isolated. The hidden service will become unreachable immediately, because Tor will think it tried all HSDirs and couldn't get a valid descriptor. It works normally if you make just one request at a time, but it's clearly a bug.

I can also confirm this wasn't happening in previous releases (#15937 which lead to #18564). I used the same script to open many concurrent rendezvous circuits to the same onion service to stress test OnionBalance back then and I didn't experience this bug.

Last edited 7 months ago by s7r (previous) (diff)

comment:3 Changed 7 months ago by s7r

  • Cc s7r added
  • Priority changed from Medium to High

comment:4 Changed 7 months ago by asn

Looking at logs, seems like this happens over and over:

Dec 21 13:48:55.000 [debug] connection_dir_client_reached_eof(): Received response from directory server '62.210.109.48:9001': 200 "OK" (purpose
: 18)
Dec 21 13:48:55.000 [debug] connection_dir_client_reached_eof(): Time on received directory is within tolerance; we are -1 seconds skewed.  (Tha
t's okay.)
Dec 21 13:48:55.000 [info] connection_dir_client_reached_eof(): Received rendezvous descriptor (size 3253, status 200 ("OK"))
Dec 21 13:48:55.000 [info] rend_cache_store_v2_desc_as_client(): Service descriptor with service ID [scrubbed] has no usable intro points. Disca
rding it.
Dec 21 13:48:55.000 [warn] Fetching v2 rendezvous descriptor failed. Retrying at another directory.

So seems like we go to all the HSDirs and we don't accept any of the descriptors because there were no usable intro points... And that only happens in concurrent SOCKS requests? Hm...

comment:5 Changed 7 months ago by s7r

I see exactly the same thing, the descriptor is fetched from the HSDir with status 200 OK, but Tor thinks all the intro points are not usuable:

[info] rend_cache_store_v2_desc_as_client(): Service descriptor with service ID [scrubbed] has no usable intro points. Discarding it.

This happens with all descriptors fetched from the entire HSDir set, and once all are fetched an onion service becomes unavailable to a client (but works perfectly fine from others).

The thing is, it's not a strict rule that it happens only when you open more concurrent socks requests. It happens (is true rarely but still happens) with a single socks request as well. Sometimes when opening multiple socks requests part of them fail, part of them work, until Tor thinks it tried the last HSDir unsuccessfully and after that all requests fail, even if sent one at a time. Here are some numbers, note that concurrent requests are always isolated, the remote onion I connect to is always the same and were performed consecutively.

Result of total number of 20 requests in batches of 5 simultaneously (14 OK, 6 FAILED):

Status code distribution:
  [200] 14 responses

Error distribution:
  [2]   Get http://[scrubbed].onion: proxy: SOCKS5 proxy at 127.0.0.1:9150 failed to connect: TTL expired
  [1]   Get http://[scrubbed].onion: proxy: SOCKS5 proxy at 127.0.0.1:9150 failed to connect: host unreachable
  [3]   Get http://[scrubbed].onion: proxy: SOCKS5 proxy at 127.0.0.1:9150 failed to connect: host unreachable

Result of total number of 100 requests in batches of 5 simultaneously (18 OK, 82 FAILED):

Status code distribution:
  [200] 18 responses

Error distribution:
  [75]  Get http://[scrubbed].onion: proxy: SOCKS5 proxy at 127.0.0.1:9150 failed to connect: host unreachable
  [1]   Get http://[scrubbed].onion: proxy: SOCKS5 proxy at 127.0.0.1:9150 failed to connect: host unreachable
  [1]   Get http://[scrubbed].onion: proxy: SOCKS5 proxy at 127.0.0.1:9150 failed to connect: host unreachable
  [5]   Get http://[scrubbed].onion: proxy: SOCKS5 proxy at 127.0.0.1:9150 failed to connect: TTL expired

Result of total number of 50 requests, one request at a time (49 OK, 1 FAILED):

Status code distribution:
  [200] 49 responses

Error distribution:
  [1]   Get http://[scrubbed].onion: proxy: SOCKS5 proxy at 127.0.0.1:9150 failed to connect: TTL expired

Repeating total of 50 requests, one at a time (26 OK, 24 FAILED):

Status code distribution:
  [200] 26 responses

  [24]  Get http://[scrubbed].onion: proxy: SOCKS5 proxy at 127.0.0.1:9150 failed to connect: host unreachable

Total of 100 requests, in batches of 20 simultaneously (100 FAILED):

Error distribution:
  [100] Get http://[scrubbed].onion: proxy: SOCKS5 proxy at 127.0.0.1:9150 failed to connect: host unreachable

If you request 6 or more simultaneously you are almost guaranteed to have 100% failed connections. With 5 or less simultaneously depends on your luck I guess, maybe 20% are failed, maybe 60%, maybe 100%.

comment:6 Changed 7 months ago by s7r

And, after failed requests a onion hostname enters in a cache that is down somehow, and Tor won't try to connect to it any more, but if you enter a different onion hostname (if you don't request too many concurrent connections) it will connect sometimes, sometimes will just lag for unknown time (always interrupted with ctrl+c, the lag is anyway more than 5 minutes). Tor client instance doesn't die and doesn't log anything while lagging.

So we have our hands on a strange bug.

comment:7 Changed 7 months ago by dgoulet

  • Status changed from new to needs_information

I've made a patch here #20307, a fix for what's causing the "unusable introduction points" issue we see also in the logs with that experiment.

The second thing I see is that with multiple SOCKS (let's say 6) requests to the *same* .onion, all 6 will trigger a fetch for the descriptor to a _different_ HSDir. And thus all 6 HSDir are flagged as "recently looked up" and they won't be usable after some minutes leading to the warning we see.

Now, I'm not sure this is ideal as it is quite redundant which consumes resources from the network but having let's say 2 parallel fetches on different replica could be a good idea for performance that is giving us more chances to get the descriptor faster. We do the same for intro points where we open parallel intro circuits and when one opens, we close the rest.

comment:8 Changed 7 months ago by s7r

Tested the patch and works so so much better. I was able to open over 150 concurrent isolated socks requests to the same hidden service and got 100% successful connections rate every time.

comment:9 Changed 7 months ago by s7r

dgoulet: I saw in the other ticket that the patch fixes this one partially -- are there any other existent issues (besides the idea of opening parallel circuits to HSDirs)? With the patch things are indeed incomparably better, but if you do say 30 concurrent isolated socks requests to the same hidden service, and repeat this like 10 times consecutively once one request finishes, it will eventually lead to the warning seen above, that we could not get a valid v2 descriptor.

comment:10 Changed 6 months ago by dgoulet

  • Milestone changed from Tor: 0.3.0.x-final to Tor: 0.3.1.x-final

Ok, #20307 has been merged and awaiting possible backport to 029 so this ticket should be partially fixed.

The other issue is the fact that a client can request all HSDir at once with 6 socks connection. This would require much more engineering to fix and I'm honestly reluctant to poke at it for v2 and instead put our effort at making it right for v3 (prop224). Thoughts?

I'll defer to 031 for now and make sure we do NOT implement this behavior for prop224 (#20657).

comment:11 Changed 6 months ago by s7r

Hmm, you mean 6 circuits to fetch descriptors from each HSDir launched for each (one) socks request to a hidden service?

I also think this could be discussed for v3 only. However, it might be overkill to request to all HSDirs for each socks connection, requesting to 2 or 3 HSDirs at once and close rest of requests once one is successful could be sufficient.

comment:12 Changed 5 months ago by dgoulet

  • Parent ID set to #17242
  • Resolution set to fixed
  • Status changed from needs_information to closed

This improved behavior will be part of #17242 so putting it as parent and closing. If we want to put some effort into improving the client fetch behavior of the v2 system, let's open a ticket for that with some action items.

Note: See TracTickets for help on using tickets.