Opened 13 months ago

Last modified 6 months ago

#23764 needs_revision defect

hs-v3: No live consensus on client with a bridge

Reported by: dgoulet Owned by: dgoulet
Priority: High Milestone: Tor: unspecified
Component: Core Tor/Tor Version:
Severity: Normal Keywords: tor-hs, prop224, 034-triage-20180328, 034-removed-20180328
Cc: starlight@… Actual Points:
Parent ID: Points:
Reviewer: Sponsor:

Description

Today we got someone coming in the v3 testing hub IRC channel that couldn't use v3 onion at all.

Turns out that this log kept happening for any v3 address:

[info] hs_client_refetch_hsdesc(): Can't fetch descriptor for service [scrubbed] because we are missing a live consensus. Stalling connection.

But its tor never got a live consensus. We could see it was trying to get it from its bridge:

[info] Received http status code 304 ("Not modified") from server 'BRIDGE_IP' while fetching consensus directory.

Sooooo, somehow the bridge has a consensus that thinks is live enough to use but when the client gets it, it doesn't think it is live. I can imagine clock skew between the client and bridge could be causing this?

Thus, this makes me question the use of "live consensus" in the HS v3 subsystem. v2 doesn't look for that at all, it only cares if tor has completed a circuit then it uses the consensus even if not live.

Maybe client side could only use the consensus tor thinks it can use and we hope that it is enough to reach the service?

Child Tickets

Change History (11)

comment:1 Changed 12 months ago by starlight

Cc: starlight@… added

Experience this here periodically with long-lived non-HS client.

Jul  4 01:35:55 Tor[]: Tor 0.2.9.11 (git-aa8950022562be76) running on Linux with {elided}
Jul  4 01:35:55 Tor[]: Read configuration file "/home/tor/torrc".
Jul  4 01:35:55 Tor[]: Opening Socks listener on {elided}
Jul  4 01:35:55 Tor[]: Opening Control listener on {elided}
Jul  4 01:35:55 Tor[]: Parsing GEOIP IPv4 file ./geoip.
Jul  4 01:35:56 Tor[]: Parsing GEOIP IPv6 file ./geoip6.
Jul  4 01:35:56 Tor[]: Bootstrapped 0%: Starting
Jul  4 01:35:59 Tor[]: new bridge descriptor '{elided}' (cached): ${elided} at {elided}
Jul  4 01:35:59 Tor[]: Bootstrapped 80%: Connecting to the Tor network
Jul  4 01:36:00 Tor[]: Bootstrapped 85%: Finishing handshake with first hop
Jul  4 01:36:00 Tor[]: Bootstrapped 90%: Establishing a Tor circuit
Jul  4 01:36:00 Tor[]: Tor has successfully opened a circuit. Looks like client functionality is working.
Jul  4 01:36:00 Tor[]: Bootstrapped 100%: Done

Aug 22 01:23:55 Tor[]: new bridge descriptor '{elided}' (fresh): ${elided} at {elided}
Aug 24 00:19:38 Tor[]: Delaying directory fetches: No running bridges
Aug 24 01:05:02 Tor[]: Tried for 120 seconds to get a connection to [scrubbed]:80. Giving up. (waiting for circuit)
Aug 24 01:05:02 Tor[]: Tried for 120 seconds to get a connection to [scrubbed]:80. Giving up. (waiting for circuit)
Aug 24 02:05:00 Tor[]: Tried for 120 seconds to get a connection to [scrubbed]:9131. Giving up. (waiting for circuit)
Aug 24 02:05:01 Tor[]: Tried for 120 seconds to get a connection to [scrubbed]:80. Giving up. (waiting for circuit)
Aug 24 02:05:01 Tor[]: Tried for 120 seconds to get a connection to [scrubbed]:443. Giving up. (waiting for circuit)
Aug 24 02:05:01 Tor[]: Tried for 120 seconds to get a connection to [scrubbed]:80. Giving up. (waiting for circuit)
Aug 24 02:05:02 Tor[]: Tried for 120 seconds to get a connection to [scrubbed]:80. Giving up. (waiting for circuit)
Aug 24 03:05:00 Tor[]: Tried for 120 seconds to get a connection to [scrubbed]:80. Giving up. (waiting for circuit)
Aug 24 03:05:01 Tor[]: Tried for 120 seconds to get a connection to [scrubbed]:9131. Giving up. (waiting for circuit)
Aug 24 03:05:02 Tor[]: Tried for 120 seconds to get a connection to [scrubbed]:443. Giving up. (waiting for circuit)
Aug 24 04:05:00 Tor[]: Tried for 120 seconds to get a connection to [scrubbed]:80. Giving up. (waiting for circuit)
Aug 24 04:05:01 Tor[]: Tried for 120 seconds to get a connection to [scrubbed]:9131. Giving up. (waiting for circuit)
Aug 24 04:05:02 Tor[]: Tried for 120 seconds to get a connection to [scrubbed]:443. Giving up. (waiting for circuit)
Aug 24 04:05:02 Tor[]: Tried for 120 seconds to get a connection to [scrubbed]:80. Giving up. (waiting for circuit)
Aug 24 04:05:02 Tor[]: Tried for 120 seconds to get a connection to [scrubbed]:80. Giving up. (waiting for circuit)
Aug 24 05:05:00 Tor[]: Tried for 120 seconds to get a connection to [scrubbed]:80. Giving up. (waiting for circuit)
Aug 24 05:05:01 Tor[]: Tried for 120 seconds to get a connection to [scrubbed]:443. Giving up. (waiting for circuit)
Aug 24 05:05:01 Tor[]: Tried for 120 seconds to get a connection to [scrubbed]:80. Giving up. (waiting for circuit)
Aug 24 05:05:02 Tor[]: Tried for 120 seconds to get a connection to [scrubbed]:80. Giving up. (waiting for circuit)
Aug 24 05:05:02 Tor[]: Tried for 120 seconds to get a connection to [scrubbed]:9131. Giving up. (waiting for circuit)
Aug 24 06:05:00 Tor[]: Tried for 120 seconds to get a connection to [scrubbed]:9131. Giving up. (waiting for circuit)
Aug 24 06:05:00 Tor[]: Tried for 120 seconds to get a connection to [scrubbed]:80. Giving up. (waiting for circuit)
Aug 24 06:05:02 Tor[]: Tried for 120 seconds to get a connection to [scrubbed]:443. Giving up. (waiting for circuit)
Aug 24 07:03:16 Tor[]: Catching signal TERM, exiting cleanly.

comment:2 Changed 12 months ago by asn

Suggested fixes here:

a) If you can't seem to get a live consensus from your bridge, but you can get a reasonably live one, then build circuit to fallback dirs and try to get a live consensus.

b) If we can't get a live consensus for a while, attempt to connect to v3 with reasonably live consensus. Or the opposite.

comment:3 Changed 12 months ago by dgoulet

Owner: set to dgoulet
Status: newaccepted

comment:4 in reply to:  2 Changed 12 months ago by dgoulet

Status: acceptedneeds_information

Replying to asn:

Suggested fixes here:

a) If you can't seem to get a live consensus from your bridge, but you can get a reasonably live one, then build circuit to fallback dirs and try to get a live consensus.

Yes definitely but that is probably a more broader bug in tor itself and not HS. I'll investigate this a bit more and open a ticket for it. So not part of this fix.

b) If we can't get a live consensus for a while, attempt to connect to v3 with reasonably live consensus. Or the opposite.

Ok here is my thought on this. I think to simplify things, the HS client should always check for a reasonably live consensus and here is why.

1) We have a live consensus, then networkstatus_get_reasonably_live_consensus() function returns it.

2) We do not have a live consensus which means that tor is trying to download one.

I see three scenarios that can lead to failures (and maybe more but my point will get clear with those):

2.1) Client has a skewed clock so the consensus we'll receive is never live BUT chances are that the consensus it has is actually live.
2.2) The dir cache/auth simply can't provide a newer one so no live.
2.3) We are waiting to download a consensus because we've already tried many times and we are backing off for now. In other words, we are stalling a bit so no live.
2.4) We wait for it but won't be sure to have a live one.

In whatever case, we want to give a chance to the HS client and thus use a reasonably live if available. If we consider 2.4) to be enough for the client to stall and have a better chance to compute a hashring that at least one HSDir will workout.

I'm unsure here, I think it would add much complexity to the code for checking if we have an inflight download consensus and when we might have failed already 2 times to fetch a live one so "ok let's use reasonably live".

A reasonably live consensus is 24h maximum skew which I think is not often tor client will be stuck to use that since the bootstrap process is to download a consensus. And if for some reason the client doesn't think it is live, chances are that it is live or close to live. The only scenario I see where that might not be true is if the dirauth failed for many hours to create a consensus or the client can't reach dircache but then I believe the dirinfo also will not be enough to continue anyway.

THAT being said with this WALL of text, here is my proposed fix: we use networkstatus_get_latest_consensus() which is btw what the hs_get_responsible_hsdirs() function uses.

We get a consensus at bootstrap else we don't bootstrap. We'll never use a consensus that is older than 24h (reasonably live) and our dirinfo will never be usable if we have a too old consensus. So, lets just use the latest consensus we have which is basically the best effort of tor? And we know that the service will be accessible with a consensus up to 24h in the past because of this 24h overlap period.

Fix is simple but many things need to change in the test to not MOCK the live consensus function. So before I do that, I would like a second opinion.

comment:5 Changed 12 months ago by asn

Hmm, I took another look here and drew the various diagrams and I did not find any cases where your suggestion causes problems. However, it's been a while since I last carefully thought about these reachability diagrams and I'm afraid whether I'm forgetting something. I'm also afraid because I remember a conscious decision for requiring a live consensus, but perhaps that was invalidated when we switched to 24hours overlap periods...

I think your suggested change makes sense, but if we are to do it I think we need to implement the new cases in test_reachability() and test_client_service_hsdir_set_sync() since we are basically creating a few extra scenarios here that we should make sure we are handling gracefully. Also, I'd like to test the branch a few days on my tor browser using a bridge to make sure that nothing goes weird before merging upstream.

Cheers!

comment:6 Changed 12 months ago by asn

BTW here is an example of a thing that might go bad (and the suggested tests above would not catch):

get_voting_interval() and hs_get_time_period_num() and hs_in_period_between_tp_and_srv() all use networkstatus_get_live_consensus() and are used both by clients and services in multiple places. If a live consensus is not found those functions have an alternative behavior, and if we do the suggested change here we could be using this alternative behavior more frequently. We should make sure that this won't cause any reachability issues.

Last edited 12 months ago by asn (previous) (diff)

comment:7 in reply to:  6 Changed 11 months ago by dgoulet

Milestone: Tor: 0.3.2.x-finalTor: 0.3.3.x-final
Status: needs_informationneeds_revision

Replying to asn:

BTW here is an example of a thing that might go bad (and the suggested tests above would not catch):

get_voting_interval() and hs_get_time_period_num() and hs_in_period_between_tp_and_srv() all use networkstatus_get_live_consensus() and are used both by clients and services in multiple places. If a live consensus is not found those functions have an alternative behavior, and if we do the suggested change here we could be using this alternative behavior more frequently. We should make sure that this won't cause any reachability issues.

Good points!

I've did an investigation and shared random is the problem. We kind of bound tightly the HS and SR subsystems with those functions. Dirauth MUST use the live consensus for their computation of intervals and time period so we can't change get_voting_interval() to use "latest consensus". It could have consequences on the SR (maybe although I really wonder if a dirauth can operate without a live consensus...)

To pull this off, we would need a function from which we can control which ns object we pass to the series of functions that we need in the HS subsystem from the SR subsystem or tell the function that we do not care about live.

For hs_get_time_period_num(), it is OK to stay like this using a live consensus because that function returns the time period tor thinks it is in and it should be either a live consensus or just wing it with the current time which is what we want if we have no live but we'll still try to reach the HS.

For the service, it ONLY uploads descriptors if it has a live consensus, this is enforced in should_service_upload_descriptor() so no way to go around. It means that once a client gets a descriptor, it is for sure a live consensus that the service has.

That being said:

  • hs_in_period_between_tp_and_srv() is only used by the service so that is fine to keep using live consensus requirement.
  • The problem, as you pointed it out, is the use of the functions from the SR subsystem that we can't change that requirement.

In conclusion, we need a set of functions that do not require a live consensus that only the HS client would use where the SR would use something else that requires the live consensus.

I'm unsure about 032 for this. I think we might be OK to postpone this until we get a sense of what is the impact on reachability/usablility and who knows, maybe 99.9% of tor clients work with a live consensus and maybe it is probably good for them to wait for it.

comment:8 Changed 9 months ago by dgoulet

Milestone: Tor: 0.3.3.x-finalTor: 0.3.4.x-final

Move 033 ticket I own to 034

comment:9 Changed 7 months ago by nickm

Keywords: 034-triage-20180328 added

comment:10 Changed 7 months ago by nickm

Keywords: 034-removed-20180328 added

Per our triage process, these tickets are pending removal from 0.3.4.

comment:11 Changed 6 months ago by nickm

Milestone: Tor: 0.3.4.x-finalTor: unspecified

These needs_revision, tickets, tagged with 034-removed-*, are no longer in-scope for 0.3.4. We can reconsider any of them, if somebody does the necessary revision.

Note: See TracTickets for help on using tickets.