hs-v3: No live consensus on client with a bridge

changed milestone to %Tor: unspecified

added 034-removed-20180328 034-triage-20180328 component::core tor/tor milestone::Tor: unspecified network-team-roadmap-2020Q1 points::10 priority::high prop224 severity::normal sponsor::27-can status::new technical-debt tor-hs type::defect labels

Experience this here periodically with long-lived non-HS client.

Jul  4 01:35:55 Tor[]: Tor 0.2.9.11 (git-aa8950022562be76) running on Linux with {elided}
Jul  4 01:35:55 Tor[]: Read configuration file "/home/tor/torrc".
Jul  4 01:35:55 Tor[]: Opening Socks listener on {elided}
Jul  4 01:35:55 Tor[]: Opening Control listener on {elided}
Jul  4 01:35:55 Tor[]: Parsing GEOIP IPv4 file ./geoip.
Jul  4 01:35:56 Tor[]: Parsing GEOIP IPv6 file ./geoip6.
Jul  4 01:35:56 Tor[]: Bootstrapped 0%: Starting
Jul  4 01:35:59 Tor[]: new bridge descriptor '{elided}' (cached): ${elided} at {elided}
Jul  4 01:35:59 Tor[]: Bootstrapped 80%: Connecting to the Tor network
Jul  4 01:36:00 Tor[]: Bootstrapped 85%: Finishing handshake with first hop
Jul  4 01:36:00 Tor[]: Bootstrapped 90%: Establishing a Tor circuit
Jul  4 01:36:00 Tor[]: Tor has successfully opened a circuit. Looks like client functionality is working.
Jul  4 01:36:00 Tor[]: Bootstrapped 100%: Done

Aug 22 01:23:55 Tor[]: new bridge descriptor '{elided}' (fresh): ${elided} at {elided}
Aug 24 00:19:38 Tor[]: Delaying directory fetches: No running bridges
Aug 24 01:05:02 Tor[]: Tried for 120 seconds to get a connection to [scrubbed]:80. Giving up. (waiting for circuit)
Aug 24 01:05:02 Tor[]: Tried for 120 seconds to get a connection to [scrubbed]:80. Giving up. (waiting for circuit)
Aug 24 02:05:00 Tor[]: Tried for 120 seconds to get a connection to [scrubbed]:9131. Giving up. (waiting for circuit)
Aug 24 02:05:01 Tor[]: Tried for 120 seconds to get a connection to [scrubbed]:80. Giving up. (waiting for circuit)
Aug 24 02:05:01 Tor[]: Tried for 120 seconds to get a connection to [scrubbed]:443. Giving up. (waiting for circuit)
Aug 24 02:05:01 Tor[]: Tried for 120 seconds to get a connection to [scrubbed]:80. Giving up. (waiting for circuit)
Aug 24 02:05:02 Tor[]: Tried for 120 seconds to get a connection to [scrubbed]:80. Giving up. (waiting for circuit)
Aug 24 03:05:00 Tor[]: Tried for 120 seconds to get a connection to [scrubbed]:80. Giving up. (waiting for circuit)
Aug 24 03:05:01 Tor[]: Tried for 120 seconds to get a connection to [scrubbed]:9131. Giving up. (waiting for circuit)
Aug 24 03:05:02 Tor[]: Tried for 120 seconds to get a connection to [scrubbed]:443. Giving up. (waiting for circuit)
Aug 24 04:05:00 Tor[]: Tried for 120 seconds to get a connection to [scrubbed]:80. Giving up. (waiting for circuit)
Aug 24 04:05:01 Tor[]: Tried for 120 seconds to get a connection to [scrubbed]:9131. Giving up. (waiting for circuit)
Aug 24 04:05:02 Tor[]: Tried for 120 seconds to get a connection to [scrubbed]:443. Giving up. (waiting for circuit)
Aug 24 04:05:02 Tor[]: Tried for 120 seconds to get a connection to [scrubbed]:80. Giving up. (waiting for circuit)
Aug 24 04:05:02 Tor[]: Tried for 120 seconds to get a connection to [scrubbed]:80. Giving up. (waiting for circuit)
Aug 24 05:05:00 Tor[]: Tried for 120 seconds to get a connection to [scrubbed]:80. Giving up. (waiting for circuit)
Aug 24 05:05:01 Tor[]: Tried for 120 seconds to get a connection to [scrubbed]:443. Giving up. (waiting for circuit)
Aug 24 05:05:01 Tor[]: Tried for 120 seconds to get a connection to [scrubbed]:80. Giving up. (waiting for circuit)
Aug 24 05:05:02 Tor[]: Tried for 120 seconds to get a connection to [scrubbed]:80. Giving up. (waiting for circuit)
Aug 24 05:05:02 Tor[]: Tried for 120 seconds to get a connection to [scrubbed]:9131. Giving up. (waiting for circuit)
Aug 24 06:05:00 Tor[]: Tried for 120 seconds to get a connection to [scrubbed]:9131. Giving up. (waiting for circuit)
Aug 24 06:05:00 Tor[]: Tried for 120 seconds to get a connection to [scrubbed]:80. Giving up. (waiting for circuit)
Aug 24 06:05:02 Tor[]: Tried for 120 seconds to get a connection to [scrubbed]:443. Giving up. (waiting for circuit)
Aug 24 07:03:16 Tor[]: Catching signal TERM, exiting cleanly.

Trac:
Cc: N/A to starlight@binnacle.cx

Suggested fixes here:

a) If you can't seem to get a live consensus from your bridge, but you can get a reasonably live one, then build circuit to fallback dirs and try to get a live consensus.

b) If we can't get a live consensus for a while, attempt to connect to v3 with reasonably live consensus. Or the opposite.

Trac:
Owner: N/A to dgoulet
Status: new to accepted

Replying to asn:

Suggested fixes here:

a) If you can't seem to get a live consensus from your bridge, but you can get a reasonably live one, then build circuit to fallback dirs and try to get a live consensus.

Yes definitely but that is probably a more broader bug in tor itself and not HS. I'll investigate this a bit more and open a ticket for it. So not part of this fix.

b) If we can't get a live consensus for a while, attempt to connect to v3 with reasonably live consensus. Or the opposite.

Ok here is my thought on this. I think to simplify things, the HS client should always check for a reasonably live consensus and here is why.

We have a live consensus, then networkstatus_get_reasonably_live_consensus() function returns it.
We do not have a live consensus which means that tor is trying to download one.

I see three scenarios that can lead to failures (and maybe more but my point will get clear with those):

2.1) Client has a skewed clock so the consensus we'll receive is never live BUT chances are that the consensus it has is actually live. 2.2) The dir cache/auth simply can't provide a newer one so no live. 2.3) We are waiting to download a consensus because we've already tried many times and we are backing off for now. In other words, we are stalling a bit so no live. 2.4) We wait for it but won't be sure to have a live one.

In whatever case, we want to give a chance to the HS client and thus use a reasonably live if available. If we consider 2.4) to be enough for the client to stall and have a better chance to compute a hashring that at least one HSDir will workout.

I'm unsure here, I think it would add much complexity to the code for checking if we have an inflight download consensus and when we might have failed already 2 times to fetch a live one so "ok let's use reasonably live".

A reasonably live consensus is 24h maximum skew which I think is not often tor client will be stuck to use that since the bootstrap process is to download a consensus. And if for some reason the client doesn't think it is live, chances are that it is live or close to live. The only scenario I see where that might not be true is if the dirauth failed for many hours to create a consensus or the client can't reach dircache but then I believe the dirinfo also will not be enough to continue anyway.

THAT being said with this WALL of text, here is my proposed fix: we use networkstatus_get_latest_consensus() which is btw what the hs_get_responsible_hsdirs() function uses.

We get a consensus at bootstrap else we don't bootstrap. We'll never use a consensus that is older than 24h (reasonably live) and our dirinfo will never be usable if we have a too old consensus. So, lets just use the latest consensus we have which is basically the best effort of tor? And we know that the service will be accessible with a consensus up to 24h in the past because of this 24h overlap period.

Fix is simple but many things need to change in the test to not MOCK the live consensus function. So before I do that, I would like a second opinion.

Trac:
Status: accepted to needs_information

Hmm, I took another look here and drew the various diagrams and I did not find any cases where your suggestion causes problems. However, it's been a while since I last carefully thought about these reachability diagrams and I'm afraid whether I'm forgetting something. I'm also afraid because I remember a conscious decision for requiring a live consensus, but perhaps that was invalidated when we switched to 24hours overlap periods...

I think your suggested change makes sense, but if we are to do it I think we need to implement the new cases in test_reachability() and test_client_service_hsdir_set_sync() since we are basically creating a few extra scenarios here that we should make sure we are handling gracefully. Also, I'd like to test the branch a few days on my tor browser using a bridge to make sure that nothing goes weird before merging upstream.

Cheers!

BTW here is an example of a thing that might go bad (and the suggested tests above would not catch):

get_voting_interval() and hs_get_time_period_num() and hs_in_period_between_tp_and_srv() all use networkstatus_get_live_consensus() and are used both by clients and services in multiple places. If a live consensus is not found those functions have an alternative behavior, and if we do the suggested change here we could be using this alternative behavior more frequently. We should make sure that this won't cause any reachability issues.

Replying to asn:

BTW here is an example of a thing that might go bad (and the suggested tests above would not catch):

get_voting_interval() and hs_get_time_period_num() and hs_in_period_between_tp_and_srv() all use networkstatus_get_live_consensus() and are used both by clients and services in multiple places. If a live consensus is not found those functions have an alternative behavior, and if we do the suggested change here we could be using this alternative behavior more frequently. We should make sure that this won't cause any reachability issues.

Good points!

I've did an investigation and shared random is the problem. We kind of bound tightly the HS and SR subsystems with those functions. Dirauth MUST use the live consensus for their computation of intervals and time period so we can't change get_voting_interval() to use "latest consensus". It could have consequences on the SR (maybe although I really wonder if a dirauth can operate without a live consensus...)

To pull this off, we would need a function from which we can control which ns object we pass to the series of functions that we need in the HS subsystem from the SR subsystem or tell the function that we do not care about live.

For hs_get_time_period_num(), it is OK to stay like this using a live consensus because that function returns the time period tor thinks it is in and it should be either a live consensus or just wing it with the current time which is what we want if we have no live but we'll still try to reach the HS.

For the service, it ONLY uploads descriptors if it has a live consensus, this is enforced in should_service_upload_descriptor() so no way to go around. It means that once a client gets a descriptor, it is for sure a live consensus that the service has.

That being said:

hs_in_period_between_tp_and_srv() is only used by the service so that is fine to keep using live consensus requirement.
The problem, as you pointed it out, is the use of the functions from the SR subsystem that we can't change that requirement.

In conclusion, we need a set of functions that do not require a live consensus that only the HS client would use where the SR would use something else that requires the live consensus.

I'm unsure about 032 for this. I think we might be OK to postpone this until we get a sense of what is the impact on reachability/usablility and who knows, maybe 99.9% of tor clients work with a live consensus and maybe it is probably good for them to wait for it.

Trac:
Status: needs_information to needs_revision
Milestone: Tor: 0.3.2.x-final to Tor: 0.3.3.x-final

Move 033 ticket I own to 034

Trac:
Milestone: Tor: 0.3.3.x-final to Tor: 0.3.4.x-final

Trac:
Keywords: N/A deleted, 034-triage-20180328 added

Per our triage process, these tickets are pending removal from 0.3.4.

Trac:
Keywords: N/A deleted, 034-removed-20180328 added

These needs_revision, tickets, tagged with 034-removed-*, are no longer in-scope for 0.3.4. We can reconsider any of them, if somebody does the necessary revision.

Trac:
Milestone: Tor: 0.3.4.x-final to Tor: unspecified

When we merge #24661 (moved), the HS subsystem will be the only remaining client code that requires a live consensus. I think we could do this ticket in Sponsor 8, if you want.

(Otherwise, clients will bootstrap with clock skew, but they won't be able to use v3 onion services.)

Trac:
Sponsor: N/A to Sponsor8-can
Parent: N/A to #23605 (moved)

I'm gonna go on a limb here and say that this is a bit "out of scope" in some ways or just too complicated for s8 at this stage.

I've gone over the thread above (which is kind of old, things have changed a bit since then) and what I can say is that the changes would need to happen in many places and thus require us to expand considerably our reachability unit testing.

First in can_client_refetch_desc() to let the client try to download a descriptor without a live consensus.

The second big part would be in hs_get_responsible_hsdirs() which also requires a live consensus but also used by the service ... so some split to be done.

Then finaly, make hs_get_time_period_num() maybe fallback on the "latest consensus" instead of approx_time() if the live consensus can't be found. The idea here is that for the whole subsystem the same time source has to be used. So having code path that use the "latest consensus valid_after" time with approx_time is a recipe for reachability issue.

We had so many issues with timing over the years and ended up realizing that whatever we use, the entire subsystem needs to use the same time source. In theory, right now, the "live consensus valid_after" should be used across the board. Part of my thinks we would benefit from a "HS time source" that is updated every time we get a new consensus and then the HS subsystem only uses.

Trac:
Sponsor: Sponsor8-can to N/A
Status: needs_revision to new

Replying to dgoulet:

I'm gonna go on a limb here and say that this is a bit "out of scope" in some ways or just too complicated for s8 at this stage.

I agree, I don't think we can get this done in a few weeks, but we should do it eventually. Because Tor clients can now bootstrap and use exits with a reasonably live consensus (or skewed clock), but they can't use v3 onion services.

I've gone over the thread above (which is kind of old, things have changed a bit since then) and what I can say is that the changes would need to happen in many places and thus require us to expand considerably our reachability unit testing.

First in can_client_refetch_desc() to let the client try to download a descriptor without a live consensus.

The second big part would be in hs_get_responsible_hsdirs() which also requires a live consensus but also used by the service ... so some split to be done.

No, services should also work with a reasonably live consensus. Otherwise, people running services on small devices with skewed clocks will be sad.

Then finaly, make hs_get_time_period_num() maybe fallback on the "latest consensus" instead of approx_time() if the live consensus can't be found. The idea here is that for the whole subsystem the same time source has to be used. So having code path that use the "latest consensus valid_after" time with approx_time is a recipe for reachability issue.

We had so many issues with timing over the years and ended up realizing that whatever we use, the entire subsystem needs to use the same time source. In theory, right now, the "live consensus valid_after" should be used across the board. Part of my thinks we would benefit from a "HS time source" that is updated every time we get a new consensus and then the HS subsystem only uses.

Sounds like we need a module that handles onion service time.

Un-parenting, and marking with technical-debt.

Trac:
Keywords: N/A deleted, technical-debt added
Parent: #23605 (moved) to N/A

Replying to teor:

The second big part would be in hs_get_responsible_hsdirs() which also requires a live consensus but also used by the service ... so some split to be done.

No, services should also work with a reasonably live consensus. Otherwise, people running services on small devices with skewed clocks will be sad.

To do such a feature, I would open two tickets for client and service side. They behave in very different ways and on the service side, the big piece of work would be unit testing by upgrading our critical reachability test case.

A few weeks ago, asn suggested I look at these commits for this ticket:

<+asn> teor: perhaps a useful commit to see while you are checking out code/spec is 2520ee34c6d1b5eb83a6c3ffdaf1e8b3013b619f
<+asn> teor: wrt hsv3 and live consensus
<+asn> teor: also 9e900d1db7c8c9e164b5b14d5cdd4099c1ce45f0
<+asn> teor: and b89d2fa1db2379bffd2e2b4c851c3facc57b6ed8

Trac:
Sponsor: N/A to Sponsor27

Assigning 10 points for this, since I imagine we might find more issues as we go through this, and we need to test this properly.

Trac:
Points: N/A to 10
Sponsor: Sponsor27 to Sponsor27-must

Trac:
Parent: N/A to #29995 (moved)

Add keyword for tickets in the network team roadmap.

Trac:
Keywords: N/A deleted, network-team-roadmap-2019-Q1Q2 added

dgoulet will assign himself to the ones he is working on right now.

Trac:
Status: new to assigned
Owner: dgoulet to N/A

No owner and "Assigned", not ideal :)

Trac:
Status: assigned to new

Moving to -can. This is not critical to the main s27 sponsor work. Thus unparenting.

Trac:
Parent: #29995 (moved) to N/A
Sponsor: Sponsor27-must to Sponsor27-can

Trac:
Keywords: network-team-roadmap-2019-Q1Q2 deleted, network-team-roadmap-2020Q1 added

changed time estimate to 80h

mentioned in issue #27299 (moved)

mentioned in issue #28669 (moved)

moved to tpo/core/tor#23764 (closed)

mentioned in issue tpo/core/tor#40237 (closed)

hs-v3: No live consensus on client with a bridge

Child items ...

Activity