Opened 18 months ago

Last modified 5 months ago

#21969 new defect

We're missing descriptors for some of our primary entry guards

Reported by: asn Owned by: asn
Priority: Medium Milestone: Tor: unspecified
Component: Core Tor/Tor Version: Tor: 0.3.0.6
Severity: Normal Keywords: tor-guard, tor-bridge, tor-client, tbb-usability-website, tbb-needs, 034-triage-20180328, 034-removed-20180328
Cc: catalyst, isis, bmeson, mrphs Actual Points:
Parent ID: Points: 1.5
Reviewer: Sponsor: SponsorV

Description

Seems like there are still a few cases where bootstrapping will stall because of:

I learned some more directory information, but not enough to build a circuit: We're missing descriptors for some of our primary entry guards 

It seems to happen mainly with bridges. For me it occured with latest master (b081a7ed2) when I was switching between using private bridges and the default TBB bridges. I also heard reports from s7r about this happening.

We last changed this behavior in #21415, when we made Tor require that we have the descriptor of 2 bridges before we start building circuits.

Child Tickets

TicketStatusOwnerSummaryComponent
#22325closed"we're missing descriptors for some of our primary entry guards" while using bridgesCore Tor/Tor
#23234closedPossible problem with bootstrapping logic (Problem bootstrapping. Stuck at 53%: Loading relay descriptors. (No route to host; NOROUTE; count 7; recommendation warn)Core Tor/Tor
#23620newTor lies about "Optimistically trying directory fetches again"Core Tor/Tor
#23670closedasnSay how many primary entry guards we're missing descriptors forCore Tor/Tor
#23671closedasnSay how many other nodes we're missing descriptors forCore Tor/Tor
#23817closedTor re-tries directory mirrors that it knows are missing microdescriptorsCore Tor/Tor
#23862closedasnTor only updates guard state after a consensus if it has enough directory infoCore Tor/Tor
#23863newWhen our directory guards don't have each others' microdescs, we should try an authority or fallbackCore Tor/Tor
#23985closednickmIf less than 15 missing mds, Tor will delay md download for 10 minsCore Tor/Tor
#23989closednickmentry_guards_update_all() will pretend to update primaries even if sampled set is emptyCore Tor/Tor
#24511closedTorBrowser 7.5 a8 takes multiple minutes to connectCore Tor/Tor
#24654newtbb-teamTor Browser 7.5a9 timed out waiting for circuitApplications/Tor Browser
#25347needs_revisionasnTor keeps on trying the same overloaded guard over and overCore Tor/Tor

Attachments (8)

tordebug.20170504.log (182.9 KB) - added by catalyst 18 months ago.
21969_primary_guards_descriptors_initial_bootstrap.txt (7.1 KB) - added by s7r 17 months ago.
21969_first_primary_guard_descriptor_inexistent.txt (243.4 KB) - added by s7r 17 months ago.
info.log.tgz (1.5 MB) - added by dgoulet 13 months ago.
OnionsTimeouts.png (22.8 KB) - added by bmeson 13 months ago.
info.log.2.tgz (2.0 MB) - added by dgoulet 13 months ago.
brokenHSindebug.tar.gz (2.7 MB) - added by bmeson 13 months ago.
info.log.3.tgz (613.9 KB) - added by dgoulet 12 months ago.

Change History (98)

comment:1 Changed 18 months ago by s7r

Right. I am getting it every 2 days or so. The server is 1gbps at an ISP datacenter, so it's out of the question if it's on a slow connection network.

I am running 0.3.1.0-alpha-dev (git-b1c7e5d8c0a648b1+f7afd80) with default guard context (no bridges).

This Tor instance hosts a hidden service, and when this happens it can no longer build new rend circuits. However, let's think about it from another perspective maybe the bug is actually fixed in terms of logic of requested descriptors, but it's a race condition in how they are processed after being downloaded / refreshed. I am saying this because, based on the timestamp, at exactly the same time or immediately after (~1 second) it complains that it is missing descriptors for some of our primary entry guards, it comes back to reality and says We now have enough directory information to build circuits. And after more time (minutes - is this normal?) it confirms that it successfully opened a circuit. This is logged two times, every time, with the very same timestamp (is this also normal?)

Check timestamps in these last events:

Apr 12 04:26:10 vm16 Tor[978]: Our directory information is no longer up-to-date enough to build circuits: We're missing descriptors for some of our primary entry guards
Apr 12 04:26:10 vm16 Tor[978]: I learned some more directory information, but not enough to build a circuit: We're missing descriptors for some of our primary entry guards
Apr 12 04:26:10 vm16 Tor[978]: We now have enough directory information to build circuits.
Apr 12 04:27:19 vm16 Tor[978]: Tor has successfully opened a circuit. Looks like client functionality is working.
Apr 12 04:27:19 vm16 Tor[978]: Tor has successfully opened a circuit. Looks like client functionality is working.
Apr 17 12:04:16 vm16 Tor[978]: Our directory information is no longer up-to-date enough to build circuits: We're missing descriptors for some of our primary entry guards
Apr 17 12:04:16 vm16 Tor[978]: I learned some more directory information, but not enough to build a circuit: We're missing descriptors for some of our primary entry guards
Apr 17 12:04:16 vm16 Tor[978]: We now have enough directory information to build circuits.
Apr 17 12:05:24 vm16 Tor[978]: Tor has successfully opened a circuit. Looks like client functionality is working.
Apr 17 12:05:24 vm16 Tor[978]: Tor has successfully opened a circuit. Looks like client functionality is working.
Apr 19 05:38:18 vm16 Tor[978]: Our directory information is no longer up-to-date enough to build circuits: We're missing descriptors for some of our primary entry guards
Apr 19 05:38:18 vm16 Tor[978]: I learned some more directory information, but not enough to build a circuit: We're missing descriptors for some of our primary entry guards
Apr 19 05:38:18 vm16 Tor[978]: We now have enough directory information to build circuits.
Apr 19 05:42:12 vm16 Tor[978]: Tor has successfully opened a circuit. Looks like client functionality is working.
Apr 19 05:42:12 vm16 Tor[978]: Tor has successfully opened a circuit. Looks like client functionality is working.
Apr 21 00:48:20 vm16 Tor[978]: Our directory information is no longer up-to-date enough to build circuits: We're missing descriptors for some of our primary entry guards
Apr 21 00:48:20 vm16 Tor[978]: I learned some more directory information, but not enough to build a circuit: We're missing descriptors for some of our primary entry guards
Apr 21 00:48:21 vm16 Tor[978]: We now have enough directory information to build circuits.
Apr 21 00:49:26 vm16 Tor[978]: Tor has successfully opened a circuit. Looks like client functionality is working.
Apr 21 00:49:26 vm16 Tor[978]: Tor has successfully opened a circuit. Looks like client functionality is working.

asn should we allow a time period for Tor to process new descriptors upon refresh? In terms of seconds, so slower computers or computers on slower connections don't get these warnings? During that time, previous descriptors can be used. This also requires to ensure we try to refresh descriptors for guards with N minutes before the ones we already have expire, so we have this time period to handle it.

comment:2 Changed 18 months ago by catalyst

Cc: catalyst added

comment:3 Changed 18 months ago by alecmuffett

Clipped from my comments on tor-dev:

I am running a cluster of 12 "worker" tor daemons (0.3.0.6) across 6 machines, under Onionbalance.

So far, 2x of the worker-onion daemons have gone "stale" - are not publishing descriptors, are not reachable from the tor network - even though their Log.NOTICE-level logfiles show nothing weird. I've sent SIGUSR2 to one of them to enable more verbose logging.

The only notable message is: Diagnostic for issue 8387: Found 2 one-hop circuits more than 1800 seconds old! Logging 2... - but all of the workers are doing that occasionally, whereas only 2x of them are stale.

Also: in each case where there is a stale tor-daemon on a machine, the other daemon, in a separate directory, is not stale - so, it can't be a connectivity issue.

I did egrep '(for.some.of.our|now.have.enough)' */tor.log on the fleet of workers; it turns out that all 12x of the worker daemons have at various points issued the following messages:

Our directory information is no longer up-to-date enough to build circuits: We're missing descriptors for some of our primary entry guards

...and...

I learned some more directory information, but not enough to build a circuit: We're missing descriptors for some of our primary entry guards.

HOWEVER the 2x STALE daemons have not issued a subsequent:

We now have enough directory information to build circuits

...so it appears that they, in particular, have given up trying to fetch directory information - a state which has persisted for several days now.

Sample tor.conf follows

# -*- conf -*-
# eotk (c) 2017 Alec Muffett

# template note: here we use TOR_DIR not PROJECT_DIR because of the
# relocation of Tor directories under `softmap`
DataDirectory /home/pi/eotk/projects.d/projname.d/hs-2.d
ControlPort unix:/home/pi/eotk/projects.d/projname.d/hs-2.d/tor-control.sock
PidFile /home/pi/eotk/projects.d/projname.d/hs-2.d/tor.pid
Log notice file /home/pi/eotk/projects.d/projname.d/hs-2.d/tor.log
SafeLogging 1
HeartbeatPeriod 60 minutes
LongLivedPorts 80,443
RunAsDaemon 1

# use single onions
SocksPort 0 # have to disable this for single onions
HiddenServiceSingleHopMode 1 # yep, i want single onions
HiddenServiceNonAnonymousMode 1 # yes, really, honest, i swear

# softmap
HiddenServiceDir /home/pi/eotk/projects.d/projname.d/hs-2.d
HiddenServicePort 80 unix:/home/pi/eotk/projects.d/projname.d/hs-2.d/port-80.sock
HiddenServicePort 443 unix:/home/pi/eotk/projects.d/projname.d/hs-2.d/port-443.sock
HiddenServiceNumIntroductionPoints 3

comment:4 Changed 18 months ago by nickm

Priority: MediumHigh

comment:5 Changed 18 months ago by asn

We need to do something smart about this problem...

I was going to suggest we just drop the non-fatal assert of #21415 and only check for the descriptor of the first primary guard (instead of first two), but I actually feel I don't understand the issue well enough.

Perhaps some log messages that show the guard names that are missing descriptor would be helpful. Also, I'm wondering why Alec's tor is not doing any efforts in fetching the missing descriptors even over multiple hour. Maybe we should also log our attempts for fetching descriptors that are missing?

Another theory: Since primary guards come from CONFIRMED_GUARDS, there is no guarantee that they are currently up and running. Could it be that Alec's primary guard was offline and off the consensus at that time, so Alec's tor couldn't even fetch the needed descriptor and proceed?

Changed 18 months ago by catalyst

Attachment: tordebug.20170504.log added

comment:6 Changed 18 months ago by catalyst

I have noticed a few problems that I think are likely related.

I attached a log of a failed bootstrap with obfs4. (Tor Browser 7.0a3 with defaults obfs4 bridge configuration) It appears to have gotten stuck because none of the selected guards was reachable when trying to download the consensus. (The guards look like they are selected prior to determining their reachability, which I vaguely recall is a countermeasure against some kinds of attacks, but is potentially a usability problem during bootstrap.)

Note that two of the guards have the same fingerprint (A832D176ECD5C7C6B58825AE22FC4C90FA249637), and seem to be multiple ports on the same IP according to the TBB-generated torrc. Maybe we should prevent this during guard selection?

comment:7 Changed 18 months ago by isis

Cc: isis added
Keywords: tor-bridge added

comment:8 Changed 18 months ago by alecmuffett

The log messages in my case are:

May 03 08:58:45.000 [notice] Heartbeat: Tor's uptime is 12:00 hours, with 6 circuits open. I've sent 2.28 MB and received 9.59 MB.
May 03 09:21:47.000 [notice] Our directory information is no longer up-to-date enough to build circuits: We're missing descriptors for some of our primary entry guards
May 03 09:21:47.000 [notice] I learned some more directory information, but not enough to build a circuit: We're missing descriptors for some of our primary entry guards
May 03 09:21:48.000 [notice] We now have enough directory information to build circuits.
May 03 09:58:45.000 [notice] Heartbeat: Tor's uptime is 13:00 hours, with 7 circuits open. I've sent 2.44 MB and received 10.39 MB.
May 03 10:11:48.000 [notice] Tor has successfully opened a circuit. Looks like client functionality is working.
May 03 10:11:48.000 [notice] Tor has successfully opened a circuit. Looks like client functionality is working.
May 03 10:58:45.000 [notice] Heartbeat: Tor's uptime is 14:00 hours, with 6 circuits open. I've sent 2.54 MB and received 10.49 MB.

...and then later...

May 04 17:25:50.000 [notice] Our directory information is no longer up-to-date enough to build circuits: We're missing descriptors for some of our primary entry guards
May 04 17:25:50.000 [notice] I learned some more directory information, but not enough to build a circuit: We're missing descriptors for some of our primary entry guards
May 04 17:58:48.000 [notice] Heartbeat: Tor's uptime is 1 day 21:00 hours, with 3 circuits open. I've sent 6.51 MB and received 25.41 MB.

...with no subsequent We now have enough...

At the risk of sounding facile, what about replacing the some of our primary message with a N-out-of-M fraction?

comment:9 in reply to:  6 ; Changed 18 months ago by teor

Replying to catalyst:

...
I attached a log of a failed bootstrap with obfs4. (Tor Browser 7.0a3 with defaults obfs4 bridge configuration) It appears to have gotten stuck because none of the selected guards was reachable when trying to download the consensus.

IMO, the right fix here is to fall back to a fallback directory mirror. It's what they are there for.

...
Note that two of the guards have the same fingerprint (A832D176ECD5C7C6B58825AE22FC4C90FA249637), and seem to be multiple ports on the same IP according to the TBB-generated torrc. Maybe we should prevent this during guard selection?

Unless, of course, you are using a restricted guard set or bridges, where we should... just try another guard?

Also, maybe we want tor to try an IPv4 and an IPv6 address for your bridge if you only have one dual-stack bridge?

comment:10 in reply to:  9 ; Changed 18 months ago by catalyst

Replying to teor:

IMO, the right fix here is to fall back to a fallback directory mirror. It's what they are there for.

I'm not sure I understand what this means. If a working bridge is a directory mirror, why is that not sufficient? Or maybe all of the (non-guard) working bridges fail to be directory mirrors?

comment:11 in reply to:  10 ; Changed 18 months ago by teor

Replying to catalyst:

Replying to teor:

IMO, the right fix here is to fall back to a fallback directory mirror. It's what they are there for.

This comment was about unrestricted entry guard settings, not bridges.

I'm not sure I understand what this means. If a working bridge is a directory mirror, why is that not sufficient? Or maybe all of the (non-guard) working bridges fail to be directory mirrors?

All bridges are directory mirrors.

But there will always be a tradeoff when all our primary guards aren't working between:

  • fail (and be more secure), and
  • try another guard/bridge/fallback/directory authority (and be more usable).

Looks like we set the bar too low. Or there is another bug.

comment:12 in reply to:  11 Changed 18 months ago by catalyst

Replying to teor:

All bridges are directory mirrors.

But there will always be a tradeoff when all our primary guards aren't working between:

  • fail (and be more secure), and
  • try another guard/bridge/fallback/directory authority (and be more usable).

Looks like we set the bar too low. Or there is another bug.

I think maybe we should be more lenient about trying other guards when we have no (or very limited) reachability information (e.g., during bootstrapping from zero state) than when we have fresh reachability information, especially when bridges are involved.

comment:13 Changed 18 months ago by s7r

asn: you could be right with the second theory. Maybe the descriptors that need to be fetched are not even available, because the guard is not running in current consensus. This is based on a log analysis I just did now:

On a period of ~1 hour and 15 minutes (20:40 - 21:55), it starts like this (this instance is using default Guard context, no bridges):

May 11 20:40:54 electrum Tor[7669]: Our directory information is no longer up-to-date enough to build circuits: We're missing descriptors for some of our primary entry guards
May 11 20:40:54 electrum Tor[7669]: I learned some more directory information, but not enough to build a circuit: We're missing descriptors for some of our primary entry guards

Immediately after, over 500 messages like this one (with 1-3 seconds time between them):

May 11 20:41:01 electrum Tor[7669]: Giving up launching first hop of circuit to rendezvous point [scrubbed] for service yzgq5m227nhsi33j.
May 11 20:41:04 electrum Tor[7669]: Giving up launching first hop of circuit to rendezvous point [scrubbed] for service yzgq5m227nhsi33j.
May 11 20:41:07 electrum Tor[7669]: Giving up launching first hop of circuit to rendezvous point [scrubbed] for service yzgq5m227nhsi33j.
May 11 20:41:10 electrum Tor[7669]: Giving up launching first hop of circuit to rendezvous point [scrubbed] for service yzgq5m227nhsi33j.
May 11 20:41:44 electrum Tor[7669]: Giving up launching first hop of circuit to rendezvous point [scrubbed] for service yzgq5m227nhsi33j.
May 11 20:41:47 electrum Tor[7669]: Giving up launching first hop of circuit to rendezvous point [scrubbed] for service yzgq5m227nhsi33j.
May 11 20:41:49 electrum Tor[7669]: Giving up launching first hop of circuit to rendezvous point [scrubbed] for service yzgq5m227nhsi33j.
May 11 20:41:51 electrum Tor[7669]: Giving up launching first hop of circuit to rendezvous point [scrubbed] for service yzgq5m227nhsi33j.

[...] so on so on until:

May 11 21:54:55 electrum Tor[7669]: Giving up launching first hop of circuit to rendezvous point [scrubbed] for service yzgq5m227nhsi33j.
May 11 21:54:57 electrum Tor[7669]: Giving up launching first hop of circuit to rendezvous point [scrubbed] for service yzgq5m227nhsi33j.
May 11 21:54:59 electrum Tor[7669]: Giving up launching first hop of circuit to rendezvous point [scrubbed] for service yzgq5m227nhsi33j.
May 11 21:55:04 electrum Tor[7669]: Giving up launching first hop of circuit to rendezvous point [scrubbed] for service yzgq5m227nhsi33j.
May 11 21:55:05 electrum Tor[7669]: Giving up launching first hop of circuit to rendezvous point [scrubbed] for service yzgq5m227nhsi33j.
May 11 21:55:53 electrum Tor[7669]: We now have enough directory information to build circuits.
May 11 21:55:54 electrum Tor[7669]: Tor has successfully opened a circuit. Looks like client functionality is working.
May 11 21:55:54 electrum Tor[7669]: Tor has successfully opened a circuit. Looks like client functionality is working.

And it stopped complaining about failing the first hop of a rendezvous circuit. First hop is obviously the guard.

There was a clear cutoff of over 1 hour in this case, just because one of the descriptors was missing. I don't know if the first primary guard was in that consensus as running or not.

To take an informed decision, let's log some more information in a custom branch:

  • all primary guards in their order (first, second, etc)
  • the primary guard(s) for which we are missing descriptors only
  • the status of all primary guards in the current consensus (running / not running)
  • attempts to fetch the missing descriptors and status of this operation (success / failure).

Maybe something else?

comment:14 Changed 18 months ago by arma

Nick asked if we could defer this one to 0.3.2, and based on my current understanding of the ticket, my answer was "no, this looks like a serious bug that needs attention".

comment:15 in reply to:  13 Changed 18 months ago by asn

Replying to s7r:

asn: you could be right with the second theory. Maybe the descriptors that need to be fetched are not even available, because the guard is not running in current consensus. This is based on a log analysis I just did now:

For the record, the theory is that since primary guards is just an ordered subset of guards that have been used in the past, there is no guarantee that the primary guards are currently online or in the consensus. Hence, this can occur naturally if your primary guards or bridges are offline.

I'm not sure if this is the only possible way that our logic can introduce big bootstrapping delays, but it seems like a plausible one.

Perhaps to address this for the non-bridge case, we should count a primary guard descriptor as missing (for the purposes of guard_selection_have_enough_dir_info_to_build_circuits()) only if that guard is present in the consensus. For the case of bridges, it might make sense to relax or disable our dir info requirements completely.

comment:16 Changed 18 months ago by nickm

Owner: set to asn
Status: newassigned

Reassigning with permission; thanks!

comment:17 in reply to:  13 Changed 17 months ago by asn

Replying to s7r:

To take an informed decision, let's log some more information in a custom branch:

  • all primary guards in their order (first, second, etc)
  • the primary guard(s) for which we are missing descriptors only
  • the status of all primary guards in the current consensus (running / not running)
  • attempts to fetch the missing descriptors and status of this operation (success / failure).

Please check my branch bug21969_debug which adds the logs you asked for and will help debug the non-bridge case!

The branch does all the above items except the last one (log guard desc fetches) because that was pretty hard to do properly, and would require non-trivial hacking time. I think the current logging branch has enough info to help us understand if the theory from comment:15 is what's causing trouble in the non-bridge case.

comment:18 Changed 17 months ago by s7r

OK switching on that branch as of now on some machines and will get back with comments.

This will help us understand the problem from comment:13 where the cutoff time is > ~1 hour (one consensus period). What do you think we should do in regards to the problem described in comment:1 where it complains only for few seconds (much much less than an entire consensus period)? I think it might be because:

  • we have many relays in the consensus (over 7000), the number is growing and it takes more and more time for clients to download and process descriptors, therefor having few seconds of lag here
  • we have no feature to download with priority descriptors we are interested in (primary guards, when/if they are available), maybe even try simultaneous download from all 3 DirGuards at once and close the rest of connections once one finishes.

comment:19 Changed 17 months ago by s7r

Thinking some more about this, maybe we can forget about descriptor priority downloads because on the longer term we won't need it and it will just complicate stuff for us. If I recall correctly (please confirm) atm we are trying to make all relays directory servers using BEGIN_DIR on ORPort which means Guard == DirGuard, so clients can first of all, download the guard descriptor directly from the guard itself and all this whole dance is avoided.

comment:20 Changed 17 months ago by s7r

OK asn cooking with your branch bug21969_debug. I have attached the initial bootstrap log, as you can see at 62% bootstrapped it has the descriptors for primary guard 1 and 3 but not for 2. Also, the descriptor for the first primary guard wasn't downloaded in the first ~40% of bootstrap - no priority logic.

Let's give it some time and I will let you know what it says when there's that cutoff.

comment:21 Changed 17 months ago by alecmuffett

I'm back after some time away from this; I can confirm the situation never heals - From May 6th when one daemon started showing the issue, to May 17th when it is still complaining that I learned some more directory information, but not enough to build a circuit: We're missing descriptors for some of our primary entry guards

Can I be of assistance with testing ideas?

comment:22 in reply to:  19 ; Changed 17 months ago by teor

Replying to s7r:

Thinking some more about this, maybe we can forget about descriptor priority downloads because on the longer term we won't need it and it will just complicate stuff for us. If I recall correctly (please confirm) atm we are trying to make all relays directory servers using BEGIN_DIR on ORPort which means Guard == DirGuard, so clients can first of all, download the guard descriptor directly from the guard itself and all this whole dance is avoided.

Operators can explicitly disable DirCache, and can also disable it by setting various other options (like AccountingMax), or by having low RAM or bandwidth. Also, DirCache was only introduced in 0.2.8, and we support relays back to 0.2.4.

So while it is true that most guards are DirCaches, not all guards will be, even in the future.

Also, this might enable an attack/issue where a guard posts one descriptor to the directory authorities, and another to its clients. (This is avoided by using microdescriptors, because their hashes are in the consensus.)

comment:23 in reply to:  22 ; Changed 17 months ago by s7r

Replying to teor:

Operators can explicitly disable DirCache, and can also disable it by setting various other options (like AccountingMax), or by having low RAM or bandwidth. Also, DirCache was only introduced in 0.2.8, and we support relays back to 0.2.4.

So while it is true that most guards are DirCaches, not all guards will be, even in the future.

Also, this might enable an attack/issue where a guard posts one descriptor to the directory authorities, and another to its clients. (This is avoided by using microdescriptors, because their hashes are in the consensus.)

Thanks for the feedback teor. In this case we need to think about a logic where first of all a client will fetch the descriptors of the sampled guards (primary first, and move down the list to all the guards we ever connected to, until we have their descriptors). After that we download / refresh the rest of few thousands descriptors.

comment:24 in reply to:  23 Changed 17 months ago by asn

Replying to s7r:

Replying to teor:

Operators can explicitly disable DirCache, and can also disable it by setting various other options (like AccountingMax), or by having low RAM or bandwidth. Also, DirCache was only introduced in 0.2.8, and we support relays back to 0.2.4.

So while it is true that most guards are DirCaches, not all guards will be, even in the future.

Also, this might enable an attack/issue where a guard posts one descriptor to the directory authorities, and another to its clients. (This is avoided by using microdescriptors, because their hashes are in the consensus.)

Thanks for the feedback teor. In this case we need to think about a logic where first of all a client will fetch the descriptors of the sampled guards (primary first, and move down the list to all the guards we ever connected to, until we have their descriptors). After that we download / refresh the rest of few thousands descriptors.

Personally, I'm not so concerned about the case where we fetch the primary guard descriptor in random order, and some times it might take 1-3 seconds more to get the descriptor (as in your logs attachment:21969_primary_guards_descriptors_initial_bootstrap.txt).

I'm more concerned about pathological cases like comment:15 that significantly delay bootstrapping. That can happen in the bridge case if primary bridges are down, or in the non-bridge case if primary guards are down.

Here is a suggested hotfix for the above pathological case ofguard_selection_have_enough_dir_info_to_build_circuits():

If we are inspecting a non-bridge primary guard that is missing 
from the current consensus, don't count it as missing a descriptor
for the purposes of that function.

If we are inspecting a primary bridge that's missing a descriptor
for more than N=7 seconds, don't count it as missing a descriptor
for the purposes of that function.

The idea here is that if a primary guard is missing from the consensus, we might never get its descriptor. Also if that primary guard is a bridge, if the bridge is offline we will never get its descriptor either, so don't bother blocking on it.

FWIW, IMO the time-sensitive logic on the second rule is kinda flaky and might delay bootstrapping. Here is a more sane alternative, but I think it will be a bigger hassle to implement:

If we are inspecting a primary bridge that's missing a descriptor
and we have failed to connect to that bridge, don't count it as
missing a descriptor for the purposes of that function.

Also, we should strongly consider engraving these rules (and the general bootstrapping blocking logic) into torspec.

Last edited 17 months ago by asn (previous) (diff)

comment:25 Changed 17 months ago by asn

Discussed comment:24 with Nick on IRC. He points out that primary guards need to be listed on the consensus, and hence the top half of the suggested comment:24 fix is not valid anymore. Nick suggests that to fix the bridge case of comment:24 we can check the is_reachable field of the bridge guard, and ignore it if it's not reachable and missing a descriptor.

Still waiting for pathological logs from s7r/alec to learn more.

comment:26 Changed 17 months ago by catalyst

It might be helpful to look at doc/NewGuardAlgorithmTesting in case any information there is still relevant. Maybe do the pre/post comparison on the prop271 effects on bootstrap times, if that isn't already recorded somewhere else.

comment:27 Changed 17 months ago by s7r

OK getting close.

So I have manually edited my state file and added a guard I control as first primary guard with confirmed_idx=0. It was running just fine with no complaints. At some point, I have configured in torrc of the guard PublishServerDescriptor 0 so that it will stop publishing descriptors. The client instance running the debug branch did not complain instantly because it had a descriptor that was still valid, and the first primary guard while not wanting to publish descriptors any longer was still running and accepting connections.

You can see in the log file that after the descriptor expired, it started to panic and never recovered even after ~ 3 consensuses. Also, the cutoff did happen, many many complains about not being able to do first hop in a rendezvous circuit (bad).

Here is what the state file says about the primary guard we are talking about at the time of the cutoff:

Guard in=default rsa_id=0C2E555864546F1B3DDC4BE71573B8CC5DD898EE nickname=AmnesicProsecutor sampled_on=2017-05-13T20:25:19 sampled_by=0.3.1.0-alpha-dev listed=1 confirmed_on=2017-05-16T01:45:08 confirmed_idx=0 pb_use_attempts=69.854044 pb_use_successes=69.854044 pb_circ_attempts=234.410156 pb_circ_successes=231.287109 pb_successful_circuits_closed=229.161133 pb_collapsed_circuits=2.125977 pb_timeouts=8.331055

FILTERED_GUARDS requires that IS_LISTED is true -> this is not enforced apparently. It will insanely look for the inexistent descriptor even if it's not listed and also not running according to the debug branch log.

I have also discovered some other things while testing this, not directly related, but I'll do a separate ticket so this one won't get to messy to follow up.

Last edited 17 months ago by s7r (previous) (diff)

comment:28 in reply to:  27 Changed 17 months ago by asn

Replying to s7r:

OK getting close.

So I have manually edited my state file and added a guard I control as first primary guard with confirmed_idx=0. It was running just fine with no complaints. At some point, I have configured in torrc of the guard PublishServerDescriptor 0 so that it will stop publishing descriptors. The client instance running the debug branch did not complain instantly because it had a descriptor that was still valid, and the first primary guard while not wanting to publish descriptors any longer was still running and accepting connections.

Thanks for the logs, however I'm not sure if PublishServerDescriptor 0 is the right methodology to reproduce this bug. Doing that will make your relay do the weird thing of not being in the consensus but still accepting connections. I'm not sure how that behavior influences the guard algorithm exactly and it might complicate the analysis here. I think the easiest test thing to do would be to just shut down your guard relay like normal, and check if the guard is still in the primary guard list when it exits the consensus.

comment:29 in reply to:  25 Changed 17 months ago by asn

Replying to asn:

Discussed comment:24 with Nick on IRC. He points out that primary guards need to be listed on the consensus, and hence the top half of the suggested comment:24 fix is not valid anymore. Nick suggests that to fix the bridge case of comment:24 we can check the is_reachable field of the bridge guard, and ignore it if it's not reachable and missing a descriptor.

Still waiting for pathological logs from s7r/alec to learn more.

I think the bridges-case approach where we check is_reachable before counting a guard towards the "needs descriptor" block is possible. However, with the current codebase it's not as easy as checking is_reachable at the time of guard_selection_have_enough_dir_info_to_build_circuits().

That's because we don't actually notify the guard subsystem when a bridge descriptor fetch fails. And that's because in launch_direct_bridge_descriptor_fetch() we never set the guard_state on the directory request, so entry_guard_chan_failed() doesn't update the guard state when the fail happens.

Seems like the correct bridges-side fix here involves the following two steps:
1) Add guard_state to bridge descriptor directory request in launch_direct_bridge_descriptor_fetch().
2) Don't count unreachable primary bridges towards the guard_selection_have_enough_dir_info_to_build_circuits() descriptor block limit.

Please see branch bug21969_bridges for the bridge-side fix here. Setting this to needs_review for now.

Last edited 17 months ago by asn (previous) (diff)

comment:30 Changed 17 months ago by asn

Status: assignedneeds_review

comment:31 Changed 17 months ago by nickm

I want to take this in 0.3.1.1-alpha is possible, but it needs a changes file, and it needs to be based on maint-0.3.0 for a possible backport if it works in 0.3.1.1-alpha.

comment:32 Changed 17 months ago by asn

OK, I have an 0.3.0 branch on bug21969_bridges_030 and a fresh force-pushed 0.3.1 branch on bug21969_bridges. Both of them should contain changes file.

BTW the whole problem ended up being the missing guard state, so I didn't actually have to do part (2) of comment:29, since we already ignore unreachable bridges:

  SMARTLIST_FOREACH_BEGIN(gs->primary_entry_guards, entry_guard_t *, guard) {
    entry_guard_consider_retry(guard);
    if (guard->is_reachable == GUARD_REACHABLE_NO)
      continue;
    n_considered++;
...

comment:33 Changed 17 months ago by nickm

Thank you, and thanks for making it even smaller!

I've made a new child ticket (#22325) for tracking the bridge bug in particular. In that ticket, I've merged this patch and marked it for backport to 0.3.0. I'll leave this one open for the non-bridge-using case.

comment:34 Changed 17 months ago by nickm

Status: needs_reviewassigned

Putting back in "assigned" to track the non-bridge-using case.

comment:35 Changed 17 months ago by cypherpunks

Keywords: tor-client added
Version: Tor: 0.3.0.6

Just started my TBB 7.0a4, exited yesterday, and got:

Tor NOTICE: Bootstrapped 100%: Done 
Tor NOTICE: Our directory information is no longer up-to-date enough to build circuits: We're missing descriptors for some of our primary entry guards 
Tor NOTICE: I learned some more directory information, but not enough to build a circuit: We're missing descriptors for some of our primary entry guards 
Tor NOTICE: New control connection opened from 127.0.0.1. 
Tor NOTICE: New control connection opened from 127.0.0.1. 
Tor NOTICE: We now have enough directory information to build circuits. 

It looks like Tor is using old info during bootstrap. (No bridges, FWIW.)

comment:36 Changed 15 months ago by cypherpunks

Same here with system-wide Tor instance on Arch Linux x86_64, not using bridges.
Three hours after bootstrapping and running smoothly, Tor suddenly lost all circuits. I decided to wait if it would reconnect by itself, but it didn't. After restarting Tor, everything went back to normal.

Jul 24 10:17:13.122 [notice] Tor 0.3.0.9 (git-22b3bf094e327093) running on Linux with Libevent 2.1.8-stable, OpenSSL 1.1.0f and Zlib 1.2.11.
Jul 24 10:17:13.122 [notice] Tor can't help you if you use it wrong! Learn how to be safe at https://www.torproject.org/download/download#warning
Jul 24 10:17:13.122 [notice] Read configuration file "/etc/tor/torrc".
Jul 24 10:17:13.125 [notice] Opening Socks listener on 127.0.0.1:9050
Jul 24 10:17:13.000 [notice] {GENERAL} Parsing GEOIP IPv4 file /usr/share/tor/geoip.
Jul 24 10:17:13.000 [notice] {GENERAL} Parsing GEOIP IPv6 file /usr/share/tor/geoip6.
Jul 24 10:17:13.000 [notice] {CONTROL} Bootstrapped 0%: Starting
Jul 24 10:17:13.000 [notice] {GUARD} Starting with guard context "default"
Jul 24 10:17:13.000 [notice] {CONTROL} Bootstrapped 80%: Connecting to the Tor network
Jul 24 10:17:14.000 [notice] {CONTROL} Bootstrapped 85%: Finishing handshake with first hop
Jul 24 10:17:14.000 [notice] {CONTROL} Bootstrapped 90%: Establishing a Tor circuit
Jul 24 10:17:14.000 [notice] {GENERAL} Tor has successfully opened a circuit. Looks like client functionality is working.
Jul 24 10:17:14.000 [notice] {CONTROL} Bootstrapped 100%: Done
Jul 24 13:21:14.000 [notice] {DIR} Our directory information is no longer up-to-date enough to build circuits: We're missing descriptors for some of our primary entry guards
Jul 24 13:21:14.000 [notice] {DIR} I learned some more directory information, but not enough to build a circuit: We're missing descriptors for some of our primary entry guards
Jul 24 13:21:26.000 [notice] {DIR,APP} Application request when we haven't used client functionality lately. Optimistically trying directory fetches again.
Jul 24 13:21:30.000 [notice] {DIR,APP} Application request when we haven't used client functionality lately. Optimistically trying directory fetches again.
Jul 24 13:21:30.000 [notice] {DIR,APP} Application request when we haven't used client functionality lately. Optimistically trying directory fetches again.
Jul 24 13:21:31.000 [notice] {DIR,APP} Application request when we haven't used client functionality lately. Optimistically trying directory fetches again.
Jul 24 13:21:31.000 [notice] {DIR,APP} Application request when we haven't used client functionality lately. Optimistically trying directory fetches again.
Jul 24 13:21:31.000 [notice] {DIR,APP} Application request when we haven't used client functionality lately. Optimistically trying directory fetches again.
Jul 24 13:21:31.000 [notice] {DIR,APP} Application request when we haven't used client functionality lately. Optimistically trying directory fetches again.
Jul 24 13:21:31.000 [notice] {DIR,APP} Application request when we haven't used client functionality lately. Optimistically trying directory fetches again.
Jul 24 13:21:31.000 [notice] {DIR,APP} Application request when we haven't used client functionality lately. Optimistically trying directory fetches again.
Jul 24 13:21:48.000 [notice] {DIR,APP} Application request when we haven't used client functionality lately. Optimistically trying directory fetches again.
Jul 24 13:21:48.000 [notice] {DIR,APP} Application request when we haven't used client functionality lately. Optimistically trying directory fetches again.
Jul 24 13:21:48.000 [notice] {DIR,APP} Application request when we haven't used client functionality lately. Optimistically trying directory fetches again.
Jul 24 13:21:48.000 [notice] {DIR,APP} Application request when we haven't used client functionality lately. Optimistically trying directory fetches again.
Jul 24 13:21:48.000 [notice] {DIR,APP} Application request when we haven't used client functionality lately. Optimistically trying directory fetches again.

[ca. 200 identical messages deleted]

Jul 24 13:39:30.000 [notice] {DIR,APP} Application request when we haven't used client functionality lately. Optimistically trying directory fetches again.
Jul 24 13:39:30.000 [notice] {DIR,APP} Application request when we haven't used client functionality lately. Optimistically trying directory fetches again.
Jul 24 13:39:34.000 [notice] {DIR,APP} Application request when we haven't used client functionality lately. Optimistically trying directory fetches again.
Jul 24 13:39:34.000 [notice] {DIR,APP} Application request when we haven't used client functionality lately. Optimistically trying directory fetches again.
Jul 24 13:39:58.000 [notice] {DIR,APP} Application request when we haven't used client functionality lately. Optimistically trying directory fetches again.
Jul 24 13:39:58.000 [notice] {DIR,APP} Application request when we haven't used client functionality lately. Optimistically trying directory fetches again.
Jul 24 13:40:36.000 [notice] {GENERAL} Interrupt: exiting cleanly.

comment:37 Changed 13 months ago by asn

Milestone: Tor: 0.3.1.x-finalTor: 0.3.3.x-final
Status: assignedneeds_information

Deferring this to 0.3.3 since we are getting less instances of this these days. Let's reexamine when we learn more info.

comment:38 Changed 13 months ago by cypherpunks

Wow! Tor-client failed for the first time since 2014 for me! Without any reason! Up-time was over 5 days. Unmodified TBB is the same as in #23620 (probably, duplicate), but on Windows. Normal browsing was halted suddenly, and the logs were the same as in comment:36, starting from [notice] {DIR}.
FWIW: Tor had 5 connections to the same addr:9001, but no network activity. Termination of that connections didn't have any effect, Tor didn't start to make new connections.

Last edited 13 months ago by cypherpunks (previous) (diff)

comment:39 Changed 13 months ago by asn

Milestone: Tor: 0.3.3.x-finalTor: 0.3.2.x-final

comment:40 Changed 13 months ago by asn

We merged #23671 and #23670 to master today, and hopefully those new logs will help us debug this issue better. If you are hunting for this bug, please upgrade to latest master :) Thanks!

comment:41 Changed 13 months ago by mquin

Having updated our hidden service to a recent nightly we're seeing:

Oct 05 06:25:03.000 [notice] Tor 0.3.2.2-alpha-dev (git-55e5446f8dfdb860+e87771b42) opening new log file.
Oct 05 07:31:49.000 [notice] I learned some more directory information, but not enough to build a circuit: We're missing descriptors for 1/2 of our primary entry guards (total microdescriptors: 6622/6679).

and lots of "Giving up launching first hop of circuit"

Existing connections to the hidden service are maintained but new ones time out

comment:42 Changed 13 months ago by dgoulet

My v3 onion service got stuck for a while in this bug. Attached here are the info logs I have which could be useful.

Changed 13 months ago by dgoulet

Attachment: info.log.tgz added

comment:43 Changed 13 months ago by mrphs

Cc: bmeson mrphs added

Changed 13 months ago by bmeson

Attachment: OnionsTimeouts.png added

comment:44 Changed 13 months ago by bmeson

Thanks @mrphs for helping us debug this further. This bug is biting us pretty bad, both because we depend on Authenticated Hidden Services for administrating servers but also regular Hidden Services to serve content. It may not be related but according to Metrics, it looks like Hidden Services are timing out more? I don't know if that is related but I added it to the ticket. For now, we are restarting Tor and hoping it comes back quickly.

comment:45 Changed 13 months ago by dgoulet

Priority: HighImmediate
Severity: NormalBlocker

comment:46 Changed 13 months ago by asn

Announcement to people who encounter this bug: If this destroys your Tor experience, please downgrade to Tor-0.2.9 for now. We are working on this issue and hope to have fixes ASAP. Thanks :)

comment:47 Changed 13 months ago by asn

Note: Based on a few logs I've been looking at (also including info.log.tgz from this ticket), this We're missing descriptors for some of our primary entry guards event seems to start occuring after the download of a new consensus:

Oct 10 03:23:07.000 [info] handle_response_fetch_consensus(): Received consensus directory (body size 2009950) from server '178.62.60.37:443'
Oct 10 03:23:08.000 [info] A consensus needs 5 good signatures from recognized authorities for us to accept it. This one has 8 (dannenberg tor26 longclaw maatuska moria1 dizum gabelmoo Faravahar).
...
Oct 10 03:23:09.000 [notice] Our directory information is no longer up-to-date enough to build circuits: We're missing descriptors for some of our primary entry guards
Oct 10 03:23:09.000 [notice] I learned some more directory information, but not enough to build a circuit: We're missing descriptors for some of our primary entry guards

Changed 13 months ago by dgoulet

Attachment: info.log.2.tgz added

comment:48 Changed 13 months ago by asn

Also, if anyone has info logs on this issue. Please attach them here. It's an edge-case so not very hard to reproduce and logs are useful!

comment:49 Changed 13 months ago by s7r

In order to reproduce the case where the hidden service becomes unavailable and does not recover:

  • have a relay with Guard flag that you control, take its fingerprint;
  • edit the state file of the Tor client instance where the hidden service is hosted and change the confirmed_idx=0 fingerprint with the fingerprint of your own Guard;
  • shut down the guard relay temporarily (do not touch its fingerprint / identity);
  • wait for the valid descriptor for this guard that is in use on the Tor client instance to expire;

Changed 13 months ago by bmeson

Attachment: brokenHSindebug.tar.gz added

comment:50 Changed 13 months ago by bmeson

I am uploading (brokenHSindebug.tar.gz) a few levels of debug from what we have been analyzing. Hopefully this will give you some good level of debug logging. I think the torinfo one has the best level of debugging. I also attempted to do this on a local install and was noticing some weird logging in the logs "tor_mmap_file(): Cold not open "/var/lib/tor/cached-extrainfor" for mmap(): no such file". So I did a hard reinstall of Tor from the repositorites (tor version 0.3.1.7~trusty) and added an additional tor-reinstall debug log (I also was unable to access those hidden services). Not sure if that's helpful but logs are there.

comment:51 Changed 12 months ago by asn

Notes: Related tickets #23862, #23863, #23817.

comment:52 Changed 12 months ago by asn

Sponsor: SponsorUSponsorV

comment:53 Changed 12 months ago by dgoulet

Last night, my service was stalled with probably this bug. Attached are the info logs.

I confirmed that the client and service were looking for the same blinded key on the same hashring but descriptor couldn't be found so somehow the service stopped uploading.

Changed 12 months ago by dgoulet

Attachment: info.log.3.tgz added

comment:54 Changed 12 months ago by s7r

Tested this more again, on latest git master 0.3.3.0-alpha-dev (git-332d6bf9010fff1c). Things are a lot better:

  1. If I manually edit the state file and substitute the guard with confirmed_idx=0 with a fingerprint that is not even in the current consensus, it will mark it as unlisted_since and proceed normally by moving to guard with confirmed_idx=1 and successfully build circuits, without any complaints in the log file. This is the expected behavior, and it works in both scenarios:

a) when DataDirectory folder contains current consensus and microdescs;
b) when DataDirectory folder only contains the state file (cached consensus and microdescs deleted on purpose)

  1. If I manually edit the state file and substitute all 3 confirmed guards with fingerprints that do not exist, Tor will mark them as unlisted_since and move on with other listed guards that are available, without any complaints in the log file. This also works as expected in both scenarios described above.

On previous tests, this was enough to get in the state where it never stops complaining about missing descriptor for some of our primary guards and not being able to build circuits.

Now I need to find a way to somehow test with a guard that appears in the consensus but for which a valid descriptor can't be found.

comment:55 Changed 11 months ago by cypherpunks

Cc: cypherpunks added

comment:56 Changed 11 months ago by cypherpunks

FWIW I still encounter this with Tor 0.3.2.6-a

comment:57 Changed 11 months ago by teor

But how long is your Tor in that state?
Some of the changes in the alpha should make it harder to get stuck in that state, and faster to get out.

comment:58 Changed 11 months ago by cypherpunks

Holy shit! It is much more worse in the last stable (0.3.1.9)! It got stuck right after the restart of the browser! First circuit is working (to allow me to post this comment :)), but any other URL timed out. Waiting 30 minutes is not enough, all attempts to connect end with Tor NOTICE: Application request when we haven't used client functionality lately. Optimistically trying directory fetches again.

comment:59 Changed 11 months ago by teor

It looks like you've encountered #23620 - Tor lies about "Optimistically trying directory fetches again".
We think it's a symptom of some of the other issues listed here.

None of the fixes for this issue have been backported to 0.3.1.9 yet. So I'm not surprised you're seeing this. It can happen by chance any time you're running Tor Browser.

comment:60 in reply to:  57 Changed 10 months ago by cypherpunks

Replying to teor:

But how long is your Tor in that state?

More than 10 minutes, in which case I just have to restart the browser.

comment:61 Changed 10 months ago by asn

Milestone: Tor: 0.3.2.x-finalTor: 0.3.3.x-final

Huge work has been done here as part of 032. Let's move it to 033 for the rest stuff.

comment:62 Changed 10 months ago by cstest

Jan 06 16:22:12.000 [notice] Received reload signal (hup). Reloading config and resetting internal state.
Jan 06 16:22:12.000 [notice] Read configuration file "/etc/tor/tor2.cfg".
Jan 06 16:22:12.000 [notice] Tor 0.3.1.9 (git-df96a13e9155c7bf) opening log file.
Jan 06 16:26:09.000 [notice] Received reload signal (hup). Reloading config and resetting internal state.
Jan 06 16:26:09.000 [notice] Read configuration file "/etc/tor/tor2.cfg".
Jan 06 16:26:09.000 [notice] Tor 0.3.1.9 (git-df96a13e9155c7bf) opening log file.
Jan 06 16:28:01.000 [notice] Our directory information is no longer up-to-date enough to build circuits: We're missing descriptors for some of our primary entry guards
Jan 06 16:28:01.000 [notice] I learned some more directory information, but not enough to build a circuit: We're missing descriptors for some of our primary entry guards
Jan 06 16:28:01.000 [warn] Requested exit point '$D885444EED4522249A1BD29A250E612889C451F8' is not known. Closing.
Jan 06 16:28:11.000 [warn] Giving up launching first hop of circuit to rendezvous point ..........
Jan 06 16:28:11.000 [notice] Received reload signal (hup). Reloading config and resetting internal state.
Jan 06 16:28:11.000 [notice] Read configuration file "/etc/tor/tor2.cfg".
Jan 06 16:28:12.000 [notice] Tor 0.3.1.9 (git-df96a13e9155c7bf) opening log file.
Jan 06 16:28:12.000 [warn] Giving up launching first hop of circuit to rendezvous point ............
Jan 06 16:28:12.000 [warn] Giving up launching first hop of circuit to rendezvous point ............
Jan 06 16:28:12.000 [warn] Giving up launching first hop of circuit to rendezvous point ............
........

Waited for an hour for Tor to repair itself but it was just throwing 'giving up' warning. Server was running 250 domains, not DDoSed, up to 200 users connected.

The first time this problem occur couple of days on Tor instance running only 1 domain but it was online for days.

comment:63 Changed 10 months ago by teor

None of the fixes for this issue have been backported to 0.3.1.9 yet. So I'm not surprised you're seeing this.
Please upgrade to 0.3.2 if you need these fixes urgently. (There should be a stable 0.3.2 release early next week.)

comment:64 Changed 9 months ago by asn

FWIW, my experience with my hsv3 shows that the situation here has improved a lot!
It usually takes just 1 second to completely recover from a decay of mds and a We're missing descriptors for 1/2 of our primary entry guards situation (this happens naturally every 6-7 days or so).

comment:65 in reply to:  57 ; Changed 9 months ago by cypherpunks

Replying to teor:

But how long is your Tor in that state?
Some of the changes in the alpha should make it harder to get stuck in that state, and faster to get out.

As you can see in comment:58, it is a state of some circuits, but not of Tor. And unfortunately, Tor 0.3.2.9 can't get out of it sometimes after hibernation, so that web apps can't be resumed (timeout). The log shows only a lot of

Tor NOTICE: Tried for 120 seconds to get a connection to [scrubbed]:443. Giving up. (waiting for circuit)

And also it showed

Tor NOTICE: Tor has not observed any network activity for the past 121 seconds. Disabling circuit build timeout recording. 
Tor NOTICE: Tor now sees network activity. Restoring circuit build timeout recording. Network was down for 296 seconds during 19 circuit attempts. 

without any reason (there were no network problems, except it meant Tor network).

comment:66 Changed 9 months ago by asn

Milestone: Tor: 0.3.3.x-finalTor: 0.3.4.x-final

comment:67 in reply to:  65 ; Changed 9 months ago by cypherpunks

Replying to cypherpunks:

As you can see in comment:58, it is a state of some circuits, but not of Tor. And unfortunately, Tor 0.3.2.9 can't get out of it sometimes after hibernation, so that web apps can't be resumed (timeout). The log shows only a lot of

Tor NOTICE: Tried for 120 seconds to get a connection to [scrubbed]:443. Giving up. (waiting for circuit)

And also it showed

Tor NOTICE: Tor has not observed any network activity for the past 121 seconds. Disabling circuit build timeout recording. 
Tor NOTICE: Tor now sees network activity. Restoring circuit build timeout recording. Network was down for 296 seconds during 19 circuit attempts. 

without any reason (there were no network problems, except it meant Tor network).

I confirm, I have Ricochet running with Tor 0.3.3.1-a and it's unable to bootstrap back after hibernation even after lefting it there for 6 hours.

comment:68 in reply to:  67 Changed 9 months ago by cypherpunks

Replying to cypherpunks:

I confirm, I have Ricochet running with Tor 0.3.3.1-a and it's unable to bootstrap back after hibernation even after lefting it there for 6 hours.

The situation is so bad that if I don't quit and restart Ricochet then it's always stuck.

comment:69 Changed 9 months ago by asn

Hello, do you guys have any info logs for these issues by any chance? That would be really useful!

comment:70 in reply to:  67 Changed 9 months ago by arma

Replying to cypherpunks:

I have Ricochet running with Tor 0.3.3.1-a and it's unable to bootstrap back after hibernation even after lefting it there for 6 hours.

Well, Ricochet has its own bug for that situation:
https://github.com/ricochet-im/ricochet/issues/245

So we should be careful not to confuse these two issues.

comment:71 in reply to:  69 ; Changed 9 months ago by cypherpunks

Replying to asn:

Hello, do you guys have any info logs for these issues by any chance? That would be really useful!

[notice] Your system clock just jumped 3902 seconds forward; assuming established circuits no longer work.
[notice] Tried for 3915 seconds to get a connection to [scrubbed]:9878. Giving up. (waiting for rendezvous desc)
[notice] Tried for 3915 seconds to get a connection to [scrubbed]:9878. Giving up. (waiting for rendezvous desc)
[notice] Delaying directory fetches: No running bridges

comment:72 in reply to:  71 ; Changed 9 months ago by asn

Replying to cypherpunks:

Replying to asn:

Hello, do you guys have any info logs for these issues by any chance? That would be really useful!

[notice] Your system clock just jumped 3902 seconds forward; assuming established circuits no longer work.
[notice] Tried for 3915 seconds to get a connection to [scrubbed]:9878. Giving up. (waiting for rendezvous desc)
[notice] Tried for 3915 seconds to get a connection to [scrubbed]:9878. Giving up. (waiting for rendezvous desc)
[notice] Delaying directory fetches: No running bridges

Thanks for the logs. Unfortunately, they are not info level which kinda makes it harder.

Some questions so that I understand the situation a bit better (too many cypherpunks in this ticket). Did you suspend your box for like an hour (3902 seconds?). Also, are you using bridges? I guess they were up at that time, and Tor was confused? Is this a ricochet setup?

comment:73 in reply to:  72 Changed 8 months ago by cypherpunks

Replying to asn:

Some questions so that I understand the situation a bit better (too many cypherpunks in this ticket).

Just to clarify: the cpunk from comment:65 is a different person with a different setup, I'm the one using Ricochet.

Did you suspend your box for like an hour (3902 seconds?). Also, are you using bridges? I guess they were up at that time, and Tor was confused? Is this a ricochet setup?

Yes to all of your questions. This happens as well with Tor Browser with bridges but it gets back to normal after some seconds. So seems that arma was right (yet again), it's a problem on the Ricochet side. Sorry then. :(

comment:74 Changed 8 months ago by teor

I use Ricochet on macOS with the embedded 0.2.8 Tor client. I can confirm that Ricochet occasionally fails to recover from a suspend/restore, even using the legacy guard code. (The latest guard code was introduced in 0.3.0.)

comment:75 Changed 8 months ago by cypherpunks

Cc: cypherpunks removed
Keywords: tbb-usability-website tbb-needs added
Milestone: Tor: 0.3.4.x-finalTor: 0.3.3.x-final
Status: needs_informationnew

comment:61 is correct. If you say tor gets out of that state in seconds now, then it should clear its leftovers (stalled circuits/state/etc) right thereafter.

comment:76 Changed 8 months ago by cypherpunks

By the way I experienced today after resuming hibernation with Tor Browser 7.5 the same situation (I'm using bridges), where the connection wouldn't resume. It's very rare indeed, but it still happen sometimes.

comment:77 in reply to:  76 Changed 8 months ago by cypherpunks

Replying to cypherpunks:

I'm using bridges

Then reopen #22325 and paste info level logs there.

comment:78 Changed 8 months ago by nickm

Keywords: 033-maybe-must added

Mark some tickets as possibly belonging in 033-must.

comment:79 in reply to:  75 Changed 8 months ago by asn

Replying to cypherpunks:

comment:61 is correct. If you say tor gets out of that state in seconds now, then it should clear its leftovers (stalled circuits/state/etc) right thereafter.

Hello, can you please explain exactly what leftover state you are talking about? I don't see any important state that should be cleared, and hence I'm suggesting we defer any such thing to release 0.3.4, since 0.3.3 is about to get closed.

comment:80 in reply to:  75 Changed 8 months ago by asn

Milestone: Tor: 0.3.3.x-finalTor: 0.3.4.x-final

Replying to cypherpunks:

comment:61 is correct. If you say tor gets out of that state in seconds now, then it should clear its leftovers (stalled circuits/state/etc) right thereafter.

I'm moving this back to 0.3.4 because I doubt that cleaning the leftovers is urgent enough for 0.3.3.

Please let us know which leftovers you want cleaned and let's open a new ticket for that. Thanks :)

comment:81 Changed 8 months ago by nickm

Keywords: 033-maybe-must removed

comment:82 Changed 8 months ago by s7r

I have been just hit by this on TBB 7.5. It is using the default guard context, no bridges. I am not sure if it is this or #23863. It is running but basically useless, it cannot load any web pages and all other apps using this Tor's instance SocksPort are also disconnected. It is in this state for about 4 hours, want to leave it a little bit more to see if it recovers. I get this:

2/24/2018 18:20:34 PM.800 [NOTICE] Our directory information is no longer up-to-date enough to build circuits: We're missing descriptors for 1/2 of our primary entry guards (total microdescriptors: 6051/6077). 
2/24/2018 18:20:34 PM.800 [NOTICE] I learned some more directory information, but not enough to build a circuit: We're missing descriptors for 1/2 of our primary entry guards (total microdescriptors: 6051/6077). 

and even after:

2/24/2018 18:20:35 PM.200 [NOTICE] We now have enough directory information to build circuits.

It is still not working and not being able to load any web page:

2/24/2018 19:03:18 PM.600 [NOTICE] Tried for 120 seconds to get a connection to [scrubbed]:443. Giving up. (waiting for circuit)

comment:83 Changed 8 months ago by teor

Please don't post the same logs on multiple tickets, it makes it hard to track the response.

Here are all the related tickets:

When Tor can't download microdescriptors (#21969), maybe it should try authorities or fallbacks (#23863), before it runs out of microdesc retries (#24113). But even after Tor has the microdescs it needs, it sometimes doesn't start building circuits again. Instead, it is in state "waiting for circuit" (#25347).

comment:84 Changed 8 months ago by teor

I think we just need to fix #25347 in 0.3.3.

comment:85 Changed 8 months ago by s7r

I agree teor, especially because the hang occurs even long after Tor said it has enough directory info to build circuits and even claims to open one:

2/24/2018 19:47:29 PM.300 [NOTICE] Tor has successfully opened a circuit. Looks like client functionality is working. 
2/24/2018 20:50:07 PM.600 [NOTICE] Tried for 120 seconds to get a connection to [scrubbed]:443. Giving up. (waiting for circuit) 

Not even a full restart will make it heal. I'll post logs and details in #25347

comment:86 Changed 7 months ago by nickm

Keywords: 034-triage-20180328 added

comment:87 Changed 7 months ago by nickm

Keywords: 034-removed-20180328 added

Per our triage process, these tickets are pending removal from 0.3.4.

comment:88 Changed 7 months ago by nickm

Milestone: Tor: 0.3.4.x-finalTor: unspecified

These tickets, tagged with 034-removed-*, are no longer in-scope for 0.3.4. We can reconsider any of them, if time permits.

comment:89 Changed 5 months ago by arma

Is this bug actually still open? Or did we fix it?

comment:90 Changed 5 months ago by teor

Priority: ImmediateMedium
Severity: BlockerNormal

The urgent parts have been fixed. But we still have some work to do.

Note: See TracTickets for help on using tickets.