Opened 5 months ago

Last modified 2 months ago

#21969 assigned defect

We're missing descriptors for some of our primary entry guards

Reported by: asn Owned by: asn
Priority: High Milestone: Tor: 0.3.1.x-final
Component: Core Tor/Tor Version: Tor: 0.3.0.6
Severity: Normal Keywords: tor-guard, tor-bridge, tor-client
Cc: catalyst, isis Actual Points:
Parent ID: Points: 1.5
Reviewer: Sponsor: SponsorU

Description

Seems like there are still a few cases where bootstrapping will stall because of:

I learned some more directory information, but not enough to build a circuit: We're missing descriptors for some of our primary entry guards 

It seems to happen mainly with bridges. For me it occured with latest master (b081a7ed2) when I was switching between using private bridges and the default TBB bridges. I also heard reports from s7r about this happening.

We last changed this behavior in #21415, when we made Tor require that we have the descriptor of 2 bridges before we start building circuits.

Child Tickets

TicketStatusOwnerSummaryComponent
#22325closed"we're missing descriptors for some of our primary entry guards" while using bridgesCore Tor/Tor

Attachments (3)

tordebug.20170504.log (182.9 KB) - added by catalyst 5 months ago.
21969_primary_guards_descriptors_initial_bootstrap.txt (7.1 KB) - added by s7r 4 months ago.
21969_first_primary_guard_descriptor_inexistent.txt (243.4 KB) - added by s7r 4 months ago.

Download all attachments as: .zip

Change History (39)

comment:1 Changed 5 months ago by s7r

Right. I am getting it every 2 days or so. The server is 1gbps at an ISP datacenter, so it's out of the question if it's on a slow connection network.

I am running 0.3.1.0-alpha-dev (git-b1c7e5d8c0a648b1+f7afd80) with default guard context (no bridges).

This Tor instance hosts a hidden service, and when this happens it can no longer build new rend circuits. However, let's think about it from another perspective maybe the bug is actually fixed in terms of logic of requested descriptors, but it's a race condition in how they are processed after being downloaded / refreshed. I am saying this because, based on the timestamp, at exactly the same time or immediately after (~1 second) it complains that it is missing descriptors for some of our primary entry guards, it comes back to reality and says We now have enough directory information to build circuits. And after more time (minutes - is this normal?) it confirms that it successfully opened a circuit. This is logged two times, every time, with the very same timestamp (is this also normal?)

Check timestamps in these last events:

Apr 12 04:26:10 vm16 Tor[978]: Our directory information is no longer up-to-date enough to build circuits: We're missing descriptors for some of our primary entry guards
Apr 12 04:26:10 vm16 Tor[978]: I learned some more directory information, but not enough to build a circuit: We're missing descriptors for some of our primary entry guards
Apr 12 04:26:10 vm16 Tor[978]: We now have enough directory information to build circuits.
Apr 12 04:27:19 vm16 Tor[978]: Tor has successfully opened a circuit. Looks like client functionality is working.
Apr 12 04:27:19 vm16 Tor[978]: Tor has successfully opened a circuit. Looks like client functionality is working.
Apr 17 12:04:16 vm16 Tor[978]: Our directory information is no longer up-to-date enough to build circuits: We're missing descriptors for some of our primary entry guards
Apr 17 12:04:16 vm16 Tor[978]: I learned some more directory information, but not enough to build a circuit: We're missing descriptors for some of our primary entry guards
Apr 17 12:04:16 vm16 Tor[978]: We now have enough directory information to build circuits.
Apr 17 12:05:24 vm16 Tor[978]: Tor has successfully opened a circuit. Looks like client functionality is working.
Apr 17 12:05:24 vm16 Tor[978]: Tor has successfully opened a circuit. Looks like client functionality is working.
Apr 19 05:38:18 vm16 Tor[978]: Our directory information is no longer up-to-date enough to build circuits: We're missing descriptors for some of our primary entry guards
Apr 19 05:38:18 vm16 Tor[978]: I learned some more directory information, but not enough to build a circuit: We're missing descriptors for some of our primary entry guards
Apr 19 05:38:18 vm16 Tor[978]: We now have enough directory information to build circuits.
Apr 19 05:42:12 vm16 Tor[978]: Tor has successfully opened a circuit. Looks like client functionality is working.
Apr 19 05:42:12 vm16 Tor[978]: Tor has successfully opened a circuit. Looks like client functionality is working.
Apr 21 00:48:20 vm16 Tor[978]: Our directory information is no longer up-to-date enough to build circuits: We're missing descriptors for some of our primary entry guards
Apr 21 00:48:20 vm16 Tor[978]: I learned some more directory information, but not enough to build a circuit: We're missing descriptors for some of our primary entry guards
Apr 21 00:48:21 vm16 Tor[978]: We now have enough directory information to build circuits.
Apr 21 00:49:26 vm16 Tor[978]: Tor has successfully opened a circuit. Looks like client functionality is working.
Apr 21 00:49:26 vm16 Tor[978]: Tor has successfully opened a circuit. Looks like client functionality is working.

asn should we allow a time period for Tor to process new descriptors upon refresh? In terms of seconds, so slower computers or computers on slower connections don't get these warnings? During that time, previous descriptors can be used. This also requires to ensure we try to refresh descriptors for guards with N minutes before the ones we already have expire, so we have this time period to handle it.

comment:2 Changed 5 months ago by catalyst

Cc: catalyst added

comment:3 Changed 5 months ago by alecmuffett

Clipped from my comments on tor-dev:

I am running a cluster of 12 "worker" tor daemons (0.3.0.6) across 6 machines, under Onionbalance.

So far, 2x of the worker-onion daemons have gone "stale" - are not publishing descriptors, are not reachable from the tor network - even though their Log.NOTICE-level logfiles show nothing weird. I've sent SIGUSR2 to one of them to enable more verbose logging.

The only notable message is: Diagnostic for issue 8387: Found 2 one-hop circuits more than 1800 seconds old! Logging 2... - but all of the workers are doing that occasionally, whereas only 2x of them are stale.

Also: in each case where there is a stale tor-daemon on a machine, the other daemon, in a separate directory, is not stale - so, it can't be a connectivity issue.

I did egrep '(for.some.of.our|now.have.enough)' */tor.log on the fleet of workers; it turns out that all 12x of the worker daemons have at various points issued the following messages:

Our directory information is no longer up-to-date enough to build circuits: We're missing descriptors for some of our primary entry guards

...and...

I learned some more directory information, but not enough to build a circuit: We're missing descriptors for some of our primary entry guards.

HOWEVER the 2x STALE daemons have not issued a subsequent:

We now have enough directory information to build circuits

...so it appears that they, in particular, have given up trying to fetch directory information - a state which has persisted for several days now.

Sample tor.conf follows

# -*- conf -*-
# eotk (c) 2017 Alec Muffett

# template note: here we use TOR_DIR not PROJECT_DIR because of the
# relocation of Tor directories under `softmap`
DataDirectory /home/pi/eotk/projects.d/projname.d/hs-2.d
ControlPort unix:/home/pi/eotk/projects.d/projname.d/hs-2.d/tor-control.sock
PidFile /home/pi/eotk/projects.d/projname.d/hs-2.d/tor.pid
Log notice file /home/pi/eotk/projects.d/projname.d/hs-2.d/tor.log
SafeLogging 1
HeartbeatPeriod 60 minutes
LongLivedPorts 80,443
RunAsDaemon 1

# use single onions
SocksPort 0 # have to disable this for single onions
HiddenServiceSingleHopMode 1 # yep, i want single onions
HiddenServiceNonAnonymousMode 1 # yes, really, honest, i swear

# softmap
HiddenServiceDir /home/pi/eotk/projects.d/projname.d/hs-2.d
HiddenServicePort 80 unix:/home/pi/eotk/projects.d/projname.d/hs-2.d/port-80.sock
HiddenServicePort 443 unix:/home/pi/eotk/projects.d/projname.d/hs-2.d/port-443.sock
HiddenServiceNumIntroductionPoints 3

comment:4 Changed 5 months ago by nickm

Priority: MediumHigh

comment:5 Changed 5 months ago by asn

We need to do something smart about this problem...

I was going to suggest we just drop the non-fatal assert of #21415 and only check for the descriptor of the first primary guard (instead of first two), but I actually feel I don't understand the issue well enough.

Perhaps some log messages that show the guard names that are missing descriptor would be helpful. Also, I'm wondering why Alec's tor is not doing any efforts in fetching the missing descriptors even over multiple hour. Maybe we should also log our attempts for fetching descriptors that are missing?

Another theory: Since primary guards come from CONFIRMED_GUARDS, there is no guarantee that they are currently up and running. Could it be that Alec's primary guard was offline and off the consensus at that time, so Alec's tor couldn't even fetch the needed descriptor and proceed?

Changed 5 months ago by catalyst

Attachment: tordebug.20170504.log added

comment:6 Changed 5 months ago by catalyst

I have noticed a few problems that I think are likely related.

I attached a log of a failed bootstrap with obfs4. (Tor Browser 7.0a3 with defaults obfs4 bridge configuration) It appears to have gotten stuck because none of the selected guards was reachable when trying to download the consensus. (The guards look like they are selected prior to determining their reachability, which I vaguely recall is a countermeasure against some kinds of attacks, but is potentially a usability problem during bootstrap.)

Note that two of the guards have the same fingerprint (A832D176ECD5C7C6B58825AE22FC4C90FA249637), and seem to be multiple ports on the same IP according to the TBB-generated torrc. Maybe we should prevent this during guard selection?

comment:7 Changed 5 months ago by isis

Cc: isis added
Keywords: tor-bridge added

comment:8 Changed 5 months ago by alecmuffett

The log messages in my case are:

May 03 08:58:45.000 [notice] Heartbeat: Tor's uptime is 12:00 hours, with 6 circuits open. I've sent 2.28 MB and received 9.59 MB.
May 03 09:21:47.000 [notice] Our directory information is no longer up-to-date enough to build circuits: We're missing descriptors for some of our primary entry guards
May 03 09:21:47.000 [notice] I learned some more directory information, but not enough to build a circuit: We're missing descriptors for some of our primary entry guards
May 03 09:21:48.000 [notice] We now have enough directory information to build circuits.
May 03 09:58:45.000 [notice] Heartbeat: Tor's uptime is 13:00 hours, with 7 circuits open. I've sent 2.44 MB and received 10.39 MB.
May 03 10:11:48.000 [notice] Tor has successfully opened a circuit. Looks like client functionality is working.
May 03 10:11:48.000 [notice] Tor has successfully opened a circuit. Looks like client functionality is working.
May 03 10:58:45.000 [notice] Heartbeat: Tor's uptime is 14:00 hours, with 6 circuits open. I've sent 2.54 MB and received 10.49 MB.

...and then later...

May 04 17:25:50.000 [notice] Our directory information is no longer up-to-date enough to build circuits: We're missing descriptors for some of our primary entry guards
May 04 17:25:50.000 [notice] I learned some more directory information, but not enough to build a circuit: We're missing descriptors for some of our primary entry guards
May 04 17:58:48.000 [notice] Heartbeat: Tor's uptime is 1 day 21:00 hours, with 3 circuits open. I've sent 6.51 MB and received 25.41 MB.

...with no subsequent We now have enough...

At the risk of sounding facile, what about replacing the some of our primary message with a N-out-of-M fraction?

comment:9 in reply to:  6 ; Changed 5 months ago by teor

Replying to catalyst:

...
I attached a log of a failed bootstrap with obfs4. (Tor Browser 7.0a3 with defaults obfs4 bridge configuration) It appears to have gotten stuck because none of the selected guards was reachable when trying to download the consensus.

IMO, the right fix here is to fall back to a fallback directory mirror. It's what they are there for.

...
Note that two of the guards have the same fingerprint (A832D176ECD5C7C6B58825AE22FC4C90FA249637), and seem to be multiple ports on the same IP according to the TBB-generated torrc. Maybe we should prevent this during guard selection?

Unless, of course, you are using a restricted guard set or bridges, where we should... just try another guard?

Also, maybe we want tor to try an IPv4 and an IPv6 address for your bridge if you only have one dual-stack bridge?

comment:10 in reply to:  9 ; Changed 5 months ago by catalyst

Replying to teor:

IMO, the right fix here is to fall back to a fallback directory mirror. It's what they are there for.

I'm not sure I understand what this means. If a working bridge is a directory mirror, why is that not sufficient? Or maybe all of the (non-guard) working bridges fail to be directory mirrors?

comment:11 in reply to:  10 ; Changed 5 months ago by teor

Replying to catalyst:

Replying to teor:

IMO, the right fix here is to fall back to a fallback directory mirror. It's what they are there for.

This comment was about unrestricted entry guard settings, not bridges.

I'm not sure I understand what this means. If a working bridge is a directory mirror, why is that not sufficient? Or maybe all of the (non-guard) working bridges fail to be directory mirrors?

All bridges are directory mirrors.

But there will always be a tradeoff when all our primary guards aren't working between:

  • fail (and be more secure), and
  • try another guard/bridge/fallback/directory authority (and be more usable).

Looks like we set the bar too low. Or there is another bug.

comment:12 in reply to:  11 Changed 5 months ago by catalyst

Replying to teor:

All bridges are directory mirrors.

But there will always be a tradeoff when all our primary guards aren't working between:

  • fail (and be more secure), and
  • try another guard/bridge/fallback/directory authority (and be more usable).

Looks like we set the bar too low. Or there is another bug.

I think maybe we should be more lenient about trying other guards when we have no (or very limited) reachability information (e.g., during bootstrapping from zero state) than when we have fresh reachability information, especially when bridges are involved.

comment:13 Changed 4 months ago by s7r

asn: you could be right with the second theory. Maybe the descriptors that need to be fetched are not even available, because the guard is not running in current consensus. This is based on a log analysis I just did now:

On a period of ~1 hour and 15 minutes (20:40 - 21:55), it starts like this (this instance is using default Guard context, no bridges):

May 11 20:40:54 electrum Tor[7669]: Our directory information is no longer up-to-date enough to build circuits: We're missing descriptors for some of our primary entry guards
May 11 20:40:54 electrum Tor[7669]: I learned some more directory information, but not enough to build a circuit: We're missing descriptors for some of our primary entry guards

Immediately after, over 500 messages like this one (with 1-3 seconds time between them):

May 11 20:41:01 electrum Tor[7669]: Giving up launching first hop of circuit to rendezvous point [scrubbed] for service yzgq5m227nhsi33j.
May 11 20:41:04 electrum Tor[7669]: Giving up launching first hop of circuit to rendezvous point [scrubbed] for service yzgq5m227nhsi33j.
May 11 20:41:07 electrum Tor[7669]: Giving up launching first hop of circuit to rendezvous point [scrubbed] for service yzgq5m227nhsi33j.
May 11 20:41:10 electrum Tor[7669]: Giving up launching first hop of circuit to rendezvous point [scrubbed] for service yzgq5m227nhsi33j.
May 11 20:41:44 electrum Tor[7669]: Giving up launching first hop of circuit to rendezvous point [scrubbed] for service yzgq5m227nhsi33j.
May 11 20:41:47 electrum Tor[7669]: Giving up launching first hop of circuit to rendezvous point [scrubbed] for service yzgq5m227nhsi33j.
May 11 20:41:49 electrum Tor[7669]: Giving up launching first hop of circuit to rendezvous point [scrubbed] for service yzgq5m227nhsi33j.
May 11 20:41:51 electrum Tor[7669]: Giving up launching first hop of circuit to rendezvous point [scrubbed] for service yzgq5m227nhsi33j.

[...] so on so on until:

May 11 21:54:55 electrum Tor[7669]: Giving up launching first hop of circuit to rendezvous point [scrubbed] for service yzgq5m227nhsi33j.
May 11 21:54:57 electrum Tor[7669]: Giving up launching first hop of circuit to rendezvous point [scrubbed] for service yzgq5m227nhsi33j.
May 11 21:54:59 electrum Tor[7669]: Giving up launching first hop of circuit to rendezvous point [scrubbed] for service yzgq5m227nhsi33j.
May 11 21:55:04 electrum Tor[7669]: Giving up launching first hop of circuit to rendezvous point [scrubbed] for service yzgq5m227nhsi33j.
May 11 21:55:05 electrum Tor[7669]: Giving up launching first hop of circuit to rendezvous point [scrubbed] for service yzgq5m227nhsi33j.
May 11 21:55:53 electrum Tor[7669]: We now have enough directory information to build circuits.
May 11 21:55:54 electrum Tor[7669]: Tor has successfully opened a circuit. Looks like client functionality is working.
May 11 21:55:54 electrum Tor[7669]: Tor has successfully opened a circuit. Looks like client functionality is working.

And it stopped complaining about failing the first hop of a rendezvous circuit. First hop is obviously the guard.

There was a clear cutoff of over 1 hour in this case, just because one of the descriptors was missing. I don't know if the first primary guard was in that consensus as running or not.

To take an informed decision, let's log some more information in a custom branch:

  • all primary guards in their order (first, second, etc)
  • the primary guard(s) for which we are missing descriptors only
  • the status of all primary guards in the current consensus (running / not running)
  • attempts to fetch the missing descriptors and status of this operation (success / failure).

Maybe something else?

comment:14 Changed 4 months ago by arma

Nick asked if we could defer this one to 0.3.2, and based on my current understanding of the ticket, my answer was "no, this looks like a serious bug that needs attention".

comment:15 in reply to:  13 Changed 4 months ago by asn

Replying to s7r:

asn: you could be right with the second theory. Maybe the descriptors that need to be fetched are not even available, because the guard is not running in current consensus. This is based on a log analysis I just did now:

For the record, the theory is that since primary guards is just an ordered subset of guards that have been used in the past, there is no guarantee that the primary guards are currently online or in the consensus. Hence, this can occur naturally if your primary guards or bridges are offline.

I'm not sure if this is the only possible way that our logic can introduce big bootstrapping delays, but it seems like a plausible one.

Perhaps to address this for the non-bridge case, we should count a primary guard descriptor as missing (for the purposes of guard_selection_have_enough_dir_info_to_build_circuits()) only if that guard is present in the consensus. For the case of bridges, it might make sense to relax or disable our dir info requirements completely.

comment:16 Changed 4 months ago by nickm

Owner: set to asn
Status: newassigned

Reassigning with permission; thanks!

comment:17 in reply to:  13 Changed 4 months ago by asn

Replying to s7r:

To take an informed decision, let's log some more information in a custom branch:

  • all primary guards in their order (first, second, etc)
  • the primary guard(s) for which we are missing descriptors only
  • the status of all primary guards in the current consensus (running / not running)
  • attempts to fetch the missing descriptors and status of this operation (success / failure).

Please check my branch bug21969_debug which adds the logs you asked for and will help debug the non-bridge case!

The branch does all the above items except the last one (log guard desc fetches) because that was pretty hard to do properly, and would require non-trivial hacking time. I think the current logging branch has enough info to help us understand if the theory from comment:15 is what's causing trouble in the non-bridge case.

comment:18 Changed 4 months ago by s7r

OK switching on that branch as of now on some machines and will get back with comments.

This will help us understand the problem from comment:13 where the cutoff time is > ~1 hour (one consensus period). What do you think we should do in regards to the problem described in comment:1 where it complains only for few seconds (much much less than an entire consensus period)? I think it might be because:

  • we have many relays in the consensus (over 7000), the number is growing and it takes more and more time for clients to download and process descriptors, therefor having few seconds of lag here
  • we have no feature to download with priority descriptors we are interested in (primary guards, when/if they are available), maybe even try simultaneous download from all 3 DirGuards at once and close the rest of connections once one finishes.

comment:19 Changed 4 months ago by s7r

Thinking some more about this, maybe we can forget about descriptor priority downloads because on the longer term we won't need it and it will just complicate stuff for us. If I recall correctly (please confirm) atm we are trying to make all relays directory servers using BEGIN_DIR on ORPort which means Guard == DirGuard, so clients can first of all, download the guard descriptor directly from the guard itself and all this whole dance is avoided.

comment:20 Changed 4 months ago by s7r

OK asn cooking with your branch bug21969_debug. I have attached the initial bootstrap log, as you can see at 62% bootstrapped it has the descriptors for primary guard 1 and 3 but not for 2. Also, the descriptor for the first primary guard wasn't downloaded in the first ~40% of bootstrap - no priority logic.

Let's give it some time and I will let you know what it says when there's that cutoff.

comment:21 Changed 4 months ago by alecmuffett

I'm back after some time away from this; I can confirm the situation never heals - From May 6th when one daemon started showing the issue, to May 17th when it is still complaining that I learned some more directory information, but not enough to build a circuit: We're missing descriptors for some of our primary entry guards

Can I be of assistance with testing ideas?

comment:22 in reply to:  19 ; Changed 4 months ago by teor

Replying to s7r:

Thinking some more about this, maybe we can forget about descriptor priority downloads because on the longer term we won't need it and it will just complicate stuff for us. If I recall correctly (please confirm) atm we are trying to make all relays directory servers using BEGIN_DIR on ORPort which means Guard == DirGuard, so clients can first of all, download the guard descriptor directly from the guard itself and all this whole dance is avoided.

Operators can explicitly disable DirCache, and can also disable it by setting various other options (like AccountingMax), or by having low RAM or bandwidth. Also, DirCache was only introduced in 0.2.8, and we support relays back to 0.2.4.

So while it is true that most guards are DirCaches, not all guards will be, even in the future.

Also, this might enable an attack/issue where a guard posts one descriptor to the directory authorities, and another to its clients. (This is avoided by using microdescriptors, because their hashes are in the consensus.)

comment:23 in reply to:  22 ; Changed 4 months ago by s7r

Replying to teor:

Operators can explicitly disable DirCache, and can also disable it by setting various other options (like AccountingMax), or by having low RAM or bandwidth. Also, DirCache was only introduced in 0.2.8, and we support relays back to 0.2.4.

So while it is true that most guards are DirCaches, not all guards will be, even in the future.

Also, this might enable an attack/issue where a guard posts one descriptor to the directory authorities, and another to its clients. (This is avoided by using microdescriptors, because their hashes are in the consensus.)

Thanks for the feedback teor. In this case we need to think about a logic where first of all a client will fetch the descriptors of the sampled guards (primary first, and move down the list to all the guards we ever connected to, until we have their descriptors). After that we download / refresh the rest of few thousands descriptors.

comment:24 in reply to:  23 Changed 4 months ago by asn

Replying to s7r:

Replying to teor:

Operators can explicitly disable DirCache, and can also disable it by setting various other options (like AccountingMax), or by having low RAM or bandwidth. Also, DirCache was only introduced in 0.2.8, and we support relays back to 0.2.4.

So while it is true that most guards are DirCaches, not all guards will be, even in the future.

Also, this might enable an attack/issue where a guard posts one descriptor to the directory authorities, and another to its clients. (This is avoided by using microdescriptors, because their hashes are in the consensus.)

Thanks for the feedback teor. In this case we need to think about a logic where first of all a client will fetch the descriptors of the sampled guards (primary first, and move down the list to all the guards we ever connected to, until we have their descriptors). After that we download / refresh the rest of few thousands descriptors.

Personally, I'm not so concerned about the case where we fetch the primary guard descriptor in random order, and some times it might take 1-3 seconds more to get the descriptor (as in your logs attachment:21969_primary_guards_descriptors_initial_bootstrap.txt).

I'm more concerned about pathological cases like comment:15 that significantly delay bootstrapping. That can happen in the bridge case if primary bridges are down, or in the non-bridge case if primary guards are down.

Here is a suggested hotfix for the above pathological case ofguard_selection_have_enough_dir_info_to_build_circuits():

If we are inspecting a non-bridge primary guard that is missing 
from the current consensus, don't count it as missing a descriptor
for the purposes of that function.

If we are inspecting a primary bridge that's missing a descriptor
for more than N=7 seconds, don't count it as missing a descriptor
for the purposes of that function.

The idea here is that if a primary guard is missing from the consensus, we might never get its descriptor. Also if that primary guard is a bridge, if the bridge is offline we will never get its descriptor either, so don't bother blocking on it.

FWIW, IMO the time-sensitive logic on the second rule is kinda flaky and might delay bootstrapping. Here is a more sane alternative, but I think it will be a bigger hassle to implement:

If we are inspecting a primary bridge that's missing a descriptor
and we have failed to connect to that bridge, don't count it as
missing a descriptor for the purposes of that function.

Also, we should strongly consider engraving these rules (and the general bootstrapping blocking logic) into torspec.

Last edited 4 months ago by asn (previous) (diff)

comment:25 Changed 4 months ago by asn

Discussed comment:24 with Nick on IRC. He points out that primary guards need to be listed on the consensus, and hence the top half of the suggested comment:24 fix is not valid anymore. Nick suggests that to fix the bridge case of comment:24 we can check the is_reachable field of the bridge guard, and ignore it if it's not reachable and missing a descriptor.

Still waiting for pathological logs from s7r/alec to learn more.

comment:26 Changed 4 months ago by catalyst

It might be helpful to look at doc/NewGuardAlgorithmTesting in case any information there is still relevant. Maybe do the pre/post comparison on the prop271 effects on bootstrap times, if that isn't already recorded somewhere else.

comment:27 Changed 4 months ago by s7r

OK getting close.

So I have manually edited my state file and added a guard I control as first primary guard with confirmed_idx=0. It was running just fine with no complaints. At some point, I have configured in torrc of the guard PublishServerDescriptor 0 so that it will stop publishing descriptors. The client instance running the debug branch did not complain instantly because it had a descriptor that was still valid, and the first primary guard while not wanting to publish descriptors any longer was still running and accepting connections.

You can see in the log file that after the descriptor expired, it started to panic and never recovered even after ~ 3 consensuses. Also, the cutoff did happen, many many complains about not being able to do first hop in a rendezvous circuit (bad).

Here is what the state file says about the primary guard we are talking about at the time of the cutoff:

Guard in=default rsa_id=0C2E555864546F1B3DDC4BE71573B8CC5DD898EE nickname=AmnesicProsecutor sampled_on=2017-05-13T20:25:19 sampled_by=0.3.1.0-alpha-dev listed=1 confirmed_on=2017-05-16T01:45:08 confirmed_idx=0 pb_use_attempts=69.854044 pb_use_successes=69.854044 pb_circ_attempts=234.410156 pb_circ_successes=231.287109 pb_successful_circuits_closed=229.161133 pb_collapsed_circuits=2.125977 pb_timeouts=8.331055

FILTERED_GUARDS requires that IS_LISTED is true -> this is not enforced apparently. It will insanely look for the inexistent descriptor even if it's not listed and also not running according to the debug branch log.

I have also discovered some other things while testing this, not directly related, but I'll do a separate ticket so this one won't get to messy to follow up.

Last edited 4 months ago by s7r (previous) (diff)

comment:28 in reply to:  27 Changed 4 months ago by asn

Replying to s7r:

OK getting close.

So I have manually edited my state file and added a guard I control as first primary guard with confirmed_idx=0. It was running just fine with no complaints. At some point, I have configured in torrc of the guard PublishServerDescriptor 0 so that it will stop publishing descriptors. The client instance running the debug branch did not complain instantly because it had a descriptor that was still valid, and the first primary guard while not wanting to publish descriptors any longer was still running and accepting connections.

Thanks for the logs, however I'm not sure if PublishServerDescriptor 0 is the right methodology to reproduce this bug. Doing that will make your relay do the weird thing of not being in the consensus but still accepting connections. I'm not sure how that behavior influences the guard algorithm exactly and it might complicate the analysis here. I think the easiest test thing to do would be to just shut down your guard relay like normal, and check if the guard is still in the primary guard list when it exits the consensus.

comment:29 in reply to:  25 Changed 4 months ago by asn

Replying to asn:

Discussed comment:24 with Nick on IRC. He points out that primary guards need to be listed on the consensus, and hence the top half of the suggested comment:24 fix is not valid anymore. Nick suggests that to fix the bridge case of comment:24 we can check the is_reachable field of the bridge guard, and ignore it if it's not reachable and missing a descriptor.

Still waiting for pathological logs from s7r/alec to learn more.

I think the bridges-case approach where we check is_reachable before counting a guard towards the "needs descriptor" block is possible. However, with the current codebase it's not as easy as checking is_reachable at the time of guard_selection_have_enough_dir_info_to_build_circuits().

That's because we don't actually notify the guard subsystem when a bridge descriptor fetch fails. And that's because in launch_direct_bridge_descriptor_fetch() we never set the guard_state on the directory request, so entry_guard_chan_failed() doesn't update the guard state when the fail happens.

Seems like the correct bridges-side fix here involves the following two steps:
1) Add guard_state to bridge descriptor directory request in launch_direct_bridge_descriptor_fetch().
2) Don't count unreachable primary bridges towards the guard_selection_have_enough_dir_info_to_build_circuits() descriptor block limit.

Please see branch bug21969_bridges for the bridge-side fix here. Setting this to needs_review for now.

Last edited 4 months ago by asn (previous) (diff)

comment:30 Changed 4 months ago by asn

Status: assignedneeds_review

comment:31 Changed 4 months ago by nickm

I want to take this in 0.3.1.1-alpha is possible, but it needs a changes file, and it needs to be based on maint-0.3.0 for a possible backport if it works in 0.3.1.1-alpha.

comment:32 Changed 4 months ago by asn

OK, I have an 0.3.0 branch on bug21969_bridges_030 and a fresh force-pushed 0.3.1 branch on bug21969_bridges. Both of them should contain changes file.

BTW the whole problem ended up being the missing guard state, so I didn't actually have to do part (2) of comment:29, since we already ignore unreachable bridges:

  SMARTLIST_FOREACH_BEGIN(gs->primary_entry_guards, entry_guard_t *, guard) {
    entry_guard_consider_retry(guard);
    if (guard->is_reachable == GUARD_REACHABLE_NO)
      continue;
    n_considered++;
...

comment:33 Changed 4 months ago by nickm

Thank you, and thanks for making it even smaller!

I've made a new child ticket (#22325) for tracking the bridge bug in particular. In that ticket, I've merged this patch and marked it for backport to 0.3.0. I'll leave this one open for the non-bridge-using case.

comment:34 Changed 4 months ago by nickm

Status: needs_reviewassigned

Putting back in "assigned" to track the non-bridge-using case.

comment:35 Changed 4 months ago by cypherpunks

Keywords: tor-client added
Version: Tor: 0.3.0.6

Just started my TBB 7.0a4, exited yesterday, and got:

Tor NOTICE: Bootstrapped 100%: Done 
Tor NOTICE: Our directory information is no longer up-to-date enough to build circuits: We're missing descriptors for some of our primary entry guards 
Tor NOTICE: I learned some more directory information, but not enough to build a circuit: We're missing descriptors for some of our primary entry guards 
Tor NOTICE: New control connection opened from 127.0.0.1. 
Tor NOTICE: New control connection opened from 127.0.0.1. 
Tor NOTICE: We now have enough directory information to build circuits. 

It looks like Tor is using old info during bootstrap. (No bridges, FWIW.)

comment:36 Changed 2 months ago by cypherpunks

Same here with system-wide Tor instance on Arch Linux x86_64, not using bridges.
Three hours after bootstrapping and running smoothly, Tor suddenly lost all circuits. I decided to wait if it would reconnect by itself, but it didn't. After restarting Tor, everything went back to normal.

Jul 24 10:17:13.122 [notice] Tor 0.3.0.9 (git-22b3bf094e327093) running on Linux with Libevent 2.1.8-stable, OpenSSL 1.1.0f and Zlib 1.2.11.
Jul 24 10:17:13.122 [notice] Tor can't help you if you use it wrong! Learn how to be safe at https://www.torproject.org/download/download#warning
Jul 24 10:17:13.122 [notice] Read configuration file "/etc/tor/torrc".
Jul 24 10:17:13.125 [notice] Opening Socks listener on 127.0.0.1:9050
Jul 24 10:17:13.000 [notice] {GENERAL} Parsing GEOIP IPv4 file /usr/share/tor/geoip.
Jul 24 10:17:13.000 [notice] {GENERAL} Parsing GEOIP IPv6 file /usr/share/tor/geoip6.
Jul 24 10:17:13.000 [notice] {CONTROL} Bootstrapped 0%: Starting
Jul 24 10:17:13.000 [notice] {GUARD} Starting with guard context "default"
Jul 24 10:17:13.000 [notice] {CONTROL} Bootstrapped 80%: Connecting to the Tor network
Jul 24 10:17:14.000 [notice] {CONTROL} Bootstrapped 85%: Finishing handshake with first hop
Jul 24 10:17:14.000 [notice] {CONTROL} Bootstrapped 90%: Establishing a Tor circuit
Jul 24 10:17:14.000 [notice] {GENERAL} Tor has successfully opened a circuit. Looks like client functionality is working.
Jul 24 10:17:14.000 [notice] {CONTROL} Bootstrapped 100%: Done
Jul 24 13:21:14.000 [notice] {DIR} Our directory information is no longer up-to-date enough to build circuits: We're missing descriptors for some of our primary entry guards
Jul 24 13:21:14.000 [notice] {DIR} I learned some more directory information, but not enough to build a circuit: We're missing descriptors for some of our primary entry guards
Jul 24 13:21:26.000 [notice] {DIR,APP} Application request when we haven't used client functionality lately. Optimistically trying directory fetches again.
Jul 24 13:21:30.000 [notice] {DIR,APP} Application request when we haven't used client functionality lately. Optimistically trying directory fetches again.
Jul 24 13:21:30.000 [notice] {DIR,APP} Application request when we haven't used client functionality lately. Optimistically trying directory fetches again.
Jul 24 13:21:31.000 [notice] {DIR,APP} Application request when we haven't used client functionality lately. Optimistically trying directory fetches again.
Jul 24 13:21:31.000 [notice] {DIR,APP} Application request when we haven't used client functionality lately. Optimistically trying directory fetches again.
Jul 24 13:21:31.000 [notice] {DIR,APP} Application request when we haven't used client functionality lately. Optimistically trying directory fetches again.
Jul 24 13:21:31.000 [notice] {DIR,APP} Application request when we haven't used client functionality lately. Optimistically trying directory fetches again.
Jul 24 13:21:31.000 [notice] {DIR,APP} Application request when we haven't used client functionality lately. Optimistically trying directory fetches again.
Jul 24 13:21:31.000 [notice] {DIR,APP} Application request when we haven't used client functionality lately. Optimistically trying directory fetches again.
Jul 24 13:21:48.000 [notice] {DIR,APP} Application request when we haven't used client functionality lately. Optimistically trying directory fetches again.
Jul 24 13:21:48.000 [notice] {DIR,APP} Application request when we haven't used client functionality lately. Optimistically trying directory fetches again.
Jul 24 13:21:48.000 [notice] {DIR,APP} Application request when we haven't used client functionality lately. Optimistically trying directory fetches again.
Jul 24 13:21:48.000 [notice] {DIR,APP} Application request when we haven't used client functionality lately. Optimistically trying directory fetches again.
Jul 24 13:21:48.000 [notice] {DIR,APP} Application request when we haven't used client functionality lately. Optimistically trying directory fetches again.

[ca. 200 identical messages deleted]

Jul 24 13:39:30.000 [notice] {DIR,APP} Application request when we haven't used client functionality lately. Optimistically trying directory fetches again.
Jul 24 13:39:30.000 [notice] {DIR,APP} Application request when we haven't used client functionality lately. Optimistically trying directory fetches again.
Jul 24 13:39:34.000 [notice] {DIR,APP} Application request when we haven't used client functionality lately. Optimistically trying directory fetches again.
Jul 24 13:39:34.000 [notice] {DIR,APP} Application request when we haven't used client functionality lately. Optimistically trying directory fetches again.
Jul 24 13:39:58.000 [notice] {DIR,APP} Application request when we haven't used client functionality lately. Optimistically trying directory fetches again.
Jul 24 13:39:58.000 [notice] {DIR,APP} Application request when we haven't used client functionality lately. Optimistically trying directory fetches again.
Jul 24 13:40:36.000 [notice] {GENERAL} Interrupt: exiting cleanly.
Note: See TracTickets for help on using tickets.