Opened 3 years ago

Closed 2 years ago

Last modified 2 years ago

#16825 closed defect (fixed)

client with explicit EntryNodes, no cached-* files never finds its entrynodes

Reported by: starlight Owned by:
Priority: Medium Milestone: Tor: 0.2.8.x-final
Component: Core Tor/Tor Version: Tor: 0.2.6.10
Severity: Normal Keywords: guards, 028-triage
Cc: Actual Points:
Parent ID: Points: medium
Reviewer: Sponsor: SponsorU

Description

Client-only 'tor' with explicit guard
nodes configured will stall forever at
Bootstrapped 80% or 85% on cold-start
with cached-* files removed.

Shutdown and fresh start of daemon
required to clear state.

Child Tickets

Attachments (1)

logfile_scrub.txt (269.3 KB) - added by starlight 3 years ago.
[info] level log showing problem

Download all attachments as: .zip

Change History (18)

Changed 3 years ago by starlight

Attachment: logfile_scrub.txt added

[info] level log showing problem

comment:1 Changed 3 years ago by nickm

Keywords: SponsorU guards added
Milestone: Tor: 0.2.7.x-final

sounds like a bug; should look at this for 0.2.7

comment:2 Changed 3 years ago by starlight

more detail:

1) state file must also be deleted, i.e. total from-scratch cold start

2) bug does not occur with UseMicrodescriptors 0

3) example torrc that consistently produces issue in case
some parameter nuance is at work

AvoidDiskWrites 1
PidFile /home/user/tor_test/tor.pid
DataDirectory /home/user/tor_test
GeoIPFile ./geoip
Log notice file ./logfile
LogTimeGranularity 1
RunAsDaemon 1
NumCPUs 1
#UseMicrodescriptors 0
MaxClientCircuitsPending 128
NumEntryGuards 2
GuardLifetime 47335104
EntryNodes <A>,<B>
SocksPort 10.10.10.10:9150
ControlPort 10.10.10.10:9151
HashedControlPassword xxxx
SocksPolicy accept 10.10.10.11:*
SocksPolicy reject *:*

comment:3 Changed 3 years ago by nickm

Keywords: PostFreeze027 added

I'd merge patches for these for 0.2.7 if they come in on time. In some cases, that will require figuring out an as-yet-unsolved bugs.

comment:4 Changed 3 years ago by nickm

Keywords: PostFreeze027 removed
Milestone: Tor: 0.2.7.x-finalTor: 0.2.8.x-final

Moving these tickets into 0.2.8. Not expecting to take patches for any into 0.2.7 at this late date. As usual, please say something if you disagree! :)

comment:5 Changed 3 years ago by nickm

Keywords: 028-triage added

comment:6 Changed 3 years ago by nickm

Keywords: SponsorU removed
Sponsor: SponsorU

Bulk-replace SponsorU keyword with SponsorU field.

comment:7 Changed 3 years ago by nickm

Points: medium

comment:8 Changed 2 years ago by nickm

Milestone: Tor: 0.2.8.x-finalTor: 0.2.9.x-final

Throw most 0.2.8 "NEW" tickets into 0.2.9. I expect that many of them will subsequently get triaged out.

comment:9 Changed 2 years ago by arma

Severity: Normal
Summary: client with explicit guards, no cached-* files stalls forever at Bootstrapped 80%client with explicit EntryNodes, no cached-* files never finds its entrynodes

Confirmed.

You can experience it yourself, with a much simpler torrc than the above. Delete your cached-* files (I didn't need to mess with my state file), pick a relay in the current consensus that has the Guard flag and is around, e.g. Mozilla1, and then run your Tor with
src/or/tor entrynodes Mozilla1

Tor will do all the directory bootstrapping stuff, but it will never make any 3-hop circuits.

What's supposed to happen: Tor starts up, fetches the consensus, fetches the authority certificates, and fetches some microdescs. Each receipt of new dir info calls directory_info_has_arrived(), and the idea is that once directory_info_has_arrived() decides that router_have_minimum_dir_info(), then it will call entry_guards_compute_status(), which will call entry_nodes_should_be_added(), which will set should_add_entry_nodes to 1. Then the next time we call choose_random_entry_impl() to pick the first hop of a new circuit, we remember to call entry_guards_set_from_config() to populate our guard list first.

Except we forgot to call directory_info_has_arrived() for incoming microdescs! What the heck. I wonder what else is broken here.

Here is a working fix:

diff --git a/src/or/directory.c b/src/or/directory.c
index 6f8edb5..0242829 100644
--- a/src/or/directory.c
+++ b/src/or/directory.c           
@@ -1991,6 +1991,8 @@ connection_dir_client_reached_eof(dir_connection_t *conn)
       }
       control_event_bootstrap(BOOTSTRAP_STATUS_LOADING_DESCRIPTORS,
                               count_loading_descriptors_progress());
+      if (mds && smartlist_len(mds))
+        directory_info_has_arrived(now, 0);
       SMARTLIST_FOREACH(which, char *, cp, tor_free(cp));
       smartlist_free(which);
       smartlist_free(mds);

(Probably we want to put that control_event_bootstrap inside the "if mds" clause too, but I didn't want to go overboard on the patch quite yet.)

One downside of this patch is that bootstrapping gets noisier, since it reconsiders whether it needs to do things each time it learns new microdescs -- which is the point of the function after all.

comment:10 Changed 2 years ago by arma

Context: Once upon a time, directory_info_has_arrived() was the spider in the middle of the directory fetching web: every time we made any progress at bootstrapping, we could call this function, and it would evaluate the situation, learn what new stuff just came in, and dispatch the next step in response.

I was going to say:
Over time, this functionality got spread throughout the source code, so when we had a guess about what step ought to happen next, we called it directly. We still also called directory_info_has_arrived() most of the time, but it lost the central dispatch role that it once had. but upon looking in more detail, I think we didn't actually decentralize it that much.

Then I was going to say It looks like router_dir_info_changed() is playing a lot of the role that the directory_info_has_arrived dispatcher used to play. but upon looking in more detail, its primary role is just to force us to re-evaluate whether we have enough dir info, next time we ask.

So I'm going to stop working on the new ticket I was going to open, entitled Use directory_info_has_arrived() consistently, or get rid of it, and suggest that we should plan to use directory_info_has_arrived more consistently.

comment:11 Changed 2 years ago by arma

Status: newneeds_review

There's a patch that could be reviewed. We might even want it in 0.2.8, since we're still early in the alpha process.

The only side effect of my patch I believe is that bootstrapping is now noisier again, since we log the output from thinking about minimum-dir-info each time we process another batch of microdescs. We could imagine passing an argument to directory_info_has_arrived to indicate whether you think it should be quiet in response to this new arrival, and then we could get our old behavior with respect to noisiness and the new behavior with respect to notifying the guard subsystem when we have a consensus here and ready.

Oh, and if we wanted to be totally wild, we could consider moving the call to entry_guards_compute_status(options, now) so it only happens when we receive a new consensus of some flavor. Since I think that's the only sort of dir info that we want to actually change our behavior from? I think?

comment:12 in reply to:  11 Changed 2 years ago by cypherpunks

Replying to arma:

We might even want it in 0.2.8, since we're still early in the alpha process.

That sounds like a good idea. I've had mysterious hangs during startup that got resolved by restarting TBB. This also seems to occur if the cached consensus data is stale (3 days old).

comment:13 Changed 2 years ago by nickm

Milestone: Tor: 0.2.9.x-finalTor: 0.2.8.x-final

ok, marking patch for review in 0.2.8

comment:14 Changed 2 years ago by arma

I made a bug16825 branch in my arma, which does all of the above. Now there is something more concrete to review.

comment:15 Changed 2 years ago by arma

There remains a mildly confusing log message during bootstrap and probably every time we get new dir info once bootstrapped:

Feb 22 03:03:10.791 [notice] 1 entries in guards

It looks like this went in during commit 2797fd8f, which went into 0.2.3.2-alpha as part of #2798.

I think maybe we want to leave this message in place though, as it's maybe useful to expert users who set EntryNodes. (Sorry, it is still the case that if you set EntryNodes, you're an expert user.)

comment:16 Changed 2 years ago by nickm

Resolution: fixed
Status: needs_reviewclosed

Okay, that looks simple enough. Merging!

comment:17 Changed 2 years ago by starlight

Experienced today a recurrence of a variation of this bug. I believe #17147 also applies or is the same issue.

Running 2.6.10 with the patch from this bug report applied in client-only bridge mode. Image compiled with ASAN+UBSAN. Uptime 72 days 12 hours.

Got

Jun 23 06:23:57 Tor[]: new bridge descriptor 'omitted' (fresh): $[omitted] at [omitted]
Jun 23 06:23:58 Tor[]: Delaying directory fetches: No running bridges

and Tor client refused to build any circuits.

Put it in full debug logging and switched to a different bridge and it started working fine. Then switched back to the normal bridge and the client daemon returned to the inoperative state. Have debug log of this event as well.

My belief is this bug / #17147 results from a logic flaw where micro-descriptors or full-descriptors are corrupted when converted to the internal binary structure representation.

Killed the client daemon with a SEGV and successfully captured a fully gdb-usable core image. Restarted and it came back running perfectly.

If anyone has interest in trying to unravel the state of this core and/or examining the log file please let me know and the materials will be made available. On the other hand I understand that the guard/descriptor logic will shortly be rewritten, so perhaps there is no point in chasing such a rarely exhibited bug.

Note: See TracTickets for help on using tickets.