Opened 12 years ago

Closed 3 years ago

#675 closed defect (worksforme)

update_consensus_networkstatus_downloads() & router_pick_directory_server() rarely called

Reported by: jasemandude Owned by:
Priority: High Milestone: Tor: unspecified
Component: Core Tor/Tor Version: 0.2.0.25-rc
Severity: Normal Keywords: tor-client
Cc: jasemandude, arma, nickm Actual Points:
Parent ID: Points:
Reviewer: Sponsor:

Description (last modified by nickm)

Possible solution to #648

Machine was left on for several hours with TOR running but no dial up network, then dial up network was brought
back up. TOR spends half an hour in a loop where it can't find nodes. Throughout this time, using TORbutton to
switch off proxies in Firefox allowed me to browse the web with no problems, so I know the network is working fine.

After waiting 30 minutes, killing and relaunching TOR seemed to fix the problem, though from the logs below it looks
like TOR had finally made some circuits just before I killed it.

Taking a closer look at the logs, I can see it wasn't able to make circuits for a whole 30 minutes until it called
update_consensus_networkstatus_downloads() and router_pick_directory_server() - then all of a sudden everything
started to work.

Are these functions rarely called because of bandwidth issues? Could some logic be added so that if we get
"[Warning] No available nodes when trying to choose node. Failing." then these functions get called?

Here are the log messages:

Apr 26 18:28:46.820 [Info] circuit_get_open_circ_or_launch(): No safe circuit (purpose 5) ready for edge connection; delaying.
Apr 26 18:28:46.822 [Info] circuit_get_open_circ_or_launch(): No safe circuit (purpose 5) ready for edge connection; delaying.
Apr 26 18:28:46.824 [Info] circuit_predict_and_launch_new(): Have 0 clean circs (0 internal), need another exit circ.
Apr 26 18:28:46.825 [Info] choose_good_exit_server_general(): Found 74 servers that might support 11/12 pending connections.
Apr 26 18:28:46.828 [Info] choose_good_exit_server_general(): Chose exit server 'oemloi'
Apr 26 18:28:46.830 [Info] router_choose_random_node(): We couldn't find any live, guard routers; falling back to list of all routers.
Apr 26 18:28:46.832 [Warning] No available nodes when trying to choose node. Failing.
Apr 26 18:28:46.834 [Warning] No available nodes when trying to choose node. Failing.
Apr 26 18:28:46.837 [Info] router_choose_random_node(): We couldn't find any live, guard routers; falling back to list of all routers.
Apr 26 18:28:46.839 [Warning] No available nodes when trying to choose node. Failing.
Apr 26 18:28:46.842 [Warning] No available nodes when trying to choose node. Failing.
Apr 26 18:28:46.844 [Info] router_choose_random_node(): We couldn't find any live, guard routers; falling back to list of all routers.
Apr 26 18:28:46.847 [Warning] No available nodes when trying to choose node. Failing.
Apr 26 18:28:46.850 [Warning] No available nodes when trying to choose node. Failing.
Apr 26 18:28:46.853 [Warning] Failed to find node for hop 0 of our path. Discarding this circuit.
Apr 26 18:28:46.856 [Info] onion_populate_cpath(): Generating cpath hop failed.
Apr 26 18:28:47.860 [Info] choose_good_exit_server_general(): Found 74 servers that might support 11/12 pending connections.
Apr 26 18:28:47.864 [Info] choose_good_exit_server_general(): Chose exit server 'gashmish'
Apr 26 18:28:47.899 [Info] router_choose_random_node(): We couldn't find any live, guard routers; falling back to list of all routers.
Apr 26 18:28:47.903 [Warning] No available nodes when trying to choose node. Failing.

[... and so on for 30 minutes, still no nodes ...]

Apr 26 18:58:16.130 [Warning] No available nodes when trying to choose node. Failing.
Apr 26 18:58:16.177 [Warning] Failed to find node for hop 0 of our path. Discarding this circuit.
Apr 26 18:58:16.219 [Info] onion_populate_cpath(): Generating cpath hop failed.
Apr 26 18:58:16.266 [Info] circuit_get_open_circ_or_launch(): No safe circuit (purpose 5) ready for edge connection; delaying.

[... but suddenly ...]

Apr 26 18:58:43.588 [Info] circuit_get_open_circ_or_launch(): No safe circuit (purpose 5) ready for edge connection; delaying.
Apr 26 18:58:43.994 [Info] circuit_predict_and_launch_new(): Have 0 clean circs (0 internal), need another exit circ.Apr 26 18:58:44.258 [Info] update_consensus_router_descriptor_downloads(): 0 router descriptors downloadable. 0 delayed; 1782 present (0 of those were in old_routers); 0 would_reject; 391 wouldnt_use; 0 in progress.
Apr 26 18:58:44.329 [Info] routerlist_remove_old_routers(): We have 2211 live routers and 0 old router descriptors. At most 2173 must be retained because of networkstatuses.
Apr 26 18:58:44.398 [Info] update_consensus_networkstatus_downloads(): Launching networkstatus consensus download.
Apr 26 18:58:44.467 [Info] router_pick_directory_server(): No reachable router entries for dirservers. Trying them all again.
Apr 26 18:58:44.567 [Debug] directory_initiate_command(): anonymized 0, use_begindir 1.
Apr 26 18:58:44.640 [Debug] directory_initiate_command(): Initiating consensus network-status fetch
Apr 26 18:58:44.709 [Info] connection_ap_make_link(): Making internal anonymized tunnel to [scrubbed]:443 ...
Apr 26 18:58:44.778 [Debug] connection_add(): new conn type Socks, socket -1, n_conns 11.
Apr 26 18:58:44.847 [Info] circuit_get_open_circ_or_launch(): No safe circuit (purpose 5) ready for edge connection; delaying.
Apr 26 18:58:44.916 [Info] connection_ap_make_link(): ... application connection created and linked.
Apr 26 18:58:44.985 [Debug] connection_add(): new conn type Directory, socket -1, n_conns 12.
Apr 26 18:58:45.054 [Info] circuit_get_open_circ_or_launch(): No safe circuit (purpose 5) ready for edge connection; delaying.
Apr 26 18:58:45.124 [Info] circuit_get_open_circ_or_launch(): No safe circuit (purpose 5) ready for edge connection; delaying.

Apr 26 18:58:45.893 [Info] circuit_predict_and_launch_new(): Have 0 clean circs (0 internal), need another exit circ.
Apr 26 18:58:45.964 [Debug] new_route_len(): Chosen route length 3 (2211 routers available).
Apr 26 18:58:46.034 [Info] choose_good_exit_server_general(): Found 119 servers that might support 6/7 pending connections.
Apr 26 18:58:46.105 [Debug] smartlist_choose_by_bandwidth(): Total weighted bw = 17053612, exit bw = 23966189, nonexit bw = 1614269, exit

weight = 1.000000 (for exit == 1), guard bw = 19886442, nonguard bw = 5694016, guard weight = 0.571224 (for guard == 0)Apr 26 18:58:46.176 [Info] choose_good_exit_server_general(): Chose exit server 'dotplex1'

Apr 26 18:58:46.247 [Debug] onion_extend_cpath(): Path is 0 long; we want 3
Apr 26 18:58:46.320 [Info] router_choose_random_node(): We couldn't find any live, guard routers; falling back to list of all routers.Apr 26 18:58:46.391 [Info] add_an_entry_guard(): Chose 'dannenberg' as new entry guard.
Apr 26 18:58:46.462 [Info] log_entry_guards(): BrainwashEducation (up made-contact),GuyMontag (up made-contact) [...]

Apr 26 18:58:46.540 [Debug] onion_extend_cpath(): Chose router dannenberg for hop 1 (exit is dotplex1)
Apr 26 18:58:46.621 [Debug] onion_extend_cpath(): Path is 1 long; we want 3
Apr 26 18:58:46.698 [Debug] choose_good_middle_server(): Contemplating intermediate hop: random choice.
Apr 26 18:58:46.774 [Debug] smartlist_choose_by_bandwidth(): Total weighted bw = 70934573, exit bw = 23971981, nonexit bw = 107505582, ex
it weight = 0.000000 (for exit == 0), guard bw = 114082699, nonguard bw = 17394864, guard weight = 0.615841 (for guard == 0)
Apr 26 18:58:46.851 [Debug] onion_extend_cpath(): Chose router RentalSponge for hop 2 (exit is dotplex1)
Apr 26 18:58:46.928 [Debug] onion_extend_cpath(): Path is 2 long; we want 3
Apr 26 18:58:47.006 [Debug] onion_extend_cpath(): Chose router dotplex1 for hop 3 (exit is dotplex1)
Apr 26 18:58:47.084 [Debug] onion_extend_cpath(): Path is complete: 3 steps long

[Automatically added by flyspray2trac: Operating System: OSX 10.5 Leopard]

Child Tickets

Change History (18)

comment:1 Changed 12 years ago by nickm

From IRC, so it doesn't get lost:

17:14 < nickm> let's see. update_networkstatus_downloads() gets called once a m
inute, which should call update_consensus_networkstatus_downloads unless should_
delay_dir_fetches() says yes.
17:15 < nickm> [for the record, run_scheduled_events is officially an ugly thing
.]

(Also, it seems JediDruid's system clock doesn't jump around a lot...)
17:32 < nickm> so, there are two possibilities here. one is that we aren't down
loading consensuses often enough. Another is that downloading a consensus trigg
ers some "hey , maybe we're up after all!" code, and we should be willing to con
sider ourselves up more often.

comment:2 Changed 12 years ago by nickm

I think the latter case is probably what's going on. We have decided (since the network was down) that all the
routers are unreachable. It's not till our next scheduled consensus download arrives that we mark everything up
again.

comment:3 Changed 12 years ago by arma

Yes: downloading the consensus marks a lot of routerstatuses Running again,
which means we try things again.

You should be able to reproduce this bug very easily. Start your Tor client,
then go offline for 8 hours (say, disconnect your laptop from wireless and
go to sleep). When you return, you will find a stream of "can't pick a circuit,
0 nodes available" complaints, and once you turn the net back on, it will
recover when the consensus fetch interval arrives and not before.

So one approach would be to look at the "I just got a socks request, optimistically
trying network again" code and teach that to include fetching a consensus now
too. That's probably the simplest approach and the best backport candidate, if
in fact this is the problem.

It might be that the 'optimistically retrying' code isn't triggering because
Tor no longer knows how to recognize when it appears to have gone offline. In
which case that would be the key bug to fix. In particular...

Perhaps this is related to switching to encrypted dir conns! That is, maybe
the dir conn never fails in a way we notice to call all_directories_have_failed,
because it's trying to establish a one-hop circuit and all the nodes are down
so it can't choose any paths and it never even gets to the point of realizing
the dir conn didn't reach its destination.

Another approach would be that if we failed to fetch a consensus at our last
attempt, and the reason is that no nodes appear up, then we should retry more
often until it works. After all, it doesn't cost much bandwidth to fail to
fetch a consensus.

comment:4 Changed 12 years ago by arma

Ok. The issue was actually that were *weren't* realizing we had gone offline.

Back when we were using v2 directory stuff, we had a "num_running" count in
the update_router_have_minimum_dir_info() check. That would keep track of
how many relays we still believed were running, regardless of how many the
networkstatuses claim ought to be running. And when that count got down to 1,
we knew we wouldn't be able to build a circuit, so we would stop trying.

But with the advent of v3 directories, we stopped checking our local opinions,
and just believed the consensus every time. And of course the last consensus
we had listed it as up, so we were repeatedly surprised that it didn't work.

I've put the first version of the fix in as r14970. It may require a few
revisions to get right.

comment:5 Changed 12 years ago by jasemandude

Wow!!! Got r15087 by SVN and compiled. Laptop was off overnight. Switched on this morning, fired up the dial-up 3g
network and had a circuit in just over a minute. Thanks for sorting this one! Your efforts are very much appreciated.

comment:6 Changed 12 years ago by arma

I'm glad it worked for you.

Unfortunately, I didn't actually fix the bug. The version you tried was after
I took out the above bugfix. :)

I spent a few days poking at it over the weekend. I am beginning to think the
best plan is to leave the 0.2.0.x release alone for now, and keep experimenting
with how to fix it more thoroughly in 0.2.1.x, and then backport when/if we find
a good answer.

comment:7 Changed 11 years ago by jasemandude

Yes, you're right - it was just a fluke. I've had it get stuck since I posted the last comment. I'm using r15376 and it still gets stuck :-(

comment:8 Changed 11 years ago by nickm

(note to self: Roger said he'd spend 10 minutes looking through the code to figure out which bits of it are
likely implicated, as a start to figuring out the real right thing to do for this. Bug him if he hasn't by tomorrow.)

comment:9 Changed 11 years ago by arma

One tip here is that update_router_have_minimum_dir_info() looks for
num_present < 2. (num_present is the number of *usable* routers in the
routerstatus for which we have the right descriptor. So if you're not
running you're not usable and therefore not present.) But we're going
to fail to find 3 suitable nodes for a circuit if there are only 2
usable present descriptors remaining, so we will never mark either of
them down. And it could even be more than 2, if the few we have aren't
the right assortment of guard and exit.

And if we never mark the last few down, then we never realize that we
no longer have minimum dir info, so a) we keep on trying to build circuits
and failing and complaining, again and again, forever; and b) the

if (!want_onehop && !router_have_minimum_dir_info()) {

line in circuit_get_open_circ_or_launch() never triggers, so we don't
call the routerlist_retry_directory_downloads() function when a new
socks request arrives.

comment:10 Changed 9 years ago by nickm

Milestone: 0.2.1.x-finalTor: 0.2.2.x-final

comment:11 Changed 9 years ago by nickm

Description: modified (diff)

Just upping the required num_present value up to 3 wouldn't help completely: we could have 3 routers that we weren't able to use in some particular circuit for some other reason.

So perhaps we should just make the choose_random_node() code do something in the case when it says "No available nodes when trying to choose node. Failing" that makes us willing to retry downloads.

We don't want to have it always just make router_have_minimum_dir_info() return 0, because although we lack the directory information we need to build that one circuit, we don't necessarily lack enough directory info to build _any_ multihop circuits.

Perhaps when choose_random_node fails on a _general_ circuit we should say "no, not enough directory info"? Or perhaps we should split router_have_minimum_dir_info() into two functions, one of which means "try really hard to get more dir info" and another of which means "don't even bother trying circuits"?

comment:12 Changed 9 years ago by arma

See #1136 for a related bug. Maybe even the same bug.

comment:13 Changed 9 years ago by arma

Priority: majorminor

Triage: this can wait until after 0.2.2.x (alas)

comment:14 Changed 9 years ago by arma

Milestone: Tor: 0.2.2.x-finalTor: 0.2.3.x-final
Priority: minormajor

This is a serious usability bug, but we don't have the programmerpower to look at everything, so delaying this one some more. :(

comment:15 Changed 8 years ago by nickm

Milestone: Tor: 0.2.3.x-finalTor: unspecified

comment:16 Changed 7 years ago by nickm

Keywords: tor-client added

comment:17 Changed 7 years ago by nickm

Component: Tor ClientTor

comment:18 Changed 3 years ago by nickm

Cc: jasemandude,arma,nickmjasemandude, arma, nickm
Resolution: Noneworksforme
Severity: Normal
Status: newclosed

update_router_have_min_dir_info() has been completely revised in the last 5 years -- at least twice, I think. We should open a new ticket if there's still an issue here.

Note: See TracTickets for help on using tickets.