Opened 4 years ago

Last modified 2 years ago

#17147 needs_information defect

long-running client path-selection not seeing some (fast) exit nodes

Reported by: starlight Owned by:
Priority: Medium Milestone: Tor: unspecified
Component: Core Tor/Tor Version: Tor: 0.2.6.10
Severity: Normal Keywords: tor-client path-selection needs-diagnosis
Cc: Actual Points:
Parent ID: Points:
Reviewer: Sponsor:

Description

I noticed that a couple of fast new exit relays
were never being chosen, which is strange
because with an exit probability of over 1%
that should happen once in awhile.

Then used setconf ExitNodes= to force one
of them and the client balked, saying

All routers are down or won't exit -- choosing a doomed exit at random.
No specified exit routers seem to be running: can't choose an exit.
Failed to choose an exit server

Then restarted the client and it worked.

Client was running uninterrupted for 10 days when this occurred.

torrc specifies UseMicrodescriptors 0 (due to various
other misbehaviors related to running with micro-descriptors
enabled.

This appears to be an area in need of QA work.

Child Tickets

Attachments (2)

logfile_scrub (127.8 KB) - added by starlight 4 years ago.
trace
logfile110_scrub (160.9 KB) - added by starlight 4 years ago.
trace of <Path Empty> variation

Download all attachments as: .zip

Change History (14)

comment:1 Changed 4 years ago by starlight

Should mention that EntryNodes is also set in torrc.

comment:2 Changed 4 years ago by teor

Milestone: Tor: 0.2.???
Status: newneeds_information

What is EntryNodes set to?
Is StrictNodes set?

comment:3 Changed 4 years ago by starlight

What is EntryNodes set to?

Not willing to reveal my guard. Is a single high-quality, long-running guard running 0.2.6.10.

Is StrictNodes set?

No, but don't have ExcludeNodes set so what difference would it make?

Issue is that for days a new high-weight exit was not selected by the normal routing algorithm and the new exit was refused as a choice when an attempt was made to force it.

Should have traced this when it happened but it so surprised me I didn't think of it. Have a script to enable debug tracing for the next time I see it, but I turned UseMicrodescriptors back on since its misbehavior bothers me less than this issue. Can put back UseMicrodescriptors 0 and try to reproduce with debug tracing if desired.

comment:4 Changed 4 years ago by starlight

Think might have identified the problem.

Looks like client-mode gets confused when the exit policy for a node changes.

before

BabylonNetwork00 Ivn7UYzogAgbMQG3C52F1b1NeWc p reject 22,25,109-110,119,143,563,4899,6881-6889,18353,23574,31826

after

BabylonNetwork00 Ivn7UYzogAgbMQG3C52F1b1NeWc p accept 20-21,23,43,53,79-81,88,110,143,194,220,389,443,464-465,531,543-544,554,563,587,636,706,749,873,902-904,981,989-995,1194,1220,1293,1500,1533,1677,1723,1755,1863,2082-2083,2086-2087,2095-2096,2102-2104,3128,3389,3690,4321,4643,5050,5190,5222-5223,5228,5900,6660-6669,6679,6697,8000,8008,8074,8080,8082,8087-8088,8332-8333,8443,8888,9418,9999-10000,11371,19294,19638,50002,64738

First tried testing port 110, which is excluded by the old policy and
accepted by the new policy. This led to <path fail> displays in Vidalia
But then also ran into port 80 failing as well. Trace of this is attached
to the ticket.

Was running with UseMicrodescriptors=0 and StrictNodes=1 where the latter
was set because the client would grab another node when BabylonNetwork00
failed. Think StrictNodes=0 was in effect with the port 110 fail but
not certain.

The exit policy change is quite recent so any other long-running client
might exhibit the problem at the preset time.

Have had

WarnPlaintextPorts ""
WarnUnsafeSocks 0

at all times.

Changed 4 years ago by starlight

Attachment: logfile_scrub added

trace

comment:5 Changed 4 years ago by starlight

Port 110 failure reproduced, with debug trace. See <Path Empty> in Vidalia. StrictNodes=0

Changed 4 years ago by starlight

Attachment: logfile110_scrub added

trace of <Path Empty> variation

comment:6 Changed 4 years ago by starlight

Looked through the second trace and it does not say
much of interest except

[debug] circuit_remove_handled_ports(): Port 110 is not handled.
[info] circuit_predict_and_launch_new(): Have 0 clean circs (0 internal), need another exit circ.
[debug] new_route_len(): Chosen route length 3 (6588/7334 routers suitable).
[info] choose_good_exit_server_general(): Found 0 servers that might support 0/0 pending connections.
[info] choose_good_exit_server_general(): We couldn't find any live, fast routers; falling back to list of all routers.
[info] choose_good_exit_server_general(): Found 0 servers that might support 0/0 pending connections.
[notice] All routers are down or won't exit -- choosing a doomed exit at random.
[debug] circuit_remove_handled_ports(): Port 110 is not handled.
[info] compute_weighted_bandwidths(): Empty routerlist passed in to consensus weight node selection for rule weight as exit
[info] compute_weighted_bandwidths(): Empty routerlist passed in to consensus weight node selection for rule weight as exit
[warn] No specified exit routers seem to be running: can't choose an exit.
[warn] Failed to choose an exit server

Spent a few minutes looking at the code and it seems
that ascii exit policies are converted into a binary
format for use by the path selection logic.

So--just guessing--perhaps the binary version
of an exit policy is not updated when the ascii form
changes in the published consensus and/or descriptor?

comment:7 Changed 4 years ago by starlight

Not so simple this one.

Today looked at a few more relays where the exit
policy changed and every one worked fine in the
long-running client--but BabylonNetwork00 remains
completely unusable.

Interestingly BabylonNetwork02 underwent the exact
same exit policy change a few hours after BabylonNetwork00
and is working as expected.

A notable difference is that BabylonNetwork00 had
a descriptor change the hour before the descriptor
change where the new exit policy was delivered,
where BabylonNetwork02 had no descriptor changes
for a awhile before the policy change.

BabylonNetwork00
2015-10-09-11-10-09-server-descriptors -- old exit policy
2015-10-09-12-10-23-server-descriptors -- new exit policy

BabylonNetwork02
2015-10-10-04-09-34-server-descriptors -- old exit policy
2015-10-10-16-09-19-server-descriptors -- new exit policy

Causes me to wonder if this problem is tied up with
some difference between processing from
cached-descriptors and cached-descriptors.new
so I'll look for that going forward.

If someone with a long-running client, especially
one with UseMicrodescriptors=0 in effect could
try exiting through the above two nodes to
see if the problem reproduces that would be
helpful.

comment:8 Changed 3 years ago by teor

Milestone: Tor: 0.2.???Tor: 0.3.???

Milestone renamed

comment:9 Changed 3 years ago by nickm

Keywords: tor-03-unspecified-201612 added
Milestone: Tor: 0.3.???Tor: unspecified

Finally admitting that 0.3.??? was a euphemism for Tor: unspecified all along.

comment:10 Changed 2 years ago by nickm

Keywords: tor-03-unspecified-201612 removed

Remove an old triaging keyword.

comment:11 Changed 2 years ago by nickm

Keywords: tor-client path-selection needs-diagnosis added
Severity: Normal

comment:12 Changed 2 years ago by starlight

Running 0.2.9.11 client lately. I can attempt to reproduce this again if it seems relevant and if this version is current enough w/r/t path selection code to be useful.

Reasonably sure the last approach I came up with could work, i.e. set UseMicrodescriptors=0, clear caches and restart. Then awk out a list of exits with exit policies. After awhile repeat and look for cases as described above. I recall thinking some problem exists in the code that translates policy descriptors to internal structures used by the path selection code. Even wrote that into the ticket it seems

Note: See TracTickets for help on using tickets.