Reachability Tests aren't conducted if there are no exit nodes

changed milestone to %Tor: 0.2.6.x-final

added 026-deferrable chutney component::core tor/tor lorax milestone::Tor: 0.2.6.x-final owner::teor parent::14034 priority::medium resolution::fixed status::closed test-network tor-relay type::defect version::tor 0.2.6.1-alpha labels

Marking as should-fix-eventually (0.2.???); I'd take a clean patch to fix this if somebody writes one.

Trac:
Milestone: N/A to Tor: 0.2.???
Keywords: N/A deleted, tor-relay test-network lorax added

Around 1-10% of chutney runs show an error like this on my OS X (multiprocessor) system, and chutney's transmission test fails. This condition can persist for well over 30 minutes, when a normal run of this chutney net succeeds within 18 seconds.

This appears to be exacerbated by:

the existing shorter intervals set for TestingTorNetwork
the custom shorter intervals set in the chutney templates
larger numbers of tor processes
in particular, larger numbers of authorities in the test network
background system load (e.g. compilation)

In this particular case, it looks like a race condition around chutney-launched tor processes causes this issue.

If it would help, I can provide logs, or try to produce a chutney net configuration with a larger failure rate.

#13787 (moved) may be a duplicate of this. nickm to confirm.

Trac:
Cc: N/A to teor, nickm

I believe that an appropriate fix for this issue is to extend router_have_minimum_dir_info to take a parameter dir_info_purpose indicating what the dir info would be used for. (Or, perhaps, a set of flags for guard/middle/exit. Have to look into this.)

This short-circuits the chicken or egg issue by splitting the checks into internal and external. Internal can succeed, then activate conditions (exit ports), allowing external to succeed.

(Alternately, we can force exits using the new TestingDirAuthVoteExit in tor 0.2.6, which works sometimes, but not always.)

Trac:
Owner: N/A to teor
Summary: Reachability Tests aren't conducted if there are no exit nodes to `
Status: new to assigned

Oops, stray keypress in the title field.

Trac:
Summary: ` to Reachability Tests aren't conducted if there are no exit nodes

After attempting to test my proposed changes, I believe there are multiple race conditions in the network bootstrap that cause intermittent failures.

However, the chicken-and-egg exit issue covered by this bug produces reproducible failures (I believe it to be the cause of #13161 (moved) and one of the potential causes for #13787 (moved)).

In order to simplify testing, I have created a chutney config that (AFAIK) contains the smallest possible/reasonable Tor network: 3 authorities, 1 exit, 1 client.

Branch: basic-min Repository: https://github.com/teor2345/chutney.git

Nick, would you like to merge the chutney branch?

I will be testing my changes against this minimal config in order to eliminate intermittent failures from the more complex, rarer race conditions.

Success Criteria: The old (95%) and new code (99%) both succeed as long as TestingDirAuthVoteExit is turned on.

The old code fails (0%) when TestingDirAuthVoteExit is turned off. (See #13161 (moved).) The new code should reliably (95%) bootstrap with TestingDirAuthVoteExit turned off.

I'll get back to you after a few hundred test runs.....................

Trac:
Keywords: tor-relay test-network lorax deleted, tor-relay test-network lorax chutney added

I've posted the draft tor changes to:

Branch: bug13718-stop-req-exits-for-or-conns Repository: https://github.com/teor2345/tor.git

The branch contains two commits:

ignore exits when checking min dir info for internal connections (includes detailed log messages). This is the maximally compatible change that could be back-ported. Reported BOOTSTRAP_STATUS values try to look as much like the old version as possible. (Some duplicate events may be generated.)
split BOOTSTRAP_STATUS into INTERNAL and EXIT stages. This changes the values and number of events the controller will receive. This helps in determining whether we're hanging waiting for internal or exit paths. But it isn't necessary to back-port it.

I'll attach my continuous testing script, which could go in chutney or tor, if it would be useful. (Which one, Nick?)

I'm currently testing the failure rate of this code on OS X (i386 & x86_64), can others test on Linux & Windows?

This also probably needs some simple unit tests. Not quite sure how to write those.

Trac:
Version: N/A to Tor: 0.2.6.1-alpha

Trac:
continuous-test-network.sh

Continuously run chutney until data transmission fails. Good for intermittent errors.

Merged the chutney patch; will review the other one on the bus.

Thanks! here are some initial thoughts:

42e4c18236068984c027ec1d737b34595ada8ace:

I kinda want an enum for the argument to router_have_minimum_dir_info(), rather than a boolean. It seems like it would be clearer that way. Or possibly, there should be two wrappers around it: have_minimum_dir_info_for_exit_circ(), have_minimum_dir_info_for_internal_circ().
The documentation for status_out in compute_frac_paths_available needs to be explicit about allocating a new string (by convention).

440b10ec29d19459376d380bdd659fc8c9d5bb26

Need to tweak messages in bootstrap_status_to_string to make them a little more human-comprehensible, or users will wonder what they mean.
Do we needs corresponding control-spec changes to document these statuses?
This needs a changes/ file too.

Happy to make these changes, Nick.

I've now seen the statuses pop up when launching TorBrowser using this build, so I understand the need to comprehensibility.

I kinda want an enum for the argument to router_have_minimum_dir_info(), rather than a boolean. It seems like it would be clearer that way. Or possibly, there should be two wrappers around it: have_minimum_dir_info_for_exit_circ(), have_minimum_dir_info_for_internal_circ().

Is there the possibility of needing to calculate weights for guard, middle, and exit nodes in arbitrary combinations? (i.e. before choosing a guard node, ensure minimum guard bandwidth) If so, we could use a set of bit-shift flags.

If not, I'm happy to set up an enum with the two current values of Exit and Internal, and possibly an aliased value for those circumstances where we want a default option.

We may also need to update the status/enough-dir-info GETINFO control event - should we add status/enough-dir-info/exit and status/enough-dir-info/internal (we default status/enough-dir-info to exit for backwards compatibility).

I also wonder about the impact of changing the invocation of circuit_build_needed_circs() so that it runs when we know we have internal circuits, rather than waiting for exit circuits.

Should we split it into internal and exit versions? If so, which types of circuits go in each category?

Replying to teor:

Happy to make these changes, Nick.

I've now seen the statuses pop up when launching TorBrowser using this build, so I understand the need to comprehensibility.

I kinda want an enum for the argument to router_have_minimum_dir_info(), rather than a boolean. It seems like it would be clearer that way. Or possibly, there should be two wrappers around it: have_minimum_dir_info_for_exit_circ(), have_minimum_dir_info_for_internal_circ().

Is there the possibility of needing to calculate weights for guard, middle, and exit nodes in arbitrary combinations? (i.e. before choosing a guard node, ensure minimum guard bandwidth) If so, we could use a set of bit-shift flags.

I don't think so. It would be likelier to have to calculate weights for different kinds of circuits, I imagine.

If not, I'm happy to set up an enum with the two current values of Exit and Internal, and possibly an aliased value for those circumstances where we want a default option.

Sounds good.

We may also need to update the status/enough-dir-info GETINFO control event - should we add status/enough-dir-info/exit and status/enough-dir-info/internal (we default status/enough-dir-info to exit for backwards compatibility).

Sounds fine, though it could be a separate ticket.

I also wonder about the impact of changing the invocation of circuit_build_needed_circs() so that it runs when we know we have internal circuits, rather than waiting for exit circuits. Should we split it into internal and exit versions? If so, which types of circuits go in each category?

That's an interesting question, but it sounds like a separate ticket. Generally, anything that is a predicted circuit, or anything that might carry user traffic, is an exit circuit. Anything else is an internal circuit.

Split off #13813 (moved) for internal and exit sub-events to the status/enough-dir-info GETINFO control event.

Split off #13814 (moved) for building HS IP and other internal needed circuits earlier, once we can build internal paths.

Occasionally, the CPU load on my test machine will increase (or some other condition affecting the scheduler will occur), and a bootstrap race condition will cause the test to fail 50-100% of the time for a few hours. Then it will start working again.

The commands run are exactly the same each time. I'll be excluding these results from the tests, because they happen with or without the changes.

Perhaps lengthening some of the default intervals chutney uses would solve this?

Split this issue into #13823 (moved)

Replying to teor:

I believe that an appropriate fix for this issue is to extend router_have_minimum_dir_info to take a parameter dir_info_purpose indicating what the dir info would be used for. (Or, perhaps, a set of flags for guard/middle/exit. Have to look into this.)

Another simpler hack might be to say that you don't have to think about whether you know about enough exits if there aren't any exits in the consensus you have.

Testing this patch appears to have revealed another bug where chutney-run tor authorities don't flag anything as an Exit (fixed in #13161 (moved) with TestingDirAuthVoteExit, perhaps related to #11264 (moved)).

Perhaps we should test with: TestingDirAuthVoteExit * AssumeReachable 0

This will avoid the issues in #13161 (moved) / #11264 (moved), while still testing the reachability bootstrapping concerns of the OP.

I have logged this as a separate issue #13839 (moved), where the policy is accept * but no Exit flag is assigned.

Reachability Tests aren't conducted if there are no exit nodes

too long; didn't read

target function: consider_testing_reachability

call site #1: directory_info_has_arrived

call site #2 (closed): run_scheduled_events (and call site #3 (closed))

call site #4 (closed): circuit_testing_opened

Child items 0

Activity