Opened 4 years ago

Closed 4 years ago

Last modified 10 months ago

#13718 closed defect (fixed)

Reachability Tests aren't conducted if there are no exit nodes

Reported by: tom Owned by: teor
Priority: Medium Milestone: Tor: 0.2.6.x-final
Component: Core Tor/Tor Version: Tor: 0.2.6.1-alpha
Severity: Keywords: tor-relay test-network lorax chutney 026-deferrable
Cc: nickm, dgoulet, tom Actual Points:
Parent ID: #14034 Points:
Reviewer: Sponsor:

Description

Context:

On 22 October 2014 05:48, Roger Dingledine <arma@…> wrote:

What I had to do was make one of my Directory Authorities an exit -
this let the other nodes start building circuits through the
authorities and upload descriptors.

This part seems surprising to me -- directory authorities always publish
their dirport whether they've found it reachable or not, and relays
publish their descriptors directly to the dirport of each directory
authority (not through the Tor network).

So maybe there's a bug that you aren't describing, or maybe you are
misunderstanding what you saw?

See also https://trac.torproject.org/projects/tor/ticket/11973

Another problem I ran into was that nodes couldn't conduct
reachability tests when I had exits that were only using the Reduced
Exit Policy - because it doesn't list the ORPort/DirPort! (I was
using nonstandard ports actually, but indeed the reduced exit policy
does not include 9001 or 9030.) Looking at the current consensus,
there are 40 exits that exit to all ports, and 400-something exits
that use the ReducedExitPolicy. It seems like 9001 and 9030 should
probably be added to that for reachability tests?

The reachability tests for the ORPort involve extending the circuit to
the ORPort -- which doesn't use an exit stream. So your relays should
have been able to find themselves reachable, and published a descriptor,
even with no exit relays in the network.

Okay, so the behavior I saw, and reproduced, is that reachability tests didn't succeed (and therefore descriptors weren't uploaded) when there were no exits. I think I may have figured out why, but there are some internals I haven't completely figured out. I'm going to lay out what I think and then the parts I'm not completely sure about.

First off, you're (obviously) correct about me misunderstanding extending the circuit via an Exit stream, that's not necessary. But still, I think the lack of Exits stopped the reachability tests from succeeding.

too long; didn't read

I don't think reachability tests happen when there are no Exit nodes because of a quirk in the bootstrapping process, where we never think we have a minimum of directory information.

target function: consider_testing_reachability

A reachability test is conducted from consider_testing_reachability (I think it's only conducted from here? Although maybe there's other situations it could happen..?) consider_testing_reachability is called from circuit_send_next_onion_skin, circuit_testing_opened, run_scheduled_events, and directory_info_has_arrived.

call site #1: directory_info_has_arrived

This is called very frequently on router startup. But consider_testing_reachability will not be called if router_have_minimum_dir_info returns false:

void directory_info_has_arrived(time_t now, int from_cache)
{ //...
  if (!router_have_minimum_dir_info()) {
    //...
    return;
  } else { /* ... */ }

  if (server_mode(options) && !net_is_disabled() && !from_cache &&
      (can_complete_circuit || !any_predicted_circuits(now)))
    consider_testing_reachability(1, 1);
}

router_have_minimum_dir_info returns the static variable have_min_dir_info. This variable is only set to 1 in update_router_have_minimum_dir_info and then only if there are Exits! Specifically, we will trigger paths < get_frac_paths_needed_for_circs(options,consensus) because we have 0% of the Exit Bandwidth, as shown by this error message:

Nov 09 22:10:26.000 [notice] I learned some more directory information, but not enough to build a circuit: We need more descriptors: we have 5/5, and can only build 0% of likely paths. (We have 100% of guards bw, 100% of midpoint bw, and 0% of exit bw.)
update_router_have_minimum_dir_info(void)
{	//...
    char *status = NULL;
    int num_present=0, num_usable=0;
    double paths = compute_frac_paths_available(consensus, options, now,
                                                &num_present, &num_usable,
                                                &status);

    if (paths < get_frac_paths_needed_for_circs(options,consensus)) {
      tor_snprintf(dir_info_status, sizeof(dir_info_status),
                   "We need more %sdescriptors: we have %d/%d, and "
                   "can only build %d%% of likely paths. (We have %s.)",
                   using_md?"micro":"", num_present, num_usable,
                   (int)(paths*100), status);
      //...
      res = 0;
      goto done;
    }
   res = 1;
  }

 done:
  if (res && !have_min_dir_info) { /* ... */ }
  if (!res && have_min_dir_info) {
    int quiet = directory_too_idle_to_fetch_descriptors(options, now);
    tor_log(quiet ? LOG_INFO : LOG_NOTICE, LD_DIR,
        "Our directory information is no longer up-to-date "
        "enough to build circuits: %s", dir_info_status);

    /* a) make us log when we next complete a circuit, so we know when Tor
     * is back up and usable, and b) disable some activities that Tor
     * should only do while circuits are working, like reachability tests
     * and fetching bridge descriptors only over circuits. */
    can_complete_circuit = 0;

    control_event_client_status(LOG_NOTICE, "NOT_ENOUGH_DIR_INFO");
  }
  have_min_dir_info = res;
}

(The exact source line is in frac_nodes_with_descriptors, called by compute_frac_paths_available:)

/** For all nodes in <b>sl</b>, return the fraction of those nodes, weighted
 * by their weighted bandwidths with rule <b>rule</b>, for which we have
 * descriptors. */
double
frac_nodes_with_descriptors(const smartlist_t *sl,
                            bandwidth_weight_rule_t rule)
{
  //...
  if (smartlist_len(sl) == 0)
    return 0.0;

This prevents reachability from occurring from directory_info_has_arrived.

call site #2: run_scheduled_events (and call site #3)

There's a litany of conditions to call consider_testing_reachability from run_scheduled_events. In particular, there's can_complete_circuit

if (time_to_check_descriptor < now && !options->DisableNetwork) {
    //...
    /* also, check religiously for reachability, if it's within the first
     * 20 minutes of our uptime. */
    if (is_server &&
        (can_complete_circuit || !any_predicted_circuits(now)) &&
        !we_are_hibernating()) {
      if (stats_n_seconds_working < TIMEOUT_UNTIL_UNREACHABILITY_COMPLAINT) {
        consider_testing_reachability(1, dirport_reachability_count==0);

can_complete_circuit is only set in circuit_send_next_onion_skin, but then only if a circuit is built and it is not circ->build_state->onehop_tunnel. I _think_ this means the circuit is a full circuit, complete with Exit. Right?

int circuit_send_next_onion_skin(origin_circuit_t *circ)
{ //...
  if (circ->cpath->state == CPATH_STATE_CLOSED) {
    // ...
  } else {
    //...
    hop = onion_next_hop_in_cpath(circ->cpath);
    if (!hop) {
      //...
      if (!can_complete_circuit && !circ->build_state->onehop_tunnel) {
        can_complete_circuit=1;
        /* FFFF Log a count of known routers here */
        log_notice(LD_GENERAL,
            "Tor has successfully opened a circuit. "
            "Looks like client functionality is working.");
        //...
        if (server_mode(options) && !check_whether_orport_reachable()) {
          inform_testing_reachability();
          consider_testing_reachability(1, 1);

This is also the third place consider_testing_reachability is called - there is only one left:

call site #4: circuit_testing_opened

/** A testing circuit has completed. Take whatever stats we want.
 * Noticing reachability is taken care of in onionskin_answer(),
 * so there's no need to record anything here. But if we still want
 * to do the bandwidth test, and we now have enough testing circuits
 * open, do it.
 */
static void
circuit_testing_opened(origin_circuit_t *circ)
{
  if (have_performed_bandwidth_test ||
      !check_whether_orport_reachable()) {
    /* either we've already done everything we want with testing circuits,
     * or this testing circuit became open due to a fluke, e.g. we picked
     * a last hop where we already had the connection open due to an
     * outgoing local circuit. */
    circuit_mark_for_close(TO_CIRCUIT(circ), END_CIRC_AT_ORIGIN);
  } else if (circuit_enough_testing_circs()) {
    router_perform_bandwidth_test(NUM_PARALLEL_TESTING_CIRCS, time(NULL));
    have_performed_bandwidth_test = 1;
  } else
    consider_testing_reachability(1, 0);
}

But... as far as I can tell - a testing circuit is only used for two things: conducting a reachability test and conducting a bandwidth self-test. The only place a bandwidth self-test is called is inside circuit_testing_opened. So this call of consider_testing_reachability is a chicken or the egg problem.

Child Tickets

TicketStatusOwnerSummaryComponent
#13787closedteorchutney doesn't seem to bootstrap with masterCore Tor/Tor
#13814closedteorAvoid building exit circuits unless we have a consensus that can build exit pathsCore Tor/Tor
#13823closedteorchutney intervals are too short for successful bootstrap, particularly under high CPU load on OS XCore Tor/Tor
#13839closedteorExit Policy says accept *:* but relay doesn't have Exit flagCore Tor/Tor
#13924closedteorReachability testing and channel is_local assume private addresses are localCore Tor/Tor
#13963closedteorIf Modified Since delay is too long when V3AuthVotingInterval is very shortCore Tor/Tor

Attachments (9)

continuous-test-network.sh (227 bytes) - added by teor 4 years ago.
Continuously run chutney until data transmission fails. Good for intermittent errors.
torrc (4.6 KB) - added by teor 4 years ago.
authority torrc that enables relays to bootstrap in a network without exits
torrc.2 (2.9 KB) - added by teor 4 years ago.
relay torrc that bootstraps in a network without exits
torrc.authority (3.0 KB) - added by teor 4 years ago.
authority torrc that bootstraps network with no exits to full functionality in 30 seconds
torrc.exit (3.0 KB) - added by teor 4 years ago.
exit torrc that bootstraps network with no exits to full functionality in 30 seconds
torrc.client (1.4 KB) - added by teor 4 years ago.
client torrc that bootstraps network with no exits to full functionality in 30 seconds
torrc.authority.fast (3.8 KB) - added by teor 4 years ago.
authority torrc that bootstraps network to full functionality in 10 seconds
torrc.exit.fast (3.8 KB) - added by teor 4 years ago.
exit torrc that bootstraps network to full functionality in 10 seconds
torrc.client.fast (2.2 KB) - added by teor 4 years ago.
client torrc that bootstraps network to full functionality in 10 seconds

Download all attachments as: .zip

Change History (62)

comment:1 Changed 4 years ago by nickm

Keywords: tor-relay test-network lorax added
Milestone: Tor: 0.2.???

Marking as should-fix-eventually (0.2.???); I'd take a clean patch to fix this if somebody writes one.

comment:2 Changed 4 years ago by teor

Around 1-10% of chutney runs show an error like this on my OS X (multiprocessor) system, and chutney's transmission test fails. This condition can persist for well over 30 minutes, when a normal run of this chutney net succeeds within 18 seconds.

This appears to be exacerbated by:

  • the existing shorter intervals set for TestingTorNetwork
  • the custom shorter intervals set in the chutney templates
  • larger numbers of tor processes
  • in particular, larger numbers of authorities in the test network
  • background system load (e.g. compilation)

In this particular case, it looks like a race condition around chutney-launched tor processes causes this issue.

If it would help, I can provide logs, or try to produce a chutney net configuration with a larger failure rate.

comment:3 Changed 4 years ago by teor

Cc: teor nickm added

#13787 may be a duplicate of this. nickm to confirm.

comment:4 Changed 4 years ago by teor

Owner: set to teor
Status: newassigned
Summary: Reachability Tests aren't conducted if there are no exit nodes`

I believe that an appropriate fix for this issue is to extend router_have_minimum_dir_info to take a parameter dir_info_purpose indicating what the dir info would be used for. (Or, perhaps, a set of flags for guard/middle/exit. Have to look into this.)

This short-circuits the chicken or egg issue by splitting the checks into internal and external. Internal can succeed, then activate conditions (exit ports), allowing external to succeed.

(Alternately, we can force exits using the new TestingDirAuthVoteExit in tor 0.2.6, which works sometimes, but not always.)

comment:5 Changed 4 years ago by teor

Summary: `Reachability Tests aren't conducted if there are no exit nodes

Oops, stray keypress in the title field.

comment:6 Changed 4 years ago by teor

Keywords: chutney added

After attempting to test my proposed changes, I believe there are multiple race conditions in the network bootstrap that cause intermittent failures.

However, the chicken-and-egg exit issue covered by this bug produces reproducible failures (I believe it to be the cause of #13161 and one of the potential causes for #13787).

In order to simplify testing, I have created a chutney config that (AFAIK) contains the smallest possible/reasonable Tor network: 3 authorities, 1 exit, 1 client.

Branch: basic-min
Repository: ​​​​https://github.com/teor2345/chutney.git

Nick, would you like to merge the chutney branch?

I will be testing my changes against this minimal config in order to eliminate intermittent failures from the more complex, rarer race conditions.

Success Criteria:
The old (95%) and new code (99%) both succeed as long as TestingDirAuthVoteExit is turned on.

The old code fails (0%) when TestingDirAuthVoteExit is turned off. (See #13161.)
The new code should reliably (95%) bootstrap with TestingDirAuthVoteExit turned off.

I'll get back to you after a few hundred test runs.....................

comment:7 Changed 4 years ago by teor

Version: Tor: 0.2.6.1-alpha

I've posted the draft tor changes to:

Branch: bug13718-stop-req-exits-for-or-conns
Repository: ​​​​​https://github.com/teor2345/tor.git

The branch contains two commits:

  • ignore exits when checking min dir info for internal connections (includes detailed log messages). This is the maximally compatible change that could be back-ported. Reported BOOTSTRAP_STATUS values try to look as much like the old version as possible. (Some duplicate events may be generated.)
  • split BOOTSTRAP_STATUS into INTERNAL and EXIT stages. This changes the values and number of events the controller will receive. This helps in determining whether we're hanging waiting for internal or exit paths. But it isn't necessary to back-port it.

I'll attach my continuous testing script, which could go in chutney or tor, if it would be useful. (Which one, Nick?)

I'm currently testing the failure rate of this code on OS X (i386 & x86_64), can others test on Linux & Windows?

This also probably needs some simple unit tests. Not quite sure how to write those.

Changed 4 years ago by teor

Attachment: continuous-test-network.sh added

Continuously run chutney until data transmission fails. Good for intermittent errors.

comment:8 Changed 4 years ago by nickm

Merged the chutney patch; will review the other one on the bus.

comment:9 Changed 4 years ago by nickm

Thanks! here are some initial thoughts:

42e4c18236068984c027ec1d737b34595ada8ace:

  • I kinda want an enum for the argument to router_have_minimum_dir_info(), rather than a boolean. It seems like it would be clearer that way. Or possibly, there should be two wrappers around it: have_minimum_dir_info_for_exit_circ(), have_minimum_dir_info_for_internal_circ().
  • The documentation for status_out in compute_frac_paths_available needs to be explicit about allocating a new string (by convention).

440b10ec29d19459376d380bdd659fc8c9d5bb26

  • Need to tweak messages in bootstrap_status_to_string to make them a little more human-comprehensible, or users will wonder what they mean.
  • Do we needs corresponding control-spec changes to document these statuses?
  • This needs a changes/ file too.

comment:10 Changed 4 years ago by teor

Happy to make these changes, Nick.

I've now seen the statuses pop up when launching TorBrowser using this build, so I understand the need to comprehensibility.

I kinda want an enum for the argument to router_have_minimum_dir_info(), rather than a boolean. It seems like it would be clearer that way. Or possibly, there should be two wrappers around it: have_minimum_dir_info_for_exit_circ(),

have_minimum_dir_info_for_internal_circ().

Is there the possibility of needing to calculate weights for guard, middle, and exit nodes in arbitrary combinations? (i.e. before choosing a guard node, ensure minimum guard bandwidth) If so, we could use a set of bit-shift flags.

If not, I'm happy to set up an enum with the two current values of Exit and Internal, and possibly an aliased value for those circumstances where we want a default option.

We may also need to update the status/enough-dir-info GETINFO control event - should we add status/enough-dir-info/exit and status/enough-dir-info/internal (we default status/enough-dir-info to exit for backwards compatibility).

comment:11 Changed 4 years ago by teor

I also wonder about the impact of changing the invocation of circuit_build_needed_circs() so that it runs when we know we have internal circuits, rather than waiting for exit circuits.

Should we split it into internal and exit versions?
If so, which types of circuits go in each category?

comment:12 in reply to:  10 Changed 4 years ago by nickm

Replying to teor:

Happy to make these changes, Nick.

I've now seen the statuses pop up when launching TorBrowser using this build, so I understand the need to comprehensibility.

I kinda want an enum for the argument to router_have_minimum_dir_info(), rather than a boolean. It seems like it would be clearer that way. Or possibly, there should be two wrappers around it: have_minimum_dir_info_for_exit_circ(),

have_minimum_dir_info_for_internal_circ().

Is there the possibility of needing to calculate weights for guard, middle, and exit nodes in arbitrary combinations? (i.e. before choosing a guard node, ensure minimum guard bandwidth) If so, we could use a set of bit-shift flags.

I don't think so. It would be likelier to have to calculate weights for different kinds of circuits, I imagine.

If not, I'm happy to set up an enum with the two current values of Exit and Internal, and possibly an aliased value for those circumstances where we want a default option.

Sounds good.

We may also need to update the status/enough-dir-info GETINFO control event - should we add status/enough-dir-info/exit and status/enough-dir-info/internal (we default status/enough-dir-info to exit for backwards compatibility).

Sounds fine, though it could be a separate ticket.

I also wonder about the impact of changing the invocation of circuit_build_needed_circs() so that it runs when we know we have internal circuits, rather than waiting for exit circuits.
Should we split it into internal and exit versions? If so, which types of circuits go in each category?

That's an interesting question, but it sounds like a separate ticket. Generally, anything that is a predicted circuit, or anything that might carry user traffic, is an exit circuit. Anything else is an internal circuit.

comment:13 Changed 4 years ago by teor

Split off #13813 for internal and exit sub-events to the status/enough-dir-info GETINFO control event.

Split off #13814 for building HS IP and other internal needed circuits earlier, once we can build internal paths.

comment:14 Changed 4 years ago by teor

Occasionally, the CPU load on my test machine will increase (or some other condition affecting the scheduler will occur), and a bootstrap race condition will cause the test to fail 50-100% of the time for a few hours. Then it will start working again.

The commands run are exactly the same each time. I'll be excluding these results from the tests, because they happen with or without the changes.

Perhaps lengthening some of the default intervals chutney uses would solve this?

Split this issue into #13823

comment:15 in reply to:  4 ; Changed 4 years ago by arma

Replying to teor:

I believe that an appropriate fix for this issue is to extend router_have_minimum_dir_info to take a parameter dir_info_purpose indicating what the dir info would be used for. (Or, perhaps, a set of flags for guard/middle/exit. Have to look into this.)

Another simpler hack might be to say that you don't have to think about whether you know about enough exits if there aren't any exits in the consensus you have.

comment:16 Changed 4 years ago by teor

Testing this patch appears to have revealed another bug where chutney-run tor authorities don't flag anything as an Exit (fixed in #13161 with TestingDirAuthVoteExit, perhaps related to #11264).

Perhaps we should test with:
TestingDirAuthVoteExit *
AssumeReachable 0

This will avoid the issues in #13161 / #11264, while still testing the reachability bootstrapping concerns of the OP.

I have logged this as a separate issue #13839, where the policy is accept * but no Exit flag is assigned.

comment:17 in reply to:  15 Changed 4 years ago by arma

Replying to arma:

Replying to teor:

I believe that an appropriate fix for this issue is to extend router_have_minimum_dir_info to take a parameter dir_info_purpose indicating what the dir info would be used for. (Or, perhaps, a set of flags for guard/middle/exit. Have to look into this.)

Another simpler hack might be to say that you don't have to think about whether you know about enough exits if there aren't any exits in the consensus you have.

In particular, I worry about the case where the guard knows what info the client has gotten, and sees the client start building circuits so it knows more about the purpose of these circuits.

So there's still value imo in waiting for circuit-building until we have all the network info that we need for a variety of actions. The bug here is that we have the wrong definition of "all the network info that we need" when the network has no exits. So we should be fixing that definition.

comment:18 Changed 4 years ago by teor

OK, so from reading the code, if the network has no exits:

The following circuits should still be built, and are internal:

  • Hidden Service Circuits (Server, Client, Introduction Point)
  • Socks Proxy Circuits (when connected to HSs)

The following circuits should be built, but aren't currently configured as internal:

  • Circuit Build Timeout Circuits

We could conditionally configure these as internal if there are no exits in the consensus.

The following circuits can never be built (and we shouldn't try, as it produces lots of errors):

  • Exit Circuits
  • Socks Proxy Circuits (when connected to Exits)

I'll change the current patch to:

  • if there are no exits in the consensus: build internal circuits as soon as we have enough info for internal circuits
  • if there are exits in the consensus: delay building internal circuits and build exit circuits when we have enough info for exit circuits

This involves a simple change in the modified router_have_minimum_dir_info to check for exits in the consensus.

comment:19 Changed 4 years ago by teor

This will, however, cause issues if we have a consensus that has exits, but we can't get (enough of) their descriptors. But this is no worse than the current behaviour.

comment:20 in reply to:  9 Changed 4 years ago by teor

To test this patch, I need to generate at least two consensuses:

  • one with no exits, which the exits can use to determine their own reachability based on internal paths
  • one with exits appropriately flagged after self-testing.

But MIN_VOTE_INTERVAL is 5 minutes, so I've defined MIN_VOTE_INTERVAL_TESTING 10. I've patched tor to use MIN_VOTE_INTERVAL_TESTING during testing. See #13823 for details.

comment:21 in reply to:  19 Changed 4 years ago by arma

Replying to teor:

This will, however, cause issues if we have a consensus that has exits, but we can't get (enough of) their descriptors. But this is no worse than the current behaviour.

I think I would call this part a feature: it reduces the impact from attacks by your guard to trickle information to you and then see when you start acting.

comment:22 Changed 4 years ago by nickm

Keywords: 026-deferrable added
Milestone: Tor: 0.2.???Tor: 0.2.6.x-final

I think there's got to be *some* version of this we can do in 0.2.6

comment:23 Changed 4 years ago by tom

I'm not able to do a code review right now, but I did run your branch on commit 440b10ec29d19459376d380bdd659fc8c9d5bb26 and stuck a relay into my no-exit network. The notices log is:

Dec 08 19:25:43.000 [notice] Bootstrapped 0%: Starting
Dec 08 19:25:43.000 [notice] This version of Tor (0.2.6.1-alpha-dev) is newer than any recommended version, according to the directory authorities. Recommended versions are: 0.2.4.23,0.2.4.24,0.2.5.6-alpha,0.2.5.7-rc,0.2.5.8-rc
Dec 08 19:25:43.000 [notice] Bootstrapped 35%: Asking for relay descriptors
Dec 08 19:25:43.000 [notice] We now have enough directory information to build internal circuits.
Dec 08 19:25:43.000 [notice] Bootstrapped 50%: Connecting to the Tor network (internal)
Dec 08 19:25:43.000 [notice] Bootstrapped 55%: Finishing handshake with first hop (internal)
Dec 08 19:25:43.000 [notice] We weren't able to find support for all of the TLS ciphersuites that we wanted to advertise. This won't hurt security, but it might make your Tor (if run as a client) more easy for censors to block.
Dec 08 19:25:43.000 [notice] To correct this, use a version of OpenSSL built with none of its ciphers disabled.
Dec 08 19:25:43.000 [notice] Bootstrapped 70%: Loading relay descriptors (internal)
Dec 08 19:29:48.000 [notice] I learned some more directory information, but not enough to build an exit circuit: We need more descriptors: we have 6/6, and can only build 0% of likely exit paths. (We have 100% of guards bw, 100% of midpoint bw, and 0% of exit bw = 0% of exit path bw.)

Previously I never got past the 50% mark, so that appears to be a good sign.

comment:24 Changed 4 years ago by teor

Thanks for these results, tom.

I am still working on an updated version of this branch, and can get it to 100% bootstrap.

But data transmission fails due to the following:

  • The authorities upload their descriptors, but appear to fail reachability self-testing (and therefore don't allow anything to exit? I need to confirm this).
  • The exit relay(s) appear to fail reachability self-testing (and therefore descriptor upload, so they don't appear in the consensus).
  • Even if reachability was working, I would also need the consensus to run again to pick up the changed descriptors - see #13823.
  • The client(s) complain they don't have any usable exits, so they fail to transmit data.

I will look into this within the next few days, and tweak a few things, then post an updated version of the code for comments (and, if I've made no progress, a plea for help).

comment:25 Changed 4 years ago by teor

I have made changes in #13823 that successfully has authorities create a consensus every 10 seconds in a chutney network. This should work as long as the clocks are strictly synchronised.

For comparison, the src/test/test-network.sh script allows 18 seconds for chutney to launch and do its tests, which is two of these consensuses.

comment:26 Changed 4 years ago by teor

The way tor determines reachability is broken for test, internal, and local networks.

When we're on a local address and DirAllowPrivateAddresses is 1, tor should check whether we're connecting to our own digest, or another router's.

Split off as #13924.

comment:27 Changed 4 years ago by teor

From #13823:
A relay doesn't re-publish its descriptor until up to 60 seconds elapses.
In a testing tor network, it should upload immediately when the ORPort or DirPort changes.

comment:28 Changed 4 years ago by teor

In a TestingTorNetwork, when TestingAuthDirTimeToLearnReachability is much lower than its normal value of 30 minutes, bootstrap will happen much more reliably if we test reachability at a proportionally faster rate.

Split off as #13929.

comment:29 Changed 4 years ago by teor

A relay with AssumeReachable 0 now makes it into the consensus after around 30-40 seconds.
The minimum time should be (each consensus is 10 seconds apart with a 4s lead-up):

  • 1 consensus to determine available relays, excluding this one (4s, initial consensus)
  • some time to download descriptors, and make an initial connection
  • some time to make a connection to ourselves for ORPort self-testing
  • some time to rebuild and upload descriptor
  • 1 consensus to determine available relays, including this one (4s-14s)

So if we can spend up to 18 seconds simply building consensuses, 30-40 seconds is a fair amount of elapsed time to complete this whole process.

This probably means we should increase the default src/test/test-network.sh bootstrap time to 40 seconds.

comment:30 Changed 4 years ago by teor

A relay with AssumeReachable 0 now makes it into the consensus after around 30-40 seconds, even without using TestingDirAuthVoteExit (from #13161). This means that it correctly:

  • determines that no exits are available in the consensus
  • continues to bootstrap with internal paths only
  • successfully self-tests reachability with an internal path

I'll attach authority and relay torrcs that exhibit this behaviour with my new code.

I still haven't solved #13839 - the authorities don't flag anything as an exit (at least until around 30 minutes later?). But I think I'll leave that for another time.

Now I'll review my changes and group them into commits with appropriate changes/* files. This could take me a week or so.

Last edited 4 years ago by teor (previous) (diff)

Changed 4 years ago by teor

Attachment: torrc added

authority torrc that enables relays to bootstrap in a network without exits

Changed 4 years ago by teor

Attachment: torrc.2 added

relay torrc that bootstraps in a network without exits

comment:31 Changed 4 years ago by teor

Cc: dgoulet added

comment:32 Changed 4 years ago by teor

#13839 can be resolved using TestingMinExitFlagThreshold 0, which causes authorities to ignore advertised / measured bandwidth when assigning the Exit flag.

A minor fix is required to router_is_active() to ignore measured bandwidth when assigning the Active flag (which is required for the Exit flag).

After this change:

An exit node with AssumeReachable 0 now makes it into the consensus after around 30-40 seconds. Clients then use this updated consensus from 40-60 seconds (the old consensus must expire first). The network uses TestingMinExitFlagThreshold 0 on the authorities, rather than TestingDirAuthVoteExit (from #13161).

This means that the exit node correctly: (20s-30s)

  • determines that no Exits are available in the consensus
  • continues to bootstrap with internal paths only
  • successfully self-tests reachability with an internal path
  • posts its descriptor to the authorities

And the authorities correctly: (30s-40s)

  • assign the Exit flag
  • distribute an updated consensus containing the Exit node

And the client correctly: (40s-60s)

  • requests the newest consensus
    • see #13963 for a fix that reduces this request time from 3 minutes to half the consensus interval, in networks with a low consensus interval.
  • determines that Exits are now available in the new consensus
  • starts building external paths

I'll attach authority, relay and client torrcs that exhibit this behaviour with the patches listed above.

Now I'll review my changes and group them into commits with appropriate changes/* files. This could take me a few days.

Last edited 4 years ago by teor (previous) (diff)

comment:33 Changed 4 years ago by teor

I now have the bootstrap process down to 30 seconds with changes to the TestingServerDownloadSchedule, TestingClientDownloadSchedule, and MIN_INITIAL_VOTE_INTERVAL_TESTING. I have defined MIN_INITIAL_VOTE_INTERVAL_TESTING as half of MIN_VOTE_INTERVAL_TESTING, as there is no previous consensus to clash with.

This should make dgoulet happy.

Updated attachments to follow.

Changed 4 years ago by teor

Attachment: torrc.authority added

authority torrc that bootstraps network with no exits to full functionality in 30 seconds

Changed 4 years ago by teor

Attachment: torrc.exit added

exit torrc that bootstraps network with no exits to full functionality in 30 seconds

Changed 4 years ago by teor

Attachment: torrc.client added

client torrc that bootstraps network with no exits to full functionality in 30 seconds

comment:34 Changed 4 years ago by teor

See also #13976, which would vastly simplify the configuration required to get rapid tor/chutney bootstraps to work.

comment:35 Changed 4 years ago by teor

Attaching example fast torrc files that will allow a chutney network to bootstrap in 8-10 seconds, as we can skip reachability testing, and give the Exit flag to everything.
Of course, these will be available in the corresponding chutney patch.
(Which I am working on right now.)

Changed 4 years ago by teor

Attachment: torrc.authority.fast added

authority torrc that bootstraps network to full functionality in 10 seconds

Changed 4 years ago by teor

Attachment: torrc.exit.fast added

exit torrc that bootstraps network to full functionality in 10 seconds

Changed 4 years ago by teor

Attachment: torrc.client.fast added

client torrc that bootstraps network to full functionality in 10 seconds

comment:36 Changed 4 years ago by teor

Cc: tom added
Status: assignedneeds_review

These changes to tor are included in commits in:

Bugs: #13718, #13814, maybe #13787, #13839, #13924, #13823, #13929, #13963
Branch: bug13718-fast-bootstrap
Note: There are 5 branches that start with bug13718, please choose the right one.
Repository: ​​​​​​​https://github.com/teor2345/tor.git

The corresponding changes to chutney are in:

Bugs: #13823
Branch: bug13823-fast-bootstrap
Repository: ​​​​​​​https://github.com/teor2345/chutney.git

The corresponding changes to various torspec documents are in:

Bugs: #13814
Branch: bug13814-no-exits-internal-circuits
Repository: ​​​​​​​https://github.com/teor2345/torspec.git

comment:37 Changed 4 years ago by teor

This may also have resolved #13935.

comment:38 Changed 4 years ago by nickm

Wow, there are a lot of tickets here now. I'm going to use this ticket, #13823, and #13814 for discussion of the branches.

Branch bug13718-fast-bootstrap:

  • 9c3be06e51cacfbe472042d17120805dd0a262bf is huge, and I'm not sure it takes the approach that Roger suggests on this ticket. I think Roger's idea was that if there are no exits running, we should treat our fraction of exit bandwidth as 100% rather than 0% when deciding whether we can build circuits. Roger, is that right?
  • 0fccbba33a9db171bfd7a0009f786d661719e22c looks okay.
  • de97197e4150c00292525299f4bbd3738abebe86 has me confused a bit. The DirAllowPrivateAddresses option is only about what directory authorities should do, but this patch seems to affect code paths that all relays and clients run. It also changes the semantics of the is_local flag. The documentation of that option and that flag need to change; and maybe the flag needs to get renamed if we're going to change the semantics here. (It's a bad idea to have a flag or function named "is_local" that really means "is local and local addresses are not allowed"). Also, what other effects does changing is_local have?
  • c9fe1c7ed8893a00aa9896838fa33af45711e389 contains unrelated changes in routerlist.c. Not a big deal, but next time, please use a separate commit for stuff like that.
  • 50bd973291d865fd9aa5ab46fbcfcaa943d9f575 seems like it could get refactored into a separate function, and that function could get some tests? I'm a little uncertain about poking this code without tests; it hasn't been touched in a long time.
  • The other commits look fine to me.

comment:39 Changed 4 years ago by teor

Notes for 9c3be06e51cacfbe472042d17120805dd0a262bf:

<teor> The minimal impact change is to treat "0 descriptors from 0 exits" as 1.0 (unlike guard and middle, where 0 from 0 is treated as 0.0).
<nickm> I think that the case where no exits exist is weird enough that it's okay to treat that special, as opposed to complicating cases that more people will encounter

Also split off "don't build doomed circuits" into another commit.

Last edited 4 years ago by teor (previous) (diff)

comment:40 Changed 4 years ago by teor

18:42] <teor> There are two scenarios:
[18:44] <teor> We can either give the test network a big push and say: "assume everything is reachable, a guard, and an exit" - this case bootstraps in 8-10 seconds
[18:45] <teor> Or we can leave the relays to bootstrap "naturally", and join a subsequent consensus once they have tested their reachability
[18:45] <teor> The second case is the issue we're trying to fix in the bug
[18:45] <nickm> ah
[18:45] <teor> The first case is a quick boot
[18:45] <teor> The second case is a comprehensive test
[18:45] <teor> (it takes around 25-30 seconds)
[18:46] <nickm> So, I can see the benefit of that, but I don't htink the right approach is to say that no connection is local. Instead maybe we should just say that connections to ourself are always nonlocal? Or something like that?
[18:48] <teor> When DirAllowPrivateAddresses is 0 on the authorities, the current code makes sense
[18:48] <teor> if I run two nodes on the one IP, they shouldn't declare themselves reachable just because they can connect to each other on 127.0.0.1
[18:50] <teor> (the current code is even stricter - if EnforceDistinctSubnets is 1, we have to connect to something outside our /24 before we're reachable)
[18:51] <nickm> right
[18:53] <nickm> Hm.
[18:53] <nickm> I see
[18:53] <teor> Another option is to disable the !channel_is_local() check when TestingTorNetwork is true
[18:53] <teor> This would leave the is_local flag alone, and just ignire it at the other end
[18:53] <nickm> That is one option, and maybe a good one.
[18:53] <teor> it's much cleaner
[18:53] <nickm> The other option is to rename is_local so that it reflects what it means in both cases.
[18:54] <nickm> I think that disabling that check, or adding a new function that does (TestingTorNetwork [or] channel_is_local()) is the right thing
[18:54] <nickm> at least, it sounds simple
[18:54] <teor> it is simple
[18:55] <teor> circuitbuild.c line 1381 is the only line that needs to change

Last edited 4 years ago by teor (previous) (diff)

comment:41 Changed 4 years ago by teor

<teor> So, nickm, if you wanted to merge everything but 9c3be06e51cacfbe472042d17120805dd0a262bf - Avoid building exit circuits, and de97197e4150c00292525299f4bbd3738abebe86 - Fix Reachability when DirAllowPrivateAddresses is 1, and 50bd973291d865fd9aa5ab46fbcfcaa943d9f575 - Increase reachability testing rate in faster Tor networks, I will make the fixes to the ones we've held back
<nickm> teor: can you make me a quick branch with the stuff I should merge today?
<teor> interactive rebase is my friend here

Today's changes to tor are included in commits in:

Bugs: #13823, #13963, #13839 (ignore bandwidth requirement)
Excluding: #13718 (and #13787), #13814, #13839 (reachability), #13924, #13929
Branch: bug13718-consensus-interval
Note: There are a number of branches that start with bug13718, please choose the right one.
Repository: ​​​​​​​​https://github.com/teor2345/tor.git

comment:42 Changed 4 years ago by teor

<teor> when #13718 is finished, you should be able to run a hidden service in a tor network without any exits.
<nickm> neat
<dgoulet> interesting
<teor> yeah, not sure what the implications are for the global tor network, but hs only need internal paths (we should use this as a way of testing #13718)

comment:43 Changed 4 years ago by teor

Hi Nick,

I agree, the first commit 9c3be06e51cacfbe472042d17120805dd0a262bf in branch bug13718-fast-bootstrap was huge. It included some mistaken commits, and didn't implement these changes in an obvious manner. ("Maintainability nightmare", anyone?) I completely revised this part of the patch, and split it into 4 significant-sized commits (and 3 minor tweaks to comments or debug statements - happy to make these a separate branch also).

I've also decided to leave #13928 & #13929 for later. They're not required for the fast bootstrap case (10s). They might be required for the comprehensive bootstrap case (30s) when we're not using TestingDirAuthVoteGuard * TestingDirAuthVoteExit *. (Although I really need to look into the guard case more.)

These changes to tor are included in commits in:

Bugs: #13718 (and maybe #13787), #13814, #13924
Branch: no-exit-bootstrap
Note: I got confused by all the branches starting with bug13718, so I picked an easier name.
Repository: ​​​​​​​https://github.com/teor2345/tor.git

The corresponding changes to chutney are in:

Bugs: #13823 (required for testing #13718 and friends)
Branch: bug13823-fast-bootstrap
Repository: ​​​​​​​https://github.com/teor2345/chutney.git

The corresponding changes to various torspec documents are in:

Bugs: #13814
Branch: bug13814-no-exits-internal-circuits
Repository: ​​​​​​​https://github.com/teor2345/torspec.git

comment:44 Changed 4 years ago by teor

dgoulet, can you test the chutney changes along with these updated tor changes?

I expect the fast (10s) case to work, and perhaps the comprehensive case to fail / be very slow, as it may require #13929.

comment:45 Changed 4 years ago by teor

Parent ID: #14034

comment:46 Changed 4 years ago by nickm

Hi! Everything below is something I can do when merging; I'd just like your feedback before I do so.

4a41976753c9e916acf12d6156be5fab7f534f07

  • The parenthesis bug me a little on the conditional. Maybe put the whole inequality on a single line. (minor)

fc92ba6b9c9d85c546b94096e6d9bf76a3f38e47

  • the message should probably be a notice again (minor)

f93b46adac531406e1cc13668ceecd3e5fca650a

  • The expression for setting have_path is pretty complex; I'd like to turn that into a function that uses a set of ifs. (minor)

Do you agree with these changes? If so I think I can go ahead and merge and tweak.

comment:47 in reply to:  46 Changed 4 years ago by teor

Replying to nickm:

Hi! Everything below is something I can do when merging; I'd just like your feedback before I do so.

4a41976753c9e916acf12d6156be5fab7f534f07

  • The parenthesis bug me a little on the conditional. Maybe put the whole inequality on a single line. (minor)

Oh yes, it looks ugly, doesn't it?
(It's funny how you look at your own work differently after a few days...)

Please rearrange as you see fit.

fc92ba6b9c9d85c546b94096e6d9bf76a3f38e47

  • the message should probably be a notice again (minor)

Happy to keep it as a notice. I bumped it down to info for being spammy. Then I implemented the suppression, which I tried to adjust to keep it from being too annoying on the public tor network. What do you think of the 60 second suppression interval?
(It applies to each kind of message separately, so the user can get a "we don't have enough descriptors" message, followed shortly by a message as soon as we do have enough descriptors.)

Also, the message tends to appear more often on test or private networks - where users are less likely to be alarmed by it anyway.

f93b46adac531406e1cc13668ceecd3e5fca650a

  • The expression for setting have_path is pretty complex; I'd like to turn that into a function that uses a set of ifs. (minor)

Yes, this makes sense since I used exactly the same code twice. And boolean expressions embedded within a ternary operator just look confusing to me now, even though I'm sure I understood what they did at the time.

Do you agree with these changes? If so I think I can go ahead and merge and tweak.

These sound like good changes - happy to make the code more readable. (At the time, I was more focused on chopping away at it until it did what dgoulet, tom and I wanted.)

I am looking forward to a working, quick-bootstrapping tor/chutney combination.

Thanks for your work reviewing all these commits and suggesting changes - I understand it's a sizeable set of changes that you've reviewed multiple times.

I'm really happy we managed to get such a dramatic improvement in the load times (and in time for 2.6.1, as long as nothing seriously breaks during alpha testing).

I also think it's interesting the bootstrap time now matches the original implementation of src/test/test-network.sh, which gives chutney 18 seconds to configure and bootstrap. Makes me suspect there was a version of tor that bootstrapped within about 10 seconds, and then something broke.

comment:48 Changed 4 years ago by teor

I've also been reminded by karsten that my changes files are straying from the standard format. That's what I get for repeatedly copying my own changes files - errors and quirks accumulate and proliferate. I'll revert to standard form in future branches.

comment:49 Changed 4 years ago by nickm

Woo. Tweaked, squashed, merged and closed. Thanks!

comment:50 Changed 4 years ago by teor

Nick, did the chutney and torspec changes make it in as well?

The corresponding changes to chutney are in:

Bugs: #13823 (required for testing #13718 and friends)
Branch: bug13823-fast-bootstrap
Repository: ​​​​​​​​https://github.com/teor2345/chutney.git

The corresponding changes to various torspec documents are in:

Bugs: #13814
Branch: bug13814-no-exits-internal-circuits
Repository: ​​​​​​​​https://github.com/teor2345/torspec.git

Merging these would let us close #13823 and #13718.

comment:51 Changed 4 years ago by nickm

will do; working on the changelogs now though

comment:52 Changed 4 years ago by teor

Resolution: fixed
Status: needs_reviewclosed

With the merge of the chutney and torspec changes in #13718, this issue is now resolved.

comment:53 Changed 10 months ago by teor

Cc: teor removed

Remove useless CC

Note: See TracTickets for help on using tickets.