Opened 7 years ago

Closed 7 years ago

#8367 closed defect (fixed)

Regression in git master: Can't bootstrap with bridges

Reported by: nickm Owned by:
Priority: Very High Milestone: Tor: 0.2.4.x-final
Component: Core Tor/Tor Version:
Severity: Keywords: tor-client bridges
Cc: arma, athena Actual Points:
Parent ID: Points:
Reviewer: Sponsor:

Description

I was trying to track down an issue in stegotorus when I found that current Tor master has a problem bootstrapping. I'm using this configuration

bridge XX:995
bridge XX:443
Bridge XX:5251

CircuitBuildTimeout 60

UseBridges 1

(With real bridges, of course.)

After bisecting, it appears that d7089ff228227259137b5a8bc32d0764a0ad4155 is at fault. Oops -- that's one of mine. :p

The failure mode here is that we get up to 50% bootstrapped, then:

Feb 28 16:58:46.000 [notice] Bootstrapped 20%: Asking for networkstatus consensus.
Feb 28 16:58:46.000 [notice] Bootstrapped 50%: Loading relay descriptors.
Feb 28 16:58:47.000 [notice] Learned fingerprint XXX for bridge XX:5251.
Feb 28 16:58:47.000 [notice] Learned fingerprint XXX for bridge XX:443.

and then no further.

Any insights for what I screwed up with d7089ff228227259137b5a8bc32d0764a0ad4155 ?

Child Tickets

Change History (10)

comment:1 Changed 7 years ago by nickm

If I revert that commit, it works okay again. So it's something in there...

comment:2 Changed 7 years ago by nickm

Priority: majorcritical
Status: newneeds_review

Ah. choose_random_dirguard() calls choose_random_entry_impl with its for_directory argument set to true. That tells it to only use nodes that have is_dir_cache set to 1. But the bridges' node_ts apparently don't have is_dir_cache set to 1.

I have an easy fix in "bug8367" in my public repository. It doesn't need a changes file, since this isn't in any released version. Please review; we shouldn't put out 0.2.4.11-alpha until this is fixed.

(Can anybody else reproduce the bug?)

comment:3 Changed 7 years ago by dcf

I reproduce using master 9bc05c30d7b035766e89209e1075ee1bc66ccd4e and flashproxy-client:

Feb 28 16:54:49.000 [notice] Bootstrapped 5%: Connecting to directory server.
Feb 28 16:54:49.000 [notice] Bootstrapped 10%: Finishing handshake with directory server.
Feb 28 16:54:51.000 [notice] Learned fingerprint 86FA348B038B6A04F2F50135BF84BB74EF63485B for bridge 0.0.1.0:1 (with transport 'websocket').
Feb 28 16:54:51.000 [notice] Bootstrapped 15%: Establishing an encrypted directory connection.
Feb 28 16:54:52.000 [notice] Bootstrapped 20%: Asking for networkstatus consensus.
Feb 28 16:54:53.000 [notice] Bootstrapped 50%: Loading relay descriptors.
Feb 28 16:54:54.000 [notice] Bridge '3VXRyxz67OeRoqHn' has both an IPv4 and an IPv6 address.  Will prefer using its IPv4 address (0.0.1.0:1).
Feb 28 16:54:54.000 [notice] new bridge descriptor '3VXRyxz67OeRoqHn' (fresh): $86FA348B038B6A04F2F50135BF84BB74EF63485B~3VXRyxz67OeRoqHn at 0.0.1.0
Feb 28 16:54:54.000 [notice] Ignoring directory request, since no bridge nodes are available yet.
Feb 28 16:54:54.000 [notice] I learned some more directory information, but not enough to build a circuit: We have no usable consensus.
Feb 28 16:55:48.000 [notice] Ignoring directory request, since no bridge nodes are available yet.
Feb 28 16:55:52.000 [notice] No circuits are opened. Relaxed timeout for a circuit with channel state open to 60000ms. However, it appears the circuit has timed out anyway. 0 guards are live.

And "bug8367" seems to fix it.

comment:4 Changed 7 years ago by andrea

I have a repro on this with 9bc05c30d7b035766e89209e1075ee1bc66ccd4e as bridge and client; testing nickm's bug8367 as client next.

comment:5 Changed 7 years ago by andrea

Hmm, I'm not managing to get bug8367 to work, though. This is with eea3115a5d9f58aea164776facaee9b44d2a16f0:

Mar 01 01:15:10.000 [notice] Parsing GEOIP IPv4 file /usr/share/tor/geoip.
Mar 01 01:15:10.000 [notice] We were built to run on a 64-bit CPU, with OpenSSL 1.0.1 or later, but with a version of OpenSSL that apparently lacks accelerated support for the NIST P-224 and P-256 groups. Building openssl with such support (using the enable-ec_nistp_64_gcc_128 option when configuring it) would make ECDH much faster.
Mar 01 01:15:11.000 [notice] Bootstrapped 5%: Connecting to directory server.
Mar 01 01:15:11.000 [notice] Bootstrapped 10%: Finishing handshake with directory server.
Mar 01 01:15:11.000 [notice] Bootstrapped 15%: Establishing an encrypted directory connection.
Mar 01 01:15:11.000 [notice] Bootstrapped 20%: Asking for networkstatus consensus.
Mar 01 01:15:11.000 [notice] Bootstrapped 50%: Loading relay descriptors.
Mar 01 01:15:11.000 [notice] new bridge descriptor 'AthenaTestBridge' (fresh): $CF86C2158900A988D1AD38B5D17ADDECF125B85C~AthenaTestBridge at 205.179.18.4
Mar 01 01:15:11.000 [notice] Ignoring directory request, since no bridge nodes are available yet.
Mar 01 01:15:11.000 [notice] I learned some more directory information, but not enough to build a circuit: We have no usable consensus.
Mar 01 01:16:11.000 [notice] No circuits are opened. Relaxed timeout for a circuit with channel state open to 60000ms. However, it appears the circuit has timed out anyway. 0 guards are live.
Mar 01 01:16:12.000 [notice] Ignoring directory request, since no bridge nodes are available yet.
Mar 01 01:17:13.000 [notice] Ignoring directory request, since no bridge nodes are available yet.
Mar 01 01:18:14.000 [notice] Ignoring directory request, since no bridge nodes are available yet.
Mar 01 01:19:15.000 [notice] Ignoring directory request, since no bridge nodes are available yet.

comment:6 Changed 7 years ago by andrea

It looks like when choose_random_entry_impl() gets down to this bit, node_sl_choose_by_bandwidth() is always returning NULL:

   if (entry_list_is_constrained(options)) {
     /* We need to weight by bandwidth, because our bridges or entryguards
      * were not already selected proportional to their bandwidth. */
     node = node_sl_choose_by_bandwidth(live_entry_guards, WEIGHT_FOR_GUARD);
   } else {

Before that point node has a pointer to the bridge node.

comment:7 Changed 7 years ago by andrea

Right, okay, because live_entry_guards is empty because the node got eliminated in the loop because it had is_dir_cache == 0.

comment:8 Changed 7 years ago by andrea

At the point in the loop in choose_random_entry_impl() that checks is_dir_cache, I have this:

(gdb) print *entry
$4 = {nickname = "AthenaTestBridge\000\000\000",
  identity = "\317\206\302\025\211\000\251\210\321\255\070\265\321z\335\354\361%\270\\", chosen_on_date = 1361581336,
  chosen_by_version = 0xdbef10 "0.2.4.10-alpha-dev", made_contact = 1, can_retry = 0, path_bias_noticed = 0,
  path_bias_warned = 0, path_bias_extreme = 0, path_bias_disabled = 0, path_bias_use_noticed = 0,
  path_bias_use_extreme = 0, is_dir_cache = 0, bad_since = 0, unreachable_since = 0, last_attempted = 0,
  circ_attempts = 0, circ_successes = 0, successful_circuits_closed = 0, collapsed_circuits = 0,
  unusable_circuits = 0, timeouts = 0, use_attempts = 0, use_successes = 0}

comment:9 Changed 7 years ago by andrea

Okay, the problem was that in add_an_entry_guard(), nickm's patch only set is_dir_cache on the case that the guard was already in the list and entry_guard_get_by_id_digest() returns non-NULL, but not in the similar code sequence in a few lines below when it isn't found and the chosen node gets added. I surmise that dcf's test of nickm's branch worked because he already had some cached consensus data, and it called entry_guard_set_status() via entry_guards_compute_status() and directory_info_has_arrived(), which is not possible on a clean bootstrap if this bug prevents ever making contact with a directory server.

Anyway, the fix is easy: modify the other case in add_an_entry_guard() as in nickm's version. See the bug8367 branch in my repository, which works starting ex nihilo for me.

comment:10 Changed 7 years ago by nickm

Resolution: fixed
Status: needs_reviewclosed

Thanks for tracking that down! I've tweaked the commit message slightly and merged to maint-0.2.4 and master.

Note: See TracTickets for help on using tickets.