Opened 3 months ago

Closed 7 weeks ago

Last modified 2 weeks ago

#21415 closed defect (fixed)

tor_bug_occurred_: Bug: src/or/entrynodes.c:1845: select_entry_guard_for_circuit: Non-fatal assertion !(!guard_has_descriptor(guard)) failed.

Reported by: cypherpunks Owned by: nickm
Priority: Medium Milestone: Tor: 0.3.0.x-final
Component: Core Tor/Tor Version: Tor: 0.3.0.3-alpha
Severity: Normal Keywords: 030-backport
Cc: isis Actual Points:
Parent ID: Points:
Reviewer: asn Sponsor:

Description

using public obfs4 bridges.

Feb 08 12:13:40.000 [warn] tor_bug_occurred_: Bug: src/or/entrynodes.c:1845: select_entry_guard_for_circuit: Non-fatal assertion !(!guard_has_descriptor(guard)) failed. (on Tor 0.3.0.3-alpha bb2ea3642d54ff03)
Feb 08 12:13:40.000 [warn] Bug: Non-fatal assertion !(!guard_has_descriptor(guard)) failed in select_entry_guard_for_circuit at src/or/entrynodes.c:1845. Stack trace: (on Tor 0.3.0.3-alpha bb2ea3642d54ff03)
Feb 08 12:13:40.000 [warn] Bug:     0   tor                                 0x000000010f4e3f98 log_backtrace + 73 (on Tor 0.3.0.3-alpha bb2ea3642d54ff03)
Feb 08 12:13:40.000 [warn] Bug:     1   tor                                 0x000000010f4f852b tor_bug_occurred_ + 268 (on Tor 0.3.0.3-alpha bb2ea3642d54ff03)
Feb 08 12:13:40.000 [warn] Bug:     2   tor                                 0x000000010f40f587 entry_guard_pick_for_circuit + 307 (on Tor 0.3.0.3-alpha bb2ea3642d54ff03)
Feb 08 12:13:40.000 [warn] Bug:     3   tor                                 0x000000010f4732b5 guards_choose_guard + 138 (on Tor 0.3.0.3-alpha bb2ea3642d54ff03)
Feb 08 12:13:40.000 [warn] Bug:     4   tor                                 0x000000010f400c16 circuit_establish_circuit + 2261 (on Tor 0.3.0.3-alpha bb2ea3642d54ff03)
Feb 08 12:13:40.000 [warn] Bug:     5   tor                                 0x000000010f411f1b circuit_build_needed_circs + 751 (on Tor 0.3.0.3-alpha bb2ea3642d54ff03)
Feb 08 12:13:40.000 [warn] Bug:     6   tor                                 0x000000010f47f4ef second_elapsed_callback + 811 (on Tor 0.3.0.3-alpha bb2ea3642d54ff03)
Feb 08 12:13:40.000 [warn] Bug:     7   libevent-2.1.6.dylib                0x000000010f641127 event_process_active_single_queue + 1262 (on Tor 0.3.0.3-alpha bb2ea3642d54ff03)
Feb 08 12:13:40.000 [warn] Bug:     8   libevent-2.1.6.dylib                0x000000010f63d9d6 event_base_loop + 1189 (on Tor 0.3.0.3-alpha bb2ea3642d54ff03)
Feb 08 12:13:40.000 [warn] Bug:     9   tor                                 0x000000010f47eed4 do_main_loop + 1118 (on Tor 0.3.0.3-alpha bb2ea3642d54ff03)
Feb 08 12:13:40.000 [warn] Bug:     10  tor                                 0x000000010f4810cd tor_main + 235 (on Tor 0.3.0.3-alpha bb2ea3642d54ff03)
Feb 08 12:13:40.000 [warn] Bug:     11  tor                                 0x000000010f3e9775 main + 25 (on Tor 0.3.0.3-alpha bb2ea3642d54ff03)
Feb 08 12:13:40.000 [warn] Bug:     12  libdyld.dylib                       0x00007fffbc194255 start + 1 (on Tor 0.3.0.3-alpha bb2ea3642d54ff03)

Child Tickets

Attachments (2)

info.log (79.9 KB) - added by teor 3 months ago.
Log from 004r in hs-ipv6 network
info.2.log (144.9 KB) - added by teor 3 months ago.
log from 008br in bridges+hs on Linux

Download all attachments as: .zip

Change History (33)

comment:1 Changed 3 months ago by nickm

  • Milestone set to Tor: 0.3.0.x-final

comment:2 Changed 3 months ago by cypherpunks

Reproduced. Posting complete log this time.

Feb 08 13:36:10.759 [notice] Tor 0.3.0.3-alpha (git-bb2ea3642d54ff03) running on Darwin with Libevent 2.1.8-stable, OpenSSL 1.0.2k and Zlib 1.2.11.
Feb 08 13:36:10.760 [notice] Tor can't help you if you use it wrong! Learn how to be safe at https://www.torproject.org/download/download#warning
Feb 08 13:36:10.760 [notice] This version is not a stable Tor release. Expect more bugs than usual.
Feb 08 13:36:10.760 [notice] Read configuration file "/Users/tor/.torrc".
Feb 08 13:36:10.765 [notice] Opening Socks listener on 127.0.0.1:9050
Feb 08 13:36:10.000 [notice] Parsing GEOIP IPv4 file /.../tor/geoip.
Feb 08 13:36:10.000 [notice] Parsing GEOIP IPv6 file /.../tor/geoip6.
Feb 08 13:36:10.000 [notice] Bootstrapped 0%: Starting
Feb 08 13:36:11.000 [notice] Starting with guard context "bridges"
Feb 08 13:36:11.000 [notice] new bridge descriptor '...
Feb 08 13:36:11.000 [notice] new bridge descriptor '...
Feb 08 13:36:11.000 [notice] new bridge descriptor '...
Feb 08 13:36:11.000 [notice] Delaying directory fetches: Pluggable transport proxies still configuring
Feb 08 13:36:12.000 [notice] Ignoring directory request, since no bridge nodes are available yet.
Feb 08 13:36:12.000 [notice] Bootstrapped 80%: Connecting to the Tor network
Feb 08 13:36:12.000 [notice] Bootstrapped 85%: Finishing handshake with first hop
Feb 08 13:36:xx.000 [warn] Proxy Client: unable to connect to ... ("general SOCKS server failure")
Feb 08 13:36:xx.000 [warn] Proxy Client: unable to connect to ... ("general SOCKS server failure")
Feb 08 13:37:xx.000 [notice] Ignoring directory request, since no bridge nodes are available yet.
Feb 08 13:37:xx.000 [warn] Proxy Client: unable to connect to ... ("general SOCKS server failure")
Feb 08 13:37:xx.000 [warn] Proxy Client: unable to connect to ... ("general SOCKS server failure")
Feb 08 13:37:xx.000 [warn] Proxy Client: unable to connect to ... ("general SOCKS server failure")
Feb 08 13:37:xx.000 [warn] Proxy Client: unable to connect to ... ("general SOCKS server failure")
Feb 08 13:37:30.000 [warn] tor_bug_occurred_: Bug: src/or/entrynodes.c:1845: select_entry_guard_for_circuit: Non-fatal assertion !(!guard_has_descriptor(guard)) failed. (on Tor 0.3.0.3-alpha bb2ea3642d54ff03)
Feb 08 13:37:30.000 [warn] Bug: Non-fatal assertion !(!guard_has_descriptor(guard)) failed in select_entry_guard_for_circuit at src/or/entrynodes.c:1845. Stack trace: (on Tor 0.3.0.3-alpha bb2ea3642d54ff03)
Feb 08 13:37:30.000 [warn] Bug:     0   tor                                 0x0000000103a98f98 log_backtrace + 73 (on Tor 0.3.0.3-alpha bb2ea3642d54ff03)
Feb 08 13:37:30.000 [warn] Bug:     1   tor                                 0x0000000103aad52b tor_bug_occurred_ + 268 (on Tor 0.3.0.3-alpha bb2ea3642d54ff03)
Feb 08 13:37:30.000 [warn] Bug:     2   tor                                 0x0000000103a24587 entry_guard_pick_for_circuit + 307 (on Tor 0.3.0.3-alpha bb2ea3642d54ff03)
Feb 08 13:37:30.000 [warn] Bug:     3   tor                                 0x0000000103a282b5 guards_choose_guard + 138 (on Tor 0.3.0.3-alpha bb2ea3642d54ff03)
Feb 08 13:37:30.000 [warn] Bug:     4   tor                                 0x00000001039b5c16 circuit_establish_circuit + 2261 (on Tor 0.3.0.3-alpha bb2ea3642d54ff03)
Feb 08 13:37:30.000 [warn] Bug:     5   tor                                 0x00000001039c6f1b circuit_build_needed_circs + 751 (on Tor 0.3.0.3-alpha bb2ea3642d54ff03)
Feb 08 13:37:30.000 [warn] Bug:     6   tor                                 0x0000000103a344ef second_elapsed_callback + 811 (on Tor 0.3.0.3-alpha bb2ea3642d54ff03)
Feb 08 13:37:30.000 [warn] Bug:     7   libevent-2.1.6.dylib                0x0000000103bf6127 event_process_active_single_queue + 1262 (on Tor 0.3.0.3-alpha bb2ea3642d54ff03)
Feb 08 13:37:30.000 [warn] Bug:     8   libevent-2.1.6.dylib                0x0000000103bf29d6 event_base_loop + 1189 (on Tor 0.3.0.3-alpha bb2ea3642d54ff03)
Feb 08 13:37:30.000 [warn] Bug:     9   tor                                 0x0000000103a33ed4 do_main_loop + 1118 (on Tor 0.3.0.3-alpha bb2ea3642d54ff03)
Feb 08 13:37:30.000 [warn] Bug:     10  tor                                 0x0000000103a360cd tor_main + 235 (on Tor 0.3.0.3-alpha bb2ea3642d54ff03)
Feb 08 13:37:30.000 [warn] Bug:     11  tor                                 0x000000010399e775 main + 25 (on Tor 0.3.0.3-alpha bb2ea3642d54ff03)
Feb 08 13:37:30.000 [warn] Bug:     12  libdyld.dylib                       0x00007fffbc194255 start + 1 (on Tor 0.3.0.3-alpha bb2ea3642d54ff03)
Feb 08 13:37:31.000 [warn] tor_bug_occurred_: Bug: src/or/entrynodes.c:1845: select_entry_guard_for_circuit: Non-fatal assertion !(!guard_has_descriptor(guard)) failed. (on Tor 0.3.0.3-alpha bb2ea3642d54ff03)
Feb 08 13:37:31.000 [warn] Bug: Non-fatal assertion !(!guard_has_descriptor(guard)) failed in select_entry_guard_for_circuit at src/or/entrynodes.c:1845. Stack trace: (on Tor 0.3.0.3-alpha bb2ea3642d54ff03)
Feb 08 13:37:31.000 [warn] Bug:     0   tor                                 0x0000000103a98f98 log_backtrace + 73 (on Tor 0.3.0.3-alpha bb2ea3642d54ff03)
Feb 08 13:37:31.000 [warn] Bug:     1   tor                                 0x0000000103aad52b tor_bug_occurred_ + 268 (on Tor 0.3.0.3-alpha bb2ea3642d54ff03)
Feb 08 13:37:31.000 [warn] Bug:     2   tor                                 0x0000000103a24587 entry_guard_pick_for_circuit + 307 (on Tor 0.3.0.3-alpha bb2ea3642d54ff03)
Feb 08 13:37:31.000 [warn] Bug:     3   tor                                 0x0000000103a282b5 guards_choose_guard + 138 (on Tor 0.3.0.3-alpha bb2ea3642d54ff03)
Feb 08 13:37:31.000 [warn] Bug:     4   tor                                 0x00000001039b5c16 circuit_establish_circuit + 2261 (on Tor 0.3.0.3-alpha bb2ea3642d54ff03)
Feb 08 13:37:31.000 [warn] Bug:     5   tor                                 0x00000001039c6f1b circuit_build_needed_circs + 751 (on Tor 0.3.0.3-alpha bb2ea3642d54ff03)
Feb 08 13:37:31.000 [warn] Bug:     6   tor                                 0x0000000103a344ef second_elapsed_callback + 811 (on Tor 0.3.0.3-alpha bb2ea3642d54ff03)
Feb 08 13:37:31.000 [warn] Bug:     7   libevent-2.1.6.dylib                0x0000000103bf6127 event_process_active_single_queue + 1262 (on Tor 0.3.0.3-alpha bb2ea3642d54ff03)
Feb 08 13:37:31.000 [warn] Bug:     8   libevent-2.1.6.dylib                0x0000000103bf29d6 event_base_loop + 1189 (on Tor 0.3.0.3-alpha bb2ea3642d54ff03)
Feb 08 13:37:31.000 [warn] Bug:     9   tor                                 0x0000000103a33ed4 do_main_loop + 1118 (on Tor 0.3.0.3-alpha bb2ea3642d54ff03)
Feb 08 13:37:31.000 [warn] Bug:     10  tor                                 0x0000000103a360cd tor_main + 235 (on Tor 0.3.0.3-alpha bb2ea3642d54ff03)
Feb 08 13:37:31.000 [warn] Bug:     11  tor                                 0x000000010399e775 main + 25 (on Tor 0.3.0.3-alpha bb2ea3642d54ff03)
Feb 08 13:37:31.000 [warn] Bug:     12  libdyld.dylib                       0x00007fffbc194255 start + 1 (on Tor 0.3.0.3-alpha bb2ea3642d54ff03)
Feb 08 13:37:31.000 [warn] Failed to find node for hop 0 of our path. Discarding this circuit.
Feb 08 13:37:xx.000 [warn] Proxy Client: unable to connect to ... ("general SOCKS server failure")
Feb 08 13:37:45.000 [notice] Interrupt: exiting cleanly.

comment:3 Changed 3 months ago by nickm

  • Owner set to nickm
  • Status changed from new to assigned

comment:4 follow-up: Changed 3 months ago by nickm

The warning here seems to be here, in select_entry_guard_for_circuit()...

  SMARTLIST_FOREACH_BEGIN(gs->primary_entry_guards, entry_guard_t *, guard) {
    entry_guard_consider_retry(guard);
    if (! entry_guard_obeys_restriction(guard, rst))
      continue;
    if (guard->is_reachable != GUARD_REACHABLE_NO) {
      if (need_descriptor && BUG(!guard_has_descriptor(guard))) {
        continue;
      }

And this, in turn, looks like a problem with our 21242 code. We're not supposed to be calling select_entry_guard_for_circuit() with need_descriptor set while guard_selection_have_enough_dir_info_to_build_circuits() is false, right?

Though hm, that function only checks to make sure that the first num_primary possibly reachable guards all have descriptors. If enough of them seem down, there's a decent chance that we'll turn to a position where we might have to update our 'can build circuits' flag.

If that's so, then the right fix here is probably either to remove the BUG guard on the check.

comment:5 in reply to: ↑ 4 Changed 3 months ago by asn

Replying to nickm:

The warning here seems to be here, in select_entry_guard_for_circuit()...

  SMARTLIST_FOREACH_BEGIN(gs->primary_entry_guards, entry_guard_t *, guard) {
    entry_guard_consider_retry(guard);
    if (! entry_guard_obeys_restriction(guard, rst))
      continue;
    if (guard->is_reachable != GUARD_REACHABLE_NO) {
      if (need_descriptor && BUG(!guard_has_descriptor(guard))) {
        continue;
      }

And this, in turn, looks like a problem with our 21242 code. We're not supposed to be calling select_entry_guard_for_circuit() with need_descriptor set while guard_selection_have_enough_dir_info_to_build_circuits() is false, right?

Though hm, that function only checks to make sure that the first num_primary possibly reachable guards all have descriptors. If enough of them seem down, there's a decent chance that we'll turn to a position where we might have to update our 'can build circuits' flag.

If that's so, then the right fix here is probably either to remove the BUG guard on the check.

Hmm, it's a interesting point, that guard_selection_have_enough_dir_info_to_build_circuits() only checks for descriptors of reachable guards.

This means that if one of our primary guards was considered unreachable (see Proxy Client errors in comment:2) we would still get past guard_selection_have_enough_dir_info_to_build_circuits() even if we don't have its descriptor. Then after a few minutes, when that primary guard gets marked for retry, and we still dont have its descriptor, we could potentially trigger the non-fatal assert.

I'm still a bit hazy with the code flow here, and whether the above is a possible, or if there are any alternative problems here.

Your fix suggestion seems plausible here, but I'd like some more confidence that we have found the right bug. BTW note that the bug reporter is reproducing this bug using obfs4 bridges, so IIUC we first call select_entry_guard_for_circuit() with need_descriptor==False (to get its descriptor) and then again with need_descriptor==True (to actually do traffic).

comment:6 Changed 3 months ago by teor

I'm seeing this intermittently on master in a relay on the hs-ipv6 chutney network.
I'll attach the info-level log.

Changed 3 months ago by teor

Log from 004r in hs-ipv6 network

comment:7 Changed 3 months ago by teor

In #21371 I updated chutney so make test-network will show you any unexpected warnings. (Unexpected warnings don't cause the tests in make test-network-all to fail, so you'd have to grep its logs to find any bugs.)

comment:8 Changed 3 months ago by teor

I am also seeing this in 008br in the bridges+hs network.

Changed 3 months ago by teor

log from 008br in bridges+hs on Linux

comment:9 Changed 3 months ago by teor

Also 011h. I won't include the log this time.
This seems to happen around 5-10% of the time.
I can capture a debug log if needed.

comment:10 follow-up: Changed 3 months ago by nickm

  • Status changed from assigned to needs_review

bug21415_030 is the fix I suggested above, of which asn said:

Your fix suggestion seems plausible here, but I'd like some more confidence that we have found the right bug.

Asn: Based on Teor's logs, do you think this is the right fix?

comment:11 Changed 3 months ago by nickm

  • Keywords review-group-16 added

comment:12 Changed 3 months ago by nickm

  • Reviewer set to asn

comment:13 in reply to: ↑ 10 Changed 3 months ago by asn

Replying to nickm:

bug21415_030 is the fix I suggested above, of which asn said:

Your fix suggestion seems plausible here, but I'd like some more confidence that we have found the right bug.

Asn: Based on Teor's logs, do you think this is the right fix?

Ugh, I've been digging in the logs for at least an hour trying to verify our theory, but it doesn't seem to apply straightforwardly.

Here is the primary guard list from info.2.log:

Feb 27 05:21:23.000 [info] entry_guards_update_primary(): Primary entry guards have changed. New primary guard list is: 
Feb 27 05:21:23.000 [info] entry_guards_update_primary():   1/3: test004r ($A92594E93FEE5D83D12D977460B1DA47916B28A6) (confirmed)
Feb 27 05:21:23.000 [info] entry_guards_update_primary():   2/3: test005r ($EF23A3601B2234F32EF6673DDCD4ECE6040327B7) (confirmed)
Feb 27 05:21:23.000 [info] entry_guards_update_primary():   3/3: test006r ($3AF6D56A2B57BE95F4D097035473E06109E426D4)

Here are guard successes for all the primaries:

Feb 27 05:21:29.000 [info] entry_guards_note_guard_success(): Recorded success for primary confirmed guard test004r ($A92594E93FEE5D83D12D977460B1DA47916B28A6)
...
Feb 27 05:21:34.000 [info] entry_guards_note_guard_success(): Recorded success for primary confirmed guard test005r ($EF23A3601B2234F32EF6673DDCD4ECE6040327B7)
...
Feb 27 05:21:34.000 [info] entry_guards_note_guard_success(): Recorded success for primary confirmed guard test006r ($3AF6D56A2B57BE95F4D097035473E06109E426D4)

and then here is the assert a few seconds afterwards:

Feb 27 05:21:40.000 [info] choose_good_exit_server_general(): Chose exit server '$A92594E93FEE5D83D12D977460B1DA47916B28A6~test004r at 127.0.0.1'
Feb 27 05:21:40.000 [info] extend_info_from_node(): Including Ed25519 ID for $A92594E93FEE5D83D12D977460B1DA47916B28A6~test004r at 127.0.0.1
Feb 27 05:21:40.000 [warn] tor_bug_occurred_(): Bug: src/or/entrynodes.c:1845: select_entry_guard_for_circuit: Non-fatal assertion !(!guard_has_descriptor(guard)) failed. (on Tor 0.3.1.0-alpha-dev efa5bbaba07d20d1)
[snip]
Feb 27 05:21:40.000 [info] select_entry_guard_for_circuit(): Selected primary guard test006r ($3AF6D56A2B57BE95F4D097035473E06109E426D4) for circuit.

which I assume was generated because of test005r missing a descriptor, since we chose test004r (our first primary) as an exit node, and then we ended up choosing test006r (our last primary) as the entry node after the assert triggered...

But then how come a circuit to test005r succeeded just 6 seconds before the assert triggered? Why didnt we get the assert at that point (maybe it was a directory circuit?)? But then how did we get past entry_guards_have_enough_dir_info_to_build_circuits()? I don't see test005r being marked as unreachable anywhere before that.

Could this be called by some bridge weirdness? But then how come it was reproduced in hs-ipv6 network?

Some more digging is required.

comment:14 Changed 3 months ago by asn

I think the bug has been found.

Here it is:

  const int num_primary = get_n_primary_guards_to_use(GUARD_USAGE_TRAFFIC);
...
  SMARTLIST_FOREACH_BEGIN(gs->primary_entry_guards, entry_guard_t *, guard) {
...
    n_considered++;
...
    if (n_considered >= num_primary)
      break;
  } SMARTLIST_FOREACH_END(guard);

In the above, it seems like get_n_primary_guards_to_use() returns 1 (because of DFLT_N_PRIMARY_GUARDS_TO_USE), and then we only actually check the first primary guard for descriptor and break after that.

This way it's possible that the second primary guard actually has no descriptor, and if for some reason we end up picking it (e.g. because we picked the first primary guard as the exit node) we will get the assert.

What's the right fix here? Maybe to set num_primary to smartlist_len(gs->primary_entry_guards) and test all guards for descriptors before proceeding? What else?

comment:15 follow-up: Changed 3 months ago by nickm

Either that, or just adding 1 to num_primary?

And in this case, should we leave the BUG() check in for now?

comment:16 in reply to: ↑ 15 ; follow-up: Changed 3 months ago by teor

Replying to nickm:

Either that, or just adding 1 to num_primary?

Circuits can be up to 8 hops long in pathological cases:

Having only 2 primary guards with descriptors fails in the scenario where we pick one guard as the exit, and the other as the middle.

Having only 3 primary guards with descriptors fails in the scenario where we cannibalise a circuit, and one guard is the exit, and the others are middles.

Having only 3 to (max_intro_points + 2) = 12 primary guards with descriptors fails in the scenario where a client fails to connect to an intro point, and so it extends to the next intro point repeatedly, up to 10 intro points. Of course, we're only allowed to send 8 RELAY_EARLY cells, so this is bounded by the maximum path length.

Really, we need to check that we have path_length primary guards with descriptors available, or if we don't know the path length at this point, max_path_length. And we should BUG out if the actual path length is greater than this.

And in this case, should we leave the BUG() check in for now?

Oh gosh yes!

comment:17 in reply to: ↑ 16 Changed 3 months ago by teor

Replying to teor:

Replying to nickm:

Either that, or just adding 1 to num_primary?

Circuits can be up to 8 hops long in pathological cases:

Having only 2 primary guards with descriptors fails in the scenario where we pick one guard as the exit, and the other as the middle.

Having only 3 primary guards with descriptors fails in the scenario where we cannibalise a circuit, and one guard is the exit, and the others are middles.

Having only 3 to (max_intro_points + 2) = 12 primary guards with descriptors fails in the scenario where a client fails to connect to an intro point, and so it extends to the next intro point repeatedly, up to 10 intro points. Of course, we're only allowed to send 8 RELAY_EARLY cells, so this is bounded by the maximum path length.

These are wrong, we pick exit, guard, middle(s).

But having only N primary guards with descriptors fails in the scenario where all our primary guards are in the exit's IPv4 /16, or the exit's family.

comment:18 Changed 3 months ago by asn

OK guys please see my branch bug21415.

There is no best solution here, so my branch takes the approach mainly discussed yesterday in IRC, which is to test 2/3 primary guards for descriptor, and also remove the BUG() check.

The idea is we test 2/3 and not 3/3 to reduce the performance penalty in startup (?).

And we remove the BUG() because we know of cases that can cause us to not have a needed primary guard descriptor if we only check 2/3. If we left the BUG() in we would get bug reports about these rare cases and we wouldn't know if it's actually a bug, or just one of the pathological cases we are aware of.

Please let me know if you hate my patch and suggest alternatives.

comment:19 Changed 3 months ago by nickm

  • Resolution set to fixed
  • Status changed from needs_review to closed

Thanks, Asn! Cherry-picked to maint-0.3.0. (Remember, master is now 0.3.1.)

comment:20 Changed 3 months ago by asn

  • Resolution fixed deleted
  • Status changed from closed to reopened

Reopening this ticket due to the bridges+ipv6-min integration test failure.

comment:21 Changed 3 months ago by asn

Some recon notes: So it seems like the bridges+ipv6-min test is super funky. The client is configured with the two bridges below:

Bridge 127.0.0.1:5003
Bridge [::1]:5003

which is actually the same bridge, but it creates two entry guard entries.

On the other hand, the bridge seems to only be configured with a [::1] listener:

OrPort [::1]:5003 IPv6Only

Then we try to fetch the descriptor of the [::1] bridge but it gets assigned to the 127.0.0.1 bridge after all:

Mar 02 13:33:55.000 [info] connection_dir_client_reached_eof(): Received server info (size 2232) from server '::1:5003'
Mar 02 13:33:55.000 [info] router_load_routers_from_string(): 1 elements to add
Mar 02 13:33:55.000 [notice] Bridge 'test003br' has both an IPv4 and an IPv6 address.  Will prefer using its IPv4 address (127.0.0.1:5003) based on the configured Bridge address.

Then after that, we drop any further descriptors for that bridge regardless of 127.0.0.1 or [::1]:

[info] connection_dir_client_reached_eof(): Received server info (size 2232) from server '127.0.0.1:5003'
[info] router_load_routers_from_string(): 1 elements to add
[info] router_add_to_routerlist(): Dropping descriptor that we already have for router $E033927EA8B93CC81235B185206C667358983BDA~test003br at 127.0.0.1
...
[info] connection_dir_client_reached_eof(): Received server info (size 2232) from server '::1:5003'
[info] router_load_routers_from_string(): 1 elements to add
[info] router_add_to_routerlist(): Dropping descriptor that we already have for router $E033927EA8B93CC81235B185206C667358983BDA~test003br at 127.0.0.1
Last edited 3 months ago by asn (previous) (diff)

comment:22 follow-up: Changed 3 months ago by asn

Some thoughts forward:

a) Should Tor create two entry guard entries for this bridge at [::1] and 127.0.0.1? I say probably yes.

b) However, we can't expect the bridge two create two descriptors, right? So maybe Tor should accept that bridge descriptor and assign it to both entry guard entries? Instead of rejecting it because "we already have" it?

c) If the test is ipv6-only and the bridge is only configured as ipv6, why does the client prefer the ipv4 address of the bridge? That's regarding [notice] Bridge 'test003br' has both an IPv4 and an IPv6 address. Will prefer using its IPv4 address (127.0.0.1:5003) based on the configured Bridge address.

What else is going weird in this test?

comment:23 Changed 3 months ago by asn

  • Status changed from reopened to needs_review

OK, after about 4 hours of debugging, I present you a fix in my branch bug21415_testfix. We are dealing with a pretty hairy multi-component part of the codebase, so careful consideration is required.

Here is the commit message that might help you:

The bridges+ipv6-min integration test has a client with bridges:
    Bridge 127.0.0.1:5003
    Bridge [::1]:5003
which got stuck in guard_selection_have_enough_dir_info_to_build_circuits()
because it couldn't find the descriptor of both bridges.

Specifically, the guard_has_descriptor() function could not find the
node_t of the [::1] bridge, because the [::1] bridge had no identity
digest assigned to it.

After further examination, it seems that during fetching the descriptor
for our bridges, we used the CERTS cell to fill the identity digest of
127.0.0.1:5003 properly. However, when we received a CERTS cell from
[::1]:5003 we actually ignored its identity digest because the
learned_router_identity() function was using
get_configured_bridge_by_addr_port_digest() which was returning the
127.0.0.1 bridge instead of the [::1] bridge (because it prioritizes
digest matching over addrport matching).

The fix replaces get_configured_bridge_by_addr_port_digest() with the
recent get_configured_bridge_by_exact_addr_port_digest() function. It
also relaxes the constraints of the
get_configured_bridge_by_exact_addr_port_digest() function by making it
return bridges whose identity digest is not yet known.

By using the _exact_() function, learned_router_identity() actually
fills in the identity digest of the [::1] bridge, which then allows
guard_has_descriptor() to find the right node_t and verify that the
descriptor is there.

FWIW, in the bridges+ipv6-min test both 127.0.0.1 and [::1] bridges
correspond to the same node_t, which I guess makes sense given that it's
actually the same underlying bridge.

Please let me know what you think about this fix. I thought about unintended consequences of the mod in get_configured_bridge_by_exact_addr_port_digest() and learned_router_identity() but I couldn't find something bad.

comment:24 in reply to: ↑ 22 Changed 3 months ago by teor

Replying to asn:

c) If the test is ipv6-only and the bridge is only configured as ipv6, why does the client prefer the ipv4 address of the bridge? That's regarding [notice] Bridge 'test003br' has both an IPv4 and an IPv6 address. Will prefer using its IPv4 address (127.0.0.1:5003) based on the configured Bridge address.

Sorry, this chutney test was never intended to be *that* evil.

comment:25 follow-up: Changed 3 months ago by nickm

The get_configured_bridge_by_exact_addr_port_digest() change looks safe.

For the learned_router_identity change -- is there some way we might have changed the addr:port arguments to that function before we call it, and they might not be the same as the bridge line any more? I know there are some weird cases with connection addresses in the code, but I'm not sure if they apply to bridges.

General note -- this branch is based on master. When you're writing a branch to go into 0.3.0, it's a little easier for me if you base the branch on maint-0.3.0 instead of on master. "git worktree" can make this really convenient, and let you have separate directories for each branch you're working on.

comment:26 in reply to: ↑ 25 Changed 3 months ago by asn

Replying to nickm:

The get_configured_bridge_by_exact_addr_port_digest() change looks safe.

For the learned_router_identity change -- is there some way we might have changed the addr:port arguments to that function before we call it, and they might not be the same as the bridge line any more? I know there are some weird cases with connection addresses in the code, but I'm not sure if they apply to bridges.

Yes that's a good point and we should be careful.

I used the make -k trick and I audited the places that could be rewriting the addrport of a connection but I didn't find anything relevant for this case...

As an example, connection_or_check_canonicity() does a pretty sneaky rewrite, but it only seems to happen for ORs and in our case we are a client using bridges.

For an alternative confidence-inspiring obsevation, notice that in learned_router_identity() (the function we modded), a bit after we call get_configured_bridge_by_exact_addr_port_digest() we call find_transport_name_by_bridge_addrport() which does a very similar operation as the _exact_ function; namely it walks through the bridges and compares their addrport with the conn addrport. So if find_transport_name_by_bridge_addrport() is trusting the connection addrport and the code has been working for a while, it seems to be an argument that our modification is also innocuous.

General note -- this branch is based on master. When you're writing a branch to go into 0.3.0, it's a little easier for me if you base the branch on maint-0.3.0 instead of on master. "git worktree" can make this really convenient, and let you have separate directories for each branch you're working on.

Oops yes, sorry about that. I need to take some time and learn how to use git worktree. Last time I spent 10 minutes but didn't manage to grasp it completely. Will try to do this soon.

Let me know if you want me to rebase the branch to 0.3.0.

comment:27 Changed 3 months ago by nickm

I got it to rebase cleanly as bug21415_testfix_030

comment:28 Changed 3 months ago by nickm

  • Keywords 030-backport added

Taking bug21415_testfix_030 on master for now; let's backport to 030 if nothing goes wrong

comment:29 Changed 2 months ago by nickm

  • Keywords review-group-16 removed

comment:30 Changed 7 weeks ago by nickm

  • Resolution set to fixed
  • Status changed from needs_review to closed

merged to 0.3.0; thanks again, asn!

comment:31 Changed 2 weeks ago by isis

  • Cc isis added
Note: See TracTickets for help on using tickets.