Changing pluggable transports (during start-up) in Tor Browser is broken

changed milestone to %Tor: 0.4.1.x-final

added 031-unreached-backport 033-removed-20180320 033-triage-20180320 034-parent-backport 035-parent-backport 040-parent-backport 041-should component::core tor/tor milestone::Tor: 0.4.1.x-final network-team-roadmap-2019-Q1Q2 owner::nickm parent::29875 points::0.5 priority::medium regression resolution::fixed reviewer::ahf severity::normal sponsor::19-must status::closed tbb-needs tor-bridge-client type::defect version::tor 0.3.0.1-alpha labels

Trac:
Summary: Changes pluggable transports (during start-up) in Tor Browser is broken to Changing pluggable transports (during start-up) in Tor Browser is broken

Trac:
Cc: teor to teor, brade, mcs

The underlying issue here is that every use of any_bridge_descriptors_known() in tor is wrong. Instead, we want to know if num_bridges_usable() > 0.

This is a bug introduced in commit 93a8ed3 in 0.3.2.1-alpha by #23347 (moved). But it was also a bug in commit eca2a30 in 0.2.0.3-alpha, but we've never encountered it before, because we always retried our bridges immediately (and far too often).

My branch bug24367 at https://github.com/teor2345/tor.git passes all of make test-network, but fails some of the guard and microdesc unit tests. This probably means the unit tests were relying on the buggy behaviour, and need to be changed. I might need nickm or asn to help me fix the unit tests, because they wrote them.

I think this branch fixes the issue after quitting the browser. It might also fix the stalled bootstrap or error, but I'm not sure. Can you re-test with this branch?

Trac:
Cc: teor, brade, mcs to teor, brade, mcs, nickm, asn
Version: N/A to Tor: 0.3.2.1-alpha
Status: new to needs_revision
Keywords: N/A deleted, tor-bridge-client, regression added
Points: N/A to 0.5

(Oh, the branch is based on 032, I've pushed a copy to bug24367_032.)

Trac:
Owner: N/A to nickm
Status: needs_revision to accepted

taking ownership of the test fixes

Trac:
Status: accepted to needs_revision

Replying to teor:

I think this branch fixes the issue after quitting the browser. It might also fix the stalled bootstrap or error, but I'm not sure. Can you re-test with this branch?

Neither of those cases is fixed for me. I am still getting the error that Tor was unable to connect to an obfs3 address even though I switched to obfs4 during start-up. And after restarting I am stuck at

Nov 22 08:57:04.000 [notice] DisableNetwork is set. Tor will not make or accept non-control network connections. Shutting down all existing connections.
Nov 22 08:57:04.000 [notice] DisableNetwork is set. Tor will not make or accept non-control network connections. Shutting down all existing connections.
Nov 22 08:57:04.000 [notice] DisableNetwork is set. Tor will not make or accept non-control network connections. Shutting down all existing connections.
Nov 22 08:57:04.000 [notice] Opening Socks listener on 127.0.0.1:9150

So, the only thing that is gone is

[notice] Ignoring directory request, since no bridge nodes are available yet.

I used d7833c9d27feed9e for testing.

The first bad commit is 93a8ed3b83b5f20768562ca2aff4eba7aca667d8.

I'm not sure this is correct.

Bridges didn't work at all between c21cfd28f4 (#17750 (moved)) and 93a8ed3b83 (#23347 (moved)). So you might want to check if the commit before c21cfd28f4 (63ceadb485) is a bad commit or not. If it is, your problem goes much further back in Tor, or it is in Tor Browser or Tor Launcher.

Replying to gk:

Replying to teor:

I think this branch fixes the issue after quitting the browser. It might also fix the stalled bootstrap or error, but I'm not sure. Can you re-test with this branch?

Neither of those cases is fixed for me. I am still getting the error that Tor was unable to connect to an obfs3 address even though I switched to obfs4 during start-up. And after restarting I am stuck at {{{ Nov 22 08:57:04.000 [notice] DisableNetwork is set. Tor will not make or accept non-control network connections. Shutting down all existing connections. Nov 22 08:57:04.000 [notice] DisableNetwork is set. Tor will not make or accept non-control network connections. Shutting down all existing connections. Nov 22 08:57:04.000 [notice] DisableNetwork is set. Tor will not make or accept non-control network connections. Shutting down all existing connections. Nov 22 08:57:04.000 [notice] Opening Socks listener on 127.0.0.1:9150 }}}

This looks like a Tor Browser or Tor Launcher bug, not a Tor bug: when DisableNetwork is set, Tor is behaving as designed by not making any connections.

Or, there aren't enough log entries here to diagnose why Tor Launcher has set DisableNetwork, and which component is at fault.

So, the only thing that is gone is {{{ [notice] Ignoring directory request, since no bridge nodes are available yet. }}} I used d7833c9d27feed9e for testing.

Excellent, then we've fixed one bug that was definitely in Tor.

Let's open other tickets for the remainder?

Edit: remove special character

Trac:
Status: needs_revision to needs_information

Replying to teor:

The first bad commit is 93a8ed3b83b5f20768562ca2aff4eba7aca667d8.

I'm not sure this is correct.

Bridges didn't work at all between c21cfd28f4 (#17750 (moved)) and 93a8ed3b83 (#23347 (moved)).

You are mistaken here. The immediate parent of 93a8ed3b83b5f20768562ca2aff4eba7aca667d8, which is 6370fb77c586e9ad68c7d1ecb95be36137cb9067, is working perfectly fine: I can change the PT during bootstrap and even after restart the new PT is being used without any complaint. Exactly like the current behavior in 0.3.1.8 is.

So you might want to check if the commit before c21cfd28f4 (63ceadb485) is a bad commit or not. If it is, your problem goes much further back in Tor, or it is in Tor Browser or Tor Launcher.

Replying to gk:

Replying to teor:

I think this branch fixes the issue after quitting the browser. It might also fix the stalled bootstrap or error, but I'm not sure. Can you re-test with this branch?

Neither of those cases is fixed for me. I am still getting the error that Tor was unable to connect to an obfs3 address even though I switched to obfs4 during start-up. And after restarting I am stuck at {{{ Nov 22 08:57:04.000 [notice] DisableNetwork is set. Tor will not make or accept non-control network connections. Shutting down all existing connections. Nov 22 08:57:04.000 [notice] DisableNetwork is set. Tor will not make or accept non-control network connections. Shutting down all existing connections. Nov 22 08:57:04.000 [notice] DisableNetwork is set. Tor will not make or accept non-control network connections. Shutting down all existing connections. Nov 22 08:57:04.000 [notice] Opening Socks listener on 127.0.0.1:9150 }}}

This looks like a Tor Browser or Tor Launcher bug, not a Tor bug: when DisableNetwork is set, Tor is behaving as designed by not making any connections.

I am not convinced yet given that all things equal only 93a8ed3b83b5f20768562ca2aff4eba7aca667d8 is breaking things for us. But let's assume for a minute that this is indeed the case then it would not explain why I still get the error when switching the PT during start-up in the first place.

Trac:
Status: needs_information to needs_revision

Replying to gk:

Replying to teor:

The first bad commit is 93a8ed3b83b5f20768562ca2aff4eba7aca667d8.

I'm not sure this is correct.

Bridges didn't work at all between c21cfd28f4 (#17750 (moved)) and 93a8ed3b83 (#23347 (moved)).

You are mistaken here. The immediate parent of 93a8ed3b83b5f20768562ca2aff4eba7aca667d8, which is 6370fb77c586e9ad68c7d1ecb95be36137cb9067, is working perfectly fine: I can change the PT during bootstrap and even after restart the new PT is being used without any complaint. Exactly like the current behavior in 0.3.1.8 is.

Oh, you're right, #23347 (moved) has a misleading title. Fixed!

...

But I still don't have enough information to diagnose the bug in the logs you supplied for tor commit d7833c9d27feed9e. They tell me that we fixed one bug, but they don't tell me where the next error is.

Perhaps we can work out how to reproduce this bug using a command-line like the one in #23347 (moved)?

Otherwise, I need someone to provide more detailed logs (info-level?) that tell me why Tor Launcher has set DisableNetwork.

About 10 hours ago in #tor-dev IRC, catalyst was saying that Tor Launcher will set DisableNetwork in response to some bootstrap events. (I can't find a bug number for that bug.) Maybe we are seeing an instance of that bug here?

Trac:
Status: needs_revision to needs_information

Replying to teor:

Replying to gk:

Replying to teor:

The first bad commit is 93a8ed3b83b5f20768562ca2aff4eba7aca667d8.

I'm not sure this is correct.

Bridges didn't work at all between c21cfd28f4 (#17750 (moved)) and 93a8ed3b83 (#23347 (moved)).

You are mistaken here. The immediate parent of 93a8ed3b83b5f20768562ca2aff4eba7aca667d8, which is 6370fb77c586e9ad68c7d1ecb95be36137cb9067, is working perfectly fine: I can change the PT during bootstrap and even after restart the new PT is being used without any complaint. Exactly like the current behavior in 0.3.1.8 is.

Oh, you're right, #23347 (moved) has a misleading title. Fixed!

...

But I still don't have enough information to diagnose the bug in the logs you supplied for tor commit d7833c9d27feed9e. They tell me that we fixed one bug, but they don't tell me where the next error is.

Perhaps we can work out how to reproduce this bug using a command-line like the one in #23347 (moved)?

I have to think about that a bit harder. I am not sure how to do that right now.

Otherwise, I need someone to provide more detailed logs (info-level?) that tell me why Tor Launcher has set DisableNetwork.

Alright. Let's look at the first issue described in steps 1)-4) in the bug description. It might give us some clue about what the actual bug(s) is/are which are remaining. I've attached two info level logs, one with 6370fb77c586e9a (where everything is fine and obfs4 is finally used), and one with d7833c9d27feed9 where still a connecting with obfs3 is attempted. I hope that's already enough for you to explain the problem in that part at least.

About 10 hours ago in #tor-dev IRC, catalyst was saying that Tor Launcher will set DisableNetwork in response to some bootstrap events. (I can't find a bug number for that bug.) Maybe we are seeing an instance of that bug here?

Hm. Not sure, although I am still not convinced that this plays a role here. Maybe mcs/brade have more ideas for that part, though.

Trac:
Status: needs_information to needs_revision

Trac:
24367_d7833c9d27feed9_obfs3_obfs4.log

Trac:
24367_6370fb77c586e9a_obfs3_obfs4.log

Replying to gk:

Hm. Not sure, although I am still not convinced that this plays a role here. Maybe mcs/brade have more ideas for that part, though.

In some cases Tor Launcher starts tor with DisableNetwork=1 to allow users to make configuration changes before bootstrapping begins. But I don't think that case is relevant here.

Tor Launcher issues a SETCONF DisableNetwork=1 command via the control port in the following three cases:

After the user clicks Cancel when the bootstrap progress bar is showing.
After the user clicks "Open Settings" when the bootstrap progress bar is showing (note that the new Tor Launcher design that is implemented in Tor Browser 7.5a8 does not include this button).
After a bootstrap error is reported by tor. The reason Tor Launcher does that is because it is about to switch its user interface back to configuration mode, and we do not want bootstrapping to proceed while the user is making adjustments. That only leads to confusion, especially when users switch from using a PT and not using one and vice-versa.

For 3, look for mBootstrapErrorOccurred in https://gitweb.torproject.org/tor-launcher.git/tree/README-BOOTSTRAP to see how Tor Launcher detects bootstrap errors. Once the user clicks "Connect" their settings are saved via the control port and Tor Launcher issues a SETCONF DisableNetwork=0 command.

If you want to see the commands that Tor Launcher issues via the control port, use about:config to set both extensions.torlauncher.logmethod and extensions.torlauncher.loglevel to 0. Then run from the command line so you can capture stderr from the firefox process. Look for lines that contain Sending Tor command:.

Replying to mcs:

Tor Launcher issues a SETCONF DisableNetwork=1 command via the control port in the following three cases: 3. After a bootstrap error is reported by tor. The reason Tor Launcher does that is because it is about to switch its user interface back to configuration mode, and we do not want bootstrapping to proceed while the user is making adjustments. That only leads to confusion, especially when users switch from using a PT and not using one and vice-versa.

I recently updated to 7.5a8 (with the new Tor Launcher UI on Linux). I saw the following via control channel when I was doing a clock skew test (client 27.5 hours in the future):

650 STATUS_GENERAL WARN CLOCK_SKEW SKEW=99000 SOURCE=OR:131.188.40.189:443
650 STATUS_CLIENT WARN BOOTSTRAP PROGRESS=40 TAG=loading_keys SUMMARY="Loading authority key certs" WARNING="Clock skew 99000 in NETINFO cell from OR" REASON=CLOCK_SKEW COUNT=3 RECOMMENDATION=warn HOSTID="F2044413DAC2E02E3D6BCF4735A19BCA1DE97281" HOSTADDR="131.188.40.189:443"
650 ORCONN $F2044413DAC2E02E3D6BCF4735A19BCA1DE97281~gabelmoo CONNECTED ID=28
650 STREAM 24 FAILED 0 131.188.40.188.$EBE718E1A49EE229071702964F8DB1F318075FF8.exit:80 REASON=HIBERNATING
650 STREAM 26 FAILED 0 131.188.40.189.$F2044413DAC2E02E3D6BCF4735A19BCA1DE97281.exit:443 REASON=HIBERNATING
650-CONF_CHANGED
650-DisableNetwork=1
650 OK

There were no other control port connections open besides my manual one and Tor Launcher, so I'm pretty sure that SETCONF came from Tor Launcher. For some reason Tor Launcher did not change back to the configuration screen after sending the SETCONF. Maybe at least this symptom is a Tor Launcher bug?

Trac:
Keywords: N/A deleted, s8-errors added
Cc: teor, brade, mcs, nickm, asn to teor, brade, mcs, nickm, asn, catalyst
Sponsor: N/A to Sponsor8-can

Replying to catalyst:

I recently updated to 7.5a8 (with the new Tor Launcher UI on Linux). I saw the following via control channel when I was doing a clock skew test (client 27.5 hours in the future):

{{{ 650 STATUS_GENERAL WARN CLOCK_SKEW SKEW=99000 SOURCE=OR:131.188.40.189:443 650 STATUS_CLIENT WARN BOOTSTRAP PROGRESS=40 TAG=loading_keys SUMMARY="Loading authority key certs" WARNING="Clock skew 99000 in NETINFO cell from OR" REASON=CLOCK_SKEW COUNT=3 RECOMMENDATION=warn HOSTID="F2044413DAC2E02E3D6BCF4735A19BCA1DE97281" HOSTADDR="131.188.40.189:443" 650 ORCONN $F2044413DAC2E02E3D6BCF4735A19BCA1DE97281~gabelmoo CONNECTED ID=28 650 STREAM 24 FAILED 0 131.188.40.188.$EBE718E1A49EE229071702964F8DB1F318075FF8.exit:80 REASON=HIBERNATING 650 STREAM 26 FAILED 0 131.188.40.189.$F2044413DAC2E02E3D6BCF4735A19BCA1DE97281.exit:443 REASON=HIBERNATING 650-CONF_CHANGED 650-DisableNetwork=1 650 OK }}}

There were no other control port connections open besides my manual one and Tor Launcher, so I'm pretty sure that SETCONF came from Tor Launcher. For some reason Tor Launcher did not change back to the configuration screen after sending the SETCONF. Maybe at least this symptom is a Tor Launcher bug? I haven't been able to reproduce this since then, unfortunately. What I do recall is that it appeared to be the usual bootstrap progress screen, but with the "Copy Tor Log to Clipboard" with warning icon. I don't remember which other buttons were there. Notably missing was the "Tor failed to establish a Tor network connection" text and the REASON text from the bootstrap message.

Trac:

I got it to happen again, this time with the +3.5h time offset. The bootstrap phase message got redrawn in the bold font, but the REASON string isn't shown and there's no error indication. See the screenshot.

[11-22 20:18:29] TorLauncher DBUG: Event response: 650 STATUS_CLIENT WARN BOOTSTRAP PROGRESS=50 TAG=loading_descriptors SUMMARY="Loading relay descriptors" WARNING="Connection refused" REASON=CONNECTREFUSED COUNT=9 RECOMMENDATION=ignore HOSTID="ABCB4965F1FEE193602B50A365425105C889D3F8" HOSTADDR="192.34.63.137:443"
[11-22 20:18:29] TorLauncher DBUG: Event response: 650 WARN Problem bootstrapping. Stuck at 50%: Loading relay descriptors. (Connection refused; CONNECTREFUSED; count 10; recommendation warn; host 9BF04559224F0F1C3C953D641F1744AF0192543A at 5.9.151.241:4223) 
[11-22 20:18:29] TorLauncher DBUG: Event response: 650 WARN 9 connections have failed: 
[11-22 20:18:29] TorLauncher DBUG: Event response: 650 WARN  9 connections died in state connect()ing with SSL state (No SSL object) 
[11-22 20:18:29] TorLauncher DBUG: Event response: 650 STATUS_CLIENT WARN BOOTSTRAP PROGRESS=50 TAG=loading_descriptors SUMMARY="Loading relay descriptors" WARNING="Connection refused" REASON=CONNECTREFUSED COUNT=10 RECOMMENDATION=warn HOSTID="9BF04559224F0F1C3C953D641F1744AF0192543A" HOSTADDR="5.9.151.241:4223"
[11-22 20:18:29] TorLauncher WARN: Tor bootstrap error: [loading_descriptors/CONNECTREFUSED] Loading relay information failed (connection refused - 5.9.151.241:4223).
[11-22 20:18:29] TorLauncher DBUG: Sending Tor command: SETCONF DisableNetwork=1
[11-22 20:18:29] TorLauncher DBUG: Command response: 250 OK
[11-22 20:18:29] TorLauncher DBUG: Event response: 650 NOTICE Closing no-longer-configured Socks listener on 127.0.0.1:9150 
[11-22 20:18:29] TorLauncher DBUG: Event response: 650 NOTICE DisableNetwork is set. Tor will not make or accept non-control network connections. Shutting down all existing connections.

I'm finding it hard to track the multiple causes of this bug in the same ticket.

I've split the Tor issue with cached bridge descriptors and patch into #24392 (moved). I assigned it to nickm for the unit test fixes, and gave it the right state.

I've unassigned this ticket from nickm. We don't have a patch to fix the remaining issues in this ticket, because we don't know what they are. So let's not put it back in "needs_revision", please?

Trac:
Owner: nickm to N/A
Status: needs_revision to assigned

If you look at the log for d7833c9d27feed9 you'll see that the guard list gets updated with obfs4 bridges as is done in the log for 6370fb77c586e9a:

Nov 22 11:37:05.000 [info] entry_guards_update_primary(): Primary entry guards have changed. New primary guard list is: 
Nov 22 11:37:05.000 [info] entry_guards_update_primary():   1/3: [bridge] ($A832D176ECD5C7C6B58825AE22FC4C90FA249637)
Nov 22 11:37:05.000 [info] entry_guards_update_primary():   2/3: [bridge] ($D9A82D2F9C2F65A18407B1D2B764F130847F8B5D)
Nov 22 11:37:05.000 [info] entry_guards_update_primary():   3/3: [bridge] ($CDF2E852BF539B82BD10E27E9115A31734E378C2)

So, that part is good. However and contrary to what is happening in the 6370fb77c586e9a case those are not getting used/tried. Rather, the previously used obfs3 bridge is still present:

Nov 22 11:37:06.000 [info] sample_reachable_filtered_entry_guards():   (Selected [bridge] ($A09D536DD1752D542E1FBB3C9CE4449D51298239).)
Nov 22 11:37:06.000 [info] select_entry_guard_for_circuit(): No primary or confirmed guards available. Selected random guard [bridge] ($A09D536DD1752D542E1FBB3C9CE4449D51298239) for circuit. Will try other guards before using this circuit.

I created #24428 (moved) to cover the Tor Launcher issue reported in comment:15.

teor: Are you still working on this bug? It would be really neat if we could get this tor issue fixed for 0.3.2.

Trac:
Keywords: N/A deleted, tbb-wants added

Trac:
Status: assigned to needs_information

Replying to gk:

teor: Are you still working on this bug? It would be really neat if we could get this tor issue fixed for 0.3.2.

I am working with arma and nickm to fix #24392 (moved) and #24486 (moved). We thought we had a solution, but it needs revising. I'll probably end up doing the revision. We will probably get this done in 0.3.2.

It seems like there might be other bugs that contribute to this issue. I don't know whether they are in Tor or Tor Launcher. I don't understand all the details.

I don't intend to work on this issue beyond fixing #24392 (moved) and #24486 (moved). I might help review or write short patches if someone can identify a specific bug in Tor. A reproducible series of steps that only involve tor (not Tor Launcher) would be really helpful here. (This is volunteer work for me, and I don't feel I know enough about Tor / Tor Launcher interactions to be helpful.)

It is likely that this change in #24392 (moved) fixes at least part of this issue, but that ticket still needs revision:

     case DL_SCHED_BRIDGE:
       /* A bridge client downloading bridge descriptors */
-      if (options->UseBridges && any_bridge_descriptors_known()) {
+      if (options->UseBridges && num_bridges_usable() > 0) {
         /* A bridge client with one or more running bridges */
         return options->TestingBridgeDownloadSchedule;
       } else {
         /* A bridge client with no running bridges */
         return options->TestingBridgeBootstrapDownloadSchedule;
       }

Before the change, we would check for cached bridge descriptors, which is definitely wrong. After the change, we check for running bridges, which is the right thing to do. (But it might still need a fix or two to work.)

I think we might have fixed some more parts of this issue in #24392 (moved): we now distinguish between bridges that might be reachable, and bridges that are definitely reachable.

This partially reverts some of the changes in #23347 (moved) and #17750 (moved):

we only delay bridge descriptor fetches when we have a bridge that is definitely running,
we only delay directory fetches when all of our bridges are definitely not running,
we keep on retrying directory downloads every time we receive a new bridge descriptor, until we definitely have a reachable bridge.

Please re-test my branch bug24367_032 from github.

Replying to gk:

If you look at the log for d7833c9d27feed9 you'll see that the guard list gets updated with obfs4 bridges as is done in the log for 6370fb77c586e9a: {{{ Nov 22 11:37:05.000 [info] entry_guards_update_primary(): Primary entry guards have changed. New primary guard list is: Nov 22 11:37:05.000 [info] entry_guards_update_primary(): 1/3: [bridge] ($A832D176ECD5C7C6B58825AE22FC4C90FA249637) Nov 22 11:37:05.000 [info] entry_guards_update_primary(): 2/3: [bridge] ($D9A82D2F9C2F65A18407B1D2B764F130847F8B5D) Nov 22 11:37:05.000 [info] entry_guards_update_primary(): 3/3: [bridge] ($CDF2E852BF539B82BD10E27E9115A31734E378C2) }}} So, that part is good. However and contrary to what is happening in the 6370fb77c586e9a case those are not getting used/tried. Rather, the previously used obfs3 bridge is still present: {{{ Nov 22 11:37:06.000 [info] sample_reachable_filtered_entry_guards(): (Selected [bridge] ($A09D536DD1752D542E1FBB3C9CE4449D51298239).) Nov 22 11:37:06.000 [info] select_entry_guard_for_circuit(): No primary or confirmed guards available. Selected random guard [bridge] ($A09D536DD1752D542E1FBB3C9CE4449D51298239) for circuit. Will try other guards before using this circuit. }}}

This looks like a bug in sample_reachable_filtered_entry_guards(). It looks like we are keeping the old bridge (A09D536DD1752D542E1FBB3C9CE4449D51298239) as a possible non-primary non-confirmed guard, even when it is no longer configured as a bridge. That's not good at all. We should be discarding it entirely.

I'm going to leave this to someone else to fix.

(I am no longer convinced that all the issues in this bug are restricted to 0.3.2, we should consider backporting these changes to 0.3.0 as a precautionary measure.)

Trac:
Version: Tor: 0.3.2.1-alpha to Tor: 0.3.0.1-alpha
Keywords: N/A deleted, 030-backport, 031-backport added

Replying to teor:

I think we might have fixed some more parts of this issue in #24392 (moved): we now distinguish between bridges that might be reachable, and bridges that are definitely reachable.

This partially reverts some of the changes in #23347 (moved) and #17750 (moved):

we only delay bridge descriptor fetches when we have a bridge that is definitely running,

we only delay directory fetches when all of our bridges are definitely not running,

we keep on retrying directory downloads every time we receive a new bridge descriptor, until we definitely have a reachable bridge.

Please re-test my branch bug24367_032 from github.

I did. Here is the good news: it seems this branch fixes the "even worse" part in my comment:description.

Replying to gk:

If you look at the log for d7833c9d27feed9 you'll see that the guard list gets updated with obfs4 bridges as is done in the log for 6370fb77c586e9a: {{{ Nov 22 11:37:05.000 [info] entry_guards_update_primary(): Primary entry guards have changed. New primary guard list is: Nov 22 11:37:05.000 [info] entry_guards_update_primary(): 1/3: [bridge] ($A832D176ECD5C7C6B58825AE22FC4C90FA249637) Nov 22 11:37:05.000 [info] entry_guards_update_primary(): 2/3: [bridge] ($D9A82D2F9C2F65A18407B1D2B764F130847F8B5D) Nov 22 11:37:05.000 [info] entry_guards_update_primary(): 3/3: [bridge] ($CDF2E852BF539B82BD10E27E9115A31734E378C2) }}} So, that part is good. However and contrary to what is happening in the 6370fb77c586e9a case those are not getting used/tried. Rather, the previously used obfs3 bridge is still present: {{{ Nov 22 11:37:06.000 [info] sample_reachable_filtered_entry_guards(): (Selected [bridge] ($A09D536DD1752D542E1FBB3C9CE4449D51298239).) Nov 22 11:37:06.000 [info] select_entry_guard_for_circuit(): No primary or confirmed guards available. Selected random guard [bridge] ($A09D536DD1752D542E1FBB3C9CE4449D51298239) for circuit. Will try other guards before using this circuit. }}}

This looks like a bug in sample_reachable_filtered_entry_guards(). It looks like we are keeping the old bridge (A09D536DD1752D542E1FBB3C9CE4449D51298239) as a possible non-primary non-confirmed guard, even when it is no longer configured as a bridge. That's not good at all. We should be discarding it entirely.

I'm going to leave this to someone else to fix.

I wonder if that's the reason for the issue outlined in steps 1)-4) in my comment:description as that one is not resolved yet with your bug24367_032 branch.

We debugged this a bit on IRC today and it turned out that a crucial part in this bug seems to be that one of the obfs3 and one of the obfs4 bridges share the same ID. So, if one chooses an obfs3 bridge Tor Launcher issues a SETCONF command like

[12-12 16:21:14] TorLauncher DBUG: Sending Tor command: SETCONF 
UseBridges=1 Bridge="obfs3 109.105.109.163:38980 
1E05F577A0EC0213F971D81BF4D86A9E4E8229ED" Bridge="obfs3 
109.105.109.163:47779 4C331FA9B3D1D6D8FB0D8FBBF0C259C360D97E6A" 
Bridge="obfs3 83.212.101.3:80 A09D536DD1752D542E1FBB3C9CE4449D51298239"
Bridge="obfs3 169.229.59.75:46328 AF9F66B7B04F8FF6F32D455F05135250A16543C9" 
Bridge="obfs3 169.229.59.74:31493 AF9F66B7B04F8FF6F32D455F05135250A16543C9"

. When pressing Cancel during the bootstrap process and selecting the obfs4 option instead, Tor Launcher is sending a similar SETCONF command. However, later on we easily get a bootstrap error like

Dec 12 17:21:35.000 [warn] Problem bootstrapping. Stuck at 85%: 
Finishing handshake with first hop. (DONE; DONE; count 1; 
recommendation warn; host A09D536DD1752D542E1FBB3C9CE4449D51298239 at 
83.212.101.3:80)

The ID A09D536DD1752D542E1FBB3C9CE4449D51298239 is used by one of the obfs4 bridges as well (the IP is the same but the port is different (50002 in this case)). It seems things are working much better if I remove the obfs3 bridge with A09D536DD1752D542E1FBB3C9CE4449D51298239 first before starting. I did some tests and I never ran into the bootstrap error. FWIW it is visible as well if one starts with obfs4 first and switches to obfs3, although it seems the error does not pop up outright (as in the obfs3 -> obfs4 switch) but gets visible once I press Cancel in Tor Launcher after the bootstrap process is stuck.

According to Nick the issue seems to be that Tor Browser will have a guard entry for the bridge it was using before but it seems the lookup which is done by fingerprint and address is not working as expected.

So, that's the error case in 4) in my comment:description. I am not sure about the stalling part yet. My current plan is to open a new ticket once all open issues related to that one are fixed/merged and I can still come up with a scenario to reproduce a stalled bootstrap process.

I've been staring at the logs, and I have a new idea:

Nov 22 11:37:06.000 [info] TLS error while handshaking with [scrubbed]: wrong version number (in SSL routines:ssl3_get_record:SSLv3/TLS write client hello)

This is an error I might expect to see if we had a mismatched pluggable transport: if we weren't using one when we were supposed to, or if we were using the wrong one and the transport didn't notice.

I've made a new branch ideas_for_24367 to try to track down this possibility. It doesn't make any changes: it just adds new logs to 0.3.2 to keep track of our transport plans, and the bridge addresses we think our guards have. Could you generate info-level logs for the failing case using this branch?

Trac:
19dc25a01cf1c385ac3b278df555351631a2d23c_24367.log

Added.

Okay! Let's look through these logs to see what happens to our favorite bridge, A09D536DD1752D542E1FBB3C9CE4449D51298239, at 82.212.101.3.

Dec 14 11:46:18.000 [info] get_transport_by_bridge_addrport(): Looked up transport for 83.212.101.3:80: found obfs3.
...
Dec 14 11:46:19.000 [info] make_guard_confirmed(): Marking [bridge] ($A09D536DD1752D542E1FBB3C9CE4449D51298239) at 83.212.101.3:80 as a confirmed guard (index 0)
Dec 14 11:46:19.000 [info] entry_guards_update_primary(): Primary entry guards have changed. New primary guard list is: 
Dec 14 11:46:19.000 [info] entry_guards_update_primary():   1/3: [bridge] ($A09D536DD1752D542E1FBB3C9CE4449D51298239) at 83.212.101.3:80 (confirmed)
Dec 14 11:46:19.000 [info] entry_guards_update_primary():   2/3: [bridge] ($4C331FA9B3D1D6D8FB0D8FBBF0C259C360D97E6A) at 109.105.109.163:47779
Dec 14 11:46:19.000 [info] entry_guards_update_primary():   3/3: [bridge] ($1E05F577A0EC0213F971D81BF4D86A9E4E8229ED) at 109.105.109.163:38980
...
Dec 14 11:46:19.000 [notice] new bridge descriptor 'LeifEricson' (fresh): $A09D536DD1752D542E1FBB3C9CE4449D51298239~LeifEricson at 83.212.101.3
... (time passes; tor bootstraps; we change the options) ...
Dec 14 11:46:31.000 [info] options_act(): Changed to using entry guards or bridges, or changed preferred or excluded node lists. Abandoning previous circuits.
Dec 14 11:46:31.000 [info] sampled_guards_update_from_consensus(): Updating sampled guard status based on received consensus.
...
Dec 14 11:46:31.000 [info] sampled_guards_update_from_consensus(): Sampled guard [bridge] ($A09D536DD1752D542E1FBB3C9CE4449D51298239) at 83.212.101.3:80 is now unlisted.
...
Dec 14 11:46:31.000 [info] entry_guards_update_primary(): Primary entry guards have changed. New primary guard list is: 
Dec 14 11:46:31.000 [info] entry_guards_update_primary():   1/3: [bridge] ($91A6354697E6B02A386312F68D82CF86824D3606) at 85.31.186.26:443
Dec 14 11:46:31.000 [info] entry_guards_update_primary():   2/3: [bridge] ($C73ADBAC8ADFDBF0FC0F3F4E8091C0107D093716) at 154.35.22.9:12166
Dec 14 11:46:31.000 [info] entry_guards_update_primary():   3/3: [bridge] ($A832D176ECD5C7C6B58825AE22FC4C90FA249637) at 154.35.22.11:80
...
Dec 14 11:46:33.000 [info] sample_reachable_filtered_entry_guards(): Trying to sample a reachable guard: We know of 20 in the USABLE_FILTERED set.
Dec 14 11:46:33.000 [info] sample_reachable_filtered_entry_guards():   (After filters [17], we have 1 guards to consider.)
Dec 14 11:46:33.000 [info] sample_reachable_filtered_entry_guards():   (Selected [bridge] ($A09D536DD1752D542E1FBB3C9CE4449D51298239) at 83.212.101.3:50002.)
Dec 14 11:46:33.000 [info] select_entry_guard_for_circuit(): No primary or confirmed guards available. Selected random guard [bridge] ($A09D536DD1752D542E1FBB3C9CE4449D51298239) at 83.212.101.3:50002 for circuit. Will try other guards before using this circuit.
Dec 14 11:46:33.000 [info] get_transport_by_bridge_addrport(): Looked up transport for 83.212.101.3:80: no such bridge found.
...
Dec 14 11:46:33.000 [info] connection_or_note_state_when_broken(): Connection died in state 'handshaking (TLS) with SSL state SSLv3/TLS write client hello in HANDSHAKE'
Dec 14 11:46:33.000 [warn] Problem bootstrapping. Stuck at 85%: Finishing handshake with first hop. (DONE; DONE; count 1; recommendation warn; host A09D536DD1752D542E1FBB3C9CE4449D51298239 at 83.212.101.3:80)

So, two interesting things here:

Why did we pick LeifEricson for our connection the second time, after we had chosen 3 other primary guards? The line After filters [17], we have 1 guards to consider. is suggestive here. Note that 17 is hex here, so we are setting every SAMPLE_EXCLUDE_CONFIRMED, SAMPLE_EXCLUDE_PRIMARY, SAMPLE_EXCLUDE_PENDING, and SAMPLE_EXCLUDE_NO_DESCRIPTOR. This means we're calling sample_reachable_filtered_entry_guards() from select_entry_guard_for_circuit with GUARD_USAGE_TRAFFIC set. But we have no descriptors for any of our primary guards, so we're not using any of them. Whoops!
When we connect to LeifEricson, we decide that it doesn't have a transport. But look what's happening here. We say that we've selected 83.212.101.3:50002 for the circuit, but when we look up the transport, we say Looked up transport for 83.212.101.3:80. So we're looking up the wrong transport entirely.

Why did we pick LeifEricson for our connection the second time, after we had chosen 3 other primary guards?

So, what should have happened here is that we should have tried to fetch our primary guards' descriptors before building circuits, and not fallen back to this entry far down later in the guard set. The logic that's supposed to stop that is in guard_selection_get_err_str_if_dir_info_missing(). But also, I don't see us launching requests for any of the new primary guards! Something's going wrong there.

So we're looking up the wrong transport entirely.

I think this is an artifact of looking at conn->addr instead of conn->real_addr.

I think this is an artifact of looking at conn->addr instead of conn->real_addr.

Wait. In this case, addr and real_addr will be the same. The problem is that there is no real_port, maybe?

But also, I don't see us launching requests for any of the new primary guards! Something's going wrong there.

Ah, we are sort-of-trying. But the first time we try, at 11:46:32, we delay dir fetches because "pt proxies still configuring", and 2 seconds later, at 11:46:34, we delay dir fecthes because DisableNetwork is set. So there's no real chance to fetch better descriptors.

So, I have a few theories now based on this log information: one was something I had suspected before, and one was something I hadn't realized.

All of the fixes are in a branch 24367_some_more_030 in my public repository. But instead of testing that one, I recommend that you test 24367_example_merge_032, which also has the log improvements I made before, as well as being merged to 0.3.2 in order to make sure Teor's fixes are in.

(These branches are a little ugly; I should revise and comment more before review)

The particular error seems to be gone but now I am stuck at

Dec 14 18:04:18.000 [notice] Opening Socks listener on 127.0.0.1:9150
Dec 14 18:04:20.000 [notice] Our directory information is no longer 
up-to-date enough to build circuits: We're missing descriptors for 
2/2 of our primary entry guards (total microdescriptors: 6502/6502).

Attached is the debug log.

Trac:
04f4f30af76cdedff990601a2e2f6387077290a3_24367.log

After the bridges and PT are changed, this is what happens:

Dec 14 19:48:33.000 [info] register_client_proxy(): Successfully registered transport obfs2
918	Dec 14 19:48:33.000 [info] register_client_proxy(): Successfully registered transport obfs3
919	Dec 14 19:48:33.000 [info] register_client_proxy(): Successfully registered transport obfs4
920	Dec 14 19:48:33.000 [info] register_client_proxy(): Successfully registered transport scramblesuit
921	Dec 14 19:48:34.000 [notice] Our directory information is no longer up-to-date enough to build circuits: We're missing descriptors for 2/2 of our primary entry guards (total microdescriptors: 6474/6486).
922	Dec 14 19:48:34.000 [info] launch_descriptor_downloads(): Launching 1 request for 12 microdescs, 12 at a time
923	Dec 14 19:48:34.000 [info] select_entry_guard_for_circuit(): Selected primary guard [bridge] ($00DC6C4FA49A65BD1472993CF6730D54F11E0DBB) at 154.35.22.12:80 for circuit.
924	Dec 14 19:48:34.000 [notice] Ignoring directory request, since no bridge nodes are available yet.

Notice how we're missing bridge descriptors, but we try to fetch microdescriptors?

So all that code asn wrote to fetch microdescs harder isn't going to help here. Instead, we need to try bridges harder when we are missing bridge descriptors. Maybe we should start by resetting their download statuses?

Also, we should think about whether this code works when we are not using microdescriptors. Does it specifically trigger a microdesc fetch? Or is it just triggering descriptor fetches?

Remove 030-backport from all open tickets that have it: 0.3.0 is now deprecated.

Trac:
Keywords: 030-backport deleted, N/A added

These bugfix tickets have no patches. The earliest they will get done is 0.3.3.

Trac:
Milestone: Tor: 0.3.2.x-final to Tor: 0.3.3.x-final

Trac:
Status: needs_information to new

Marking all tickets reached by current round of 033 triage.

Trac:
Keywords: N/A deleted, 033-triage-20180320 added

Mark all not-already-included tickets as pending review for removal from 0.3.3 milestone.

Trac:
Keywords: N/A deleted, 033-removed-20180320 added

These tickets were marked as removed, and nobody has said that they can fix them. Let's remember to look at 033-removed-20180320 as we re-evaluate our triage process, to see whether we're triaging out unnecessarily, and to evaluate whether we're deferring anything unnecessarily. But for now, we can't do these: we need to fix the 033-must stuff now.

Trac:
Milestone: Tor: 0.3.3.x-final to Tor: unspecified

0.3.1 is end of life, there are no more backports. Tagging with 031-unreached-backport instead.

Trac:
Keywords: 031-backport deleted, 031-unreached-backport added

Changing pluggable transports (during start-up) in Tor Browser is broken

Child items ...

Activity