Opened 2 months ago

Last modified 6 weeks ago

#24367 needs_information defect

Changing pluggable transports (during start-up) in Tor Browser is broken

Reported by: gk Owned by:
Priority: Medium Milestone: Tor: 0.3.2.x-final
Component: Core Tor/Tor Version: Tor: 0.3.0.1-alpha
Severity: Normal Keywords: 030-backport, 031-backport, regression, tor-bridge-client, s8-errors, tbb-wants
Cc: teor, brade, mcs, nickm, asn, catalyst Actual Points:
Parent ID: Points: 0.5
Reviewer: Sponsor: Sponsor8-can

Description

The fix for #23347 made bridges work again in tor but did not fully fix the problem/introduced a new issue: changing pluggable transports during start-up of Tor Browser is broken. Steps to reproduce are:

1) Start Tor Browser and select e.g. obfs3 as a transport
2) Before finally connecting, click the Cancel button
3) Reconfigure the censorship bypass settings and choose, e.g. obfs4
4) Resume starting. This time you might get an error (indicating that it's still an obfs3 bridge your tor wants to connect to) and/or just a stalled bootstrap process.

Even worse, after quitting the browser and restarting the start-up process is still broken: it does not proceed and one only gets

Nov 21 11:55:59.000 [notice] Ignoring directory request, since no bridge nodes are available yet.

messages in the terminal.

The first bad commit is 93a8ed3b83b5f20768562ca2aff4eba7aca667d8.

Child Tickets

TicketStatusOwnerSummaryComponent
#24392merge_readyteorIgnore cached bridge descriptors until we check if they are runningCore Tor/Tor
#24486closedMark all bridges as up on application activityCore Tor/Tor

Change History (41)

comment:1 Changed 2 months ago by gk

Summary: Changes pluggable transports (during start-up) in Tor Browser is brokenChanging pluggable transports (during start-up) in Tor Browser is broken

comment:2 Changed 2 months ago by mcs

Cc: brade mcs added

comment:3 Changed 2 months ago by teor

Cc: nickm asn added
Keywords: regression tor-bridge-client added
Points: 0.5
Status: newneeds_revision
Version: Tor: 0.3.2.1-alpha

The underlying issue here is that every use of any_bridge_descriptors_known() in tor is wrong. Instead, we want to know if num_bridges_usable() > 0.

This is a bug introduced in commit 93a8ed3 in 0.3.2.1-alpha by #23347. But it was also a bug in commit eca2a30 in 0.2.0.3-alpha, but we've never encountered it before, because we always retried our bridges immediately (and far too often).

My branch bug24367 at https://github.com/teor2345/tor.git passes all of make test-network, but fails some of the guard and microdesc unit tests. This probably means the unit tests were relying on the buggy behaviour, and need to be changed. I might need nickm or asn to help me fix the unit tests, because they wrote them.

I think this branch fixes the issue after quitting the browser. It might also fix the stalled bootstrap or error, but I'm not sure. Can you re-test with this branch?

comment:4 Changed 2 months ago by teor

(Oh, the branch is based on 032, I've pushed a copy to bug24367_032.)

comment:5 Changed 2 months ago by nickm

Owner: set to nickm
Status: needs_revisionaccepted

comment:6 Changed 2 months ago by nickm

Status: acceptedneeds_revision

taking ownership of the test fixes

comment:7 in reply to:  3 ; Changed 2 months ago by gk

Replying to teor:

I think this branch fixes the issue after quitting the browser. It might also fix the stalled bootstrap or error, but I'm not sure. Can you re-test with this branch?

Neither of those cases is fixed for me. I am still getting the error that Tor was unable to connect to an obfs3 address even though I switched to obfs4 during start-up. And after restarting I am stuck at

Nov 22 08:57:04.000 [notice] DisableNetwork is set. Tor will not make or accept non-control network connections. Shutting down all existing connections.
Nov 22 08:57:04.000 [notice] DisableNetwork is set. Tor will not make or accept non-control network connections. Shutting down all existing connections.
Nov 22 08:57:04.000 [notice] DisableNetwork is set. Tor will not make or accept non-control network connections. Shutting down all existing connections.
Nov 22 08:57:04.000 [notice] Opening Socks listener on 127.0.0.1:9150

So, the only thing that is gone is

[notice] Ignoring directory request, since no bridge nodes are available yet.

I used d7833c9d27feed9e for testing.

comment:8 in reply to:  7 ; Changed 2 months ago by teor

Status: needs_revisionneeds_information

The first bad commit is 93a8ed3b83b5f20768562ca2aff4eba7aca667d8.

I'm not sure this is correct.

Bridges didn't work at all between c21cfd28f4 (#17750) and 93a8ed3b83 (#23347).
So you might want to check if the commit before c21cfd28f4 (63ceadb485) is a bad commit or not.
If it is, your problem goes much further back in Tor, or it is in Tor Browser or Tor Launcher.

Replying to gk:

Replying to teor:

I think this branch fixes the issue after quitting the browser. It might also fix the stalled bootstrap or error, but I'm not sure. Can you re-test with this branch?

Neither of those cases is fixed for me. I am still getting the error that Tor was unable to connect to an obfs3 address even though I switched to obfs4 during start-up. And after restarting I am stuck at

Nov 22 08:57:04.000 [notice] DisableNetwork is set. Tor will not make or accept non-control network connections. Shutting down all existing connections.
Nov 22 08:57:04.000 [notice] DisableNetwork is set. Tor will not make or accept non-control network connections. Shutting down all existing connections.
Nov 22 08:57:04.000 [notice] DisableNetwork is set. Tor will not make or accept non-control network connections. Shutting down all existing connections.
Nov 22 08:57:04.000 [notice] Opening Socks listener on 127.0.0.1:9150

This looks like a Tor Browser or Tor Launcher bug, not a Tor bug: when DisableNetwork is set, Tor is behaving as designed by not making any connections.

Or, there aren't enough log entries here to diagnose why Tor Launcher has set DisableNetwork, and which component is at fault.

So, the only thing that is gone is

[notice] Ignoring directory request, since no bridge nodes are available yet.

I used d7833c9d27feed9e for testing.

Excellent, then we've fixed one bug that was definitely in Tor.

Let's open other tickets for the remainder?

Edit: remove special character

Last edited 2 months ago by teor (previous) (diff)

comment:9 in reply to:  8 ; Changed 2 months ago by gk

Status: needs_informationneeds_revision

Replying to teor:

The first bad commit is 93a8ed3b83b5f20768562ca2aff4eba7aca667d8.

I'm not sure this is correct.

Bridges didn't work at all between c21cfd28f4 (#17750) and 93a8ed3b83 (#23347).

You are mistaken here. The immediate parent of 93a8ed3b83b5f20768562ca2aff4eba7aca667d8, which is 6370fb77c586e9ad68c7d1ecb95be36137cb9067, is working perfectly fine: I can change the PT during bootstrap and even after restart the new PT is being used without any complaint. Exactly like the current behavior in 0.3.1.8 is.

So you might want to check if the commit before c21cfd28f4 (63ceadb485) is a bad commit or not.
If it is, your problem goes much further back in Tor, or it is in Tor Browser or Tor Launcher.

Replying to gk:

Replying to teor:

I think this branch fixes the issue after quitting the browser. It might also fix the stalled bootstrap or error, but I'm not sure. Can you re-test with this branch?

Neither of those cases is fixed for me. I am still getting the error that Tor was unable to connect to an obfs3 address even though I switched to obfs4 during start-up. And after restarting I am stuck at

Nov 22 08:57:04.000 [notice] DisableNetwork is set. Tor will not make or accept non-control network connections. Shutting down all existing connections.
Nov 22 08:57:04.000 [notice] DisableNetwork is set. Tor will not make or accept non-control network connections. Shutting down all existing connections.
Nov 22 08:57:04.000 [notice] DisableNetwork is set. Tor will not make or accept non-control network connections. Shutting down all existing connections.
Nov 22 08:57:04.000 [notice] Opening Socks listener on 127.0.0.1:9150

This looks like a Tor Browser or Tor Launcher bug, not a Tor bug: when DisableNetwork is set, Tor is behaving as designed by not making any connections.

I am not convinced yet given that all things equal only 93a8ed3b83b5f20768562ca2aff4eba7aca667d8 is breaking things for us. But let's assume for a minute that this is indeed the case then it would not explain why I still get the error when switching the PT during start-up in the first place.

comment:10 in reply to:  9 ; Changed 2 months ago by teor

Status: needs_revisionneeds_information

Replying to gk:

Replying to teor:

The first bad commit is 93a8ed3b83b5f20768562ca2aff4eba7aca667d8.

I'm not sure this is correct.

Bridges didn't work at all between c21cfd28f4 (#17750) and 93a8ed3b83 (#23347).

You are mistaken here. The immediate parent of 93a8ed3b83b5f20768562ca2aff4eba7aca667d8, which is 6370fb77c586e9ad68c7d1ecb95be36137cb9067, is working perfectly fine: I can change the PT during bootstrap and even after restart the new PT is being used without any complaint. Exactly like the current behavior in 0.3.1.8 is.

Oh, you're right, #23347 has a misleading title. Fixed!

...

But I still don't have enough information to diagnose the bug in the logs you supplied for tor commit d7833c9d27feed9e. They tell me that we fixed one bug, but they don't tell me where the next error is.

Perhaps we can work out how to reproduce this bug using a command-line like the one in #23347?

Otherwise, I need someone to provide more detailed logs (info-level?) that tell me *why* Tor Launcher has set DisableNetwork.

About 10 hours ago in #tor-dev IRC, catalyst was saying that Tor Launcher will set DisableNetwork in response to some bootstrap events. (I can't find a bug number for that bug.) Maybe we are seeing an instance of that bug here?

comment:11 in reply to:  10 ; Changed 2 months ago by gk

Status: needs_informationneeds_revision

Replying to teor:

Replying to gk:

Replying to teor:

The first bad commit is 93a8ed3b83b5f20768562ca2aff4eba7aca667d8.

I'm not sure this is correct.

Bridges didn't work at all between c21cfd28f4 (#17750) and 93a8ed3b83 (#23347).

You are mistaken here. The immediate parent of 93a8ed3b83b5f20768562ca2aff4eba7aca667d8, which is 6370fb77c586e9ad68c7d1ecb95be36137cb9067, is working perfectly fine: I can change the PT during bootstrap and even after restart the new PT is being used without any complaint. Exactly like the current behavior in 0.3.1.8 is.

Oh, you're right, #23347 has a misleading title. Fixed!

...

But I still don't have enough information to diagnose the bug in the logs you supplied for tor commit d7833c9d27feed9e. They tell me that we fixed one bug, but they don't tell me where the next error is.

Perhaps we can work out how to reproduce this bug using a command-line like the one in #23347?

I have to think about that a bit harder. I am not sure how to do that right now.

Otherwise, I need someone to provide more detailed logs (info-level?) that tell me *why* Tor Launcher has set DisableNetwork.

Alright. Let's look at the first issue described in steps 1)-4) in the bug description. It might give us some clue about what the actual bug(s) is/are which are remaining. I've attached two info level logs, one with 6370fb77c586e9a (where everything is fine and obfs4 is finally used), and one with d7833c9d27feed9 where still a connecting with obfs3 is attempted. I hope that's already enough for you to explain the problem in that part at least.

About 10 hours ago in #tor-dev IRC, catalyst was saying that Tor Launcher will set DisableNetwork in response to some bootstrap events. (I can't find a bug number for that bug.) Maybe we are seeing an instance of that bug here?

Hm. Not sure, although I am still not convinced that this plays a role here. Maybe mcs/brade have more ideas for that part, though.

Changed 2 months ago by gk

Changed 2 months ago by gk

comment:12 in reply to:  11 ; Changed 2 months ago by mcs

Replying to gk:

Hm. Not sure, although I am still not convinced that this plays a role here. Maybe mcs/brade have more ideas for that part, though.

In some cases Tor Launcher starts tor with DisableNetwork=1 to allow users to make configuration changes before bootstrapping begins. But I don't think that case is relevant here.

Tor Launcher issues a SETCONF DisableNetwork=1 command via the control port in the following three cases:

  1. After the user clicks Cancel when the bootstrap progress bar is showing.
  2. After the user clicks "Open Settings" when the bootstrap progress bar is showing (note that the new Tor Launcher design that is implemented in Tor Browser 7.5a8 does not include this button).
  3. After a bootstrap error is reported by tor. The reason Tor Launcher does that is because it is about to switch its user interface back to configuration mode, and we do not want bootstrapping to proceed while the user is making adjustments. That only leads to confusion, especially when users switch from using a PT and not using one and vice-versa.

For 3, look for mBootstrapErrorOccurred in https://gitweb.torproject.org/tor-launcher.git/tree/README-BOOTSTRAP to see how Tor Launcher detects bootstrap errors. Once the user clicks "Connect" their settings are saved via the control port and Tor Launcher issues a SETCONF DisableNetwork=0 command.

If you want to see the commands that Tor Launcher issues via the control port, use about:config to set both extensions.torlauncher.logmethod and extensions.torlauncher.loglevel to 0. Then run from the command line so you can capture stderr from the firefox process. Look for lines that contain Sending Tor command:.

comment:13 in reply to:  12 ; Changed 2 months ago by catalyst

Cc: catalyst added
Keywords: s8-errors added
Sponsor: Sponsor8-can

Replying to mcs:

Tor Launcher issues a SETCONF DisableNetwork=1 command via the control port in the following three cases:

  1. After a bootstrap error is reported by tor. The reason Tor Launcher does that is because it is about to switch its user interface back to configuration mode, and we do not want bootstrapping to proceed while the user is making adjustments. That only leads to confusion, especially when users switch from using a PT and not using one and vice-versa.

I recently updated to 7.5a8 (with the new Tor Launcher UI on Linux). I saw the following via control channel when I was doing a clock skew test (client 27.5 hours in the future):

650 STATUS_GENERAL WARN CLOCK_SKEW SKEW=99000 SOURCE=OR:131.188.40.189:443
650 STATUS_CLIENT WARN BOOTSTRAP PROGRESS=40 TAG=loading_keys SUMMARY="Loading authority key certs" WARNING="Clock skew 99000 in NETINFO cell from OR" REASON=CLOCK_SKEW COUNT=3 RECOMMENDATION=warn HOSTID="F2044413DAC2E02E3D6BCF4735A19BCA1DE97281" HOSTADDR="131.188.40.189:443"
650 ORCONN $F2044413DAC2E02E3D6BCF4735A19BCA1DE97281~gabelmoo CONNECTED ID=28
650 STREAM 24 FAILED 0 131.188.40.188.$EBE718E1A49EE229071702964F8DB1F318075FF8.exit:80 REASON=HIBERNATING
650 STREAM 26 FAILED 0 131.188.40.189.$F2044413DAC2E02E3D6BCF4735A19BCA1DE97281.exit:443 REASON=HIBERNATING
650-CONF_CHANGED
650-DisableNetwork=1
650 OK

There were no other control port connections open besides my manual one and Tor Launcher, so I'm pretty sure that SETCONF came from Tor Launcher. For some reason Tor Launcher did not change back to the configuration screen after sending the SETCONF. Maybe at least this symptom is a Tor Launcher bug?

comment:14 in reply to:  13 Changed 2 months ago by catalyst

Replying to catalyst:

I recently updated to 7.5a8 (with the new Tor Launcher UI on Linux). I saw the following via control channel when I was doing a clock skew test (client 27.5 hours in the future):

650 STATUS_GENERAL WARN CLOCK_SKEW SKEW=99000 SOURCE=OR:131.188.40.189:443
650 STATUS_CLIENT WARN BOOTSTRAP PROGRESS=40 TAG=loading_keys SUMMARY="Loading authority key certs" WARNING="Clock skew 99000 in NETINFO cell from OR" REASON=CLOCK_SKEW COUNT=3 RECOMMENDATION=warn HOSTID="F2044413DAC2E02E3D6BCF4735A19BCA1DE97281" HOSTADDR="131.188.40.189:443"
650 ORCONN $F2044413DAC2E02E3D6BCF4735A19BCA1DE97281~gabelmoo CONNECTED ID=28
650 STREAM 24 FAILED 0 131.188.40.188.$EBE718E1A49EE229071702964F8DB1F318075FF8.exit:80 REASON=HIBERNATING
650 STREAM 26 FAILED 0 131.188.40.189.$F2044413DAC2E02E3D6BCF4735A19BCA1DE97281.exit:443 REASON=HIBERNATING
650-CONF_CHANGED
650-DisableNetwork=1
650 OK

There were no other control port connections open besides my manual one and Tor Launcher, so I'm pretty sure that SETCONF came from Tor Launcher. For some reason Tor Launcher did not change back to the configuration screen after sending the SETCONF. Maybe at least this symptom is a Tor Launcher bug?

I haven't been able to reproduce this since then, unfortunately. What I do recall is that it appeared to be the usual bootstrap progress screen, but with the "Copy Tor Log to Clipboard" with warning icon. I don't remember which other buttons were there. Notably missing was the "Tor failed to establish a Tor network connection" text and the REASON text from the bootstrap message.

Changed 2 months ago by catalyst

comment:15 Changed 2 months ago by catalyst

I got it to happen again, this time with the +3.5h time offset. The bootstrap phase message got redrawn in the bold font, but the REASON string isn't shown and there's no error indication. See the screenshot.

[11-22 20:18:29] TorLauncher DBUG: Event response: 650 STATUS_CLIENT WARN BOOTSTRAP PROGRESS=50 TAG=loading_descriptors SUMMARY="Loading relay descriptors" WARNING="Connection refused" REASON=CONNECTREFUSED COUNT=9 RECOMMENDATION=ignore HOSTID="ABCB4965F1FEE193602B50A365425105C889D3F8" HOSTADDR="192.34.63.137:443"
[11-22 20:18:29] TorLauncher DBUG: Event response: 650 WARN Problem bootstrapping. Stuck at 50%: Loading relay descriptors. (Connection refused; CONNECTREFUSED; count 10; recommendation warn; host 9BF04559224F0F1C3C953D641F1744AF0192543A at 5.9.151.241:4223) 
[11-22 20:18:29] TorLauncher DBUG: Event response: 650 WARN 9 connections have failed: 
[11-22 20:18:29] TorLauncher DBUG: Event response: 650 WARN  9 connections died in state connect()ing with SSL state (No SSL object) 
[11-22 20:18:29] TorLauncher DBUG: Event response: 650 STATUS_CLIENT WARN BOOTSTRAP PROGRESS=50 TAG=loading_descriptors SUMMARY="Loading relay descriptors" WARNING="Connection refused" REASON=CONNECTREFUSED COUNT=10 RECOMMENDATION=warn HOSTID="9BF04559224F0F1C3C953D641F1744AF0192543A" HOSTADDR="5.9.151.241:4223"
[11-22 20:18:29] TorLauncher WARN: Tor bootstrap error: [loading_descriptors/CONNECTREFUSED] Loading relay information failed (connection refused - 5.9.151.241:4223).
[11-22 20:18:29] TorLauncher DBUG: Sending Tor command: SETCONF DisableNetwork=1
[11-22 20:18:29] TorLauncher DBUG: Command response: 250 OK
[11-22 20:18:29] TorLauncher DBUG: Event response: 650 NOTICE Closing no-longer-configured Socks listener on 127.0.0.1:9150 
[11-22 20:18:29] TorLauncher DBUG: Event response: 650 NOTICE DisableNetwork is set. Tor will not make or accept non-control network connections. Shutting down all existing connections. 

comment:16 Changed 2 months ago by teor

Owner: nickm deleted
Status: needs_revisionassigned

I'm finding it hard to track the multiple causes of this bug in the same ticket.

I've split the Tor issue with cached bridge descriptors and patch into #24392. I assigned it to nickm for the unit test fixes, and gave it the right state.

I've unassigned this ticket from nickm. We don't have a patch to fix the remaining issues in this ticket, because we don't know what they are. So let's not put it back in "needs_revision", please?

comment:17 Changed 2 months ago by gk

If you look at the log for d7833c9d27feed9 you'll see that the guard list gets updated with obfs4 bridges as is done in the log for 6370fb77c586e9a:

Nov 22 11:37:05.000 [info] entry_guards_update_primary(): Primary entry guards have changed. New primary guard list is: 
Nov 22 11:37:05.000 [info] entry_guards_update_primary():   1/3: [bridge] ($A832D176ECD5C7C6B58825AE22FC4C90FA249637)
Nov 22 11:37:05.000 [info] entry_guards_update_primary():   2/3: [bridge] ($D9A82D2F9C2F65A18407B1D2B764F130847F8B5D)
Nov 22 11:37:05.000 [info] entry_guards_update_primary():   3/3: [bridge] ($CDF2E852BF539B82BD10E27E9115A31734E378C2)

So, that part is good. However and contrary to what is happening in the 6370fb77c586e9a case those are not getting used/tried. Rather, the previously used obfs3 bridge is still present:

Nov 22 11:37:06.000 [info] sample_reachable_filtered_entry_guards():   (Selected [bridge] ($A09D536DD1752D542E1FBB3C9CE4449D51298239).)
Nov 22 11:37:06.000 [info] select_entry_guard_for_circuit(): No primary or confirmed guards available. Selected random guard [bridge] ($A09D536DD1752D542E1FBB3C9CE4449D51298239) for circuit. Will try other guards before using this circuit.

comment:18 Changed 8 weeks ago by brade

I created #24428 to cover the Tor Launcher issue reported in comment:15.

comment:19 Changed 7 weeks ago by gk

Keywords: tbb-wants added

teor: Are you still working on this bug? It would be really neat if we could get this tor issue fixed for 0.3.2.

comment:20 Changed 7 weeks ago by gk

Status: assignedneeds_information

comment:21 in reply to:  19 Changed 7 weeks ago by teor

Replying to gk:

teor: Are you still working on this bug? It would be really neat if we could get this tor issue fixed for 0.3.2.

I am working with arma and nickm to fix #24392 and #24486. We thought we had a solution, but it needs revising.
I'll probably end up doing the revision. We will probably get this done in 0.3.2.

It seems like there might be other bugs that contribute to this issue. I don't know whether they are in Tor or Tor Launcher. I don't understand all the details.

I don't intend to work on this issue beyond fixing #24392 and #24486.
I might help review or write short patches if someone can identify a specific bug in Tor.
A reproducible series of steps that only involve tor (not Tor Launcher) would be really helpful here.
(This is volunteer work for me, and I don't feel I know enough about Tor / Tor Launcher interactions to be helpful.)

comment:22 Changed 6 weeks ago by teor

It is likely that this change in #24392 fixes at least part of this issue, but that ticket still needs revision:

     case DL_SCHED_BRIDGE:
       /* A bridge client downloading bridge descriptors */
-      if (options->UseBridges && any_bridge_descriptors_known()) {
+      if (options->UseBridges && num_bridges_usable() > 0) {
         /* A bridge client with one or more running bridges */
         return options->TestingBridgeDownloadSchedule;
       } else {
         /* A bridge client with no running bridges */
         return options->TestingBridgeBootstrapDownloadSchedule;
       }

Before the change, we would check for cached bridge descriptors, which is definitely wrong. After the change, we check for running bridges, which is the right thing to do. (But it might still need a fix or two to work.)

comment:23 in reply to:  17 ; Changed 6 weeks ago by teor

I think we might have fixed some more parts of this issue in #24392: we now distinguish between bridges that might be reachable, and bridges that are definitely reachable.

This partially reverts some of the changes in #23347 and #17750:

  • we only delay bridge descriptor fetches when we have a bridge that is definitely running,
  • we only delay directory fetches when all of our bridges are definitely not running,
  • we keep on retrying directory downloads every time we receive a new bridge descriptor, until we definitely have a reachable bridge.

Please re-test my branch bug24367_032 from github.

Replying to gk:

If you look at the log for d7833c9d27feed9 you'll see that the guard list gets updated with obfs4 bridges as is done in the log for 6370fb77c586e9a:

Nov 22 11:37:05.000 [info] entry_guards_update_primary(): Primary entry guards have changed. New primary guard list is: 
Nov 22 11:37:05.000 [info] entry_guards_update_primary():   1/3: [bridge] ($A832D176ECD5C7C6B58825AE22FC4C90FA249637)
Nov 22 11:37:05.000 [info] entry_guards_update_primary():   2/3: [bridge] ($D9A82D2F9C2F65A18407B1D2B764F130847F8B5D)
Nov 22 11:37:05.000 [info] entry_guards_update_primary():   3/3: [bridge] ($CDF2E852BF539B82BD10E27E9115A31734E378C2)

So, that part is good. However and contrary to what is happening in the 6370fb77c586e9a case those are not getting used/tried. Rather, the previously used obfs3 bridge is still present:

Nov 22 11:37:06.000 [info] sample_reachable_filtered_entry_guards():   (Selected [bridge] ($A09D536DD1752D542E1FBB3C9CE4449D51298239).)
Nov 22 11:37:06.000 [info] select_entry_guard_for_circuit(): No primary or confirmed guards available. Selected random guard [bridge] ($A09D536DD1752D542E1FBB3C9CE4449D51298239) for circuit. Will try other guards before using this circuit.

This looks like a bug in sample_reachable_filtered_entry_guards(). It looks like we are keeping the old bridge (A09D536DD1752D542E1FBB3C9CE4449D51298239) as a possible non-primary non-confirmed guard, even when it is no longer configured as a bridge. That's not good at all. We should be discarding it entirely.

I'm going to leave this to someone else to fix.

comment:24 Changed 6 weeks ago by teor

Keywords: 030-backport 031-backport added
Version: Tor: 0.3.2.1-alphaTor: 0.3.0.1-alpha

(I am no longer convinced that all the issues in this bug are restricted to 0.3.2, we should consider backporting these changes to 0.3.0 as a precautionary measure.)

comment:25 in reply to:  23 Changed 6 weeks ago by gk

Replying to teor:

I think we might have fixed some more parts of this issue in #24392: we now distinguish between bridges that might be reachable, and bridges that are definitely reachable.

This partially reverts some of the changes in #23347 and #17750:

  • we only delay bridge descriptor fetches when we have a bridge that is definitely running,
  • we only delay directory fetches when all of our bridges are definitely not running,
  • we keep on retrying directory downloads every time we receive a new bridge descriptor, until we definitely have a reachable bridge.

Please re-test my branch bug24367_032 from github.

I did. Here is the good news: it seems this branch fixes the "even worse" part in my comment:description.

Replying to gk:

If you look at the log for d7833c9d27feed9 you'll see that the guard list gets updated with obfs4 bridges as is done in the log for 6370fb77c586e9a:

Nov 22 11:37:05.000 [info] entry_guards_update_primary(): Primary entry guards have changed. New primary guard list is: 
Nov 22 11:37:05.000 [info] entry_guards_update_primary():   1/3: [bridge] ($A832D176ECD5C7C6B58825AE22FC4C90FA249637)
Nov 22 11:37:05.000 [info] entry_guards_update_primary():   2/3: [bridge] ($D9A82D2F9C2F65A18407B1D2B764F130847F8B5D)
Nov 22 11:37:05.000 [info] entry_guards_update_primary():   3/3: [bridge] ($CDF2E852BF539B82BD10E27E9115A31734E378C2)

So, that part is good. However and contrary to what is happening in the 6370fb77c586e9a case those are not getting used/tried. Rather, the previously used obfs3 bridge is still present:

Nov 22 11:37:06.000 [info] sample_reachable_filtered_entry_guards():   (Selected [bridge] ($A09D536DD1752D542E1FBB3C9CE4449D51298239).)
Nov 22 11:37:06.000 [info] select_entry_guard_for_circuit(): No primary or confirmed guards available. Selected random guard [bridge] ($A09D536DD1752D542E1FBB3C9CE4449D51298239) for circuit. Will try other guards before using this circuit.

This looks like a bug in sample_reachable_filtered_entry_guards(). It looks like we are keeping the old bridge (A09D536DD1752D542E1FBB3C9CE4449D51298239) as a possible non-primary non-confirmed guard, even when it is no longer configured as a bridge. That's not good at all. We should be discarding it entirely.

I'm going to leave this to someone else to fix.

I wonder if that's the reason for the issue outlined in steps 1)-4) in my comment:description as that one is not resolved yet with your bug24367_032 branch.

Last edited 6 weeks ago by gk (previous) (diff)

comment:26 Changed 6 weeks ago by gk

We debugged this a bit on IRC today and it turned out that a crucial part in this bug seems to be that one of the obfs3 and one of the obfs4 bridges share the same ID. So, if one chooses an obfs3 bridge Tor Launcher issues a SETCONF command like

[12-12 16:21:14] TorLauncher DBUG: Sending Tor command: SETCONF 
UseBridges=1 Bridge="obfs3 109.105.109.163:38980 
1E05F577A0EC0213F971D81BF4D86A9E4E8229ED" Bridge="obfs3 
109.105.109.163:47779 4C331FA9B3D1D6D8FB0D8FBBF0C259C360D97E6A" 
Bridge="obfs3 83.212.101.3:80 A09D536DD1752D542E1FBB3C9CE4449D51298239"
Bridge="obfs3 169.229.59.75:46328 AF9F66B7B04F8FF6F32D455F05135250A16543C9" 
Bridge="obfs3 169.229.59.74:31493 AF9F66B7B04F8FF6F32D455F05135250A16543C9"

. When pressing Cancel during the bootstrap process and selecting the obfs4 option instead, Tor Launcher is sending a similar SETCONF command. However, later on we easily get a bootstrap error like

Dec 12 17:21:35.000 [warn] Problem bootstrapping. Stuck at 85%: 
Finishing handshake with first hop. (DONE; DONE; count 1; 
recommendation warn; host A09D536DD1752D542E1FBB3C9CE4449D51298239 at 
83.212.101.3:80)

The ID A09D536DD1752D542E1FBB3C9CE4449D51298239 is used by one of the obfs4 bridges as well (the IP is the same but the port is different (50002 in this case)). It seems things are working much better if I remove the obfs3 bridge with A09D536DD1752D542E1FBB3C9CE4449D51298239 first before starting. I did some tests and I never ran into the bootstrap error. FWIW it is visible as well if one starts with obfs4 first and switches to obfs3, although it seems the error does not pop up outright (as in the obfs3 -> obfs4 switch) but gets visible once I press Cancel in Tor Launcher after the bootstrap process is stuck.

According to Nick the issue seems to be that Tor Browser will have a guard entry for the bridge it was using before but it seems the lookup which is done by fingerprint _and_ address is not working as expected.

So, that's the error case in 4) in my comment:description. I am not sure about the stalling part yet. My current plan is to open a new ticket once all open issues related to that one are fixed/merged and I can still come up with a scenario to reproduce a stalled bootstrap process.

comment:27 Changed 6 weeks ago by nickm

I've been staring at the logs, and I have a new idea:

Nov 22 11:37:06.000 [info] TLS error while handshaking with [scrubbed]: wrong version number (in SSL routines:ssl3_get_record:SSLv3/TLS write client hello)

This is an error I might expect to see if we had a mismatched pluggable transport: if we weren't using one when we were supposed to, or if we were using the wrong one and the transport didn't notice.

I've made a new branch ideas_for_24367 to try to track down this possibility. It doesn't make any changes: it just adds new logs to 0.3.2 to keep track of our transport plans, and the bridge addresses we think our guards have. Could you generate info-level logs for the failing case using this branch?

comment:28 Changed 6 weeks ago by gk

Added.

comment:29 Changed 6 weeks ago by nickm

Okay! Let's look through these logs to see what happens to our favorite bridge, A09D536DD1752D542E1FBB3C9CE4449D51298239, at 82.212.101.3.

Dec 14 11:46:18.000 [info] get_transport_by_bridge_addrport(): Looked up transport for 83.212.101.3:80: found obfs3.
...
Dec 14 11:46:19.000 [info] make_guard_confirmed(): Marking [bridge] ($A09D536DD1752D542E1FBB3C9CE4449D51298239) at 83.212.101.3:80 as a confirmed guard (index 0)
Dec 14 11:46:19.000 [info] entry_guards_update_primary(): Primary entry guards have changed. New primary guard list is: 
Dec 14 11:46:19.000 [info] entry_guards_update_primary():   1/3: [bridge] ($A09D536DD1752D542E1FBB3C9CE4449D51298239) at 83.212.101.3:80 (confirmed)
Dec 14 11:46:19.000 [info] entry_guards_update_primary():   2/3: [bridge] ($4C331FA9B3D1D6D8FB0D8FBBF0C259C360D97E6A) at 109.105.109.163:47779
Dec 14 11:46:19.000 [info] entry_guards_update_primary():   3/3: [bridge] ($1E05F577A0EC0213F971D81BF4D86A9E4E8229ED) at 109.105.109.163:38980
...
Dec 14 11:46:19.000 [notice] new bridge descriptor 'LeifEricson' (fresh): $A09D536DD1752D542E1FBB3C9CE4449D51298239~LeifEricson at 83.212.101.3
... (time passes; tor bootstraps; we change the options) ...
Dec 14 11:46:31.000 [info] options_act(): Changed to using entry guards or bridges, or changed preferred or excluded node lists. Abandoning previous circuits.
Dec 14 11:46:31.000 [info] sampled_guards_update_from_consensus(): Updating sampled guard status based on received consensus.
...
Dec 14 11:46:31.000 [info] sampled_guards_update_from_consensus(): Sampled guard [bridge] ($A09D536DD1752D542E1FBB3C9CE4449D51298239) at 83.212.101.3:80 is now unlisted.
...
Dec 14 11:46:31.000 [info] entry_guards_update_primary(): Primary entry guards have changed. New primary guard list is: 
Dec 14 11:46:31.000 [info] entry_guards_update_primary():   1/3: [bridge] ($91A6354697E6B02A386312F68D82CF86824D3606) at 85.31.186.26:443
Dec 14 11:46:31.000 [info] entry_guards_update_primary():   2/3: [bridge] ($C73ADBAC8ADFDBF0FC0F3F4E8091C0107D093716) at 154.35.22.9:12166
Dec 14 11:46:31.000 [info] entry_guards_update_primary():   3/3: [bridge] ($A832D176ECD5C7C6B58825AE22FC4C90FA249637) at 154.35.22.11:80
...
Dec 14 11:46:33.000 [info] sample_reachable_filtered_entry_guards(): Trying to sample a reachable guard: We know of 20 in the USABLE_FILTERED set.
Dec 14 11:46:33.000 [info] sample_reachable_filtered_entry_guards():   (After filters [17], we have 1 guards to consider.)
Dec 14 11:46:33.000 [info] sample_reachable_filtered_entry_guards():   (Selected [bridge] ($A09D536DD1752D542E1FBB3C9CE4449D51298239) at 83.212.101.3:50002.)
Dec 14 11:46:33.000 [info] select_entry_guard_for_circuit(): No primary or confirmed guards available. Selected random guard [bridge] ($A09D536DD1752D542E1FBB3C9CE4449D51298239) at 83.212.101.3:50002 for circuit. Will try other guards before using this circuit.
Dec 14 11:46:33.000 [info] get_transport_by_bridge_addrport(): Looked up transport for 83.212.101.3:80: no such bridge found.
...
Dec 14 11:46:33.000 [info] connection_or_note_state_when_broken(): Connection died in state 'handshaking (TLS) with SSL state SSLv3/TLS write client hello in HANDSHAKE'
Dec 14 11:46:33.000 [warn] Problem bootstrapping. Stuck at 85%: Finishing handshake with first hop. (DONE; DONE; count 1; recommendation warn; host A09D536DD1752D542E1FBB3C9CE4449D51298239 at 83.212.101.3:80)

So, two interesting things here:

  1. Why did we pick LeifEricson for our connection the second time, after we had chosen 3 other primary guards? The line After filters [17], we have 1 guards to consider. is suggestive here. Note that 17 is hex here, so we are setting every SAMPLE_EXCLUDE_CONFIRMED, SAMPLE_EXCLUDE_PRIMARY, SAMPLE_EXCLUDE_PENDING, and SAMPLE_EXCLUDE_NO_DESCRIPTOR. This means we're calling sample_reachable_filtered_entry_guards() from select_entry_guard_for_circuit with GUARD_USAGE_TRAFFIC set. But we have no descriptors for any of our primary guards, so we're not using any of them. Whoops!
  1. When we connect to LeifEricson, we decide that it doesn't have a transport. But look what's happening here. We say that we've selected 83.212.101.3:50002 for the circuit, but when we look up the transport, we say Looked up transport for 83.212.101.3:80. So we're looking up the wrong transport entirely.

comment:30 Changed 6 weeks ago by nickm

  1. Why did we pick LeifEricson for our connection the second time, after we had chosen 3 other primary guards?

So, what should have happened here is that we should have tried to fetch our primary guards' descriptors before building circuits, and not fallen back to this entry far down later in the guard set. The logic that's supposed to stop that is in guard_selection_get_err_str_if_dir_info_missing(). But also, I don't see us launching requests for any of the new primary guards! Something's going wrong there.

  1. So we're looking up the wrong transport entirely.

I think this is an artifact of looking at conn->addr instead of conn->real_addr.

comment:31 Changed 6 weeks ago by nickm

I think this is an artifact of looking at conn->addr instead of conn->real_addr.

Wait. In this case, addr and real_addr will be the same. The problem is that there is no real_port, maybe?

comment:32 Changed 6 weeks ago by nickm

But also, I don't see us launching requests for any of the new primary guards! Something's going wrong there.

Ah, we are sort-of-trying. But the first time we try, at 11:46:32, we delay dir fetches because "pt proxies still configuring", and 2 seconds later, at 11:46:34, we delay dir fecthes because DisableNetwork is set. So there's no real chance to fetch better descriptors.

comment:33 Changed 6 weeks ago by nickm

So, I have a few theories now based on this log information: one was something I had suspected before, and one was something I hadn't realized.

All of the fixes are in a branch 24367_some_more_030 in my public repository. But instead of testing that one, I recommend that you test 24367_example_merge_032, which also has the log improvements I made before, as well as being merged to 0.3.2 in order to make sure Teor's fixes are in.

comment:34 Changed 6 weeks ago by nickm

(These branches are a little ugly; I should revise and comment more before review)

comment:35 Changed 6 weeks ago by gk

The particular error seems to be gone but now I am stuck at

Dec 14 18:04:18.000 [notice] Opening Socks listener on 127.0.0.1:9150
Dec 14 18:04:20.000 [notice] Our directory information is no longer 
up-to-date enough to build circuits: We're missing descriptors for 
2/2 of our primary entry guards (total microdescriptors: 6502/6502).

Attached is the debug log.

comment:36 Changed 6 weeks ago by teor

After the bridges and PT are changed, this is what happens:

Dec 14 19:48:33.000 [info] register_client_proxy(): Successfully registered transport obfs2
918	Dec 14 19:48:33.000 [info] register_client_proxy(): Successfully registered transport obfs3
919	Dec 14 19:48:33.000 [info] register_client_proxy(): Successfully registered transport obfs4
920	Dec 14 19:48:33.000 [info] register_client_proxy(): Successfully registered transport scramblesuit
921	Dec 14 19:48:34.000 [notice] Our directory information is no longer up-to-date enough to build circuits: We're missing descriptors for 2/2 of our primary entry guards (total microdescriptors: 6474/6486).
922	Dec 14 19:48:34.000 [info] launch_descriptor_downloads(): Launching 1 request for 12 microdescs, 12 at a time
923	Dec 14 19:48:34.000 [info] select_entry_guard_for_circuit(): Selected primary guard [bridge] ($00DC6C4FA49A65BD1472993CF6730D54F11E0DBB) at 154.35.22.12:80 for circuit.
924	Dec 14 19:48:34.000 [notice] Ignoring directory request, since no bridge nodes are available yet.

Notice how we're missing bridge descriptors, but we try to fetch microdescriptors?

So all that code asn wrote to fetch microdescs harder isn't going to help here. Instead, we need to try bridges harder when we are missing bridge descriptors. Maybe we should start by resetting their download statuses?

Also, we should think about whether this code works when we are not using microdescriptors. DOes it specifically trigger a microdesc fetch? Or is it just triggering descriptor fetches?

Note: See TracTickets for help on using tickets.