Opened 6 years ago

Closed 5 years ago

#11156 closed defect (fixed)

"can't find a pluggable transport proxy" errors when obfsproxy started up

Reported by: asn Owned by:
Priority: Medium Milestone: Tor: 0.2.5.x-final
Component: Core Tor/Tor Version:
Severity: Keywords: 024-backport, tor-pt, tor-client, tbb-needs, nickm-review-0254, tbb-tor-backported-3.6b1
Cc: mcs, brade, mikeperry Actual Points:
Parent ID: Points:
Reviewer: Sponsor:

Description (last modified by asn)

brade in #10418 gave some reproducible instructions on how to trigger the "can't find a pluggable transport proxy" error, even when obfsproxy has no trouble starting up.

You can find brade's post in comment:40:ticket:10418

We were supposed to connect to bridge '1.3.4.5:12345' using pluggable transport 'obfs3', but we can't find a pluggable transport proxy supporting 'obfs3'. This can happen if you haven't provided a ClientTransportPlugin line, or if your pluggable transport proxy stopped running.

Child Tickets

Attachments (2)

bug_11156.txt (13.2 KB) - added by asn 6 years ago.
bug11156_60s.txt (15.2 KB) - added by asn 6 years ago.

Download all attachments as: .zip

Change History (44)

comment:1 Changed 6 years ago by asn

Description: modified (diff)

comment:2 Changed 6 years ago by asn

After a little bit of debugging, I found out that the errors were thrown because we tried to create an OR connection before the managed proxy was finished configuring (managed proxies are configured with a configuration protocol, that takes a bit of time).

ATM, we are blocking bridge descriptor fetches when PTs haven't finished configuring, by doing:

  if (pt_proxies_configuration_pending())
    return;

in fetch_bridge_descriptors(). But apparently, blocking that function is not sufficient to postpone all OR connections.

For example, the errors in brade's bug were caused because of the:

...
#12 0x0000555555633c43 in directory_initiate_command (if_modified_since=0, payload_len=0, payload=0x0, resource=0x55555568a2c7 "microdesc", indirection=DIRIND_ONEHOP, router_purpose=0 '\000', dir_purpose=14 '\016', 
    digest=0x555555c5f27c "\240\235Sm\321u-T.\037\273<\234\344D\235Q)\202\071\020\313\031S", dir_port=0, or_port=<optimized out>, _addr=0x7fffffffdfa0, address=<optimized out>) at src/or/directory.c:878
#13 directory_get_from_dirserver (dir_purpose=dir_purpose@entry=14 '\016', router_purpose=router_purpose@entry=0 '\000', resource=<optimized out>, resource@entry=0x55555568a2c7 "microdesc", pds_flags=pds_flags@entry=2)
    at src/or/directory.c:467
#14 0x000055555558c354 in update_consensus_networkstatus_downloads (now=now@entry=1394201439) at src/or/networkstatus.c:767
#15 0x000055555558dd40 in update_networkstatus_downloads (now=1394201439) at src/or/networkstatus.c:906
#16 0x0000555555586c0d in run_scheduled_events (now=1394201439) at src/or/main.c:1468
....

codepath. This means that maybe we should also block entry to update_consensus_networkstatus_downloads() if pt_proxies_configuration_pending(). I made a small patch that did that, and it seems to solve the error messages in this case.

But how can we be sure that we have blocked all the relevant codepaths that might launch an OR connection before PTs have been configured?

comment:3 Changed 6 years ago by asn

FWIW, the simple patch I tried is:

diff --git a/src/or/main.c b/src/or/main.c
index b0529cd..366d583 100644
--- a/src/or/main.c
+++ b/src/or/main.c
@@ -1462,7 +1462,9 @@ run_scheduled_events(time_t now)
   if (time_to_download_networkstatus < now && !options->DisableNetwork) {
     time_to_download_networkstatus =
       now + networkstatus_dl_check_interval(options);
-    update_networkstatus_downloads(now);
+    if (!pt_proxies_configuration_pending()) {
+        update_networkstatus_downloads(now);
+    }
   }

This seems to happen on the second restart, because we've already cached the descriptors of the bridges, so Tor jumps directly to consensus download.

comment:4 Changed 6 years ago by asn

Here is a better patch:

diff --git a/src/or/networkstatus.c b/src/or/networkstatus.c
index 2b0242b..8fad960 100644
--- a/src/or/networkstatus.c
+++ b/src/or/networkstatus.c
@@ -31,6 +31,7 @@
 #include "router.h"
 #include "routerlist.h"
 #include "routerparse.h"
+#include "transports.h"
 
 /** Map from lowercase nickname to identity digest of named server, if any. */
 static strmap_t *named_server_map = NULL;
@@ -888,7 +889,8 @@ update_consensus_networkstatus_fetch_time(time_t now)
 int
 should_delay_dir_fetches(const or_options_t *options)
 {
-  if (options->UseBridges && !any_bridge_descriptors_known()) {
+  if (options->UseBridges &&
+      (!any_bridge_descriptors_known() || pt_proxies_configuration_pending())){
     log_info(LD_DIR, "delaying dir fetches (no running bridges known)");
     return 1;
   }

Nick, opinions? How can we be sure that we are blocking all OR connections till the managed proxies have finished configuring?

comment:5 Changed 6 years ago by brade

Cc: mcs brade added

comment:6 Changed 6 years ago by nickm

Status: newneeds_review

comment:7 Changed 6 years ago by nickm

If I'm not mistaken, on a client, two things can cause us to launch an OR connection: an attempt to fetch directory info, and an attempt to build a circuit for traffic.

I believe that if should_delay_dir_fetches() is true, we won't try to launch any directory fetches.

If we don't have enough directory info, then we won't try to build circuits, because router_have_minimum_dir_info() will return false. Also, if should_delay_dir_fetches() is true, we won't update have_minimum_dir_info in update_have_minimum_dir_info() [*], so even if we do have sufficient directory info in the cache, I believe we won't acknowledge that until should_delay_dir_fetches() returns false.

So the approach above seems sound-ish to me. I'm fine merging this patch if you've tested it, and it comes with a changes file.

[*] We will need to update the notice message and status message that come from update_router_have_minimum_dir_info() when should_delay_dir_fetches() returns true.

comment:8 Changed 6 years ago by mikeperry

Keywords: tbb-needs added

comment:9 Changed 6 years ago by asn

OK. How about bug11156 at https://git.torproject.org/user/asn/tor.git?

comment:10 Changed 6 years ago by nickm

Status: needs_reviewneeds_revision

msg_out needs to be documented.

QOI: msg_out should be set to NULL if we aren't delaying.

What limits the frequency with which we log this message?

Other than that, looks okay.

comment:11 in reply to:  10 Changed 6 years ago by asn

Status: needs_revisionneeds_review

Replying to nickm:

msg_out needs to be documented.

QOI: msg_out should be set to NULL if we aren't delaying.

Fixed in fixup commit. Please check the branch again.

What limits the frequency with which we log this message?

Hm, you mean of the log_notice() in update_router_have_minimum_dir_info()? Good question.

IIUC, we normally don't reach the log_notice() if we can't connect to bridges, because we don't have a consensus at all (it will fail the checks above).

If we have a previously cached consensus, we might reach the log_notice() in which case it will print "Delaying dir fetches: PT proxies are still configuring" till the PTs finish configuring. This usually takes 2 run_scheduled_events() ticks, so that message should not appear too many times there. If you want, we can turn it to an info-level log if it's the PT case (and not the "no bridge descriptors known" case).

comment:12 Changed 6 years ago by nickm

Resolution: fixed
Status: needs_reviewclosed

looks fine, merged.

(WRT message frequency, I think we can see whether it happens too much in practice, and if so, rate-limit it.)

comment:13 Changed 6 years ago by nickm

(TBB folks, did you need an 0.2.4 backport of this? If it doesn't break anything in 0.2.5, it seems simple enough for a backport candidate...)

comment:14 in reply to:  13 Changed 6 years ago by mcs

Cc: mikeperry added

Replying to nickm:

(TBB folks, did you need an 0.2.4 backport of this? If it doesn't break anything in 0.2.5, it seems simple enough for a backport candidate...)

Adding Mike to the CC.

I think we do want this for TBB 3.6, but that is Mike's call. The messages are confusing, which will add to the TBB support burden.

comment:15 Changed 6 years ago by nickm

Milestone: Tor: 0.2.5.x-finalTor: 0.2.4.x-final
Resolution: fixed
Status: closedreopened

comment:16 Changed 6 years ago by nickm

Status: reopenedneeds_review

Assuming this doesn't make 0.2.5 explode, it's backportable.

comment:17 Changed 6 years ago by andrea

Keywords: 024-backport added

Changed 6 years ago by asn

Attachment: bug_11156.txt added

comment:18 Changed 6 years ago by asn

I think there might be a bug in my patch here.
My fix doesn't seem to interact well with the RESETCONF control command.

I attached an info-level log from the current tor master when used with OONI.
OONI uses txtorcon which sends the RESETCONF signal.

If you check the logs, you can see the pluggable transport proxy is configured properly. But at some point the torrc is re-read (there is a second opening log file message) and from that point things go weird.

Specifically, the transport proxies are marked as HUPed (which means that Tor is supposed to examine them in case they need to be restarted), and when Tor reaches should_delay_dir_fetches() it delays the dir fetches because pt_proxies_configuration_pending() returns true because of check_if_restarts_needed is true (because the proxies got HUPed).

Then for some reason, it hits should_delay_dir_fetches() a few more times in the same second and then it decides that the bridge is unusable and kills it. Then it finally reaches the end of run_scheduled_events() where it runs pt_configure_remaining_proxies() and decides that the proxy doesn't need to be restarted ( pt_configure_remaining_proxies(): Nothing changed for managed proxy '/usr/local/bin/obfsproxy' after HUP: not restarting.).

If should_delay_dir_fetches() was run after that, it wouldn't delay and the fetches would run normally. However at that time, the bridge is already dead...

I tried to reproduce this by doing SIGHUP (so that I can attach a debugger) but it worked fine. I'm wondering wy all the should_delay_dir_fetches() were ran in the same tick.

comment:19 Changed 6 years ago by asn

Long shory short, it seems that there might be another 60s bug here.

I managed to reproduce this bug without OONI. You just need to perform a well-timed RESETCONF signal right after hitting 10% bootstrapping.

I think what happens, is that Tor connects to the bridge over PT and gets the descriptor. But at the same time, RESETCONF is called and the PT proxies are marked for potential restart. From then on should_delay_dir_fetches() returns true because of the RESETCONF till pt_configure_remaining_proxies() gets called (which is at the end of the run_scheduled_events() loop). So all the dir fetches caused by the new bridge descriptor get cancelled.

Finally, when Tor calls update_consensus_networkstatus_downloads() after 60s, it proceeds bootstrapping normally.

Will look more into this.

Changed 6 years ago by asn

Attachment: bug11156_60s.txt added

comment:20 Changed 6 years ago by asn

For example, decreasing the time in networkstatus_dl_check_interval() fixed this for me.

Would it make sense to decrease the time there (to 5 or 10s or so) if we don't have a workable consensus?
If we have a workable consensus, 60s should be fine.

comment:21 Changed 6 years ago by nickm

Okay, this is sounding non-backportable, but the new stuff needs review. (Is the RESETCONF issue so bad that we should just revert the fix and push this into 0.2.6? Did RESETCONF work before we merged this?)

I really don't like "solve the problem by lowering an interval" fixes; those intervals are there to avoid runaway loops. If there's something that needs to get done faster than the "check" interval, the right solution is to have the thing that should trigger a re-check reset the timer for the interval.

comment:22 Changed 6 years ago by nickm

Milestone: Tor: 0.2.4.x-finalTor: 0.2.5.x-final
Status: needs_reviewneeds_revision

comment:23 Changed 6 years ago by nickm

(Also, is this 60-second delay related to #9229?)

comment:24 in reply to:  21 Changed 6 years ago by asn

Replying to nickm:

Okay, this is sounding non-backportable, but the new stuff needs review. (Is the RESETCONF issue so bad that we should just revert the fix and push this into 0.2.6? Did RESETCONF work before we merged this?)

I'm not sure if we should revert this fix, but it indeed does expose another 60s bug.

FWIW, RESETCONF works currently without any issues. It's just that when you call RESETCONF (or any other command that re-reads the torrc) between the point where you get the bridge descriptor and the point where you need to fetch the other directory documents, Tor gets confused and causes a 60s bug.

This 60s bug is not #9229, or at least the fix of #9229 doesn't fix it. The fix of #9229 asks for a fetch of all directory documents when we receive the first bridge descriptor. In this case, that fetch will get delayed because we have just re-read our torrc (because RESETCONF), and our PT proxies are marked for possible restart (part of the SIGHUP mechanism), and hence should_delay_dir_fetches() returns true.

Did this make it more clear?

I really don't like "solve the problem by lowering an interval" fixes; those intervals are there to avoid runaway loops. If there's something that needs to get done faster than the "check" interval, the right solution is to have the thing that should trigger a re-check reset the timer for the interval.

I see what you are saying.

Then a fix here might be to call routerlist_retry_directory_downloads() when our PTs move from the might need to be restarted state to the restarted/not restarted state: this happens in every SIGHUP or config reread (SETCONF/RESETCONF/etc.).

However, if we do this we will end up calling that function needlessly sometimes (actually almost all times; since this bug needs good timing to trigger).

Maybe a better fix would be to call routerlist_retry_directory_downloads() when our PTs move to the restarted/not restarted phase but only if they had delayed a descriptor fetch? This will be a bit messier to code, but I think it's doable.

(PS: Pluggable Transport proxies are marked for might need to be restarted in every config read, so that we can figure out whether their ClientTransportPlugin/ServerTransportPlugin line changed. If their torrc line changed, Tor will try to restart them in place, to cause a seamless SIGHUP experience. In the end of run_sheduled_events() we run pt_configure_remaining_proxies() which moves the PT proxies from the might need to be restarted to the restrarted/not restarted state.)

comment:25 Changed 6 years ago by nickm

I'm not sure why the right fix here should involve changes to the directory fetch logic at all. Wouldn't it be better to make it so that instead of having PTs go into a "maybe reset this; I'll figure it out later" state, we identify the ones that don't need to be reset immediately, and leave them alone? Would that be super-hard?

I'm also not clear why we're using got_hup to indicate something that isn't "received a HUP" but is instead "the configuration changed at some point"

comment:26 in reply to:  25 Changed 6 years ago by asn

Replying to nickm:

I'm not sure why the right fix here should involve changes to the directory fetch logic at all. Wouldn't it be better to make it so that instead of having PTs go into a "maybe reset this; I'll figure it out later" state, we identify the ones that don't need to be reset immediately, and leave them alone? Would that be super-hard?

I think this should be doable. It would probably require some non-trivial refactoring of the SIGHUP logic of src/or/transports.c (that part of the code is quite messy).

I can look into this, but it might take me a few days.

I'm also not clear why we're using got_hup to indicate something that isn't "received a HUP" but is instead "the configuration changed at some point"

That's a bad variable name. When I wrote that code, I didn't know much about Tor, and I thought that the only reason the config would be re-read was a SIGHUP.

comment:27 Changed 6 years ago by nickm

Hm, okay. We should also consider the possibility, if the refactoring turns out to be complex, that we should revert this whole business until 0.2.6.

comment:28 Changed 6 years ago by asn

Status: needs_revisionneeds_review

Maybe not such a big refactoring was required after all. Check out branch bug11156_issue2 in my repo.

I tried the solution of calling pt_configure_remaining_proxies() right after parsing the transport-related parts of torrc. It seems to work nicely in my tests.

I'd like to think a bit more about this, but I think there is no problem with calling pt_configure_remaining_proxies() that early. It doesn't do any network operations, and doesn't depend on anything but stuff in src/or/transports.c.

comment:29 Changed 6 years ago by nickm

This looks plausible and the overall diff looks short-ish, though it'll take a bit of time to trace through all the logic. How much have you tested this? How should we test it?

comment:30 Changed 6 years ago by asn

I tested it using the same OONI test that I originally discovered the bug with. OONI is using txtorcon, and txtorcon sends a RESETCONF to Tor at just the right moment to trigger this bug.

I tested OONI with my new branch and it worked. If I test it with the current master, it takes 60s to bootstrap.

I will try to think of better ways to test this.

comment:31 Changed 6 years ago by nickm

Keywords: nickm-review-0254 added

comment:32 Changed 6 years ago by nickm

Hm. The actual change is only a two-line change in 1a3eb5c72dd0feb43a542ca465c57dd0801ff7cc. The other stuff is just comments and flag names. That's good.

The part I'm trying to figure out is, I'm comparing this patch to the part of run_scheduled_events() where we do:

  if (!net_is_disabled() && pt_proxies_configuration_pending())
    pt_configure_remaining_proxies();

Should this patch also check !net_is_disabled() in config.c?

comment:33 in reply to:  32 Changed 6 years ago by asn

Replying to nickm:

Hm. The actual change is only a two-line change in 1a3eb5c72dd0feb43a542ca465c57dd0801ff7cc. The other stuff is just comments and flag names. That's good.

Yes, the actual changes are not very many.

The part I'm trying to figure out is, I'm comparing this patch to the part of run_scheduled_events() where we do:

  if (!net_is_disabled() && pt_proxies_configuration_pending())
    pt_configure_remaining_proxies();

Should this patch also check !net_is_disabled() in config.c?

Hm, I think it should. Thanks for catching that. I will write a new patch tomorrow.

As I understand it, the !net_is_disabled() is there so that PT proxies don't start up their listeners as long as DisableNetwork is enabled. This needs to happen in config.c too, since calling pt_configure_remaining_proxies() for the first time will set the managed-proxy environent variables and launch the PT proxy, which will then start up its listeners.

comment:34 Changed 6 years ago by asn

Status: needs_reviewneeds_revision

comment:35 Changed 6 years ago by asn

Status: needs_revisionneeds_review

Please pull my branch bug11156_issue2 again.

comment:36 Changed 6 years ago by nickm

Resolution: fixed
Status: needs_reviewclosed

Okay; squashed and merged to 0.2.5

comment:37 Changed 6 years ago by nickm

Milestone: Tor: 0.2.5.x-finalTor: 0.2.4.x-final
Resolution: fixed
Status: closedreopened

comment:38 Changed 6 years ago by nickm

Status: reopenedneeds_review

comment:39 Changed 6 years ago by mikeperry

Keywords: tbb-tor-backported-3.6b1 added

FTR: At asn's suggestion, I took 1a3eb5c72dd0feb43a542ca465c57dd0801ff7cc and 4719a2f5248b8cf6d70daef91fd1cf9fd65628f4 from asn/bug11156_issue2 and added them to https://gitweb.torproject.org/builders/tor-browser-bundle.git/blob/HEAD:/gitian/patches/bug11156.patch. These additions will go out in the next 3.6 release.

Prior to those additions, the patch has been in TBB 3.6 since 3.6-beta-1.

comment:40 Changed 5 years ago by nickm

Recommendation, no backport to 0.2.4. (persuade arma that this is causing huge problems in 0.2.4 if I'm wrong.)

comment:41 Changed 5 years ago by arma

Given that I just found a bug in it (#11654), and that otherwise TBB seems to be happy with its patch, I'm fine with no backport here.

comment:42 Changed 5 years ago by nickm

Milestone: Tor: 0.2.4.x-finalTor: 0.2.5.x-final
Resolution: fixed
Status: needs_reviewclosed

Okay, no backport to 0.2.4 for these, for stability reasons.

Note: See TracTickets for help on using tickets.