"can't find a pluggable transport proxy" errors when obfsproxy started up

changed milestone to %Tor: 0.2.5.x-final

added 024-backport component::core tor/tor milestone::Tor: 0.2.5.x-final nickm-review-0254 priority::medium resolution::fixed status::closed tbb-needs tbb-tor-backported-3.6b1 tor-client tor-pt type::defect labels

Trac:
Description: brade in #10418 (moved) gave some reproducible instructions on how to trigger the "can't find a pluggable transport proxy" error, even when obfsproxy has no trouble starting up.

We were supposed to connect to bridge '1.3.4.5:12345' using pluggable transport 'obfs3', but we can't find a pluggable transport proxy supporting 'obfs3'. This can happen if you haven't provided a ClientTransportPlugin line, or if your pluggable transport proxy stopped running.

to

brade in #10418 (moved) gave some reproducible instructions on how to trigger the "can't find a pluggable transport proxy" error, even when obfsproxy has no trouble starting up.

You can find brade's post in comment:40:ticket:10418

We were supposed to connect to bridge '1.3.4.5:12345' using pluggable transport 'obfs3', but we can't find a pluggable transport proxy supporting 'obfs3'. This can happen if you haven't provided a ClientTransportPlugin line, or if your pluggable transport proxy stopped running.

After a little bit of debugging, I found out that the errors were thrown because we tried to create an OR connection before the managed proxy was finished configuring (managed proxies are configured with a configuration protocol, that takes a bit of time).

ATM, we are blocking bridge descriptor fetches when PTs haven't finished configuring, by doing:

  if (pt_proxies_configuration_pending())
    return;

in fetch_bridge_descriptors(). But apparently, blocking that function is not sufficient to postpone all OR connections.

For example, the errors in brade's bug were caused because of the:

...
#12 0x0000555555633c43 in directory_initiate_command (if_modified_since=0, payload_len=0, payload=0x0, resource=0x55555568a2c7 "microdesc", indirection=DIRIND_ONEHOP, router_purpose=0 '\000', dir_purpose=14 '\016', 
    digest=0x555555c5f27c "\240\235Sm\321u-T.\037\273<\234\344D\235Q)\202\071\020\313\031S", dir_port=0, or_port=<optimized out>, _addr=0x7fffffffdfa0, address=<optimized out>) at src/or/directory.c:878
#13 directory_get_from_dirserver (dir_purpose=dir_purpose@entry=14 '\016', router_purpose=router_purpose@entry=0 '\000', resource=<optimized out>, resource@entry=0x55555568a2c7 "microdesc", pds_flags=pds_flags@entry=2)
    at src/or/directory.c:467
#14 0x000055555558c354 in update_consensus_networkstatus_downloads (now=now@entry=1394201439) at src/or/networkstatus.c:767
#15 0x000055555558dd40 in update_networkstatus_downloads (now=1394201439) at src/or/networkstatus.c:906
#16 0x0000555555586c0d in run_scheduled_events (now=1394201439) at src/or/main.c:1468
....

codepath. This means that maybe we should also block entry to update_consensus_networkstatus_downloads() if pt_proxies_configuration_pending(). I made a small patch that did that, and it seems to solve the error messages in this case.

But how can we be sure that we have blocked all the relevant codepaths that might launch an OR connection before PTs have been configured?

FWIW, the simple patch I tried is:

diff --git a/src/or/main.c b/src/or/main.c
index b0529cd..366d583 100644
--- a/src/or/main.c
+++ b/src/or/main.c
@@ -1462,7 +1462,9 @@ run_scheduled_events(time_t now)
   if (time_to_download_networkstatus < now && !options->DisableNetwork) {
     time_to_download_networkstatus =
       now + networkstatus_dl_check_interval(options);
-    update_networkstatus_downloads(now);
+    if (!pt_proxies_configuration_pending()) {
+        update_networkstatus_downloads(now);
+    }
   }

This seems to happen on the second restart, because we've already cached the descriptors of the bridges, so Tor jumps directly to consensus download.

Here is a better patch:

diff --git a/src/or/networkstatus.c b/src/or/networkstatus.c
index 2b0242b..8fad960 100644
--- a/src/or/networkstatus.c
+++ b/src/or/networkstatus.c
@@ -31,6 +31,7 @@
 #include "router.h"
 #include "routerlist.h"
 #include "routerparse.h"
+#include "transports.h"
 
 /** Map from lowercase nickname to identity digest of named server, if any. */
 static strmap_t *named_server_map = NULL;
@@ -888,7 +889,8 @@ update_consensus_networkstatus_fetch_time(time_t now)
 int
 should_delay_dir_fetches(const or_options_t *options)
 {
-  if (options->UseBridges && !any_bridge_descriptors_known()) {
+  if (options->UseBridges &&
+      (!any_bridge_descriptors_known() || pt_proxies_configuration_pending())){
     log_info(LD_DIR, "delaying dir fetches (no running bridges known)");
     return 1;
   }

Nick, opinions? How can we be sure that we are blocking all OR connections till the managed proxies have finished configuring?

Trac:
Cc: N/A to mcs, brade

Trac:
Status: new to needs_review

If I'm not mistaken, on a client, two things can cause us to launch an OR connection: an attempt to fetch directory info, and an attempt to build a circuit for traffic.

I believe that if should_delay_dir_fetches() is true, we won't try to launch any directory fetches.

If we don't have enough directory info, then we won't try to build circuits, because router_have_minimum_dir_info() will return false. Also, if should_delay_dir_fetches() is true, we won't update have_minimum_dir_info in update_have_minimum_dir_info() [*], so even if we do have sufficient directory info in the cache, I believe we won't acknowledge that until should_delay_dir_fetches() returns false.

So the approach above seems sound-ish to me. I'm fine merging this patch if you've tested it, and it comes with a changes file.

[*] We will need to update the notice message and status message that come from update_router_have_minimum_dir_info() when should_delay_dir_fetches() returns true.

Trac:
Keywords: N/A deleted, tbb-needs added

OK. How about bug11156 at https://git.torproject.org/user/asn/tor.git?

msg_out needs to be documented.

QOI: msg_out should be set to NULL if we aren't delaying.

What limits the frequency with which we log this message?

Other than that, looks okay.

Trac:
Status: needs_review to needs_revision

Replying to nickm:

msg_out needs to be documented.

QOI: msg_out should be set to NULL if we aren't delaying.

Fixed in fixup commit. Please check the branch again.

What limits the frequency with which we log this message?

Hm, you mean of the log_notice() in update_router_have_minimum_dir_info()? Good question.

IIUC, we normally don't reach the log_notice() if we can't connect to bridges, because we don't have a consensus at all (it will fail the checks above).

If we have a previously cached consensus, we might reach the log_notice() in which case it will print "Delaying dir fetches: PT proxies are still configuring" till the PTs finish configuring. This usually takes 2 run_scheduled_events() ticks, so that message should not appear too many times there. If you want, we can turn it to an info-level log if it's the PT case (and not the "no bridge descriptors known" case).

Trac:
Status: needs_revision to needs_review

looks fine, merged.

(WRT message frequency, I think we can see whether it happens too much in practice, and if so, rate-limit it.)

Trac:
Resolution: N/A to fixed
Status: needs_review to closed

(TBB folks, did you need an 0.2.4 backport of this? If it doesn't break anything in 0.2.5, it seems simple enough for a backport candidate...)

Replying to nickm:

(TBB folks, did you need an 0.2.4 backport of this? If it doesn't break anything in 0.2.5, it seems simple enough for a backport candidate...)

Adding Mike to the CC.

I think we do want this for TBB 3.6, but that is Mike's call. The messages are confusing, which will add to the TBB support burden.

Trac:
Cc: mcs, brade to mcs, brade, mikeperry

Trac:
Status: closed to reopened
Milestone: Tor: 0.2.5.x-final to Tor: 0.2.4.x-final
Resolution: fixed to N/A

Assuming this doesn't make 0.2.5 explode, it's backportable.

Trac:
Status: reopened to needs_review

Trac:
Keywords: tor-pt tor-client deleted, 024-backport tor-pt tor-client added

Trac:
bug_11156.txt

I think there might be a bug in my patch here. My fix doesn't seem to interact well with the RESETCONF control command.

I attached an info-level log from the current tor master when used with OONI. OONI uses txtorcon which sends the RESETCONF signal.

If you check the logs, you can see the pluggable transport proxy is configured properly. But at some point the torrc is re-read (there is a second opening log file message) and from that point things go weird.

Specifically, the transport proxies are marked as HUPed (which means that Tor is supposed to examine them in case they need to be restarted), and when Tor reaches should_delay_dir_fetches() it delays the dir fetches because pt_proxies_configuration_pending() returns true because of check_if_restarts_needed is true (because the proxies got HUPed).

Then for some reason, it hits should_delay_dir_fetches() a few more times in the same second and then it decides that the bridge is unusable and kills it. Then it finally reaches the end of run_scheduled_events() where it runs pt_configure_remaining_proxies() and decides that the proxy doesn't need to be restarted ( pt_configure_remaining_proxies(): Nothing changed for managed proxy '/usr/local/bin/obfsproxy' after HUP: not restarting.).

If should_delay_dir_fetches() was run after that, it wouldn't delay and the fetches would run normally. However at that time, the bridge is already dead...

I tried to reproduce this by doing SIGHUP (so that I can attach a debugger) but it worked fine. I'm wondering wy all the should_delay_dir_fetches() were ran in the same tick.

Long shory short, it seems that there might be another 60s bug here.

I managed to reproduce this bug without OONI. You just need to perform a well-timed RESETCONF signal right after hitting 10% bootstrapping.

I think what happens, is that Tor connects to the bridge over PT and gets the descriptor. But at the same time, RESETCONF is called and the PT proxies are marked for potential restart. From then on should_delay_dir_fetches() returns true because of the RESETCONF till pt_configure_remaining_proxies() gets called (which is at the end of the run_scheduled_events() loop). So all the dir fetches caused by the new bridge descriptor get cancelled.

Finally, when Tor calls update_consensus_networkstatus_downloads() after 60s, it proceeds bootstrapping normally.

Will look more into this.

Trac:
bug11156_60s.txt

For example, decreasing the time in networkstatus_dl_check_interval() fixed this for me.

Would it make sense to decrease the time there (to 5 or 10s or so) if we don't have a workable consensus? If we have a workable consensus, 60s should be fine.

Okay, this is sounding non-backportable, but the new stuff needs review. (Is the RESETCONF issue so bad that we should just revert the fix and push this into 0.2.6? Did RESETCONF work before we merged this?)

I really don't like "solve the problem by lowering an interval" fixes; those intervals are there to avoid runaway loops. If there's something that needs to get done faster than the "check" interval, the right solution is to have the thing that should trigger a re-check reset the timer for the interval.

Trac:
Status: needs_review to needs_revision
Milestone: Tor: 0.2.4.x-final to Tor: 0.2.5.x-final

(Also, is this 60-second delay related to #9229 (moved)?)

Replying to nickm:

Okay, this is sounding non-backportable, but the new stuff needs review. (Is the RESETCONF issue so bad that we should just revert the fix and push this into 0.2.6? Did RESETCONF work before we merged this?)

I'm not sure if we should revert this fix, but it indeed does expose another 60s bug.

FWIW, RESETCONF works currently without any issues. It's just that when you call RESETCONF (or any other command that re-reads the torrc) between the point where you get the bridge descriptor and the point where you need to fetch the other directory documents, Tor gets confused and causes a 60s bug.

This 60s bug is not #9229 (moved), or at least the fix of #9229 (moved) doesn't fix it. The fix of #9229 (moved) asks for a fetch of all directory documents when we receive the first bridge descriptor. In this case, that fetch will get delayed because we have just re-read our torrc (because RESETCONF), and our PT proxies are marked for possible restart (part of the SIGHUP mechanism), and hence should_delay_dir_fetches() returns true.

Did this make it more clear?

I really don't like "solve the problem by lowering an interval" fixes; those intervals are there to avoid runaway loops. If there's something that needs to get done faster than the "check" interval, the right solution is to have the thing that should trigger a re-check reset the timer for the interval.

I see what you are saying.

Then a fix here might be to call routerlist_retry_directory_downloads() when our PTs move from the might need to be restarted state to the restarted/not restarted state: this happens in every SIGHUP or config reread (SETCONF/RESETCONF/etc.).

However, if we do this we will end up calling that function needlessly sometimes (actually almost all times; since this bug needs good timing to trigger).

Maybe a better fix would be to call routerlist_retry_directory_downloads() when our PTs move to the restarted/not restarted phase but only if they had delayed a descriptor fetch? This will be a bit messier to code, but I think it's doable.

(PS: Pluggable Transport proxies are marked for might need to be restarted in every config read, so that we can figure out whether their ClientTransportPlugin/ServerTransportPlugin line changed. If their torrc line changed, Tor will try to restart them in place, to cause a seamless SIGHUP experience. In the end of run_sheduled_events() we run pt_configure_remaining_proxies() which moves the PT proxies from the might need to be restarted to the restrarted/not restarted state.)

I'm not sure why the right fix here should involve changes to the directory fetch logic at all. Wouldn't it be better to make it so that instead of having PTs go into a "maybe reset this; I'll figure it out later" state, we identify the ones that don't need to be reset immediately, and leave them alone? Would that be super-hard?

I'm also not clear why we're using got_hup to indicate something that isn't "received a HUP" but is instead "the configuration changed at some point"

Replying to nickm:

I'm not sure why the right fix here should involve changes to the directory fetch logic at all. Wouldn't it be better to make it so that instead of having PTs go into a "maybe reset this; I'll figure it out later" state, we identify the ones that don't need to be reset immediately, and leave them alone? Would that be super-hard?

I think this should be doable. It would probably require some non-trivial refactoring of the SIGHUP logic of src/or/transports.c (that part of the code is quite messy).

I can look into this, but it might take me a few days.

I'm also not clear why we're using got_hup to indicate something that isn't "received a HUP" but is instead "the configuration changed at some point"

That's a bad variable name. When I wrote that code, I didn't know much about Tor, and I thought that the only reason the config would be re-read was a SIGHUP.

Hm, okay. We should also consider the possibility, if the refactoring turns out to be complex, that we should revert this whole business until 0.2.6.

Maybe not such a big refactoring was required after all. Check out branch bug11156_issue2 in my repo.

I tried the solution of calling pt_configure_remaining_proxies() right after parsing the transport-related parts of torrc. It seems to work nicely in my tests.

I'd like to think a bit more about this, but I think there is no problem with calling pt_configure_remaining_proxies() that early. It doesn't do any network operations, and doesn't depend on anything but stuff in src/or/transports.c.

Trac:
Status: needs_revision to needs_review

This looks plausible and the overall diff looks short-ish, though it'll take a bit of time to trace through all the logic. How much have you tested this? How should we test it?

I tested it using the same OONI test that I originally discovered the bug with. OONI is using txtorcon, and txtorcon sends a RESETCONF to Tor at just the right moment to trigger this bug.

I tested OONI with my new branch and it worked. If I test it with the current master, it takes 60s to bootstrap.

I will try to think of better ways to test this.

Trac:
Keywords: 024-backport tor-pt tor-client deleted, 024-backport, tor-client, nickm-review-0254, tor-pt added

Hm. The actual change is only a two-line change in 1a3eb5c72dd0feb43a542ca465c57dd0801ff7cc. The other stuff is just comments and flag names. That's good.

The part I'm trying to figure out is, I'm comparing this patch to the part of run_scheduled_events() where we do:

  if (!net_is_disabled() && pt_proxies_configuration_pending())
    pt_configure_remaining_proxies();

Should this patch also check !net_is_disabled() in config.c?

Replying to nickm:

Hm. The actual change is only a two-line change in 1a3eb5c72dd0feb43a542ca465c57dd0801ff7cc. The other stuff is just comments and flag names. That's good.

Yes, the actual changes are not very many.

The part I'm trying to figure out is, I'm comparing this patch to the part of run_scheduled_events() where we do: {{{ if (!net_is_disabled() && pt_proxies_configuration_pending()) pt_configure_remaining_proxies(); }}}

Should this patch also check !net_is_disabled() in config.c?

Hm, I think it should. Thanks for catching that. I will write a new patch tomorrow.

As I understand it, the !net_is_disabled() is there so that PT proxies don't start up their listeners as long as DisableNetwork is enabled. This needs to happen in config.c too, since calling pt_configure_remaining_proxies() for the first time will set the managed-proxy environent variables and launch the PT proxy, which will then start up its listeners.

Trac:
Status: needs_review to needs_revision

Please pull my branch bug11156_issue2 again.

Trac:
Status: needs_revision to needs_review

Okay; squashed and merged to 0.2.5

Trac:
Status: needs_review to closed
Resolution: N/A to fixed

Trac:
Milestone: Tor: 0.2.5.x-final to Tor: 0.2.4.x-final
Resolution: fixed to N/A
Status: closed to reopened

Trac:
Status: reopened to needs_review

FTR: At asn's suggestion, I took 1a3eb5c72dd0feb43a542ca465c57dd0801ff7cc and 4719a2f5248b8cf6d70daef91fd1cf9fd65628f4 from asn/bug11156_issue2 and added them to https://gitweb.torproject.org/builders/tor-browser-bundle.git/blob/HEAD:/gitian/patches/bug11156.patch. These additions will go out in the next 3.6 release.

Prior to those additions, the patch has been in TBB 3.6 since 3.6-beta-1.

Trac:
Keywords: N/A deleted, tbb-tor-backported-3.6b1 added

Recommendation, no backport to 0.2.4. (persuade arma that this is causing huge problems in 0.2.4 if I'm wrong.)

Given that I just found a bug in it (#11654 (moved)), and that otherwise TBB seems to be happy with its patch, I'm fine with no backport here.

Okay, no backport to 0.2.4 for these, for stability reasons.

Trac:
Milestone: Tor: 0.2.4.x-final to Tor: 0.2.5.x-final
Status: needs_review to closed
Resolution: N/A to fixed

closed

mentioned in issue #11200 (moved)

mentioned in issue #11965 (moved)

moved to tpo/core/tor#11156 (closed)

"can't find a pluggable transport proxy" errors when obfsproxy started up

Child items ...

Activity