Opened 6 years ago

Closed 5 years ago

Last modified 5 years ago

#9531 closed defect (fixed)

More Torbutton hangs on New Identity control port access

Reported by: mikeperry Owned by: gk
Priority: High Milestone:
Component: Applications/Tor Browser Version:
Severity: Keywords: tbb-usability, tbb-newnym, TorBrowserTeam201408, tbb-torbutton, MikePerry201408R
Cc: gk, mcs, brade, rex_4539 Actual Points:
Parent ID: Points:
Reviewer: Sponsor:

Description (last modified by mikeperry)

skruffy is reporting hangs on Torbutton's New Identity. Oddly, the hangs seem to be more common if you are loading a bunch of websites at the time of hitting New Identity, but that is not a full repro case.

The hang actually appears to be happening while attempting to open a control port connection. Perhaps another race introduced by our SOCKS optimistic data hack in nsITransport? Or perhaps our hack for #8642 is insufficient in some cases, and/or during heavy network activity.

Child Tickets

Change History (48)

comment:1 Changed 6 years ago by mikeperry

Description: modified (diff)

comment:2 Changed 6 years ago by cypherpunks

Hang happens exactly after call of "outputStream.writeBytes(auth_cmd, auth_cmd.length)".
Tcpdump shows no one packet, no syn, no connection established.

comment:3 Changed 6 years ago by cypherpunks

Modified torbutton_do_new_identity() so it calls only torbutton_send_ctrl_cmd() and nothing else. Hangs still happens in the same place.

comment:4 Changed 6 years ago by cypherpunks

Socket is never alive during repro case that leads to hangs. Tested with modified torbutton_send_ctrl_cmd() so it waited alive socket after opening stream, like so:

    var input = socket.openInputStream(3, 1, 1); // 3 == OPEN_BLOCKING|OPEN_UNBUFFERED
    var output = socket.openOutputStream(3, 1, 1); // 3 == OPEN_BLOCKING|OPEN_UNBUFFERED

    while (!socket.isAlive()) {}
    torbutton_log(4, "Alive!");

If hangs happens then it never reaches "Alive!" and browser ask about unresponsive script later.

comment:5 Changed 6 years ago by cypherpunks

I got reproducable hang using modified torbutton_send_ctrl_cmd(), on startup without any pages loaded.

    var socketTransportService = Components.classes["@mozilla.org/network/socket-transport-service;1"]
        .getService(Components.interfaces.nsISocketTransportService);
    var sockets = new Array();
    var max=540; // 550 hangs, 545 bypasses
    for(var i = 0; i < max; i++) {
      var s = socketTransportService.createTransport(null, 0, m_tb_control_host, m_tb_control_port, null);


      var input = s.openInputStream(3, 1, 1); // 3 == OPEN_BLOCKING|OPEN_UNBUFFERED
      var output = s.openOutputStream(3, 1, 1); // 3 == OPEN_BLOCKING|OPEN_UNBUFFERED
      sockets.push(s);
    }
    for(var i = 0; i < max-1; i++) {
      sockets[i].close;
    }
    var socket =  sockets[max-2];
    torbutton_log(4, "Browser will hang now!");

Those number for tested box, it's depends platform in some way.
I was unable to find any correlation with local limits like max descriptors or something. Probably missed something.

comment:6 Changed 6 years ago by cypherpunks

var socket = sockets[max-2];

Should be var socket = sockets[max-1];

comment:7 Changed 6 years ago by cypherpunks

Well code is buggy. Ignore it.

comment:8 Changed 6 years ago by cypherpunks

It hangs if socket never closed, as broken code do not close any.

comment:9 Changed 6 years ago by gk

Cc: g.koppen@… added

comment:10 Changed 6 years ago by cypherpunks

Status: newneeds_review

This code helps to avoid conditions that leads to hang if no connection able to create.

--- torbutton.js.original
+++ torbutton.js
@@ -1385,6 +1385,20 @@
   }
 }
 
+function torbutton_process_even_queue() {
+    // Get the current thread.
+    var thread = Components.classes["@mozilla.org/thread-manager;1"].
+                 getService(Components.interfaces.nsIThreadManager).currentThread;
+/*
+Warning: Calling nsIThread.processNextEvent allows network
+and UI events to run which can modify data structures that
+your code isn't expecting to be modified during a synchronous
+method call. This can cause random crashes and other bugs that
+may be hard to find and fix. Consider use asynchronous callbacks instead.
+*/
+    while (thread.processNextEvent(false)) {}
+}
+
 /* The "New Identity" implementation does the following:
  *   1. Disables Javascript and plugins on all tabs
  *   2. Clears state:
@@ -1580,7 +1594,8 @@
   var prefService = Components.classes["@mozilla.org/preferences-service;1"]
       .getService(Components.interfaces.nsIPrefService);
   prefService.savePrefFile(null);
-  
+
+  torbutton_process_even_queue();
   torbutton_log(3, "New Identity: Sending NEWNYM");
 
   // We only support TBB for newnym.

comment:11 Changed 6 years ago by cypherpunks

it was event queue in my brain. typo, but this code works.

comment:12 Changed 6 years ago by gk

Why do you think that warning can be ignored in the current case?

comment:13 in reply to:  12 Changed 6 years ago by cypherpunks

Replying to gk:

Why do you think that warning can be ignored in the current case?

Do you think that warning can't be ignored?

No any modified data should prevent: access control port, open new window, close current window. Before "Sending NEWNYM" it actually assumed everything finished: tabs closed, conns destroyed, prefs synced, etc. With processing all events in queue you can guarantee that.

comment:14 Changed 6 years ago by cypherpunks

The only thing is UI. User can open new tab, or press "NEWNYM" yet while it process queue? or something evil yet.
Ok lets fix browser code instead, it's bug anyway. Or migrate away.

comment:15 Changed 6 years ago by cypherpunks

Status: needs_reviewnew

comment:16 Changed 6 years ago by cypherpunks

comment:17 Changed 6 years ago by gk

Here comes some gdb output as a starting point:

#4  0xb628579a in mozilla::ReentrantMonitor::Wait (this=0x9b2814e0, 
    interval=4294967295)
    at /home/ubuntu/build/tor-browser/obj-i686-pc-linux-gnu/xpcom/build/BlockingResourceBase.cpp:313
#5  0xb4a8f352 in mozilla::ReentrantMonitorAutoEnter::Wait (this=0xbfffb7ac, 
    interval=4294967295) at ../../dist/include/mozilla/ReentrantMonitor.h:192
#6  0xb62caa37 in nsPipeOutputStream::Wait (this=0x9b2814bc)
    at /home/ubuntu/build/tor-browser/xpcom/io/nsPipe3.cpp:980
#7  0xb62cae45 in nsPipeOutputStream::WriteSegments (this=0x9b2814bc, 
    reader=0xb62cafea <nsReadFromRawBuffer(nsIOutputStream*, void*, char*, uint32_t, uint32_t, uint32_t*)>, closure=0x9b5e0be0, count=46, 
    writeCount=0xbfffb898)
    at /home/ubuntu/build/tor-browser/xpcom/io/nsPipe3.cpp:1090
#8  0xb62cb077 in nsPipeOutputStream::Write (this=0x9b2814bc, 
    fromBuf=0x9b5e0be0 "AUTHENTICATE 422f587d45483e5b737b2f7863234042\r\n", 
    bufLen=47, writeCount=0xbfffb898)
    at /home/ubuntu/build/tor-browser/xpcom/io/nsPipe3.cpp:1150
#9  0xb62bc86d in nsBinaryOutputStream::Write (this=0x9b6d1980, 
    aBuf=0x9b5e0be0 "AUTHENTICATE 422f587d45483e5b737b2f7863234042\r\n", 
    aCount=47, aActualBytes=0xbfffb898)
    at /home/ubuntu/build/tor-browser/xpcom/io/nsBinaryStream.cpp:51
#10 0xb62bd0e7 in nsBinaryOutputStream::WriteBytes (this=0x9b6d1980, 
    aString=0x9b5e0be0 "AUTHENTICATE 422f587d45483e5b737b2f7863234042\r\n", 
    aLength=47)
    at /home/ubuntu/build/tor-browser/xpcom/io/nsBinaryStream.cpp:216

It is pretty reliably reproducible in my debug build if I open a bunch of boomkarks (say 5) simultaneously and choose the "New Identity" option immediately afterwards.

Version 0, edited 6 years ago by gk (next)

comment:18 Changed 6 years ago by arma

Keywords: tbb-newnym added

comment:19 Changed 6 years ago by gk

Cc: gk mcs brade added; g.koppen@… removed

comment:20 Changed 6 years ago by gk

Cc: rex_4539 added

#10880 is a duplicate of this one.

comment:21 Changed 6 years ago by gk

Another user reported similar symptoms (beginning with 3.5.2): https://blog.torproject.org/blog/tor-browser-3521-released#comment-47522

comment:22 Changed 6 years ago by gk

Okay, I've filed https://bugzilla.mozilla.org/show_bug.cgi?id=978690 to get things started. Let's see what Mozilla folks are saying.

comment:23 Changed 6 years ago by gk

TL;DR: We can't use blocking sockets the way we are doing that right now. They are responsible for the hangs.

If we only have non-TLS connections everything is fine. The picture changes as soon as TLS connections are processed while we click on New Identity or start the browser. Then the following may happen (this is the case when we start-up; hitting New Identity gives a slightly different stacktrace which boils down to the same problem, though):

#1  0xb772cc5b in pthread_cond_wait@@GLIBC_2.3.2 ()
    at ../nptl/sysdeps/unix/sysv/linux/i386/i686/../i486/pthread_cond_wait.S:187
#2  0xb71ec899 in PR_WaitCondVar (cvar=0xa1a07d00, timeout=4294967295)
    at /home/ubuntu/build/tor-browser/nsprpub/pr/src/pthreads/ptsynch.c:385
#3  0xb5e53402 in Wait (this=<optimized out>, interval=<optimized out>)
    at ../../../../dist/include/mozilla/CondVar.h:71
#4  Wait (interval=<optimized out>, this=<optimized out>)
    at ../../../../dist/include/mozilla/Monitor.h:47
#5  Wait (interval=<optimized out>, this=<optimized out>)
    at ../../../../dist/include/mozilla/Monitor.h:102
#6  mozilla::psm::SyncRunnableBase::DispatchToMainThreadAndWait (
    this=0xa1a2f6e0)
    at /home/ubuntu/build/tor-browser/security/manager/ssl/src/PSMRunnable.cpp:25
#7  0xb5e45b6d in nsNSSSocketInfo::GetPreviousCert (this=0xab3c8580, 
    _result=0xb28feef4)
    at /home/ubuntu/build/tor-browser/security/manager/ssl/src/nsNSSIOLayer.cpp:427
#8  0xb5e31670 in HandshakeCallback (fd=0xab318280, client_data=0xab3c8580)
    at /home/ubuntu/build/tor-browser/security/manager/ssl/src/nsNSSCallbacks.cpp:901
#9  0xb709f871 in ssl_FinishHandshake (ss=0xfffffe00) at sslsecur.c:141
#10 0xb708cd65 in ssl3_FinishHandshake (ss=0xaa08a000) at ssl3con.c:10637
#11 0xb708e609 in ssl3_AuthCertificateComplete (ss=0xaa08a000, error=0)
    at ssl3con.c:10040
#12 0xb709faec in SSL_AuthCertificateComplete (fd=0xaa0760e0, error=0)
    at sslsecur.c:1566
#13 0xb5e486ea in nsNSSSocketInfo::SetCertVerificationResult (this=0xab3c8580, 
    errorCode=0, errorMessageType=mozilla::psm::PlainErrorMessage)
    at /home/ubuntu/build/tor-browser/security/manager/ssl/src/nsNSSIOLayer.cpp:456
#14 0xb5e53ecf in mozilla::psm::(anonymous namespace)::SSLServerCertVerificationResult::Run (this=0xa1a2f6a0)
    at /home/ubuntu/build/tor-browser/security/manager/ssl/src/SSLServerCertVerification.cpp:1357
#15 0xb6366498 in nsThread::ProcessNextEvent (this=0xb2907a60, mayWait=true, 
    result=0xb28ff1ef)
    at /home/ubuntu/build/tor-browser/xpcom/threads/nsThread.cpp:626
#16 0xb63307a7 in NS_ProcessNextEvent (thread=0xfffffe00, mayWait=true)
    at /home/ubuntu/build/tor-browser/obj-i686-pc-linux-gnu/xpcom/build/nsThreadUtils.cpp:238
#17 0xb543f43e in nsSocketTransportService::Run (this=0xb29079a0)
    at /home/ubuntu/build/tor-browser/netwerk/base/src/nsSocketTransportService2.cpp:648

A SSLServerCertVerificationResult is dispatched to the socket transport thread (see: https://mxr.mozilla.org/mozilla-esr24/source/security/manager/ssl/src/SSLServerCertVerification.cpp#88 for a reason why this has to happen). Processing this event leads to dispatching PreviousCertRunnable to the main thread. Now, if that is processed on the main thread *before* our

    outputStream.writeBytes(command, command.length);

everything is fine. If not we see something like the following in the log:

-1344275648[b722a540]: nsSocketOutputStream::Write [this=9d128794 count=0]
-1344275648[b722a540]:   calling PR_Write [count=0]
-1399850176[acb3b680]: STS dispatch [ab6821f0]
-1220368640[b722a300]: STS dispatch [a9dca700]
-1220368640[b722a300]: STS dispatch [aa76cce0]
-1220368640[b722a300]: STS dispatch [aa76cd00]
-1220368640[b722a300]: creating nsSocketTransport @9d128e00
-1220368640[b722a300]: nsSocketTransport::Init [this=9d128e00 host=127.0.0.1:9151 proxy=:0]
-1220368640[b722a300]: nsSocketTransport::PostEvent [this=9d128e00 type=3 status=0 param=0]
-1220368640[b722a300]: STS dispatch [aa76cde0]
-1220368640[b722a300]: nsSocketTransport::OpenInputStream [this=9d128e00 flags=3]
-1220368640[b722a300]: STS dispatch [aa46c2e8]
-1220368640[b722a300]: nsSocketTransport::PostEvent [this=9d128e00 type=0 status=0 param=0]
-1220368640[b722a300]: STS dispatch [aa76ce00]
-1220368640[b722a300]: nsSocketTransport::OpenOutputStream [this=9d128e00 flags=3]
-1220368640[b722a300]: STS dispatch [aa46c338]
-1220368640[b722a300]: nsSocketTransport::PostEvent [this=9d128e00 type=0 status=0 param=0]
-1220368640[b722a300]: STS dispatch [aa76ce40]
-1220368640[b722a300]: OOO WriteSegments [this=a18ed2ac count=47]
-1220368640[b722a300]: OOO appended new segment
-1220368640[b722a300]: OOO rolling back write cursor 0 bytes
-1220368640[b722a300]: OOO SegmentLen is 1
-1220368640[b722a300]: OOO ReadCount is 1
-1220368640[b722a300]: OOO SegmentLen is now 0 and originalLen is 1
-1220368640[b722a300]: OOO advancing write cursor by 1
-1220368640[b722a300]: OOO pipe output: waiting for space
-1690313920[aa456940]: STS dispatch [aa49a3a0]
-1673528512[aa456640]: STS dispatch [aa48fac0]
-1681921216[aa4567c0]: STS dispatch [aa496ec0]

Calling PR_Write actually leads to dispatching the TLS stuff to the socket thread as explained above (note though that this log documents a hang on "New Identity"). While the TLS code is running on that thread Torbutton creates a new socket transport and is opening an input and an output stream which leads to dispatching things to the socket thread, too. But this one is till blocked by the TLS code, thus, the stream setup is not properly done yet. Nevertheless, execution of Torbutton code is proceeding with

outputStream.writeBytes(command, command.length);

which results in a blocking main thread with a still blocked socket thread: we have a deadlock.

comment:24 Changed 6 years ago by cypherpunks

blocking sockets

I tried to use non blocking for test purpose in that code months ago, without any luck. Connection just never launched. No deadlock and no connection.
You can to try too, maybe it was my wrong, and it will work.

comment:25 Changed 6 years ago by mcs

Great work tracking this down. To some extent this is still a Mozilla problem in that users of the socket transport service need to avoid using it in certain ways.

Does the deadlock still occur if we use buffered sockets? The combination of blocking + unbuffered is unusual. I don't remember why buffering causes problems, but Mike probably does.

comment:26 in reply to:  25 Changed 6 years ago by gk

Replying to mcs:

Does the deadlock still occur if we use buffered sockets? The combination of blocking + unbuffered is unusual. I don't remember why buffering causes problems, but Mike probably does.

https://trac.torproject.org/projects/tor/ticket/8642#comment:1 ? It seems we get buffering for free if we have blocking sockets :) So, the comments in torbuttons.js are wrong. Well, they are not wrong but they just don't match reality.

comment:27 Changed 6 years ago by gk

Keywords: MikePerry201404R added

Delaying the execution with setTimout() and friends does not help as we are still on the same thread and using workers seems not ideal either but I have not tried the latter yet. So, we need probably a smarter approach.

comment:28 Changed 6 years ago by mikeperry

Keywords: MikePerry201405R added; MikePerry201404R removed

What am I supposed to review, test, or investigate here?

comment:29 in reply to:  28 Changed 6 years ago by gk

Replying to mikeperry:

What am I supposed to review, test, or investigate here?

Knowing whether you tested (and dismissed) other options than using blocking sockets for communicating with tor back then when you wrote the code might be helpful. And, of course, any ideas on how to solve this bug are appreciated as well. :)

comment:30 Changed 6 years ago by mikeperry

We went with blocking sockets because New Identity needed to ensure that it actually got a new circuit before allowing the user to do anything else with the browser.

I am wondering if we should try clearing the event queue as cypherpunks suggested. It may be a good idea for a few reasons. We may be able to use https://developer.mozilla.org/en-US/docs/Mozilla/Tech/XPCOM/Reference/Interface/nsIDOMWindowUtils#suppressEventHandling%28%29 to suppress UI events in the meantime, if need be.

Alternatively, we could attempt to do an async New Identity and also use the event suppressor to suppress UI event handling until the New Identity onOutputStreamReady+onInputStream callbacks come back...

comment:31 Changed 6 years ago by mikeperry

Keywords: MikePerry201405R removed

comment:32 in reply to:  30 ; Changed 6 years ago by gk

Replying to mikeperry:

We went with blocking sockets because New Identity needed to ensure that it actually got a new circuit before allowing the user to do anything else with the browser.

I am wondering if we should try clearing the event queue as cypherpunks suggested. It may be a good idea for a few reasons. We may be able to use https://developer.mozilla.org/en-US/docs/Mozilla/Tech/XPCOM/Reference/Interface/nsIDOMWindowUtils#suppressEventHandling%28%29 to suppress UI events in the meantime, if need be.

Yes, sounds reasonable. Alas, this idea is not working for the hang during start-up. There a blank Tor Launcher window is visible for a short amount of time until the usual messages appear giving a weird user experience. So I wonder whether we should separate the two problems and treat them differently: fixing the hang on New Identity with the processNextEvent/suppressEventHandling idea and trying the other one for the hang during start-up.

comment:33 Changed 6 years ago by gk

#12119 is a duplicate.

comment:34 Changed 5 years ago by gk

Keywords: TorBrowserTeam201406 added

comment:35 Changed 5 years ago by gk

Keywords: TorBrowserTeam201407 added; TorBrowserTeam201406 removed

comment:36 Changed 5 years ago by erinn

Component: TorBrowserButtonTor Browser
Keywords: tbb-torbutton added
Owner: changed from mikeperry to tbb-team

comment:37 in reply to:  32 Changed 5 years ago by gk

Keywords: MikePerry201407R added
Status: newneeds_review

Replying to gk:

So I wonder whether we should separate the two problems and treat them differently: fixing the hang on New Identity with the processNextEvent/suppressEventHandling idea and trying the other one for the hang during start-up.

I convinced myself that this might be a good approach and have a fix for this bug in my bug_9531_v2 branch in my public Torbutton repo.

comment:38 Changed 5 years ago by mikeperry

Keywords: TorBrowserTeam201408 MikePerry201408R added; TorBrowserTeam201407 MikePerry201407R removed

comment:39 Changed 5 years ago by mikeperry

Owner: changed from tbb-team to gk
Status: needs_reviewassigned

comment:40 Changed 5 years ago by mikeperry

Status: assignedneeds_review

Trac cannot change assignment without resetting ticket state...

comment:41 Changed 5 years ago by mikeperry

Resolution: fixed
Status: needs_reviewclosed

Ok, this is merged for Torbutton 1.6.12.0, which we'll put in 4.0-alpha-1. TBB 3.6 will stay on Torbutton 1.6.11.1.

I am going to tempt fate and call this fixed.

comment:42 Changed 5 years ago by cypherpunks

Then the following may happen (this is the case when we start-up; hitting New Identity gives a slightly different stacktrace which boils down to the same problem, though)
If not we see something like the following in the log

Was it log for start-up case or for hitting New Identity?

comment:43 Changed 5 years ago by cypherpunks

There a blank Tor Launcher window is visible for a short amount of time until the usual messages appear giving a weird user experience.

Blank windows with your patch?

comment:44 Changed 5 years ago by cypherpunks

I think it's better to place this patch to torbutton_send_ctrl_cmd. Which is called yet by torbutton_local_tor_check on start-up, it's better to exclude any possible hangs could be triggered by Torbutton, and investigate #10804 from scratch.

comment:45 in reply to:  42 Changed 5 years ago by gk

Replying to cypherpunks:

Then the following may happen (this is the case when we start-up; hitting New Identity gives a slightly different stacktrace which boils down to the same problem, though)
If not we see something like the following in the log

Was it log for start-up case or for hitting New Identity?

start-up case. The issue when hitting New Identity shows basically the same log with minor differences. I'd say it is definitely the same underlying bug we hit here.

comment:46 in reply to:  43 Changed 5 years ago by gk

Replying to cypherpunks:

There a blank Tor Launcher window is visible for a short amount of time until the usual messages appear giving a weird user experience.

Blank windows with your patch?

If I apply the patch idea to the Tor Launcher related issue, yes.

comment:47 Changed 5 years ago by cypherpunks

There a blank Tor Launcher window is visible for a short amount of time until the usual messages appear giving a weird user experience.

Tor Launcher could to finish every stuff for progress document before starting socket transport operations.
This specially crafted workaround reducing negative user experience.

diff --git a/src/components/tl-process.js b/src/components/tl-process.js
index cc5d654..31d4e45 100644
--- a/src/components/tl-process.js
+++ b/src/components/tl-process.js
@@ -91,6 +91,7 @@ TorProcessService.prototype =
     if ("profile-after-change" == aTopic)
     {
       this.mObsSvc.addObserver(this, "quit-application-granted", false);
+      this.mObsSvc.addObserver(this, "document-shown", false);
       this.mObsSvc.addObserver(this, kOpenNetworkSettingsTopic, false);
       this.mObsSvc.addObserver(this, kUserQuitTopic, false);
       this.mObsSvc.addObserver(this, kBootstrapStatusTopic, false);
@@ -169,6 +170,17 @@ TorProcessService.prototype =
         }
       }
     }
+    else if ("document-shown" == aTopic)
+    {
+     if (aSubject.getElementById("progressMeter"))
+       ++this.mDocumentProgressShown;
+     else 
+        this.mDocumentProgressShown = 0;
+     if (!this.mDocumentProgressShown || this.mDocumentProgressShown == 3) {
+       this._monitorTorProcessStartup();
+       this.mObsSvc.removeObserver(this, "document-shown");
+     }
+    }
     else if ("timer-callback" == aTopic)
     {
       if (aSubject == this.mControlConnTimer)
@@ -296,6 +308,7 @@ TorProcessService.prototype =
   mRestartWithQuit: false,
   mLastTorWarningPhase: null,
   mLastTorWarningReason: null,
+  mDocumentProgressShown: 0,
 
 
   // Private Methods /////////////////////////////////////////////////////////
@@ -426,7 +439,6 @@ TorProcessService.prototype =
   {
     try
     {
-      this._monitorTorProcessStartup();
 
       var bridgeConfigIsBad = (this._defaultBridgesStatus ==
                                this.kDefaultBridgesStatus_BadConfig);

With more knowledge it could be less kludgy fix.

comment:48 Changed 5 years ago by cypherpunks

Torbutton 1.6.12.3 should be free of such hangs, it draining queue for every attempts to access Tor's control port . Next patches and research of Tor Launcher's hangs going to #10804.

Note: See TracTickets for help on using tickets.