skruffy is reporting hangs on Torbutton's New Identity. Oddly, the hangs seem to be more common if you are loading a bunch of websites at the time of hitting New Identity, but that is not a full repro case.
The hang actually appears to be happening while attempting to open a control port connection. Perhaps another race introduced by our SOCKS optimistic data hack in nsITransport? Or perhaps our hack for #8642 (closed) is insufficient in some cases, and/or during heavy network activity.
To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information
Child items ...
Show closed items
Linked items 0
Link issues together to show that they're related.
Learn more.
Trac: Description: skruffy is reporting hangs on Torbutton's New Identity. Oddly, the hangs seem to be more common if you are loading a bunch of websites at the time of hitting New Identity, but that is not a full repro case.
The hang actually appears to be happening while attempting to open a control port connection. Perhaps another race introduced by our SOCKS optimistic data hack in nsITransport? Or perhaps our hack for #8462 (moved) is insufficient in some cases, and/or during heavy network activity.
to
skruffy is reporting hangs on Torbutton's New Identity. Oddly, the hangs seem to be more common if you are loading a bunch of websites at the time of hitting New Identity, but that is not a full repro case.
The hang actually appears to be happening while attempting to open a control port connection. Perhaps another race introduced by our SOCKS optimistic data hack in nsITransport? Or perhaps our hack for #8642 (closed) is insufficient in some cases, and/or during heavy network activity.
Hang happens exactly after call of "outputStream.writeBytes(auth_cmd, auth_cmd.length)".
Tcpdump shows no one packet, no syn, no connection established.
Socket is never alive during repro case that leads to hangs. Tested with modified torbutton_send_ctrl_cmd() so it waited alive socket after opening stream, like so:
var input = socket.openInputStream(3, 1, 1); // 3 == OPEN_BLOCKING|OPEN_UNBUFFERED var output = socket.openOutputStream(3, 1, 1); // 3 == OPEN_BLOCKING|OPEN_UNBUFFERED while (!socket.isAlive()) {} torbutton_log(4, "Alive!");
If hangs happens then it never reaches "Alive!" and browser ask about unresponsive script later.
I got reproducable hang using modified torbutton_send_ctrl_cmd(), on startup without any pages loaded.
var socketTransportService = Components.classes["@mozilla.org/network/socket-transport-service;1"] .getService(Components.interfaces.nsISocketTransportService); var sockets = new Array(); var max=540; // 550 hangs, 545 bypasses for(var i = 0; i < max; i++) { var s = socketTransportService.createTransport(null, 0, m_tb_control_host, m_tb_control_port, null); var input = s.openInputStream(3, 1, 1); // 3 == OPEN_BLOCKING|OPEN_UNBUFFERED var output = s.openOutputStream(3, 1, 1); // 3 == OPEN_BLOCKING|OPEN_UNBUFFERED sockets.push(s); } for(var i = 0; i < max-1; i++) { sockets[i].close; } var socket = sockets[max-2]; torbutton_log(4, "Browser will hang now!");
Those number for tested box, it's depends platform in some way.
I was unable to find any correlation with local limits like max descriptors or something. Probably missed something.
This code helps to avoid conditions that leads to hang if no connection able to create.
--- torbutton.js.original+++ torbutton.js@@ -1385,6 +1385,20 @@ } }+function torbutton_process_even_queue() {+ // Get the current thread.+ var thread = Components.classes["@mozilla.org/thread-manager;1"].+ getService(Components.interfaces.nsIThreadManager).currentThread;+/*+Warning: Calling nsIThread.processNextEvent allows network+and UI events to run which can modify data structures that+your code isn't expecting to be modified during a synchronous+method call. This can cause random crashes and other bugs that+may be hard to find and fix. Consider use asynchronous callbacks instead.+*/+ while (thread.processNextEvent(false)) {}+}+ /* The "New Identity" implementation does the following: * 1. Disables Javascript and plugins on all tabs * 2. Clears state:@@ -1580,7 +1594,8 @@ var prefService = Components.classes["@mozilla.org/preferences-service;1"] .getService(Components.interfaces.nsIPrefService); prefService.savePrefFile(null);- ++ torbutton_process_even_queue(); torbutton_log(3, "New Identity: Sending NEWNYM"); // We only support TBB for newnym.
Why do you think that warning can be ignored in the current case?
Do you think that warning can't be ignored?
No any modified data should prevent: access control port, open new window, close current window. Before "Sending NEWNYM" it actually assumed everything finished: tabs closed, conns destroyed, prefs synced, etc. With processing all events in queue you can guarantee that.
The only thing is UI. User can open new tab, or press "NEWNYM" yet while it process queue? or something evil yet.
Ok lets fix browser code instead, it's bug anyway. Or migrate away.
#4 0xb628579a in mozilla::ReentrantMonitor::Wait (this=0x9b2814e0, interval=4294967295) at /home/ubuntu/build/tor-browser/obj-i686-pc-linux-gnu/xpcom/build/BlockingResourceBase.cpp:313#5 0xb4a8f352 in mozilla::ReentrantMonitorAutoEnter::Wait (this=0xbfffb7ac, interval=4294967295) at ../../dist/include/mozilla/ReentrantMonitor.h:192#6 0xb62caa37 in nsPipeOutputStream::Wait (this=0x9b2814bc) at /home/ubuntu/build/tor-browser/xpcom/io/nsPipe3.cpp:980#7 0xb62cae45 in nsPipeOutputStream::WriteSegments (this=0x9b2814bc, reader=0xb62cafea <nsReadFromRawBuffer(nsIOutputStream*, void*, char*, uint32_t, uint32_t, uint32_t*)>, closure=0x9b5e0be0, count=46, writeCount=0xbfffb898) at /home/ubuntu/build/tor-browser/xpcom/io/nsPipe3.cpp:1090#8 0xb62cb077 in nsPipeOutputStream::Write (this=0x9b2814bc, fromBuf=0x9b5e0be0 "AUTHENTICATE 422f587d45483e5b737b2f7863234042\r\n", bufLen=47, writeCount=0xbfffb898) at /home/ubuntu/build/tor-browser/xpcom/io/nsPipe3.cpp:1150#9 0xb62bc86d in nsBinaryOutputStream::Write (this=0x9b6d1980, aBuf=0x9b5e0be0 "AUTHENTICATE 422f587d45483e5b737b2f7863234042\r\n", aCount=47, aActualBytes=0xbfffb898) at /home/ubuntu/build/tor-browser/xpcom/io/nsBinaryStream.cpp:51#10 0xb62bd0e7 in nsBinaryOutputStream::WriteBytes (this=0x9b6d1980, aString=0x9b5e0be0 "AUTHENTICATE 422f587d45483e5b737b2f7863234042\r\n", aLength=47) at /home/ubuntu/build/tor-browser/xpcom/io/nsBinaryStream.cpp:216
It is pretty reliably reproducible in my debug build if I open a bunch of bookmarks (say 5) simultaneously and choose the "New Identity" option immediately afterwards.
TL;DR: We can't use blocking sockets the way we are doing that right now. They are responsible for the hangs.
If we only have non-TLS connections everything is fine. The picture changes as soon as TLS connections are processed while we click on New Identity or start the browser. Then the following may happen (this is the case when we start-up; hitting New Identity gives a slightly different stacktrace which boils down to the same problem, though):
#1 0xb772cc5b in pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/i386/i686/../i486/pthread_cond_wait.S:187#2 0xb71ec899 in PR_WaitCondVar (cvar=0xa1a07d00, timeout=4294967295) at /home/ubuntu/build/tor-browser/nsprpub/pr/src/pthreads/ptsynch.c:385#3 0xb5e53402 in Wait (this=<optimized out>, interval=<optimized out>) at ../../../../dist/include/mozilla/CondVar.h:71#4 Wait (interval=<optimized out>, this=<optimized out>) at ../../../../dist/include/mozilla/Monitor.h:47#5 Wait (interval=<optimized out>, this=<optimized out>) at ../../../../dist/include/mozilla/Monitor.h:102#6 mozilla::psm::SyncRunnableBase::DispatchToMainThreadAndWait ( this=0xa1a2f6e0) at /home/ubuntu/build/tor-browser/security/manager/ssl/src/PSMRunnable.cpp:25#7 0xb5e45b6d in nsNSSSocketInfo::GetPreviousCert (this=0xab3c8580, _result=0xb28feef4) at /home/ubuntu/build/tor-browser/security/manager/ssl/src/nsNSSIOLayer.cpp:427#8 0xb5e31670 in HandshakeCallback (fd=0xab318280, client_data=0xab3c8580) at /home/ubuntu/build/tor-browser/security/manager/ssl/src/nsNSSCallbacks.cpp:901#9 0xb709f871 in ssl_FinishHandshake (ss=0xfffffe00) at sslsecur.c:141#10 0xb708cd65 in ssl3_FinishHandshake (ss=0xaa08a000) at ssl3con.c:10637#11 0xb708e609 in ssl3_AuthCertificateComplete (ss=0xaa08a000, error=0) at ssl3con.c:10040#12 0xb709faec in SSL_AuthCertificateComplete (fd=0xaa0760e0, error=0) at sslsecur.c:1566#13 0xb5e486ea in nsNSSSocketInfo::SetCertVerificationResult (this=0xab3c8580, errorCode=0, errorMessageType=mozilla::psm::PlainErrorMessage) at /home/ubuntu/build/tor-browser/security/manager/ssl/src/nsNSSIOLayer.cpp:456#14 0xb5e53ecf in mozilla::psm::(anonymous namespace)::SSLServerCertVerificationResult::Run (this=0xa1a2f6a0) at /home/ubuntu/build/tor-browser/security/manager/ssl/src/SSLServerCertVerification.cpp:1357#15 0xb6366498 in nsThread::ProcessNextEvent (this=0xb2907a60, mayWait=true, result=0xb28ff1ef) at /home/ubuntu/build/tor-browser/xpcom/threads/nsThread.cpp:626#16 0xb63307a7 in NS_ProcessNextEvent (thread=0xfffffe00, mayWait=true) at /home/ubuntu/build/tor-browser/obj-i686-pc-linux-gnu/xpcom/build/nsThreadUtils.cpp:238#17 0xb543f43e in nsSocketTransportService::Run (this=0xb29079a0) at /home/ubuntu/build/tor-browser/netwerk/base/src/nsSocketTransportService2.cpp:648
Calling PR_Write actually leads to dispatching the TLS stuff to the socket thread as explained above (note though that this log documents a hang on "New Identity"). While the TLS code is running on that thread Torbutton creates a new socket transport and is opening an input and an output stream which leads to dispatching things to the socket thread, too. But this one is till blocked by the TLS code, thus, the stream setup is not properly done yet. Nevertheless, execution of Torbutton code is proceeding with
outputStream.writeBytes(command, command.length);
which results in a blocking main thread with a still blocked socket thread: we have a deadlock.
blocking sockets
I tried to use non blocking for test purpose in that code months ago, without any luck. Connection just never launched. No deadlock and no connection.
You can to try too, maybe it was my wrong, and it will work.
Great work tracking this down. To some extent this is still a Mozilla problem in that users of the socket transport service need to avoid using it in certain ways.
Does the deadlock still occur if we use buffered sockets? The combination of blocking + unbuffered is unusual. I don't remember why buffering causes problems, but Mike probably does.
Does the deadlock still occur if we use buffered sockets? The combination of blocking + unbuffered is unusual. I don't remember why buffering causes problems, but Mike probably does.
Delaying the execution with setTimout() and friends does not help as we are still on the same thread and using workers seems not ideal either but I have not tried the latter yet. So, we need probably a smarter approach.
What am I supposed to review, test, or investigate here?
Knowing whether you tested (and dismissed) other options than using blocking sockets for communicating with tor back then when you wrote the code might be helpful. And, of course, any ideas on how to solve this bug are appreciated as well. :)
We went with blocking sockets because New Identity needed to ensure that it actually got a new circuit before allowing the user to do anything else with the browser.
Alternatively, we could attempt to do an async New Identity and also use the event suppressor to suppress UI event handling until the New Identity onOutputStreamReady+onInputStream callbacks come back...
We went with blocking sockets because New Identity needed to ensure that it actually got a new circuit before allowing the user to do anything else with the browser.
Yes, sounds reasonable. Alas, this idea is not working for the hang during start-up. There a blank Tor Launcher window is visible for a short amount of time until the usual messages appear giving a weird user experience. So I wonder whether we should separate the two problems and treat them differently: fixing the hang on New Identity with the processNextEvent/suppressEventHandling idea and trying the other one for the hang during start-up.
So I wonder whether we should separate the two problems and treat them differently: fixing the hang on New Identity with the processNextEvent/suppressEventHandling idea and trying the other one for the hang during start-up.
I convinced myself that this might be a good approach and have a fix for this bug in my bug_9531_v2 branch in my public Torbutton repo.
Trac: Keywords: N/Adeleted, MikePerry201407R added Status: new to needs_review
Then the following may happen (this is the case when we start-up; hitting New Identity gives a slightly different stacktrace which boils down to the same problem, though)
If not we see something like the following in the log
Was it log for start-up case or for hitting New Identity?
There a blank Tor Launcher window is visible for a short amount of time until the usual messages appear giving a weird user experience.
Blank windows with your patch?
I think it's better to place this patch to torbutton_send_ctrl_cmd. Which is called yet by torbutton_local_tor_check on start-up, it's better to exclude any possible hangs could be triggered by Torbutton, and investigate #10804 (moved) from scratch.
Then the following may happen (this is the case when we start-up; hitting New Identity gives a slightly different stacktrace which boils down to the same problem, though)
If not we see something like the following in the log
Was it log for start-up case or for hitting New Identity?
start-up case. The issue when hitting New Identity shows basically the same log with minor differences. I'd say it is definitely the same underlying bug we hit here.
There a blank Tor Launcher window is visible for a short amount of time until the usual messages appear giving a weird user experience.
Blank windows with your patch?
If I apply the patch idea to the Tor Launcher related issue, yes.
There a blank Tor Launcher window is visible for a short amount of time until the usual messages appear giving a weird user experience.
Tor Launcher could to finish every stuff for progress document before starting socket transport operations.
This specially crafted workaround reducing negative user experience.
Torbutton 1.6.12.3 should be free of such hangs, it draining queue for every attempts to access Tor's control port . Next patches and research of Tor Launcher's hangs going to #10804 (moved).