Compiled tor again to note how it and libevent2 was compiled. This also caused an OpenSSL upgrade to version 1.0.0e. Will update the ticket if the bug appears again now.
Trac: Username: jn Status: new to needs_information
So, this message occurs when Tor has sent a question to a cpuworker, and hasn't gotten a response for an insanely long time.
One way to try to fix this, if you know your way around gdb, is to use gdb -p to attach to the tor process after the bug occurs, and look at each thread's stack in turn to see where each one is. If one of them really is stuck (on a lock or something), that should tell you where. Or you could try it after the process has been running for an hour or two: if this happens reliably, then odds are one or more cpuworkers will be stuck by then.
To provoke this message into occurring more quickly, we might lower CPUWORKER_BUSY_TIMEOUT; it is ridiculously high at the moment.
The problem did not return on DFRI0 after tor was recompiled seven weeks ago. But I got the error again on another relay with similar OS and versions as noted in #4715 (moved). Will see if this error returns again on that relay after the upgrade to version 0.2.2.35.
I'm running into exactly the same thing here on FreeBSD 9.0-RELEASE.
I have tor-devel-0.2.3.10.a running with libevent2-2.0.16 and compiled both against openssl-1.0.0f (openssl-1.0.0_8).
Has anyone found a solution or workaround for this, yet and could gibe some details?
If more information is required I will provide it.
No solution yet. My plan when time permits is to follow nickm's tip to lower CPUWORKER_BUSY_TIMEOUT to say one hour and then debug with gdb -p as soon as the problem happen again.
I'm running into exactly the same thing here on FreeBSD 9.0-RELEASE.
I have tor-devel-0.2.3.10.a running with libevent2-2.0.16 and compiled both against openssl-1.0.0f (openssl-1.0.0_8).
Has anyone found a solution or workaround for this, yet and could gibe some details?
If more information is required I will provide it.
No solution yet. My plan when time permits is to follow nickm's tip to lower CPUWORKER_BUSY_TIMEOUT to say one hour and then debug with gdb -p as soon as the problem happen again.
Too bad. I'd hoped for an easy solution by now. I've tried gdb yesterday already, although I have to admit, that I'm not very good with it. I'll take a closer look into the matter as well, then. I hope that one of us will be able to resolve this sooner or later, then.
I put some work into it and patched cpuworker.c. As nickm wrote CPUWORKER_BUSY_TIMEOUT seems to be ridiculously high, also gamambel suggested trying out just using more cpuworkers, so I also set MAX_CPUWORKERS to 32. Of course this doesn't solve the error yet, it is merely a workaround, but so far it makes a huge difference. While previously it merely hit about 8MB/sec I'm now up to about 60MB/sec (~450-500Mbits), which is pretty good, considering I just built a workaround.
Here comes the patch. Just put it in the files directory, recompile, and have fun.
I'll also try finding the actual cause of the problem, but I'll be busy the next two weeks, so that'll have to wait.
less /usr/ports/security/tor-devel/files/patch-src-or-cpuworker.c--- src/or/cpuworker.c 2012-01-20 17:38:13.000000000 +0100+++ src/or/cpuworker.c 2012-01-20 19:16:44.000000000 +0100@@ -24,7 +24,7 @@ #include "router.h" /** The maximum number of cpuworker processes we will keep around. */-#define MAX_CPUWORKERS 16+#define MAX_CPUWORKERS 32 /** The minimum number of cpuworker processes we will keep around. */ #define MIN_CPUWORKERS 1@@ -405,7 +405,7 @@ * up on it and decide that we have a bug or infinite loop? * This value is high because some servers with low memory/cpu * sometimes spend an hour or more swapping, and Tor starves. */-#define CPUWORKER_BUSY_TIMEOUT (60*60*12)+#define CPUWORKER_BUSY_TIMEOUT (60*2*1) /** We have a bug that I can't find. Sometimes, very rarely, cpuworkers get * stuck in the 'busy' state, even though the cpuworker process thinks of
I can reproduce this on FreeBSD 8.2 within 10-30 minutes, using processes (a single one, in fact).
It seems like the problem is either tor writing to the socketpair or the cpuworker reading from it.
Furthermore, adding 9 calls to log_debug() in read_all() seems to mask the bug. This suggests that we might want to think about possible problem with recv() on a socketpair.
Linus found a lot of other interesting stuff when investigating this bug. Unfortunately, it doesn't seem to be described here. Linus, can you summarize, or should I try?
It looks like the cpuworker is blocking on recv(fd, buf, 1, 0) in
cpuworker_main() even when tor thinks it has written to the
socketpair. The worker process then gets 0 from recv as soon as tor
has decided it's dead and closes the socket.
Reading _connection_write_to_buf_impl() with friends makes me think
it'd be interesting to see if a busy relay on Linux compiled w/o using
bufferevents could reproduce the bug. The three relays I've seen this
on are all compiled w/o bufferevents due to problem with bufferevents
on FreeBSD.
Flushing explicitly (connection_flush()) after writing in
assign_onionskin_to_cpuworker() in one of our relays. If this relay
hasn't produced a wedged cpuworker by this time tomorrow, we should
conclude that tor isn't writing.
Added some log [warn] to _connection_write_to_buf_impl() to catch
cases where tor doesn't write/flush, see bug4345-debugging in my
public repo for details. Running this on one another relay.
If I'm reading _connection_write_to_buf_impl() correctly, a non OR
connection will never be flushed (unless zlib is true, which it's not
when coming from connection_write_to_buf()) -- ssize_t extra is set
only for CONN_TYPE_OR. This looks like a bug -- the comment says /*
just try to flush all of it */ but that won't happen.
For our case, it might not help fixing this. Will
conn->outbuf_flushlen really be >= 1<<16?
Should we bother making the flushing work for this case or just force
a flush in assign_onionskin_to_cpuworker()?
In 2.0.18, Sebastian had " o Properly zero the kevent in kq_setup_kevent() (c2c7b39 Sebastian Hahn)"
If this can't be reproduced any longer, I might be inclined to close it. If you want to verify whether some libevent version fixed it, that might be helpful, but not if doing so will take a long time IMO.
Skruffy says this bug (still present sometimes) is because
"Another end of socketpair is nowhere connected. Main thread writes
to unix domain socket wasting memory for 12 hours. Reason is missed check for
result of spawn_func."
If that's the case then my better_workqueue branch should fix it (see #9682 (moved)). For an 0.2.4 fix, here's a trivial one: bug4345a_024 in my public repository. It's been there since SVN revision 402...
Trac: Status: reopened to needs_review Milestone: Tor: unspecified to Tor: 0.2.4.x-final
For an 0.2.4 fix, here's a trivial one: bug4345a_024 in my public repository.
spawn_func returns errors for threads in cases:
if (pthread_create(&thread,NULL,tor_pthread_helper_fn,d)) return -1; if (pthread_detach(thread)) return -1;
If pthread_create failed it's for sure no thread launched, but if pthread_detach failed then thread can be launched still? Is it safe to close another end's socket for socketpair if thread actually runs? spawn_func should to detect actual errors and terminate failed thread or recover from failed state.
I've made bug4345a_024 include a fix for the pthread_detach() issue above.
I agree that it probably isn't the root cause of the specific case reported above, but I think it might be worth merging bug4545a_024 anyway, as a possible bug that's worth fixing. Agree?
bug4545a_024 still looks good to me; merged it to 0.2.5. This bug is now a needs_information for 0.2.???, though if we do #9682 (moved) (better workqueue implementation), it will probably just be something to ignore.