Opened 8 years ago

Closed 4 years ago

Last modified 3 years ago

#4345 closed defect (wontfix)

Bug: closing wedged cpuworker.

Reported by: jn Owned by:
Priority: Medium Milestone:
Component: Core Tor/Tor Version: Tor: 0.2.3.10-alpha
Severity: Keywords: tor-relay 024-backport 025-triaged
Cc: juwi@… Actual Points:
Parent ID: Points:
Reviewer: Sponsor:

Description

The relay DFRI0 running version 0.2.2.34 on FreeBSD 8.2-RELEASE-p4 amd64, OpenSSL OpenSSL 0.9.8q 2 Dec 2010 got this error:

Oct 29 00:10:32.013 [notice] cull_wedged_cpuworkers(): Bug: closing wedged cpuworker. Can somebody find the bug?
Oct 29 00:10:32.024 [err] cpuworker_main(): Bug: writing response buf failed. Exiting.

Child Tickets

Change History (44)

comment:1 Changed 8 years ago by jn

More of the same error. Did not restart tor when the first error ocurred and it is still running now.

Oct 29 13:18:24.101 [notice] cull_wedged_cpuworkers(): Bug: closing wedged cpuworker. Can somebody find the bug?
Oct 29 13:18:24.109 [err] cpuworker_main(): Bug: writing response buf failed. Exiting.
Oct 29 13:44:19.301 [warn] Tried to establish rendezvous on non-OR or non-edge circuit.

libevent2-2.0.13 is in use.

comment:2 Changed 8 years ago by jn

Status: newneeds_information

Compiled tor again to note how it and libevent2 was compiled. This also caused an OpenSSL upgrade to version 1.0.0e. Will update the ticket if the bug appears again now.

comment:3 Changed 8 years ago by nickm

So, this message occurs when Tor has sent a question to a cpuworker, and hasn't gotten a response for an insanely long time.

One way to try to fix this, if you know your way around gdb, is to use gdb -p to attach to the tor process after the bug occurs, and look at each thread's stack in turn to see where each one is. If one of them really is stuck (on a lock or something), that should tell you where. Or you could try it after the process has been running for an hour or two: if this happens reliably, then odds are one or more cpuworkers will be stuck by then.

To provoke this message into occurring more quickly, we might lower CPUWORKER_BUSY_TIMEOUT; it is ridiculously high at the moment.

Any other ideas?

comment:4 Changed 8 years ago by nickm

Milestone: Tor: unspecified

comment:5 Changed 8 years ago by jn

The problem did not return on DFRI0 after tor was recompiled seven weeks ago. But I got the error again on another relay with similar OS and versions as noted in #4715. Will see if this error returns again on that relay after the upgrade to version 0.2.2.35.

comment:6 Changed 8 years ago by juwi

Cc: juwi@… added
Version: Tor: 0.2.2.34Tor: 0.2.3.10-alpha

Hi, 

I'm running into exactly the same thing here on FreeBSD 9.0-RELEASE.

I have tor-devel-0.2.3.10.a running with libevent2-2.0.16 and compiled both against openssl-1.0.0f (openssl-1.0.0_8).

Has anyone found a solution or workaround for this, yet and could gibe some details?

If more information is required I will provide it.

comment:7 in reply to:  6 ; Changed 8 years ago by jn

Replying to juwi:

Hi, 

I'm running into exactly the same thing here on FreeBSD 9.0-RELEASE.

I have tor-devel-0.2.3.10.a running with libevent2-2.0.16 and compiled both against openssl-1.0.0f (openssl-1.0.0_8).

Has anyone found a solution or workaround for this, yet and could gibe some details?

If more information is required I will provide it.

No solution yet. My plan when time permits is to follow nickm's tip to lower CPUWORKER_BUSY_TIMEOUT to say one hour and then debug with gdb -p as soon as the problem happen again.

comment:8 in reply to:  7 Changed 8 years ago by juwi

Replying to jn:

Replying to juwi:

Hi,

I'm running into exactly the same thing here on FreeBSD 9.0-RELEASE.

I have tor-devel-0.2.3.10.a running with libevent2-2.0.16 and compiled both against openssl-1.0.0f (openssl-1.0.0_8).

Has anyone found a solution or workaround for this, yet and could gibe some details?

If more information is required I will provide it.

No solution yet. My plan when time permits is to follow nickm's tip to lower CPUWORKER_BUSY_TIMEOUT to say one hour and then debug with gdb -p as soon as the problem happen again.

Too bad. I'd hoped for an easy solution by now. I've tried gdb yesterday already, although I have to admit, that I'm not very good with it. I'll take a closer look into the matter as well, then. I hope that one of us will be able to resolve this sooner or later, then.

comment:9 Changed 8 years ago by juwi

I put some work into it and patched cpuworker.c. As nickm wrote CPUWORKER_BUSY_TIMEOUT seems to be ridiculously high, also gamambel suggested trying out just using more cpuworkers, so I also set MAX_CPUWORKERS to 32. Of course this doesn't solve the error yet, it is merely a workaround, but so far it makes a huge difference. While previously it merely hit about 8MB/sec I'm now up to about 60MB/sec (~450-500Mbits), which is pretty good, considering I just built a workaround.

Here comes the patch. Just put it in the files directory, recompile, and have fun.

I'll also try finding the actual cause of the problem, but I'll be busy the next two weeks, so that'll have to wait.

less /usr/ports/security/tor-devel/files/patch-src-or-cpuworker.c--- src/or/cpuworker.c  2012-01-20 17:38:13.000000000 +0100+++ src/or/cpuworker.c  2012-01-20 19:16:44.000000000 +0100@@ -24,7 +24,7 @@ #include "router.h" /** The maximum number of cpuworker processes we will keep around. */-#define MAX_CPUWORKERS 16+#define MAX_CPUWORKERS 32 /** The minimum number of cpuworker processes we will keep around. */ #define MIN_CPUWORKERS 1@@ -405,7 +405,7 @@  * up on it and decide that we have a bug or infinite loop?  * This value is high because some servers with low memory/cpu  * sometimes spend an hour or more swapping, and Tor starves. */-#define CPUWORKER_BUSY_TIMEOUT (60*60*12)+#define CPUWORKER_BUSY_TIMEOUT (60*2*1) /** We have a bug that I can't find. Sometimes, very rarely, cpuworkers get  * stuck in the 'busy' state, even though the cpuworker process thinks of

comment:10 Changed 8 years ago by juwi

Another observation that I've made is, that it only affects one of my two running tor processes. So far I have no clue why this is, though.

comment:11 Changed 8 years ago by ln5

I can reproduce this on FreeBSD 8.2 within 10-30 minutes, using processes (a single one, in fact).

It seems like the problem is either tor writing to the socketpair or the cpuworker reading from it.

Furthermore, adding 9 calls to log_debug() in read_all() seems to mask the bug. This suggests that we might want to think about possible problem with recv() on a socketpair.

comment:12 Changed 8 years ago by ln5

When I wrote "seems to mask the bug" I meant to say that the bug didn't show itself. It seems I was wrong.

The bug showed after some five hours it seems.

comment:13 Changed 7 years ago by nickm

Linus found a lot of other interesting stuff when investigating this bug. Unfortunately, it doesn't seem to be described here. Linus, can you summarize, or should I try?

comment:14 Changed 7 years ago by ln5

It looks like the cpuworker is blocking on recv(fd, buf, 1, 0) in
cpuworker_main() even when tor thinks it has written to the
socketpair. The worker process then gets 0 from recv as soon as tor
has decided it's dead and closes the socket.

Time to verify the write in tor.

comment:15 Changed 7 years ago by ln5

Reading _connection_write_to_buf_impl() with friends makes me think
it'd be interesting to see if a busy relay on Linux compiled w/o using
bufferevents could reproduce the bug. The three relays I've seen this
on are all compiled w/o bufferevents due to problem with bufferevents
on FreeBSD.

comment:16 Changed 7 years ago by ln5

Plase ignore that last comment -- bufferevents are not being used in general.

comment:17 Changed 7 years ago by ln5

Flushing explicitly (connection_flush()) after writing in
assign_onionskin_to_cpuworker() in one of our relays. If this relay
hasn't produced a wedged cpuworker by this time tomorrow, we should
conclude that tor isn't writing.

Added some log [warn] to _connection_write_to_buf_impl() to catch
cases where tor doesn't write/flush, see bug4345-debugging in my
public repo for details. Running this on one another relay.

comment:18 Changed 7 years ago by ln5

Flushing makes the bug go away.

--- a/src/or/cpuworker.c
+++ b/src/or/cpuworker.c
@@ -517,6 +517,7 @@ assign_onionskin_to_cpuworker(connection_t *cpuworker,
     connection_write_to_buf(qbuf, 1, cpuworker);
     connection_write_to_buf(tag, sizeof(tag), cpuworker);
     connection_write_to_buf(onionskin, ONIONSKIN_CHALLENGE_LEN, cpuworker);
+    connection_flush(cpuworker);
     tor_free(onionskin);
   }
   return 0;

comment:19 Changed 7 years ago by arma

Status: needs_informationassigned

stupid trac can't change from 'needs information' to 'needs review'

comment:20 Changed 7 years ago by arma

Status: assignedneeds_review

needs a changes entry too

comment:21 Changed 7 years ago by ln5

Flushing is more of a band aid than a solution IMO.

We should try to find out what's going wrong, probably by inspecting
the send buffer when we're in the wedged state. Patch welcome.

comment:22 Changed 7 years ago by ln5

If I'm reading _connection_write_to_buf_impl() correctly, a non OR
connection will never be flushed (unless zlib is true, which it's not
when coming from connection_write_to_buf()) -- ssize_t extra is set
only for CONN_TYPE_OR. This looks like a bug -- the comment says /*
just try to flush all of it */ but that won't happen.

For our case, it might not help fixing this. Will
conn->outbuf_flushlen really be >= 1<<16?

Should we bother making the flushing work for this case or just force
a flush in assign_onionskin_to_cpuworker()?

comment:23 Changed 7 years ago by ln5

But hmm, why does this seem to happen only on FreeBSD systems? Do
socketpairs usually not need flushing?

comment:24 in reply to:  23 Changed 7 years ago by juwi

Replying to ln5:

But hmm, why does this seem to happen only on FreeBSD systems? Do
socketpairs usually not need flushing?

Strange, indeed. I wonder if this could actually be a FreeBSD bug instead of a Tor bug.

comment:25 Changed 7 years ago by ln5

I'm reading it wrong, sorry. :-(

If conn->write_event != 0 we'll call connection_start_writing().

I'll dig some more later.

comment:26 Changed 7 years ago by ln5

I'm unable to reproduce this on FreeBSD 8.2 and 9.0 both running Tor
0.2.3.16-alpha + Libevent 2.0.19 and OpenSSL 1.0.1c, all from ports.

I wonder if it'd be worth wile to bisect Libevent versions. Didn't
Sebastian fix something related to kqueue in Libevent semi-recently?

comment:27 Changed 7 years ago by nickm

In 2.0.18, Sebastian had " o Properly zero the kevent in kq_setup_kevent() (c2c7b39 Sebastian Hahn)"

If this can't be reproduced any longer, I might be inclined to close it. If you want to verify whether some libevent version fixed it, that might be helpful, but not if doing so will take a long time IMO.

comment:28 Changed 7 years ago by ln5

Resolution: fixed
Status: needs_reviewclosed

Closing this as fixed.

More info on what the fix actually was still welcome of course.

comment:29 Changed 7 years ago by nickm

Keywords: tor-relay added

comment:30 Changed 7 years ago by nickm

Component: Tor RelayTor

comment:31 Changed 6 years ago by arma

Skruffy says this bug (still present sometimes) is because
"Another end of socketpair is nowhere connected. Main thread writes
to unix domain socket wasting memory for 12 hours. Reason is missed check for
result of spawn_func."

comment:32 Changed 6 years ago by nickm

Resolution: fixed
Status: closedreopened

comment:33 Changed 6 years ago by nickm

Milestone: Tor: unspecifiedTor: 0.2.4.x-final
Status: reopenedneeds_review

If that's the case then my better_workqueue branch should fix it (see #9682). For an 0.2.4 fix, here's a trivial one: bug4345a_024 in my public repository. It's been there since SVN revision 402...

comment:34 Changed 6 years ago by cypherpunks

For an 0.2.4 fix, here's a trivial one: bug4345a_024 in my public repository.

spawn_func returns errors for threads in cases:

  if (pthread_create(&thread,NULL,tor_pthread_helper_fn,d))
    return -1;
  if (pthread_detach(thread))
    return -1;

If pthread_create failed it's for sure no thread launched, but if pthread_detach failed then thread can be launched still? Is it safe to close another end's socket for socketpair if thread actually runs? spawn_func should to detect actual errors and terminate failed thread or recover from failed state.

comment:35 Changed 6 years ago by cypherpunks

bug4345a_024 fixes not #4345, non launched thread can't log anything like reported:

cpuworker_main(): Bug: writing response buf failed. Exiting.

bug4345a_024 fixes false wedged cpuworker.

comment:36 Changed 6 years ago by nickm

I've made bug4345a_024 include a fix for the pthread_detach() issue above.

I agree that it probably isn't the root cause of the specific case reported above, but I think it might be worth merging bug4545a_024 anyway, as a possible bug that's worth fixing. Agree?

comment:37 Changed 6 years ago by cypherpunks

Agree?

Yes.

comment:38 Changed 5 years ago by nickm

Keywords: 024-backport added
Milestone: Tor: 0.2.4.x-finalTor: 0.2.5.x-final

Protocol is now to try things in 0.2.5.x first.

comment:39 Changed 5 years ago by andrea

Keywords: 025-triaged added

comment:40 Changed 5 years ago by nickm

bug4545a_024 still looks good to me; merged it to 0.2.5. This bug is now a needs_information for 0.2.???, though if we do #9682 (better workqueue implementation), it will probably just be something to ignore.

comment:41 Changed 5 years ago by nickm

Milestone: Tor: 0.2.5.x-finalTor: 0.2.???
Status: needs_reviewneeds_information

comment:42 Changed 4 years ago by nickm

Resolution: wontfix
Status: needs_informationclosed

The cpuworker code has been completely rewritten in 0.2.6

comment:43 Changed 3 years ago by teor

Milestone: Tor: 0.2.???Tor: 0.3.???

Milestone renamed

comment:44 Changed 3 years ago by nickm

Milestone: Tor: 0.3.???

Milestone deleted

Note: See TracTickets for help on using tickets.