Opened 7 years ago

Closed 2 years ago

#5650 closed defect (worksforme)

Tor hogging CPU

Reported by: One Sock Owned by: nickm
Priority: High Milestone: Tor: unspecified
Component: Core Tor/Tor Version:
Severity: Normal Keywords: tor-relay
Cc: Shondoit, mehturt@…, mikewolf53@…, pierre-yves.david@… Actual Points:
Parent ID: Points:
Reviewer: Sponsor:

Description

Ordinarily Tor will run at about 5% CPU usage; but then, for no apparent reason, it will jump up to 50% usage and make the processor run at 50-60 degrees C. Once it jumps up, it stays there until I exit Tor, even if there's no traffic and I don't touch the mouse or keyboard.

I'm running Windows XP with a 2GHz core 2 CPU, and Tor browser bundle 2.2.35-8. I've been having the same problem with older Tor versions as well. I'm not running a relay.

Child Tickets

Attachments (1)

0001-Avoid-busy-looping-on-WANTREAD-within-connection_han.patch (2.0 KB) - added by nickm 6 years ago.
possible fix for #5650

Download all attachments as: .zip

Change History (49)

comment:1 Changed 7 years ago by Shondoit

Cc: Shondoit added
Component: - Select a componentTor bundles/installation
Owner: set to erinn

Do you happen to run Kaspersky?

Otherwise, does this happen at startup, or some point during browsing?
And does this happen consistently or randomly? i.e. Does it run fine one time and hog cpu another time.

Can you try with just a Vidalia bundle? (Can be found at https://www.torproject.org/download/download.html.en)
Be advised that this is not the recommended way to browse, so only use this for testing purposes.

comment:2 Changed 7 years ago by erinn

Status: newneeds_information

One Sock, can you let us know if you are still having this problem?

comment:3 Changed 6 years ago by mehturt

Cc: mehturt@… added

I'm seeing similar behavior, tor randomly starts to consume 100% of CPU of one core, until I restart it. I'm using tor on Debian stable, amd64.
But I only started to see this a week ago, was running fine for a year or so.

comment:4 Changed 6 years ago by mehturt

FYI, strace was showing repetitive epoll_wait(EPOLLOUT) returning 1 and read() returning -1 with errno being EAGAIN. After tor restart (and CPU normal), this is not the case.

comment:5 Changed 6 years ago by nickm

Component: Tor bundles/installationTor
Keywords: tor-relay added
Milestone: Tor: 0.2.4.x-final

Oh hey, that sounds like a bug.

If it happens again, I would love to see a stack trace with "bt --full" on the point where read is returning -1,EAGAIN over and over.

comment:6 Changed 6 years ago by temp5

I'm seeing the same thing as mehturt- tor relay consumes 100% of a core until I restart, then it's fine for a couple hours or more. Tor version 0.2.3.25 on Ubuntu Server 12.04 64 bit. And again same as him, been running fine for years until a few days ago.

I figured out how to run "perf top" but not sure how to run strace or anything. Would be happy to help too if you let me know what commands I should enter!

comment:7 Changed 6 years ago by mehturt

I think yoy can attach to a running tor process using "gdb tor $PID" and then type "bt" to get the stack trace.

comment:8 Changed 6 years ago by temp5

Thank you, while I was trying to figure out how to load symbols the 100% CPU usage went down to normal levels of 1%! So I'll keep working on it so I'm ready for next time.

I remembered too that I upgraded around 10 libraries a few days ago. I wonder if there's a bug in one of them, since Ubuntu is basically Debian?

comment:9 Changed 6 years ago by nickm

Owner: changed from erinn to nickm
Priority: normalmajor
Status: needs_informationassigned

Check out the stack traces from VladV on #8679 .

It looks like this is happening in or around read_to_buf_tls().

comment:10 Changed 6 years ago by nickm

Another thing to try: send Tor a USR2 signal while this is happening to turn on DEBUG-level logs. I want to see what SSL_read is returning and why we aren't disabling that event.

Oh hey; it's a connection_read_to_buf inside a connection_handle_write_impl. Interesting...

comment:11 Changed 6 years ago by nickm

#8680 (also closed as a duplicate) has a debug log. Thanks, alphawolf!

Apr 10 17:59:25.000 [debug] conn_write_callback(): socket 124 wants to write.
Apr 10 17:59:25.000 [debug] connection_read_to_buf(): 124: starting, inbuf_datalen 0 (0 pending in tls object). at_most 15872.
Apr 10 17:59:25.000 [debug] tor_tls_read(): read returned r=-1, err=-2
Apr 10 17:59:25.000 [debug] connection_read_to_buf(): After TLS read of 0: 0 read, 0 written
Apr 10 17:59:25.000 [debug] connection_or_process_cells_from_inbuf(): 124: starting, inbuf_datalen 0 (0 pending in tls object).

comment:12 Changed 6 years ago by alphawolf

Cc: mikewolf53@… added

Changed 6 years ago by nickm

possible fix for #5650

comment:13 Changed 6 years ago by nickm

Status: assignedneeds_review

Okay, it looks like this is happening while we get a write event on a connection in state OR_CONN_STATE_TLS_SERVER_RENEGOTIATING. We then call "connection_or_handle_read", since we expect that we're in read-blocked-on-write state. But in fact there is nothing to read.

I have a more or less totally untested fix in my branch "bug5650" in my public repository. It's against 0.2.3. It needs testing!

I'll also attach it as a patch. Please let me know if it stops this bug, or breaks the server entirely, or what.

comment:14 Changed 6 years ago by nickm

(This is not the cleanest approach to a fix for this issue, but it seemed like the one easiest to understand and review, which is a plus if we want to get it back into 0.2.3 or earlier.)

comment:15 Changed 6 years ago by alphawolf

I'm willing to test, but I'd rather not obliterate my current install (via repo). Can I just move my current tor binary (which is 0.2.4.11), build this 0.2.3.x from source but *not* "make install", then copy the new tor binary to /usr/sbin/tor? Then after I'm done testing, I can replace the test binary with my original?  Or do I need to uninstall the original tor completely?

comment:16 Changed 6 years ago by nickm

That would work. You could also apply the above patch to 0.2.4.11-alpha if you prefer and you want to minimize external variables -- I expect it will probably apply cleanly.

comment:17 Changed 6 years ago by marmoute

Cc: pierre-yves.david@… added

comment:18 Changed 6 years ago by alphawolf

Argh!! I'm a newb.  This is all pretty new to me... but I figured out how to clone the main repo... and how make a branch at 0.2.4.11-alpha... and how to apply the patch to that branch...  And it even compiles successfully.  But after 4 tries, I can't figure out the correct set of configure flags to make the darn thing read the torrc in the default location on Debain (/etc/tor/torrc).  I thought just compiling it *without* any flags would be correct, as long as I started it with "/etc/init.d/tor start" ...  I was wrong.  Here is my latest (failed) attempt.

# CFLAGS="-O2 -march=native -g" | ./configure --disable-asciidoc --sysconfdir=/etc/tor

Please help.

comment:19 Changed 6 years ago by nickm

Here are the rules that debian uses to configure Tor:
https://gitweb.torproject.org/debian/tor.git/blob/HEAD:/debian/rules

It looks like the requisite magic may be "--sysconfdir=/etc" ; the other configure options may be needed too.

(There is probably a better way to do this, but I don't know debian packaging so well.)

comment:20 Changed 6 years ago by alphawolf

I could have sworn I tried that...  anyway, that worked, thanks!  Tor is up and running, relaying traffic.  I'll let you know if the problem recurs.  If there is a manual way to recreate the bug, I'm willing to give that a try.

comment:21 Changed 6 years ago by nickm

Thank you! Keep an eye on your server as it goes, to make sure not only that the bug doesn't reappear, but also that you don't see any other weird behaviors.

How long did it typically take the behavior to manifest before?

comment:22 Changed 6 years ago by alphawolf

It seems like it started in the last 24 hours... and if I remember correctly, the first time I noticed it was after 16hrs of uptime, and the second time was after about 2.5hrs. But prior to this, it's never happened at all. It is interesting that 4 people reported having the problem today, and all 4 are running either Debian or a Debian derivative. (Two Ubuntu 12.04's, one Debian Squeeze, and one Debian Wheezy).

comment:23 Changed 6 years ago by alphawolf

This is odd... I seem to be experiencing this bug suddenly: #8235.

Is it possible the patch could have caused this?

comment:24 Changed 6 years ago by nickm

I'd guess that it's *probably* unrelated.

comment:25 Changed 6 years ago by nickm

Keywords: 023-backport added

(Other folks -- it would be great if you can test this too.)

comment:26 Changed 6 years ago by nickm

I tweaked it a little more and added a changes file. Roger, Andrea -- could one of you take a look at my change before I merge this into 0.2.4?

comment:27 Changed 6 years ago by arma

changes file looks fine. (Thanks!)

You might rebase it to remove the bonus blank line introduced in 098b085a89. Unless you wanted it there.

Also, it would be wonderful to add a few sentences under the TLS_WANTREAD case that explain what's going on. That will sure help the next developer struggling with related issues.

And finally, should we merge into 0.2.3, since we expect to want it there too?

comment:28 in reply to:  27 Changed 6 years ago by nickm

Replying to arma:

changes file looks fine. (Thanks!)

You might rebase it to remove the bonus blank line introduced in 098b085a89. Unless you wanted it there.

Also, it would be wonderful to add a few sentences under the TLS_WANTREAD case that explain what's going on. That will sure help the next developer struggling with related issues.

Will do.

And finally, should we merge into 0.2.3, since we expect to want it there too?

I think the plan is, "Try it in an 0.2.4 release, and assuming it doesn't break stuff, merge it into 0.2.3 too". Good plan?

comment:29 Changed 6 years ago by alphawolf

Just an update: my relay with 0.2.4.11-alpha + bug5650 patch has been up for 26 hours without any recurrence of this bug. If that's sufficient testing, I'd really like to reboot this instance and see if I can tackle #8235, which has gotten quite nasty on this machine.

comment:30 Changed 6 years ago by nickm

fixes done, and squashed into "bug5650_squashed", and merged into 0.2.4.

Marking for 0.2.3 backport.

Please let me know if the bug recurs with this patch in place.

comment:31 in reply to:  29 Changed 6 years ago by nickm

Replying to alphawolf:

Just an update: my relay with 0.2.4.11-alpha + bug5650 patch has been up for 26 hours without any recurrence of this bug. If that's sufficient testing, I'd really like to reboot this instance and see if I can tackle #8235, which has gotten quite nasty on this machine.

FWIW, we *think* we did a fix for #8235 in current maint-0.2.4. At the very least, the log messages should be better there.

comment:32 Changed 6 years ago by nickm

Milestone: Tor: 0.2.4.x-finalTor: 0.2.3.x-final

comment:33 Changed 6 years ago by arma

It sounds like we should try to get this into 0.2.3.26 (it's affecting relay operators, and I think quite a few). But it would be fine to wait until 0.2.3.27 too (if there even is one, that is).

comment:34 Changed 6 years ago by nickm

Assuming that it doesn't fail in 0.2.4.12-alpha, I suggest an early backport; this one is annoying and problematic, and the fix is pretty simple.

comment:35 Changed 6 years ago by arma

Merge it, I say. (We're not going to figure out if 0.2.4.12-alpha broke it anytime soon, if we haven't already figured it out.)

comment:36 Changed 6 years ago by nickm

Resolution: fixed
Status: needs_reviewclosed

Merged into maint-0.2.3

comment:37 Changed 6 years ago by cypherpunks

Merged fix do hides real reason of this bug. Suggests another way to fix it.

--- connection.origin.c
+++ connection.c
@@ -3068,16 +3068,8 @@
         connection_start_writing(conn);
         return 0;
       case TOR_TLS_WANTREAD:
-        if (conn->in_connection_handle_write) {
-          /* We've been invoked from connection_handle_write, because we're
-           * waiting for a TLS renegotiation, the renegotiation started, and
-           * SSL_read returned WANTWRITE.  But now SSL_read is saying WANTREAD
-           * again.  Stop waiting for write events now, or else we'll
-           * busy-loop until data arrives for us to read. */
-          connection_stop_writing(conn);
-          if (!connection_is_reading(conn))
-            connection_start_reading(conn);
-        }
+        if (!connection_is_reading(conn)) /* XXX: It's bug if no reading */
+          connection_start_reading(conn);
         /* we're already reading, one hopes */
         result = 0;
         break;
@@ -3500,6 +3492,7 @@
       }
       return 0;
     } else if (conn->state == OR_CONN_STATE_TLS_SERVER_RENEGOTIATING) {
+      connection_stop_writing(conn);
       return connection_handle_read(conn);
     }

in_connection_handle_write flag is no need anymore.

comment:38 Changed 6 years ago by cypherpunks

I think part

+        if (!connection_is_reading(conn)) /* XXX: It's bug if no reading */
+          connection_start_reading(conn);

need to be removed. reading connection can be blocked on bw limits, no need to start it here if so.

comment:39 Changed 6 years ago by cypherpunks

Resolution: fixed
Status: closedreopened

comment:40 Changed 6 years ago by nickm

Keywords: 024-backport added
Milestone: Tor: 0.2.3.x-finalTor: 0.2.5.x-final

cypherpunks says:

Merged fix do hides real reason of this bug.

If I understand this correctly, you mean that this merged fix is not the best way to fix the bug, but that it should solve the symptoms of the bug anyway.

If that's correct, I'm marking this for 0.2.5 and a possible backport.

comment:41 Changed 5 years ago by nickm

Milestone: Tor: 0.2.5.x-finalTor: 0.2.6.x-final
Status: reopenedneeds_review

comment:42 Changed 4 years ago by nickm

Milestone: Tor: 0.2.6.x-finalTor: 0.2.???

comment:43 Changed 3 years ago by teor

Milestone: Tor: 0.2.???Tor: 0.3.???

Milestone renamed

comment:44 Changed 2 years ago by nickm

Keywords: tor-03-unspecified-201612 added
Milestone: Tor: 0.3.???Tor: unspecified

Finally admitting that 0.3.??? was a euphemism for Tor: unspecified all along.

comment:45 Changed 2 years ago by nickm

Keywords: tor-03-unspecified-201612 removed

Remove an old triaging keyword.

comment:46 Changed 2 years ago by nickm

Keywords: 023-backport removed

0.2.3 is obsolete; there will be no 0.2.3 backports.

comment:47 Changed 2 years ago by nickm

Keywords: 024-backport removed

None of these is ripe for backport to 0.2.4 even if it does get fixed.

comment:48 Changed 2 years ago by nickm

Resolution: worksforme
Severity: Normal
Status: needs_reviewclosed

It looks like we fixed the issue here?

Note: See TracTickets for help on using tickets.