[Tor relay] Random hangs

added component::core tor/tor priority::medium reporter::mrc0mmand resolution::worksforme status::closed type::defect version::tor unspecified labels

MySQL listens on port 3306 on same interface as tor without enabled SSL, and is used only locally. ExtendAllowPrivateAddresses is set to default, I have not changed it. Policy is now set to exit with port 80 and 443 enabled but before it was configured as non-exit relay.

Trac:
Username: mrc0mmand

Yes, it's not binded to public IPv4 address.

Trac:
Username: mrc0mmand

It has been three times so far. Hard to tell if it's coincidence or not, because OOM killer always select mysqld.

Trac:
Username: mrc0mmand

I apologize for late response. I am not aware of any application/utility which would communicate with tor and mysqld at once. There are only two applications, which communicate with mysql: php and php-fpm. Tor has also disabled control and socket ports. Anyway, few minutes ago it happened again. I didn't use OOM killer this time and manually tried to kill any other proces than mysqld - after killing php-fpm, tor returned to its default CPU usage. Apparently it's not just mysqld, which is really strange.

Trac:
Username: mrc0mmand

That sounds reasonable. I've changed ORPort from 110 to 9001. Now the only thing I can do is wait. Thank you for your help so far and I will post here any results I'll get.

Trac:
Username: mrc0mmand

I telnet'd to tor's ORPort 9001 with and without sending anything but sadly without any change in tor's CPU usage.

Trac:
Username: mrc0mmand

Unfortunately, now it happened again, and in this case killing php/mysql wasn't helpful at all - tor still jumped between R and D status until OOM killer killed it. It's really frustrating, that I can't get any debug logs or something which would be helpul in solving this issue.

Trac:
Username: mrc0mmand

Every time I try to attach strace or gdb to tor (as root) during that unresponsive period, it just says "Attached to " and nothing more.

Trac:
Username: mrc0mmand

When I called it as 'gdb --pid ' it got stuck at "Attaching to process ". It didn't load any symbols from libraries and I couldn't enter any commands.

Trac:
Username: mrc0mmand

When I investigated last 'freeze' for a longer time, I found it had 100% CPU usage only, when it was running (status R) which was for 1 sec at max. During above mentioned uninterruptible sleep it had 0% CPU usage. The mentioned 100% CPU usage is caused by kernel threads (red CPU bar in htop). I apologize for misinterpretation. I understand, that uninterruptible sleep is necessary for I/O operations, but why tor get stuck in it?

Trac:
Username: mrc0mmand

What I can see, RLIMIT_FSIZE is set to 'ulimited' and system has almost 7 GB HDD space left. Unfortunately, I have no idea, how I could reproduce this problem in my env. Anyway, ~4 hours ago it happened again, I at least got a call trace log, maybe there could be something worthwhile in it.

[3120575.364013] tor             R  running task        0 17990      1 0x00000084
[3120575.364013]  ffff88001165b768 0000000000000082 ffff88001165bfd8 0000000000014180
[3120575.364013]  ffff88001165bfd8 0000000000014180 ffff88001f6da620 ffff88001165a000
[3120575.364013]  00000001b9fb9fa3 ffff88001165ba68 0000000000000001 ffff88001ffe96c0
[3120575.364013] Call Trace:
[3120575.364013]  [<ffffffff810962e6>] __cond_resched+0x26/0x30
[3120575.364013]  [<ffffffff8164c78a>] _cond_resched+0x3a/0x50
[3120575.364013]  [<ffffffff8115af4d>] wait_iff_congested+0x6d/0x140
[3120575.364013]  [<ffffffff81089420>] ? wake_up_atomic_t+0x30/0x30
[3120575.364013]  [<ffffffff8114fb38>] shrink_inactive_list+0x228/0x4e0
[3120575.364013]  [<ffffffff81150475>] shrink_lruvec+0x345/0x670
[3120575.364013]  [<ffffffff81150806>] shrink_zone+0x66/0x1a0
[3120575.364013]  [<ffffffff81150cf0>] do_try_to_free_pages+0xf0/0x590
[3120575.364013]  [<ffffffff8114cfb4>] ? throttle_direct_reclaim.isra.40+0x84/0x270
[3120575.364013]  [<ffffffff81151261>] try_to_free_pages+0xd1/0x170
[3120575.364013]  [<ffffffff811459ea>] __alloc_pages_nodemask+0x69a/0xa30
[3120575.364013]  [<ffffffff811830e9>] alloc_pages_current+0xa9/0x170
[3120575.364013]  [<ffffffff81537600>] sk_page_frag_refill+0x70/0x160
[3120575.364013]  [<ffffffff81591720>] tcp_sendmsg+0x2f0/0xdc0
[3120575.364013]  [<ffffffff815baac4>] inet_sendmsg+0x64/0xb0
[3120575.364013]  [<ffffffff8129a973>] ? selinux_socket_sendmsg+0x23/0x30
[3120575.364013]  [<ffffffff81532e37>] sock_aio_write+0x137/0x150
[3120575.364013]  [<ffffffff811a7a30>] do_sync_write+0x80/0xb0
[3120575.364013]  [<ffffffff811a8235>] vfs_write+0x1b5/0x1e0
[3120575.364013]  [<ffffffff811a8b79>] SyS_write+0x49/0xa0
[3120575.364013]  [<ffffffff810e64f6>] ? __audit_syscall_exit+0x1f6/0x2a0
[3120575.364013]  [<ffffffff81656919>] system_call_fastpath+0x16/0x1b
[3120575.364013] tor             S ffff88001fc14180     0 19768      1 0x00000080
[3120575.364013]  ffff880001735b30 0000000000000082 ffff880001735fd8 0000000000014180
[3120575.364013]  ffff880001735fd8 0000000000014180 ffff88001c3acf40 7fffffffffffffff
[3120575.364013]  0000000000000000 ffff88001c3acf40 ffff880017e9a380 ffff880017e9a678
[3120575.364013] Call Trace:
[3120575.364013]  [<ffffffff8164c3c9>] schedule+0x29/0x70
[3120575.364013]  [<ffffffff8164a451>] schedule_timeout+0x201/0x2c0
[3120575.364013]  [<ffffffff811ecc83>] ? ep_poll_callback+0xf3/0x160
[3120575.364013]  [<ffffffff815ecd0e>] unix_stream_recvmsg+0x30e/0x850
[3120575.364013]  [<ffffffff81089420>] ? wake_up_atomic_t+0x30/0x30
[3120575.364013]  [<ffffffff81534328>] sock_recvmsg+0xa8/0xe0
[3120575.364013]  [<ffffffff81533c69>] ? sock_sendmsg+0x99/0xd0
[3120575.364013]  [<ffffffff811655e3>] ? handle_pte_fault+0x93/0xa70
[3120575.364013]  [<ffffffff810982e6>] ? try_to_wake_up+0xe6/0x290
[3120575.364013]  [<ffffffff8109ef87>] ? dequeue_entity+0x107/0x520
[3120575.364013]  [<ffffffff8153448f>] SYSC_recvfrom+0xdf/0x160
[3120575.364013]  [<ffffffff8164bf0a>] ? __schedule+0x2ba/0x750
[3120575.364013]  [<ffffffff81534c0e>] SyS_recvfrom+0xe/0x10
[3120575.364013]  [<ffffffff81656919>] system_call_fastpath+0x16/0x1b

Trac:
Username: mrc0mmand

Trac:
Status: new to needs_review

Trac:
Status: needs_review to new

Trac:
detect_zero_of_write.txt

Trac:
Username: mrc0mmand

tor_debug_log.txt

Tor debug log of last freeze (at 20:14:31 was called OOM killer)

Trac:
Username: mrc0mmand

It looks like cpunks deleted whatever patch he wrote above before I had a chance to see it.

His theory, as I understand it, is that write() is returning 0 rather than blocking when trying to write to a full socket or file. That doesn't seem like correct behavior to me, but let's try it out.

I've attached a trivial patch that will fix the problem if that's the case. Can you try it out?

diff --git a/src/common/util.c b/src/common/util.c
index 054de3d..0665720 100644
--- a/src/common/util.c
+++ b/src/common/util.c
@@ -1762,6 +1762,9 @@ write_all(tor_socket_t fd, const char *buf, size_t count, int isSocket)
       result = write((int)fd, buf+written, count-written);
     if (result<0)
       return -1;
+    else if (result == 0) {
+      log_notice(LD_BUG, "Apparently write() can return 0.")
+    }
     written += result;
   }
   return (ssize_t)count;

Also, if you're hitting OOM conditions, you should make sure that you're using the MaxMemInCellQueues option.

Hi, thank you for your answer. I've recompiled tor with your patch and hopefully this will solve my problem.

As for MaxMemInCellQueues, I totally forgot about it. I will fix it asap, thanks.

(Just one note: there is missing semicolon after log_notice())

Trac:
Username: mrc0mmand

It looked promising but after six hours it happened again.

Well, first 'deadlock' was after ~2 hours, but when I stopped ttrss update daemon, tor somehow recovered from it. So I disabled that daemon and hoped it will work, but after another four hours tor was back in that 'deadlock'. I think my server is haunted ...

I forgot to mention it before, but when tor gets into that 'deadlock' it has impact on several apps: when I try to run dmesg or make new connection via ssh, both processes get stuck as well as tor - in uninterruptible sleep.

Trac:
Username: mrc0mmand

Hm. Does the patch mentioned at #4345 (moved) help at all?

That patch from #4345 (moved) appears to be for tor-0.2.3.25. I'm not quite sure where should I put connection_flush() in cpuworker.c from version 0.2.4.20.

Trac:
Username: mrc0mmand

I meant this one: https://gitweb.torproject.org/nickm/tor.git/commitdiff/85b46d57bcc40b8053dafe5d0ebb4b0bb611b484 , referenced in the last comment. I think it should apply cleanly to 0.2.4.

Oh, I somehow missed it, thanks. Applying and recompiling was without problems, we'll see.

Trac:
Username: mrc0mmand

Sadly, I have to report failure, again. Still the same problem and same scenario.

Trac:
Username: mrc0mmand

I've run tor three times in debug mode, just to be sure about place where it gets stuck. Here are the results:

Jan 18 23:44:29.000 [debug] channel_write_packed_cell(): Writing packed_cell_t 0x7f601aa9d658 to channel 0x7f601bb1aa70 with global ID 468
Jan 18 23:44:29.000 [debug] channel_write_packed_cell(): Writing packed_cell_t 0x7f601aa9d870 to channel 0x7f601bb1aa70 with global ID 468
Jan 18 23:44:29.000 [debug] conn_write_callback(): socket 127 wants to write.
Jan 18 23:44:29.000 [debug] flush_chunk_tls(): flushed 4064 bytes, 6216 ready to flush, 6216 remain.

----
Jan 19 14:58:43.000 [debug] circuit_resume_edge_reading(): resuming
Jan 19 14:58:43.000 [debug] connection_or_process_cells_from_inbuf(): 275: starting, inbuf_datalen 0 (0 pending in tls object).
Jan 19 14:58:43.000 [debug] conn_write_callback(): socket 275 wants to write.

----
Jan 19 16:11:17.000 [debug] connection_or_process_cells_from_inbuf(): 375: starting, inbuf_datalen 0 (0 pending in tls object).
Jan 19 16:11:17.000 [debug] conn_write_callback(): socket 375 wants to write.
Jan 19 16:11:17.000 [debug] flush_chunk_tls(): flushed 3936 bytes, 12448 ready to flush, 12448 remain.
Jan 19 16:11:17.000 [debug] flush_chunk_tls(): flushed 4064 bytes, 8384 ready to flush, 8384 remain.

Trac:
Username: mrc0mmand

Well, I've upgraded OS on my VPS to f20, optimized/removed some I/O operations and now my relay has been running for more than one week (with one necessary restart). I don't know what caused my previous problem, but it looks like it's been resolved.

Anyway, I really thank you for your help and I hope my relay will run smoothly for some time.

Trac:
Username: mrc0mmand

Okay. We should reopen this one if somebody else runs into it, or if we ever figure out how to reproduce it, or what might have been going on there.

Trac:
Resolution: N/A to worksforme
Status: new to closed

closed

moved to tpo/core/tor#10532 (closed)

[Tor relay] Random hangs

Child items 0

Activity