Opened 5 years ago

Closed 5 years ago

#10532 closed defect (worksforme)

[Tor relay] Random hangs

Reported by: mrc0mmand Owned by:
Priority: Medium Milestone:
Component: Core Tor/Tor Version: Tor: unspecified
Severity: Keywords:
Cc: Actual Points:
Parent ID: Points:
Reviewer: Sponsor:

Description

To relay runs smoothly for some time (1 - 10% CPU usage) when suddenly it starts consuming 100% CPU and its status in top/htop jumps between S and D. Process is uninterruptible/unresponsive and ignores every signal I send to it thus I can't attach strace/gdb to it nor send SIGSUSR2 for debug output. The only thing I can do is call manually OOM killer, which kills mysqld (which does at these times literally nothing) and the tor usage returns back to 1 - 10%. I happens at random intevals from few hours to days.

I'm experiencing this issue with three versions of tor: 0.2.3.25, 0.2.4.19-rc and now with 0.2.4.20. I've tried many different configurations and tried to limit RelayBandwidthRate and RelayBandwidthRateBurst to 100KB/200KB which reduced CPU usage to 0.5 - 2% but eventually it got stuck too.

Due to previous reasons the only log I was able to get is stack trace dump before call of OOM killer. I'm not sure if it's useful, but it's the only thing I got.

System info: Fedora 19 3.11.6-201.fc19.x86_64

[2949490.851017] tor             R  running task        0  9426      1 0x00000084
[2949490.851017]  0000000000000000 ffff88001fc03c30 ffffffff81097b18 ffff88001c3acf40
[2949490.851017]  0000000000000000 ffff88001fc03c60 ffffffff81097bf2 ffff88001c3ad0c0
[2949490.851017]  ffffffff81c9cae0 0000000000000074 0000000000000002 ffff88001fc03c70
[2949490.851017] Call Trace:
[2949490.851017]  <IRQ>  [<ffffffff81097b18>] sched_show_task+0xa8/0x110
[2949490.851017]  [<ffffffff81097bf2>] show_state_filter+0x72/0xb0
[2949490.851017]  [<ffffffff813bb3d0>] sysrq_handle_showstate+0x10/0x20
[2949490.851017]  [<ffffffff813bba62>] __handle_sysrq+0xa2/0x170
[2949490.851017]  [<ffffffff813bbef2>] sysrq_filter+0x392/0x3d0
[2949490.851017]  [<ffffffff814ad849>] input_to_handler+0x59/0xf0
[2949490.851017]  [<ffffffff814aeb59>] input_pass_values.part.4+0x159/0x160
[2949490.851017]  [<ffffffff814b0ba5>] input_handle_event+0x125/0x530
[2949490.851017]  [<ffffffff814b1006>] input_event+0x56/0x70
[2949490.851017]  [<ffffffff814b798e>] atkbd_interrupt+0x5be/0x6b0
[2949490.851017]  [<ffffffff814aaa93>] serio_interrupt+0x43/0x90
[2949490.851017]  [<ffffffff814abafa>] i8042_interrupt+0x18a/0x370
[2949490.851017]  [<ffffffff810f5ffe>] handle_irq_event_percpu+0x3e/0x1e0
[2949490.851017]  [<ffffffff810f61d6>] handle_irq_event+0x36/0x60
[2949490.851017]  [<ffffffff810f8adf>] handle_edge_irq+0x6f/0x120
[2949490.851017]  [<ffffffff8101459f>] handle_irq+0xbf/0x150
[2949490.851017]  [<ffffffff8106c60f>] ? irq_enter+0x4f/0x90
[2949490.851017]  [<ffffffff81658acd>] do_IRQ+0x4d/0xc0
[2949490.851017]  [<ffffffff8164e46d>] common_interrupt+0x6d/0x6d
[2949490.851017]  <EOI>  [<ffffffff8114ed70>] ? shrink_page_list+0x460/0xb00
[2949490.851017]  [<ffffffff8114fa9a>] shrink_inactive_list+0x18a/0x4e0
[2949490.851017]  [<ffffffff81150475>] shrink_lruvec+0x345/0x670
[2949490.851017]  [<ffffffff8107f262>] ? insert_work+0x62/0xa0
[2949490.851017]  [<ffffffff8107f262>] ? insert_work+0x62/0xa0
[2949490.851017]  [<ffffffff81150806>] shrink_zone+0x66/0x1a0
[2949490.851017]  [<ffffffff81150cf0>] do_try_to_free_pages+0xf0/0x590
[2949490.851017]  [<ffffffff8114cfb4>] ? throttle_direct_reclaim.isra.40+0x84/0x270
[2949490.851017]  [<ffffffff81151261>] try_to_free_pages+0xd1/0x170
[2949490.851017]  [<ffffffff811459ea>] __alloc_pages_nodemask+0x69a/0xa30
[2949490.851017]  [<ffffffff811830e9>] alloc_pages_current+0xa9/0x170
[2949490.851017]  [<ffffffff81537600>] sk_page_frag_refill+0x70/0x160
[2949490.851017]  [<ffffffff81591720>] tcp_sendmsg+0x2f0/0xdc0
[2949490.851017]  [<ffffffff815baac4>] inet_sendmsg+0x64/0xb0
[2949490.851017]  [<ffffffff8129a973>] ? selinux_socket_sendmsg+0x23/0x30
[2949490.851017]  [<ffffffff81532e37>] sock_aio_write+0x137/0x150
[2949490.851017]  [<ffffffff811a7a30>] do_sync_write+0x80/0xb0
[2949490.851017]  [<ffffffff811a8235>] vfs_write+0x1b5/0x1e0
[2949490.851017]  [<ffffffff8164bf0a>] ? __schedule+0x2ba/0x750
[2949490.851017]  [<ffffffff811a8b79>] SyS_write+0x49/0xa0
[2949490.851017]  [<ffffffff81656919>] system_call_fastpath+0x16/0x1b
[2949490.851017] tor             S ffff88001fc14180     0 26681      1 0x00000080
[2949490.851017]  ffff88000f443b30 0000000000000082 ffff88000f443fd8 0000000000014180
[2949490.851017]  ffff88000f443fd8 0000000000014180 ffff88001cc8de80 7fffffffffffffff
[2949490.851017]  0000000000000000 ffff88001cc8de80 ffff88001eee4380 ffff88001eee4678
[2949490.851017] Call Trace:
[2949490.851017]  [<ffffffff8164c3c9>] schedule+0x29/0x70
[2949490.851017]  [<ffffffff8164a451>] schedule_timeout+0x201/0x2c0
[2949490.851017]  [<ffffffff811ecc83>] ? ep_poll_callback+0xf3/0x160
[2949490.851017]  [<ffffffff815ecd0e>] unix_stream_recvmsg+0x30e/0x850
[2949490.851017]  [<ffffffff81089420>] ? wake_up_atomic_t+0x30/0x30
[2949490.851017]  [<ffffffff81534328>] sock_recvmsg+0xa8/0xe0
[2949490.851017]  [<ffffffff81533c69>] ? sock_sendmsg+0x99/0xd0
[2949490.851017]  [<ffffffff811655e3>] ? handle_pte_fault+0x93/0xa70
[2949490.851017]  [<ffffffff8153448f>] SYSC_recvfrom+0xdf/0x160
[2949490.851017]  [<ffffffff8164bf0a>] ? __schedule+0x2ba/0x750
[2949490.851017]  [<ffffffff81534c0e>] SyS_recvfrom+0xe/0x10
[2949490.851017]  [<ffffffff81656919>] system_call_fastpath+0x16/0x1b

Child Tickets

Attachments (2)

detect_zero_of write.txt (3 bytes) - added by cypherpunks 5 years ago.
tor_debug_log.txt (10.1 KB) - added by mrc0mmand 5 years ago.
Tor debug log of last freeze (at 20:14:31 was called OOM killer)

Download all attachments as: .zip

Change History (41)

comment:1 Changed 5 years ago by cypherpunks

Last edited 5 years ago by cypherpunks (previous) (diff)

comment:2 Changed 5 years ago by mrc0mmand

MySQL listens on port 3306 on same interface as tor without enabled SSL, and is used only locally. ExtendAllowPrivateAddresses is set to default, I have not changed it. Policy is now set to exit with port 80 and 443 enabled but before it was configured as non-exit relay.

comment:3 Changed 5 years ago by cypherpunks

Last edited 5 years ago by cypherpunks (previous) (diff)

comment:4 Changed 5 years ago by mrc0mmand

Yes, it's not binded to public IPv4 address.

comment:5 Changed 5 years ago by cypherpunks

Last edited 5 years ago by cypherpunks (previous) (diff)

comment:6 Changed 5 years ago by mrc0mmand

It has been three times so far. Hard to tell if it's coincidence or not, because OOM killer always select mysqld.

comment:7 Changed 5 years ago by cypherpunks

Last edited 5 years ago by cypherpunks (previous) (diff)

comment:8 Changed 5 years ago by mrc0mmand

I apologize for late response. I am not aware of any application/utility which would communicate with tor and mysqld at once. There are only two applications, which communicate with mysql: php and php-fpm. Tor has also disabled control and socket ports.
Anyway, few minutes ago it happened again. I didn't use OOM killer this time and manually tried to kill any other proces than mysqld - after killing php-fpm, tor returned to its default CPU usage. Apparently it's not just mysqld, which is really strange.

comment:9 Changed 5 years ago by cypherpunks

Last edited 5 years ago by cypherpunks (previous) (diff)

comment:10 Changed 5 years ago by mrc0mmand

That sounds reasonable. I've changed ORPort from 110 to 9001. Now the only thing I can do is wait. Thank you for your help so far and I will post here any results I'll get.

comment:11 Changed 5 years ago by cypherpunks

Last edited 5 years ago by cypherpunks (previous) (diff)

comment:12 Changed 5 years ago by mrc0mmand

I telnet'd to tor's ORPort 9001 with and without sending anything but sadly without any change in tor's CPU usage.

comment:13 Changed 5 years ago by cypherpunks

Last edited 5 years ago by cypherpunks (previous) (diff)

comment:14 Changed 5 years ago by cypherpunks

Last edited 5 years ago by cypherpunks (previous) (diff)

comment:15 Changed 5 years ago by mrc0mmand

Unfortunately, now it happened again, and in this case killing php/mysql wasn't helpful at all - tor still jumped between R and D status until OOM killer killed it. It's really frustrating, that I can't get any debug logs or something which would be helpul in solving this issue.

comment:16 Changed 5 years ago by cypherpunks

Last edited 5 years ago by cypherpunks (previous) (diff)

comment:17 Changed 5 years ago by mrc0mmand

Every time I try to attach strace or gdb to tor (as root) during that unresponsive period, it just says "Attached to <PID>" and nothing more.

comment:18 Changed 5 years ago by cypherpunks

Last edited 5 years ago by cypherpunks (previous) (diff)

comment:19 Changed 5 years ago by mrc0mmand

When I called it as 'gdb --pid <PID>' it got stuck at "Attaching to process <PID>". It didn't load any symbols from libraries and I couldn't enter any commands.

comment:20 Changed 5 years ago by cypherpunks

Last edited 5 years ago by cypherpunks (previous) (diff)

comment:21 Changed 5 years ago by mrc0mmand

When I investigated last 'freeze' for a longer time, I found it had 100% CPU usage only, when it was running (status R) which was for 1 sec at max. During above mentioned uninterruptible sleep it had 0% CPU usage. The mentioned 100% CPU usage is caused by kernel threads (red CPU bar in htop). I apologize for misinterpretation. I understand, that uninterruptible sleep is necessary for I/O operations, but why tor get stuck in it?

comment:22 Changed 5 years ago by cypherpunks

Last edited 5 years ago by cypherpunks (previous) (diff)

comment:23 Changed 5 years ago by cypherpunks

Last edited 5 years ago by cypherpunks (previous) (diff)

comment:24 Changed 5 years ago by cypherpunks

Last edited 5 years ago by cypherpunks (previous) (diff)

comment:25 Changed 5 years ago by mrc0mmand

What I can see, RLIMIT_FSIZE is set to 'ulimited' and system has almost 7 GB HDD space left. Unfortunately, I have no idea, how I could reproduce this problem in my env.
Anyway, ~4 hours ago it happened again, I at least got a call trace log, maybe there could be something worthwhile in it.

[3120575.364013] tor             R  running task        0 17990      1 0x00000084
[3120575.364013]  ffff88001165b768 0000000000000082 ffff88001165bfd8 0000000000014180
[3120575.364013]  ffff88001165bfd8 0000000000014180 ffff88001f6da620 ffff88001165a000
[3120575.364013]  00000001b9fb9fa3 ffff88001165ba68 0000000000000001 ffff88001ffe96c0
[3120575.364013] Call Trace:
[3120575.364013]  [<ffffffff810962e6>] __cond_resched+0x26/0x30
[3120575.364013]  [<ffffffff8164c78a>] _cond_resched+0x3a/0x50
[3120575.364013]  [<ffffffff8115af4d>] wait_iff_congested+0x6d/0x140
[3120575.364013]  [<ffffffff81089420>] ? wake_up_atomic_t+0x30/0x30
[3120575.364013]  [<ffffffff8114fb38>] shrink_inactive_list+0x228/0x4e0
[3120575.364013]  [<ffffffff81150475>] shrink_lruvec+0x345/0x670
[3120575.364013]  [<ffffffff81150806>] shrink_zone+0x66/0x1a0
[3120575.364013]  [<ffffffff81150cf0>] do_try_to_free_pages+0xf0/0x590
[3120575.364013]  [<ffffffff8114cfb4>] ? throttle_direct_reclaim.isra.40+0x84/0x270
[3120575.364013]  [<ffffffff81151261>] try_to_free_pages+0xd1/0x170
[3120575.364013]  [<ffffffff811459ea>] __alloc_pages_nodemask+0x69a/0xa30
[3120575.364013]  [<ffffffff811830e9>] alloc_pages_current+0xa9/0x170
[3120575.364013]  [<ffffffff81537600>] sk_page_frag_refill+0x70/0x160
[3120575.364013]  [<ffffffff81591720>] tcp_sendmsg+0x2f0/0xdc0
[3120575.364013]  [<ffffffff815baac4>] inet_sendmsg+0x64/0xb0
[3120575.364013]  [<ffffffff8129a973>] ? selinux_socket_sendmsg+0x23/0x30
[3120575.364013]  [<ffffffff81532e37>] sock_aio_write+0x137/0x150
[3120575.364013]  [<ffffffff811a7a30>] do_sync_write+0x80/0xb0
[3120575.364013]  [<ffffffff811a8235>] vfs_write+0x1b5/0x1e0
[3120575.364013]  [<ffffffff811a8b79>] SyS_write+0x49/0xa0
[3120575.364013]  [<ffffffff810e64f6>] ? __audit_syscall_exit+0x1f6/0x2a0
[3120575.364013]  [<ffffffff81656919>] system_call_fastpath+0x16/0x1b
[3120575.364013] tor             S ffff88001fc14180     0 19768      1 0x00000080
[3120575.364013]  ffff880001735b30 0000000000000082 ffff880001735fd8 0000000000014180
[3120575.364013]  ffff880001735fd8 0000000000014180 ffff88001c3acf40 7fffffffffffffff
[3120575.364013]  0000000000000000 ffff88001c3acf40 ffff880017e9a380 ffff880017e9a678
[3120575.364013] Call Trace:
[3120575.364013]  [<ffffffff8164c3c9>] schedule+0x29/0x70
[3120575.364013]  [<ffffffff8164a451>] schedule_timeout+0x201/0x2c0
[3120575.364013]  [<ffffffff811ecc83>] ? ep_poll_callback+0xf3/0x160
[3120575.364013]  [<ffffffff815ecd0e>] unix_stream_recvmsg+0x30e/0x850
[3120575.364013]  [<ffffffff81089420>] ? wake_up_atomic_t+0x30/0x30
[3120575.364013]  [<ffffffff81534328>] sock_recvmsg+0xa8/0xe0
[3120575.364013]  [<ffffffff81533c69>] ? sock_sendmsg+0x99/0xd0
[3120575.364013]  [<ffffffff811655e3>] ? handle_pte_fault+0x93/0xa70
[3120575.364013]  [<ffffffff810982e6>] ? try_to_wake_up+0xe6/0x290
[3120575.364013]  [<ffffffff8109ef87>] ? dequeue_entity+0x107/0x520
[3120575.364013]  [<ffffffff8153448f>] SYSC_recvfrom+0xdf/0x160
[3120575.364013]  [<ffffffff8164bf0a>] ? __schedule+0x2ba/0x750
[3120575.364013]  [<ffffffff81534c0e>] SyS_recvfrom+0xe/0x10
[3120575.364013]  [<ffffffff81656919>] system_call_fastpath+0x16/0x1b

comment:26 Changed 5 years ago by cypherpunks

Status: newneeds_review

comment:27 Changed 5 years ago by cypherpunks

Last edited 5 years ago by cypherpunks (previous) (diff)

comment:28 Changed 5 years ago by cypherpunks

Status: needs_reviewnew

Changed 5 years ago by cypherpunks

Attachment: detect_zero_of write.txt added

Changed 5 years ago by mrc0mmand

Attachment: tor_debug_log.txt added

Tor debug log of last freeze (at 20:14:31 was called OOM killer)

comment:29 Changed 5 years ago by nickm

It looks like cpunks deleted whatever patch he wrote above before I had a chance to see it.

His theory, as I understand it, is that write() is returning 0 rather than blocking when trying to write to a full socket or file. That doesn't seem like correct behavior to me, but let's try it out.

I've attached a trivial patch that will fix the problem if that's the case. Can you try it out?

diff --git a/src/common/util.c b/src/common/util.c
index 054de3d..0665720 100644
--- a/src/common/util.c
+++ b/src/common/util.c
@@ -1762,6 +1762,9 @@ write_all(tor_socket_t fd, const char *buf, size_t count, int isSocket)
       result = write((int)fd, buf+written, count-written);
     if (result<0)
       return -1;
+    else if (result == 0) {
+      log_notice(LD_BUG, "Apparently write() can return 0.")
+    }
     written += result;
   }
   return (ssize_t)count;

Also, if you're hitting OOM conditions, you should make sure that you're using the MaxMemInCellQueues option.

comment:30 Changed 5 years ago by mrc0mmand

Hi, thank you for your answer. I've recompiled tor with your patch and hopefully this will solve my problem.

As for MaxMemInCellQueues, I totally forgot about it. I will fix it asap, thanks.

(Just one note: there is missing semicolon after log_notice())

comment:31 Changed 5 years ago by mrc0mmand

It looked promising but after six hours it happened again.

Well, first 'deadlock' was after ~2 hours, but when I stopped ttrss update daemon, tor somehow recovered from it. So I disabled that daemon and hoped it will work, but after another four hours tor was back in that 'deadlock'. I think my server is haunted ...

I forgot to mention it before, but when tor gets into that 'deadlock' it has impact on several apps: when I try to run dmesg or make new connection via ssh, both processes get stuck as well as tor - in uninterruptible sleep.

comment:32 Changed 5 years ago by nickm

Hm. Does the patch mentioned at #4345 help at all?

comment:33 Changed 5 years ago by mrc0mmand

That patch from #4345 appears to be for tor-0.2.3.25. I'm not quite sure where should I put connection_flush() in cpuworker.c from version 0.2.4.20.

comment:34 Changed 5 years ago by nickm

I meant this one: https://gitweb.torproject.org/nickm/tor.git/commitdiff/85b46d57bcc40b8053dafe5d0ebb4b0bb611b484 , referenced in the last comment. I think it should apply cleanly to 0.2.4.

comment:35 Changed 5 years ago by mrc0mmand

Oh, I somehow missed it, thanks. Applying and recompiling was without problems, we'll see.

comment:36 Changed 5 years ago by mrc0mmand

Sadly, I have to report failure, again. Still the same problem and same scenario. 

comment:37 Changed 5 years ago by mrc0mmand

I've run tor three times in debug mode, just to be sure about place where it gets stuck. Here are the results:

Jan 18 23:44:29.000 [debug] channel_write_packed_cell(): Writing packed_cell_t 0x7f601aa9d658 to channel 0x7f601bb1aa70 with global ID 468
Jan 18 23:44:29.000 [debug] channel_write_packed_cell(): Writing packed_cell_t 0x7f601aa9d870 to channel 0x7f601bb1aa70 with global ID 468
Jan 18 23:44:29.000 [debug] conn_write_callback(): socket 127 wants to write.
Jan 18 23:44:29.000 [debug] flush_chunk_tls(): flushed 4064 bytes, 6216 ready to flush, 6216 remain.

----
Jan 19 14:58:43.000 [debug] circuit_resume_edge_reading(): resuming
Jan 19 14:58:43.000 [debug] connection_or_process_cells_from_inbuf(): 275: starting, inbuf_datalen 0 (0 pending in tls object).
Jan 19 14:58:43.000 [debug] conn_write_callback(): socket 275 wants to write.

----
Jan 19 16:11:17.000 [debug] connection_or_process_cells_from_inbuf(): 375: starting, inbuf_datalen 0 (0 pending in tls object).
Jan 19 16:11:17.000 [debug] conn_write_callback(): socket 375 wants to write.
Jan 19 16:11:17.000 [debug] flush_chunk_tls(): flushed 3936 bytes, 12448 ready to flush, 12448 remain.
Jan 19 16:11:17.000 [debug] flush_chunk_tls(): flushed 4064 bytes, 8384 ready to flush, 8384 remain.

comment:38 Changed 5 years ago by mrc0mmand

Well, I've upgraded OS on my VPS to f20, optimized/removed some I/O operations and now my relay has been running for more than one week (with one necessary restart). I don't know what caused my previous problem, but it looks like it's been resolved.

Anyway, I really thank you for your help and I hope my relay will run smoothly for some time.

comment:39 Changed 5 years ago by nickm

Resolution: worksforme
Status: newclosed

Okay. We should reopen this one if somebody else runs into it, or if we ever figure out how to reproduce it, or what might have been going on there.

Note: See TracTickets for help on using tickets.