#27813 closed defect (fixed)

Tor 0.3.4.8 is leaking memory

Reported by: anong Owned by:
Priority: Very High Milestone: Tor: 0.3.5.x-final
Component: Core Tor/Tor Version: Tor: 0.3.4.8
Severity: Critical Keywords: regression? memleak oom 034-backport tor-relay 035-must
Cc: Actual Points:
Parent ID: Points:
Reviewer: Sponsor:

Description

I upgraded to Tor 0.3.4.8 yesterday and today the process has been killed by the oom killer.

Out of memory: Kill process 1369 (tor) score 338 or sacrifice child.

It seems Tor is leaking memory until there is nothing left.

Child Tickets

Attachments (1)

tor-sockstats.png (212.9 KB) - added by torkit 12 months ago.
tor sockstats with proposed patch applied

Download all attachments as: .zip

Change History (38)

comment:1 Changed 13 months ago by anong

It seems this has been observed by other users as well https://www.reddit.com/r/TOR/comments/9gts72/memory_leaks_since_034_on_debian/

comment:2 Changed 13 months ago by anong

Here are the latest log entries before the process got killed. Tor is running on a Raspberry Pi under Debian 9.4 (DietPi), with kernel 4.14.70-v7+.

Sep 21 12:37:03 DietPi Tor[1369]: Heartbeat: Tor's uptime is 1 day 12:00 hours, with 1365 circuits open. I've sent 69.43 GB and received 68.12 GB.
Sep 21 12:37:03 DietPi Tor[1369]: Circuit handshake stats since last time: 3340/3340 TAP, 33194/33194 NTor.
Sep 21 12:37:03 DietPi Tor[1369]: Since startup, we have initiated 0 v1 connections, 0 v2 connections, 0 v3 connections, and 32984 v4 connections; and received 2 v1 connect
Sep 21 12:37:03 DietPi Tor[1369]: DoS mitigation since startup: 0 circuits killed with too many cells. 0 circuits rejected, 0 marked addresses. 0 connections closed. 389 si
Sep 21 13:17:01 DietPi CRON[7483]: pam_unix(cron:session): session opened for user root by (uid=0)
Sep 21 13:17:01 DietPi CRON[7487]: (root) CMD (   cd / && run-parts --report /etc/cron.hourly)
Sep 21 13:17:02 DietPi CRON[7483]: pam_unix(cron:session): session closed for user root
Sep 21 14:17:01 DietPi CRON[7615]: pam_unix(cron:session): session opened for user root by (uid=0)
Sep 21 14:17:01 DietPi CRON[7619]: (root) CMD (   cd / && run-parts --report /etc/cron.hourly)
Sep 21 14:17:01 DietPi CRON[7615]: pam_unix(cron:session): session closed for user root
Sep 21 14:26:18 DietPi dropbear[7726]: Child connection from 192.168.100.11:51216
Sep 21 14:26:18 DietPi dropbear[7726]: Pubkey auth succeeded for 'root' with key md5 de:04:75:29:92:ad:c2:a0:9a:62:27:4a:37:53:92:fb from 192.168.100.11:51216
Sep 21 14:26:18 DietPi dropbear[7727]: lastlog_perform_login: Couldn't stat /var/log/lastlog: No such file or directory
Sep 21 14:26:18 DietPi dropbear[7727]: lastlog_openseek: /var/log/lastlog is not a file or directory!
Sep 21 14:27:23 DietPi sudo[7880]:     root : TTY=pts/1 ; PWD=/root ; USER=debian-tor ; COMMAND=/usr/local/bin/nyx
Sep 21 14:27:23 DietPi sudo[7880]: pam_unix(sudo:session): session opened for user debian-tor by root(uid=0)
Sep 21 14:27:25 DietPi Tor[1369]: New control connection opened from 127.0.0.1.
Sep 21 14:28:30 DietPi sudo[7880]: pam_unix(sudo:session): session closed for user debian-tor
Sep 21 15:17:01 DietPi CRON[7943]: pam_unix(cron:session): session opened for user root by (uid=0)
Sep 21 15:17:01 DietPi CRON[7947]: (root) CMD (   cd / && run-parts --report /etc/cron.hourly)
Sep 21 15:17:02 DietPi CRON[7943]: pam_unix(cron:session): session closed for user root
Sep 21 15:34:42 DietPi dropbear[7726]: Exit (root): Disconnect received
Sep 21 16:17:01 DietPi CRON[8072]: pam_unix(cron:session): session opened for user root by (uid=0)
Sep 21 16:17:01 DietPi CRON[8076]: (root) CMD (   cd / && run-parts --report /etc/cron.hourly)
Sep 21 16:17:02 DietPi CRON[8072]: pam_unix(cron:session): session closed for user root
Sep 21 17:17:01 DietPi kernel: tor invoked oom-killer: gfp_mask=0x14080c0(GFP_KERNEL|__GFP_ZERO), nodemask=(null),  order=0, oom_score_adj=0
Sep 21 17:17:02 DietPi kernel: tor cpuset=/ mems_allowed=0
Sep 21 17:17:02 DietPi kernel: CPU: 3 PID: 1376 Comm: tor Not tainted 4.14.70-v7+ #1144
Sep 21 17:17:02 DietPi kernel: Hardware name: BCM2835
Sep 21 17:17:02 DietPi kernel: [<8010ffd8>] (unwind_backtrace) from [<8010c240>] (show_stack+0x20/0x24)
Sep 21 17:17:02 DietPi kernel: [<8010c240>] (show_stack) from [<80787b24>] (dump_stack+0xd4/0x118)
Sep 21 17:17:02 DietPi kernel: [<80787b24>] (dump_stack) from [<80224564>] (dump_header+0xac/0x208)
Sep 21 17:17:02 DietPi kernel: [<80224564>] (dump_header) from [<802238cc>] (oom_kill_process+0x478/0x584)
Sep 21 17:17:02 DietPi kernel: [<802238cc>] (oom_kill_process) from [<8022422c>] (out_of_memory+0x124/0x334)
Sep 21 17:17:02 DietPi kernel: [<8022422c>] (out_of_memory) from [<80229ce8>] (__alloc_pages_nodemask+0x1060/0x11d8)
Sep 21 17:17:02 DietPi kernel: [<80229ce8>] (__alloc_pages_nodemask) from [<802561e8>] (__pte_alloc+0x34/0x19c)
Sep 21 17:17:02 DietPi kernel: [<802561e8>] (__pte_alloc) from [<80259eb4>] (handle_mm_fault+0xb98/0xc20)
Sep 21 17:17:02 DietPi kernel: [<80259eb4>] (handle_mm_fault) from [<802543f0>] (__get_user_pages+0x108/0x470)
Sep 21 17:17:02 DietPi kernel: [<802543f0>] (__get_user_pages) from [<80254dc4>] (populate_vma_page_range+0x84/0x8c)
Sep 21 17:17:02 DietPi kernel: [<80254dc4>] (populate_vma_page_range) from [<80254e7c>] (__mm_populate+0xb0/0x148)
Sep 21 17:17:02 DietPi kernel: [<80254e7c>] (__mm_populate) from [<802429cc>] (vm_mmap_pgoff+0xc4/0xc8)
Sep 21 17:17:02 DietPi kernel: [<802429cc>] (vm_mmap_pgoff) from [<8025d3d0>] (SyS_mmap_pgoff+0x58/0xd0)
Sep 21 17:17:02 DietPi kernel: [<8025d3d0>] (SyS_mmap_pgoff) from [<80108060>] (ret_fast_syscall+0x0/0x28)
Sep 21 17:17:02 DietPi kernel: Mem-Info:
Sep 21 17:17:02 DietPi kernel: active_anon:7 inactive_anon:0 isolated_anon:0
                                active_file:19 inactive_file:0 isolated_file:0
                                unevictable:177347 dirty:6 writeback:0 unstable:0
                                slab_reclaimable:2269 slab_unreclaimable:6250
                                mapped:18266 shmem:20 pagetables:394 bounce:0
                                free:5786 free_pcp:42 free_cma:1700
Sep 21 17:17:02 DietPi kernel: Node 0 active_anon:28kB inactive_anon:0kB active_file:76kB inactive_file:0kB unevictable:709388kB isolated(anon):0kB isolated(file):0kB mappe
Sep 21 17:17:02 DietPi kernel: Normal free:23144kB min:16384kB low:20480kB high:24576kB active_anon:196kB inactive_anon:0kB active_file:44kB inactive_file:0kB unevictable:7
Sep 21 17:17:02 DietPi kernel: lowmem_reserve[]: 0 0
Sep 21 17:17:02 DietPi kernel: Normal: 198*4kB (UMEC) 229*8kB (UMEC) 325*16kB (UMEC) 69*32kB (UMEC) 14*64kB (UMEC) 6*128kB (UMC) 2*256kB (UC) 3*512kB (M) 3*1024kB (UM) 1*20
Sep 21 17:17:02 DietPi kernel: 18353 total pagecache pages
Sep 21 17:17:02 DietPi kernel: 11 pages in swap cache
Sep 21 17:17:02 DietPi kernel: Swap cache stats: add 4204, delete 4193, find 24/46
Sep 21 17:17:02 DietPi kernel: Free swap  = 1080828kB
Sep 21 17:17:02 DietPi kernel: Total swap = 1097724kB
Sep 21 17:17:02 DietPi kernel: 255488 pages RAM
Sep 21 17:17:02 DietPi kernel: 0 pages HighMem/MovableOnly
Sep 21 17:17:02 DietPi kernel: 5443 pages reserved
Sep 21 17:17:02 DietPi kernel: 2048 pages cma reserved
Sep 21 17:17:02 DietPi kernel: [ pid ]   uid  tgid total_vm      rss nr_ptes nr_pmds swapents oom_score_adj name
Sep 21 17:17:02 DietPi kernel: [   84]     0    84     6889      446      10       0       98             0 systemd-journal
Sep 21 17:17:02 DietPi kernel: [  142]     0   142     3492      435       8       0      126         -1000 systemd-udevd
Sep 21 17:17:02 DietPi kernel: [  276]     0   276      616      262       6       0       30             0 dropbear
Sep 21 17:17:02 DietPi kernel: [  814]     0   814     1049      258       5       0       35             0 agetty
Sep 21 17:17:02 DietPi kernel: [  953]     0   953     1324      403       6       0       43             0 cron
Sep 21 17:17:02 DietPi kernel: [ 1369]  1001  1369   182943   177124     358       0        0             0 tor
Sep 21 17:17:02 DietPi kernel: Out of memory: Kill process 1369 (tor) score 338 or sacrifice child
Sep 21 17:17:02 DietPi kernel: Killed process 1369 (tor) total-vm:731772kB, anon-rss:635436kB, file-rss:73060kB, shmem-rss:0kB
Sep 21 17:17:02 DietPi kernel: oom_reaper: reaped process 1369 (tor), now anon-rss:636364kB, file-rss:73048kB, shmem-rss:0kB
Last edited 13 months ago by anong (previous) (diff)

comment:3 Changed 13 months ago by medo

Same thing on my server, Tor version 0.3.4.8 (git-5da0e95e4871a0a1) on Ubuntu Xenial. You can clearly see on this picture where I updated to 0.3.4.8: https://www.ganggarrison.com/filehost/sharex/chrome_2018-09-23_02-28-57.png

comment:4 Changed 13 months ago by nickm

Keywords: 034-backport added
Milestone: Tor: 0.3.5.x-final

Hm. Did previous versions have this problem? Which is the first version that behaves this way for you?

Are you able to try running Tor with some kind of memory debugging tool like tcmalloc to see where the RAM is going?

comment:5 Changed 13 months ago by traumschule

Keywords: regression? memleak tor-relay added

comment:6 Changed 13 months ago by anong

This happened initially after the upgrade to 0.3.4.8. Never happened before. The process just crashed again:

Sep 23 21:03:40 DietPi kernel: [<80254dc4>] (populate_vma_page_range) from [<80254e7c>] (__mm_populate+0xb0/0x148)
Sep 23 21:03:40 DietPi kernel: [<80254e7c>] (__mm_populate) from [<802429cc>] (vm_mmap_pgoff+0xc4/0xc8)
Sep 23 21:03:40 DietPi kernel: [<802429cc>] (vm_mmap_pgoff) from [<8025d3d0>] (SyS_mmap_pgoff+0x58/0xd0)
Sep 23 21:03:40 DietPi kernel: [<8025d3d0>] (SyS_mmap_pgoff) from [<80108060>] (ret_fast_syscall+0x0/0x28)
Sep 23 21:03:40 DietPi kernel: Mem-Info:
Sep 23 21:03:40 DietPi kernel: active_anon:0 inactive_anon:0 isolated_anon:0
                                active_file:10 inactive_file:25 isolated_file:0
                                unevictable:180523 dirty:16 writeback:0 unstable:0
                                slab_reclaimable:2202 slab_unreclaimable:6519
                                mapped:20297 shmem:0 pagetables:444 bounce:0
                                free:4014 free_pcp:200 free_cma:0
Sep 23 21:03:40 DietPi kernel: Node 0 active_anon:0kB inactive_anon:0kB active_file:40kB inactive_file:100kB unevictable:722092kB isolated(anon):0kB isolated(file):0kB mapped:81188kB dirty:64kB writeback:0kB shmem:0kB writeback_tmp:0kB
Sep 23 21:03:40 DietPi kernel: Normal free:16056kB min:16384kB low:20480kB high:24576kB active_anon:0kB inactive_anon:0kB active_file:160kB inactive_file:44kB unevictable:722096kB writepending:64kB present:1021952kB managed:1000180kB ml
Sep 23 21:03:40 DietPi kernel: lowmem_reserve[]: 0 0
Sep 23 21:03:40 DietPi kernel: Normal: 4*4kB (UM) 382*8kB (UME) 365*16kB (UE) 110*32kB (UME) 55*64kB (UME) 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 15952kB
Sep 23 21:03:40 DietPi kernel: 20347 total pagecache pages
Sep 23 21:03:40 DietPi kernel: 0 pages in swap cache
Sep 23 21:03:40 DietPi kernel: Swap cache stats: add 22160, delete 22160, find 2613/5056
Sep 23 21:03:40 DietPi kernel: Free swap  = 4159740kB
Sep 23 21:03:40 DietPi kernel: Total swap = 4194300kB
Sep 23 21:03:40 DietPi kernel: 255488 pages RAM
Sep 23 21:03:40 DietPi kernel: 0 pages HighMem/MovableOnly
Sep 23 21:03:40 DietPi kernel: 5443 pages reserved
Sep 23 21:03:40 DietPi kernel: 2048 pages cma reserved
Sep 23 21:03:40 DietPi kernel: [ pid ]   uid  tgid total_vm      rss nr_ptes nr_pmds swapents oom_score_adj name
Sep 23 21:03:40 DietPi kernel: [   84]     0    84     7373      446      11       0      100             0 systemd-journal
Sep 23 21:03:40 DietPi kernel: [  142]     0   142     3492      356       8       0      140         -1000 systemd-udevd
Sep 23 21:03:40 DietPi kernel: [  276]     0   276      616      293       6       0       35             0 dropbear
Sep 23 21:03:40 DietPi kernel: [  814]     0   814     1049      258       5       0       35             0 agetty
Sep 23 21:03:40 DietPi kernel: [ 9666]     0  9666     1324      434       7       0       61             0 cron
Sep 23 21:03:40 DietPi kernel: [10659]  1001 10659   182060   180276     367       0        0             0 tor
Sep 23 21:03:40 DietPi kernel: [17067]     0 17067      733      371       6       0       53             0 dropbear
Sep 23 21:03:40 DietPi kernel: [17068]     0 17068     1678      338       7       0      476             0 bash
Sep 23 21:03:40 DietPi kernel: [17624]     0 17624      733      371       6       0       47             0 dropbear
Sep 23 21:03:40 DietPi kernel: [17625]     0 17625     1678      362       8       0      476             0 bash
Sep 23 21:03:40 DietPi kernel: Out of memory: Kill process 10659 (tor) score 139 or sacrifice child
Sep 23 21:03:40 DietPi kernel: Killed process 10659 (tor) total-vm:728240kB, anon-rss:640012kB, file-rss:81092kB, shmem-rss:0kB

I am not familiar with memory debugging / tcmalloc. I have downgraded to 0.3.3.9 now.

Last edited 13 months ago by anong (previous) (diff)

comment:7 Changed 13 months ago by r1610091651@…

Similar story here.

After upgrade from 3.3.7 -> 3.4.8, I've noticed that memory usage is erratic. Please consult the image below, showing memory available in green and memory used in blue on a tor dedicate system.

https://www.dropbox.com/s/5mshoftrnpslima/mem-month.png

The upgrade to 3.4.8 took place on Saturday in week 37. The connection count / load was relativity stable:

https://www.dropbox.com/s/ojq6dhwi0hcdis0/torconns-month.png
(green inbound, blue outbound)
I was wondering if there were any changes from 3.3.7 to 3.4.8 which might impact memory behaviour on a middle-relay.
I'm also wander what is the "source" of this usage?

Potentially relevant config entries (remainder is port/ip, accounting & exit(no)):
MaxConsensusAgeForDiffs 4 hours
DirCache 0
MaxMemInQueues 64 MB (but "defaulting" to 256mb now)

Os: Ubuntu 16.04.5 LTS

Adding version number:
Sep 29 06:25:01.000 [notice] Tor 0.3.4.8 (git-5da0e95e4871a0a1) opening new log file.

Last edited 13 months ago by r1610091651@… (previous) (diff)

comment:8 Changed 13 months ago by nickm

We really need somebody who can run Tor as a middle relay using tcmalloc or a similar memory usage profiling tool. That would have the best chance to tell us what accounts for the increased memory usage.

comment:9 Changed 13 months ago by dgoulet

I've been running 0.3.4.8 for 2 days now and still no sign of memory leak whatsoever. Relay at 23MB/s and at ~370M of RAM which is normal. On Ubuntu Xenial here.

Sep 24 15:37:08.385 [notice] Tor 0.3.4.8 (git-da95b91355248ad8) running on Linux with Libevent 2.0.21-stable, OpenSSL 1.0.2g, Zlib 1.2.8, Liblzma 5.1.0alpha, and Libzstd N/A.

comment:10 Changed 13 months ago by traumschule

Would be interesting to see the version line from the reporters as well just to be sure it is not caused by an old a newer version of any of the libraries. For example to compare the versions 0.3.3.9-1~d90.stretch+1 and 0.3.4.8-1~d90.stretch+1 mentioned in the reddit post.

Last edited 13 months ago by traumschule (previous) (diff)

comment:11 Changed 13 months ago by anong

This was the version line from 0.3.4.8 where I experienced the issues:

Tor 0.3.4.8 (git-5da0e95e4871a0a1) running on Linux with Libevent 2.0.21-stable, OpenSSL 1.1.0f, Zlib 1.2.8, Liblzma 5.2.2, and Libzstd 1.1.2.

I have downgraded to 0.3.3.9 now which is running fine:

Tor 0.3.3.9 (git-ca1a436fa8e53a32) running on Linux with Libevent 2.0.21-stable, OpenSSL 1.1.0f, Zlib 1.2.8, Liblzma 5.2.2, and Libzstd 1.1.2.

comment:12 Changed 13 months ago by mem

it happens also here (tor relay configured as non-exit)

It takes about 8 hours to eat up 4GB of memory, then it starts eating swap, when it is done with swap it gets killed (oom-killer) and started again by systemd.

it happens pretty reliably (every day since upgrading from 0.3.3.x)

only observed on Ubuntu 18.04 (not on Debian 9 or FreeBSD also running 0.3.4.8)

the system is also running nyx

I can not help debugging since I've no access to the affected box (just saw the monitoring and logs)

We really need somebody who can run Tor as a middle relay using tcmalloc or a similar memory usage profiling tool. That would have the best chance to tell us what accounts for the increased memory usage.

can you write down the steps to do so, so affected people can collect the necessary information?

comment:13 Changed 13 months ago by nickm

Keywords: 035-must added

Tentatively assign the 035-must tag here, but we will have to defer this if we can't figure it out.

comment:14 Changed 13 months ago by nickm

Priority: HighVery High

Mark all 035-must tickets as "very high"

comment:15 Changed 12 months ago by torkit

I can also confirm the problem on Ubuntu 18.04, installed from deb.torproject.org:

Tor 0.3.4.8 (git-5da0e95e4871a0a1) running on Linux with Libevent 2.1.8-stable, OpenSSL 1.1.0g, Zlib 1.2.11, Liblzma 5.2.2, and Libzstd 1.3.3.

System is on auto-update, problems started on Oct 5th, 10pm UTC. There do not seem to have been any updates during that time period.

Leak fills up memory roughly every 4h and restarts the process. The relay is not an exit node, but a fallback-directory. I'm happy to help investigating.

Please contact me at the provided relay address: https://metrics.torproject.org/rs.html#details/B291D30517D23299AD7CEE3E60DFE60D0E3A4664

comment:16 Changed 12 months ago by cypherbits

The same is happening to me with an Ubuntu 14.04 VPS. I start the service and in some hours the service is down with the same errors from the firts post.

Tor version: 0.3.4.8, Kernel version: 3.13.0-160-generic, OpenSSL 1.1.0h

Last edited 12 months ago by cypherbits (previous) (diff)

comment:17 Changed 12 months ago by wrmsr

Happens to me on 2 nodes, running Tor 0.3.4.8 and Linux 4.18.12.a-1-hardened: https://imgur.com/a/h4bzyhy

Version: Tor 0.3.4.8 (git-da95b91355248ad8) running on Linux with Libevent 2.1.8-stable, OpenSSL 1.1.1, Zlib 1.2.11, Liblzma 5.2.4, and Libzstd 1.3.5.

However, on a 3rd node, running the same Tor version but an older Kernel (4.16.13-2-hardened) this does NOT happen.

What's funny about this memleak is, that this memory usage "comes out of nowhere", in top it does not appear as VIRT, RES or even SHR. Just by restarting services "with good luck" I found out that Tor was causing this memleak. When the OOM killer was summoned, it seemed also to be unsure what was causing this, because it killed first Redis, PostgreSQL, Dovecot and then Tor with that the memory freeing up again.

Last edited 12 months ago by wrmsr (previous) (diff)

comment:18 Changed 12 months ago by wrmsr

It seems that Tor doesn't like Valgrind's memory profiler:

Tor[22094]: connection_dir_finished_flushing(): Bug: Emptied a dirserv buffer, but it's still spooling! (on Tor 0.3.4.8 da95b91355248ad8)
Tor[22094]: connection_mark_for_close_internal_(): Bug: Duplicate call to connection_mark_for_close at src/or/directory.c:5201 (first at src/or/main.c:1210) (on Tor 0.3.4.8 da95b91355248ad8)
Tor[22094]: tor_bug_occurred_(): Bug: src/or/connection.c:841: connection_mark_for_close_internal_: This line should not have been reached. (Future instances of this warning will be silenced.) (on Tor 0.3.4.8 da95b91355248ad8)
Tor[22094]: Bug: Line unexpectedly reached at connection_mark_for_close_internal_ at src/or/connection.c:841. Stack trace: (on Tor 0.3.4.8 da95b91355248ad8)
Tor[22094]: Bug:     /usr/bin/tor(log_backtrace+0x45) [0x2970b5] (on Tor 0.3.4.8 da95b91355248ad8)
Tor[22094]: Bug:     /usr/bin/tor(tor_bug_occurred_+0xbc) [0x2b26cc] (on Tor 0.3.4.8 da95b91355248ad8)
Tor[22094]: Bug:     /usr/bin/tor(connection_dir_finished_flushing+0xad) [0x24a33d] (on Tor 0.3.4.8 da95b91355248ad8)
Tor[22094]: Bug:     /usr/bin/tor(connection_handle_read+0xb00) [0x21f5b0] (on Tor 0.3.4.8 da95b91355248ad8)
Tor[22094]: Bug:     /usr/bin/tor(+0x4fbff) [0x157bff] (on Tor 0.3.4.8 da95b91355248ad8)
Tor[22094]: Bug:     /usr/lib/libevent-2.1.so.6(+0x220d8) [0x4c120d8] (on Tor 0.3.4.8 da95b91355248ad8)
Tor[22094]: Bug:     /usr/lib/libevent-2.1.so.6(event_base_loop+0x53f) [0x4c12b1f] (on Tor 0.3.4.8 da95b91355248ad8)
Tor[22094]: Bug:     /usr/bin/tor(do_main_loop+0x225) [0x15a0b5] (on Tor 0.3.4.8 da95b91355248ad8)
Tor[22094]: Bug:     /usr/bin/tor(tor_run_main+0x1125) [0x15c935] (on Tor 0.3.4.8 da95b91355248ad8)
Tor[22094]: Bug:     /usr/bin/tor(tor_main+0x3b) [0x15440b] (on Tor 0.3.4.8 da95b91355248ad8)
Tor[22094]: Bug:     /usr/bin/tor(main+0x1a) [0x15418a] (on Tor 0.3.4.8 da95b91355248ad8)
Tor[22094]: Bug:     /usr/lib/libc.so.6(__libc_start_main+0xf3) [0x5b81223] (on Tor 0.3.4.8 da95b91355248ad8)
Tor[22094]: Bug:     /usr/bin/tor(_start+0x2e) [0x1541ee] (on Tor 0.3.4.8 da95b91355248ad8)
Tor[22094]: Failing because we have 991 connections already. Please read doc/TUNING for guidance.

comment:19 Changed 12 months ago by Christian

Interesting. I stumbled across this report via tor-talk and appear to experience a similar issue. Non-exit node running on openSUSE Tumbleweed in VM with 1 GB RAM. Upgraded from tor-0.3.3.9 to tor-0.3.4.8 (git-da95b91355248ad8) on September 16. Ever since tor gets killed twice a day by the OOM killer:

  OOM date
    2 2018-02
   23 2018-09
   26 2018-10

I can't really put more memory into the machine, so I tried to use zram and now zswap to save some memory, but the OOM killer still thinks tor is using too much memory. Yesterday's OOM kill happend at 15:27 local time, so let's look at the memory usage around that time:

$ sar -f /var/log/sa/sa20181015 -s 15:25 -e 15:30 -r -S
Linux 4.18.12-1-default               10/15/2018      _x86_64_        (2 CPU)

03:25:01 PM kbmemfree   kbavail kbmemused  %memused kbbuffers  kbcached  kbcommit   %commit  kbactive   kbinact   kbdirty
03:26:01 PM     95876     21496    797884     79.11      6604     37264    380396     29.94     88728     87552         0
03:27:21 PM    865980    781032     53680      5.32      3748     21556    194388     15.30      9528     17088        36
03:28:01 PM    529876    560976    205424     20.37      5564    207108    333216     26.22    180396    184756     45472
03:29:01 PM    520572    560204    206152     20.44      6936    213964    333596     26.25    187636    187700       124
Average:       503076    480927    315785     31.31      5713    119973    310399     24.43    116572    119274     11408

03:25:01 PM kbswpfree kbswpused  %swpused  kbswpcad   %swpcad
03:26:01 PM     23112    239028     91.18       688      0.29
03:27:21 PM    203400     58740     22.41      1932      3.29
03:28:01 PM    204168     57972     22.11      2048      3.53
03:29:01 PM    205192     56948     21.72      2848      5.00
Average:       158968    103172     39.36      1879      1.82

(Yes, I'm running sadc every minute now because of this)

So, I assume %memused went from 79.11% to 100% in a few seconds until tor was killed at 15:27:19 according to syslog. Swap usage (%swpused) is dangerously high with 91.18% -- however, I only switched to zswap a few days ago and used zram earlier, where %swpused would always < 60%, maybe tor was trying to allocate memory so quickly that zram could not keep up.

Tuned two more settings, but to no avail so far:

$ sysctl vm.swappiness vm.min_free_kbytes
vm.swappiness = 100
vm.min_free_kbytes = 65536

Are there any tweaks besides MaxMemInQueues to limit Tor's memory usage?

Last edited 12 months ago by Christian (previous) (diff)

comment:20 in reply to:  19 ; Changed 12 months ago by arma

Replying to Christian:

Interesting. I stumbled across this report via tor-talk and appear to experience a similar issue. Non-exit node running on openSUSE Tumbleweed in VM with 1 GB RAM. Upgraded from tor-0.3.3.9 to tor-0.3.4.8 (git-da95b91355248ad8) on September 16.

What's in your torrc? Can you simplify it to the point that it's super simple, yet you still experience this problem?

comment:21 in reply to:  18 Changed 12 months ago by arma

Replying to wrmsr:

It seems that Tor doesn't like Valgrind's memory profiler:

Your paste here looks ok. So maybe it doesn't work, but your output doesn't indicate that. :)

Tor[22094]: connection_dir_finished_flushing(): Bug: Emptied a dirserv buffer, but it's still spooling! (on Tor 0.3.4.8 da95b91355248ad8)

Looks like #26896 maybe.

Tor[22094]: Failing because we have 991 connections already. Please read doc/TUNING for guidance.

This part is important, and you need to raise your ulimit -n or Tor won't do well as a relay.

comment:22 Changed 12 months ago by rm

Is here any way to subscribe to tickets without commenting on them? Wanted to subscribe to this one anyways.

comment:23 Changed 12 months ago by r1610091651@…

Some additional info.

I've downgraded to 3.3.7 this weekend, and relay is stable again. Noticed following though:

https://www.dropbox.com/s/44hmbd4j3hkip5s/throughput-week.png

After upgrade from 3.3.7 -> 3.4.8 I've noticed a drop in throughput with similar #connections & weight. Now with downgrade done, i see increase in throughput again.

Could it be that somewhere between 3.3.7 -> 3.4.8 stream compression was added? That could explain the drop in network usage, and if resources used for that compression aren't released properly also the memory leak, all with similar connection count & weight..

Regards

comment:24 Changed 12 months ago by dgoulet

I'm 99.9% certain the issue is caused by two things. Primarily #28089 and #28090 made it in part crazier.

comment:25 Changed 12 months ago by rm

Could be! the only relay where I had RelayBandwidthRate limiting:

# free

total used free shared buffers cached

Mem: 16327228 16101720 225508 71524 82848 1467484
-/+ buffers/cache: 14551388 1775840
Swap: 1048572 0 1048572
# pkill -u debian-tor
# free

total used free shared buffers cached

Mem: 16327228 4209652 12117576 71516 82856 1472024
-/+ buffers/cache: 2654772 13672456
Swap: 1048572 0 1048572

Three other relays with no bandwidth limit in place are not affected.

comment:26 Changed 12 months ago by dgoulet

Status: newneeds_information

So far testing shows that #28089 was the cause of this memory leak kernel side. We'll keep this open for a while to monitor the situation after 0.3.4.9 is released but it is looking good!

comment:27 in reply to:  20 ; Changed 12 months ago by Christian

Replying to arma:

What's in your torrc? Can you simplify it to the point that it's super simple, yet you still experience this problem?

I didn't touch torrc in a long time, except now for adding and tweaking MaxMemInQueues:

$ grep ^[A-Z] /etc/tor/torrc
SOCKSPort                 0
Log                       notice file /var/log/tor/notices.log
RunAsDaemon               0             # SystemD needs this to be 0
DataDirectory             /var/lib/tor
ControlPort               9051
HashedControlPassword     xxxx
ORPort                    9001
Nickname                  foobar
RelayBandwidthRate        1024 KBytes
RelayBandwidthBurst       1536 KBytes
AccountingStart           week 1 00:00
ContactInfo               XXX
DirPort                   9030
MyFamily                  $xxxx, $xxxx
ExitPolicy                reject *:*
ExitRelay                 0
DisableDebuggerAttachment 0
MaxMemInQueues            256 MB

I haven't had a OOM-killing in two days now and I like to imagine that switching from zram to zswap magically helped, maybe. I tried running tor under Valgrind but don't don't know if I did it right and it didn't report any leaks:

==1697== Command: /usr/bin/tor -f /etc/tor/torrc --pidfile /var/run/tor/tor.pid --log notice\ file\ /var/log/tor/tor.log --runasdaemon 1 --datadirectory /var/lib/tor --user tor
[...]
==1699== LEAK SUMMARY:
==1699==    definitely lost: 0 bytes in 0 blocks
==1699==    indirectly lost: 0 bytes in 0 blocks
==1699==      possibly lost: 864 bytes in 3 blocks
==1699==    still reachable: 34,118 bytes in 206 blocks
==1699==         suppressed: 0 bytes in 0 blocks

comment:28 in reply to:  27 Changed 12 months ago by arma

Replying to Christian:

I haven't had a OOM-killing in two days now and I like to imagine that switching from zram to zswap magically helped, maybe.

As dgoulet mentioned above, we believe we have found and fixed the bug: #28089.

That fix is in 0.3.5.3-alpha, and it will be in 0.3.4.9 when it comes out.

comment:29 Changed 12 months ago by starlight

Reading this ticket and #28089 a potential concern comes to mind. In my experience the Linux kernel is fastidious in releasing skb memory when sockets close, which leads to the question: is an additional bug present in KIST where sockets are leaked? For kernel memory consumption to grow unbounded while the relay daemon continues to function correctly, a socket leak or stranding would appear necessary. I do not know this code and someone familiar with it would be in a better position to consider the possibility.

comment:30 in reply to:  29 ; Changed 12 months ago by dgoulet

Replying to starlight:

Reading this ticket and #28089 a potential concern comes to mind. In my experience the Linux kernel is fastidious in releasing skb memory when sockets close, which leads to the question: is an additional bug present in KIST where sockets are leaked? For kernel memory consumption to grow unbounded while the relay daemon continues to function correctly, a socket leak or stranding would appear necessary. I do not know this code and someone familiar with it would be in a better position to consider the possibility.

KIST doesn't manage sockets. It is a scheduler that only relays cell based on kernel information per-socket and trying to be fair by considering all circuits.

The issue here is clearly the fact that one side of the connection (relay to relay) wasn't reading on the socket which caused the TCP window to shrink down to 0 and thus the other side queuing cells in the Kernel. That explains why nobody sees the memory leak in userspace by looking at tor's memory.

KIST should prevent that in theory but not entirely. Some could always go as in flushing 1M of cells and only one goes through leaving the rest in the kernel.

The bigger issue is that we still have cells bypassing the KIST scheduler that are directly written on the socket. We have several ticket open about these issues.

comment:31 in reply to:  30 Changed 12 months ago by starlight

Replying to dgoulet:

. . .
KIST should prevent that in theory but not entirely. Some could always go as in flushing 1M of cells and only one goes through leaving the rest in the kernel.

The bigger issue is that we still have cells bypassing the KIST scheduler that are directly written on the socket. We have several ticket open about these issues.

Thank you for the update, the details are informative and interesting.

After posting I realized stuck sockets will hide an EOF condition and when the other side closes EOFs are not detected resulting in a socket leak, but one cured by the fix. A more carefully read turns up that performance degradation did result.

Newer kernels account kernel SKBs under the owning process for correct CGROUP resource control, and graphs above show the leaked memory as "app" memory.

Changed 12 months ago by torkit

Attachment: tor-sockstats.png added

tor sockstats with proposed patch applied

comment:32 Changed 12 months ago by torkit

I've tested the proposed patch that has been merged into maint-0.3.4 branch and do have the same behavior as before. tor keeps getting killed. I'll check 0.3.3 for regression now.

https://trac.torproject.org/projects/tor/attachment/ticket/27813/tor-sockstats.png

Last edited 12 months ago by torkit (previous) (diff)

comment:33 in reply to:  32 ; Changed 12 months ago by dgoulet

Replying to torkit:

I've tested the proposed patch that has been merged into maint-0.3.4 branch and do have the same behavior as before. tor keeps getting killed. I'll check 0.3.3 for regression now.

So in theory, your relay should not anymore have Recv-Q that keeps filling up ... but on the Send-Q side, because the entire network running <= 0.3.4.8 still have that issue, it will keep filling up.

Until we roll out 0.3.4.9 and everyone updates, it won't fully go away :S.

comment:34 in reply to:  33 Changed 12 months ago by rm

Replying to dgoulet:

but on the Send-Q side, because the entire network running <= 0.3.4.8 still have that issue, it will keep filling up.

Until we roll out 0.3.4.9 and everyone updates, it won't fully go away :S.

What? You are saying remote peers can do something (no matter what) which causes you to leak memory and eventually get OOM? I.e. basically Tor has a remote DoS vulnerability? And your solution for that is to kindly ask everyone to stop doing that (upgrade versions)? You can't be serious.

comment:35 Changed 12 months ago by torkit

Soooo, there are two problems on ubuntu. One seems to have been the one this bug described, while the other is pretty mundane: the installed tor.service-file in the distribution does not properly recognize the started tor instance and restarts it after the default timeout of 300sec. Which is weird, because the old tor.service-file did work properly with the provided package on the debian-tor-binary-package, and I just replaced the tor binary with the self-built one from the maintenance-branches.

I've replaced it with a better suiting tor.service-file, and maint-0.3.3 worked fine for the past 12h hours. I will recheck the maint-0.3.4-branch now.

comment:36 Changed 12 months ago by torkit

I can confirm the patch works - node is stable on maint-0.3.4, although it does not seems to properly use all available cores. But that's a different story.

comment:37 Changed 12 months ago by nickm

Resolution: fixed
Status: needs_informationclosed

Tor 0.3.4.9 ships today, with this bugfix included.

Note: See TracTickets for help on using tickets.