Opened 9 months ago

Closed 8 weeks ago

Last modified 7 weeks ago

#22255 closed defect (fixed)

Frequent OOM kills of tor process

Reported by: DeS Owned by:
Priority: High Milestone: Tor: 0.3.0.x-final
Component: Core Tor/Tor Version:
Severity: Major Keywords:
Cc: starlight@… Actual Points:
Parent ID: Points:
Reviewer: Sponsor:

Description

We are currently expierencing frequent Out of memory kills of or tor processes on two boxes running ubuntu 14.04.

root@tor1:~# lsb_release -a
No LSB modules are available.
Distributor ID:	Ubuntu
Description:	Ubuntu 14.04.5 LTS
Release:	14.04
Codename:	trusty

We experience this on following versions:

0.3.0.5-rc-1~trusty+1 and 0.2.9.10-1~trusty+1

I changed to the rc after the normal 2.9.10-1 Version keept having this problems. The 0.3.0.5-rc has the same problems.

From syslog

May 14 10:21:42 tor1 kernel: [123517.565688] UDP: bad checksum. From 82.247.214.125:51413 to 176.10.104.240:21446 ulen 38
May 14 11:02:47 tor1 kernel: [125982.993268] tor invoked oom-killer: gfp_mask=0x200da, order=0, oom_score_adj=0
May 14 11:02:47 tor1 kernel: [125982.993270] tor cpuset=/ mems_allowed=0
May 14 11:02:47 tor1 kernel: [125982.993272] CPU: 2 PID: 33180 Comm: tor Not tainted 3.13.0-117-generic !#164-Ubuntu
May 14 11:02:47 tor1 kernel: [125982.993273] Hardware name: HP !ProLiant ML310e Gen8 v2, BIOS P78 03/28/2014
May 14 11:02:47 tor1 kernel: [125982.993274]  0000000000000000 ffff8800a8679990 ffffffff8172d1c9 ffff88013c868000
May 14 11:02:47 tor1 kernel: [125982.993276]  0000000000000000 ffff8800a8679a18 ffffffff81727768 ffffffff8106a926
May 14 11:02:47 tor1 kernel: [125982.993278]  ffff8800a86799f0 ffffffff810cb27c 0000000001320122 0000000000000000
May 14 11:02:47 tor1 kernel: [125982.993279] Call Trace:
May 14 11:02:47 tor1 kernel: [125982.993285]  [<ffffffff8172d1c9>] dump_stack+0x64/0x82
May 14 11:02:47 tor1 kernel: [125982.993286]  [<ffffffff81727768>] dump_header+0x7f/0x1f1
May 14 11:02:47 tor1 kernel: [125982.993289]  [<ffffffff8106a926>] ? put_online_cpus+0x56/0x80
May 14 11:02:47 tor1 kernel: [125982.993291]  [<ffffffff810cb27c>] ? rcu_oom_notify+0xcc/0xf0
May 14 11:02:47 tor1 kernel: [125982.993294]  [<ffffffff81156d81>] oom_kill_process+0x201/0x360
May 14 11:02:47 tor1 kernel: [125982.993297]  [<ffffffff812dde35>] ? security_capable_noaudit+0x15/0x20
May 14 11:02:47 tor1 kernel: [125982.993299]  [<ffffffff81157511>] out_of_memory+0x471/0x4b0
May 14 11:02:47 tor1 kernel: [125982.993301]  [<ffffffff8115d82c>] !__alloc_pages_nodemask+0xa6c/0xb90
May 14 11:02:47 tor1 kernel: [125982.993304]  [<ffffffff8119e25a>] alloc_pages_vma+0x9a/0x140
May 14 11:02:47 tor1 kernel: [125982.993307]  [<ffffffff811908eb>] read_swap_cache_async+0xeb/0x160
May 14 11:02:47 tor1 kernel: [125982.993308]  [<ffffffff811909f8>] swapin_readahead+0x98/0xe0
May 14 11:02:47 tor1 kernel: [125982.993320]  [<ffffffff8117e90b>] handle_mm_fault+0xa7b/0xf10
May 14 11:02:47 tor1 kernel: [125982.993322]  [<ffffffff81739344>] !__do_page_fault+0x184/0x560
May 14 11:02:47 tor1 kernel: [125982.993325]  [<ffffffff810a3af5>] ? set_next_entity+0x95/0xb0
May 14 11:02:47 tor1 kernel: [125982.993328]  [<ffffffff810135db>] ? !__switch_to+0x16b/0x4f0
May 14 11:02:47 tor1 kernel: [125982.993329]  [<ffffffff8173973a>] do_page_fault+0x1a/0x70
May 14 11:02:47 tor1 kernel: [125982.993331]  [<ffffffff81735a68>] page_fault+0x28/0x30
May 14 11:02:47 tor1 kernel: [125982.993350] Mem-Info:
May 14 11:02:47 tor1 kernel: [125982.993351] Node 0 DMA per-cpu:
May 14 11:02:47 tor1 kernel: [125982.993352] CPU    0: hi:    0, btch:   1 usd:   0
May 14 11:02:47 tor1 kernel: [125982.993353] CPU    1: hi:    0, btch:   1 usd:   0
May 14 11:02:47 tor1 kernel: [125982.993353] CPU    2: hi:    0, btch:   1 usd:   0
May 14 11:02:47 tor1 kernel: [125982.993354] CPU    3: hi:    0, btch:   1 usd:   0
May 14 11:02:47 tor1 kernel: [125982.993355] Node 0 DMA32 per-cpu:
May 14 11:02:47 tor1 kernel: [125982.993356] CPU    0: hi:  186, btch:  31 usd:  25
May 14 11:02:47 tor1 kernel: [125982.993356] CPU    1: hi:  186, btch:  31 usd:   2
May 14 11:02:47 tor1 kernel: [125982.993357] CPU    2: hi:  186, btch:  31 usd:   0
May 14 11:02:47 tor1 kernel: [125982.993358] CPU    3: hi:  186, btch:  31 usd:  26
May 14 11:02:47 tor1 kernel: [125982.993358] Node 0 Normal per-cpu:
May 14 11:02:47 tor1 kernel: [125982.993359] CPU    0: hi:  186, btch:  31 usd:   0
May 14 11:02:47 tor1 kernel: [125982.993360] CPU    1: hi:  186, btch:  31 usd:   0
May 14 11:02:47 tor1 kernel: [125982.993360] CPU    2: hi:  186, btch:  31 usd:   0
May 14 11:02:47 tor1 kernel: [125982.993361] CPU    3: hi:  186, btch:  31 usd:   0
May 14 11:02:47 tor1 kernel: [125982.993363] active_!anon:0 inactive_!anon:0 isolated_!anon:0
May 14 11:02:47 tor1 kernel: [125982.993363]  active_!file:141 inactive_!file:204 isolated_!file:0
May 14 11:02:47 tor1 kernel: [125982.993363]  !unevictable:0 !dirty:0 !writeback:0 !unstable:0
May 14 11:02:47 tor1 kernel: [125982.993363]  !free:20517 slab_!reclaimable:7176 slab_!unreclaimable:41948
May 14 11:02:47 tor1 kernel: [125982.993363]  !mapped:80 !shmem:0 !pagetables:692 !bounce:0
May 14 11:02:47 tor1 kernel: [125982.993363]  free_!cma:0
May 14 11:02:47 tor1 kernel: [125982.993365] Node 0 DMA !free:15344kB !min:268kB !low:332kB !high:400kB active_!anon:0kB inactive_!anon:0kB active_!file:4kB inactive_!file:28kB !unevictable:0kB isolated(anon):0kB isolated(file):0kB !present:15976kB !managed:15892kB !mlocked:0kB !dirty:0kB !writeback:8kB !mapped:36kB !shmem:0kB slab_!reclaimable:0kB slab_!unreclaimable:156kB kernel_!stack:0kB !pagetables:4kB !unstable:0kB !bounce:0kB free_!cma:0kB writeback_!tmp:0kB pages_!scanned:18 all_unreclaimable? no
May 14 11:02:47 tor1 kernel: [125982.993368] lowmem_reserve[]: 0 2688 3771 3771
May 14 11:02:47 tor1 kernel: [125982.993370] Node 0 DMA32 !free:51080kB !min:46532kB !low:58164kB !high:69796kB active_!anon:0kB inactive_!anon:0kB active_!file:516kB inactive_!file:788kB !unevictable:0kB isolated(anon):0kB isolated(file):0kB !present:2832272kB !managed:2753204kB !mlocked:0kB !dirty:0kB !writeback:0kB !mapped:284kB !shmem:0kB slab_!reclaimable:15612kB slab_!unreclaimable:83432kB kernel_!stack:312kB !pagetables:2064kB !unstable:0kB !bounce:0kB free_!cma:0kB writeback_!tmp:0kB pages_!scanned:17 all_unreclaimable? no
May 14 11:02:47 tor1 kernel: [125982.993372] lowmem_reserve[]: 0 0 1082 1082
May 14 11:02:47 tor1 kernel: [125982.993373] Node 0 Normal !free:15644kB !min:18732kB !low:23412kB !high:28096kB active_!anon:0kB inactive_!anon:0kB active_!file:44kB inactive_!file:0kB !unevictable:0kB isolated(anon):0kB isolated(file):0kB !present:1179644kB !managed:1108484kB !mlocked:0kB !dirty:0kB !writeback:0kB !mapped:0kB !shmem:0kB slab_!reclaimable:13092kB slab_!unreclaimable:84204kB kernel_!stack:2072kB !pagetables:700kB !unstable:0kB !bounce:0kB free_!cma:0kB writeback_!tmp:0kB pages_!scanned:203 all_unreclaimable? yes
May 14 11:02:47 tor1 kernel: [125982.993376] lowmem_reserve[]: 0 0 0 0
May 14 11:02:47 tor1 kernel: [125982.993377] Node 0 DMA: 3*4kB (M) 3*8kB (UM) 2*16kB (UM) 0*32kB 1*64kB (U) 1*128kB (M) 1*256kB (M) 1*512kB (M) 2*1024kB (UM) 2*2048kB (MR) 2*4096kB (M) = 15364kB
May 14 11:02:47 tor1 kernel: [125982.993384] Node 0 DMA32: 303*4kB (UE) 662*8kB (UEM) 418*16kB (UEM) 952*32kB (UEM) 51*64kB (UM) 1*128kB (M) 0*256kB 0*512kB 0*1024kB 0*2048kB 1*4096kB (R) = 51148kB
May 14 11:02:47 tor1 kernel: [125982.993394] Node 0 Normal: 375*4kB (UEM) 330*8kB (UEM) 649*16kB (UEM) 3*32kB (M) 2*64kB (R) 1*128kB (R) 1*256kB (R) 1*512kB (R) 0*1024kB 0*2048kB 0*4096kB = 15644kB
May 14 11:02:47 tor1 kernel: [125982.993401] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
May 14 11:02:47 tor1 kernel: [125982.993401] 385 total pagecache pages
May 14 11:02:47 tor1 kernel: [125982.993402] 20 pages in swap cache
May 14 11:02:47 tor1 kernel: [125982.993403] Swap cache stats: add 2432647, delete 2432627, find 700250/1142133
May 14 11:02:47 tor1 kernel: [125982.993404] Free swap  = 3627940kB
May 14 11:02:47 tor1 kernel: [125982.993404] Total swap = 4026364kB

May 14 11:02:47 tor1 kernel: [125982.993405] 1006973 pages RAM
May 14 11:02:47 tor1 kernel: [125982.993405] 0 pages !HighMem/MovableOnly
May 14 11:02:47 tor1 kernel: [125982.993406] 17790 pages reserved
May 14 11:02:47 tor1 kernel: [125982.993406] [ pid ]   uid  tgid total_vm      rss nr_ptes swapents oom_score_adj name
May 14 11:02:47 tor1 kernel: [125982.993422] ![32898]     0 32898     5073        0      13      251             0 upstart-udev-br
May 14 11:02:47 tor1 kernel: [125982.993423] ![32903]     0 32903    12900        1      27      228         -1000 systemd-udevd
May 14 11:02:47 tor1 kernel: [125982.993425] ![33049]     0 33049     3821        0      11       48             0 upstart-file-br
May 14 11:02:47 tor1 kernel: [125982.993426] ![33092]   102 33092     9813        1      23       95             0 dbus-daemon
May 14 11:02:47 tor1 kernel: [125982.993436] ![33096]   101 33096    63962        0      28      722             0 rsyslogd
May 14 11:02:47 tor1 kernel: [125982.993437] ![33119]     0 33119    10864        1      26       88             0 systemd-logind
May 14 11:02:47 tor1 kernel: [125982.993439] ![33206]     0 33206     3817        0      12       55             0 upstart-socket-
May 14 11:02:47 tor1 kernel: [125982.993440] ![33053]     0 33053     3699        1      12       41             0 getty
May 14 11:02:47 tor1 kernel: [125982.993441] ![33056]     0 33056     3699        1      12       41             0 getty
May 14 11:02:47 tor1 kernel: [125982.993442] ![33061]     0 33061     3699        1      12       39             0 getty
May 14 11:02:47 tor1 kernel: [125982.993443] ![33062]     0 33062     3699        1      11       40             0 getty
May 14 11:02:47 tor1 kernel: [125982.993444] ![33064]     0 33064     3699        1      12       39             0 getty
May 14 11:02:47 tor1 kernel: [125982.993446] ![33079]     0 33079     1094        0       8       35             0 acpid
May 14 11:02:47 tor1 kernel: [125982.993447] ![33089]     0 33089    15347        1      33      171         -1000 sshd
May 14 11:02:47 tor1 kernel: [125982.993448] ![33094]     0 33094     5915        0      17       81             0 cron
May 14 11:02:47 tor1 kernel: [125982.993449] ![33095]     0 33095     4786        0      12       48             0 atd
May 14 11:02:47 tor1 kernel: [125982.993450] ![33155]     0 33155     4801       34      13       73             0 irqbalance
May 14 11:02:47 tor1 kernel: [125982.993457] ![33180]   106 33180   111003        0     152    50520             0 tor
May 14 11:02:47 tor1 kernel: [125982.993458] ![33317]   108 33317    20352        0      41     9936             0 unbound
May 14 11:02:47 tor1 kernel: [125982.993459] ![33369]   105 33369     1871       12       9       33             0 vnstatd
May 14 11:02:47 tor1 kernel: [125982.993460] ![32890]     0 32890     3699        1      11       40             0 getty
May 14 11:02:47 tor1 kernel: [125982.993461] ![32991]  1002 32991     2867        0      10       84             0 memlog.sh
May 14 11:02:47 tor1 kernel: [125982.993462] ![33057]   106 33057    97196        0     128    36538             0 tor
May 14 11:02:47 tor1 kernel: [125982.993464] ![57323]  1002 57323     2867        0      10       91             0 memlog.sh
May 14 11:02:47 tor1 kernel: [125982.993465] ![57325]  1002 57325     2218       78      10       32             0 grep
May 14 11:02:47 tor1 kernel: [125982.993466] ![57326]  1002 57326     2746       44      11       46             0 sed
May 14 11:02:47 tor1 kernel: [125982.993467] Out of memory: Kill process 33180 (tor) score 25 or sacrifice child
May 14 11:02:47 tor1 kernel: [125982.993478] Killed process 33180 (tor) !total-vm:444012kB, !anon-rss:0kB, !file-rss:0kB

notices.log does not show anything around the kill time. First record ist the restart of the process at 20:32

May 14 06:54:03.000 [notice] Tor 0.3.0.5-rc (git-61f68662421142d2) opening new log file.
May 14 07:13:01.000 [warn] eventdns: All nameservers have failed
May 14 07:13:01.000 [notice] eventdns: Nameserver 127.0.0.1:53 is back up
May 14 07:47:36.000 [warn] eventdns: All nameservers have failed
May 14 07:47:36.000 [notice] eventdns: Nameserver 127.0.0.1:53 is back up
May 14 08:10:38.000 [warn] eventdns: All nameservers have failed
May 14 08:10:38.000 [notice] eventdns: Nameserver 127.0.0.1:53 is back up
May 14 08:12:28.000 [warn] eventdns: All nameservers have failed
May 14 08:12:28.000 [notice] eventdns: Nameserver 127.0.0.1:53 is back up
May 14 08:20:08.000 [warn] eventdns: All nameservers have failed
May 14 08:20:09.000 [notice] eventdns: Nameserver 127.0.0.1:53 is back up
May 14 20:23:18.000 [notice] Tor 0.3.0.5-rc (git-61f68662421142d2) opening log file.
May 14 20:23:18.000 [notice] Parsing GEOIP IPv4 file /usr/share/tor/geoip.
May 14 20:23:18.000 [notice] Parsing GEOIP IPv6 file /usr/share/tor/geoip6.
May 14 20:23:18.000 [notice] Configured to measure statistics. Look for the *-stats files that will first be written to the data directory in 24 hours from now.

We are operation the * Exit Servers since years and they became unstable about a week ago. No changes other the normal software updates where performed.

Other than the two tor processes an local unbound resolver and ssh are running.

It feels like an memory leak. Since tor is the one getting killed I assume the problem would be the tor software. Any thoughts ?

Child Tickets

Attachments (27)

mem.log (595.7 KB) - added by DeS 9 months ago.
Logfile of memory usage
valgrind.txt (11.2 KB) - added by DeS 9 months ago.
Screenshot_2017-05-26_09-56-09.png (189.7 KB) - added by DeS 9 months ago.
syslog20170528 (23.4 KB) - added by DeS 9 months ago.
SMEM20170528.log (579.4 KB) - added by DeS 9 months ago.
Screenshot_2017-05-30_20-11-49.png (119.7 KB) - added by DeS 9 months ago.
SMEM_Mo.log (1.7 MB) - added by DeS 9 months ago.
PROC_MEMINFO_Mon.log (639.2 KB) - added by DeS 9 months ago.
munin.tar.gz (338.2 KB) - added by DeS 9 months ago.
Screenshot_2017-05-30_23-15-37.png (67.4 KB) - added by DeS 9 months ago.
Screenshot_2017-05-30_23-14-02.png (40.8 KB) - added by DeS 9 months ago.
Screenshot_2017-05-31_20-43-46.png (59.3 KB) - added by DeS 9 months ago.
netstat-pinpoint=1496233206,1496341206.png (57.4 KB) - added by DeS 9 months ago.
memory-pinpoint=1496233206,1496341206.png (104.3 KB) - added by DeS 9 months ago.
netstat-pinpoint=1496233206,1496341206.2.png (57.4 KB) - added by DeS 9 months ago.
memory-pinpoint=1496354400,1496430027.png (83.2 KB) - added by DeS 9 months ago.
netstat-pinpoint=1496354412,1496425827.png (47.7 KB) - added by DeS 9 months ago.
PROC_MEMINFO.log (301.0 KB) - added by DeS 9 months ago.
SMEM.log (825.8 KB) - added by DeS 9 months ago.
mem.2.log (177.2 KB) - added by DeS 9 months ago.
syslog (14.0 KB) - added by DeS 9 months ago.
collectl.log (234.1 KB) - added by DeS 9 months ago.
20170607.tar.gz (339.3 KB) - added by DeS 9 months ago.
syslog.2 (12.9 KB) - added by DeS 8 months ago.
20170615.tar.gz (303.8 KB) - added by DeS 8 months ago.
swap-pinpoint=1497453606,1497561606.png (35.9 KB) - added by DeS 8 months ago.
20170708.tar.gz (355.5 KB) - added by DeS 8 months ago.

Change History (109)

comment:1 Changed 9 months ago by DeS

Summary: Frequent OOM Kill of tor precessFrequent OOM kills of tor precess

comment:2 Changed 9 months ago by nickm

Have you tried setting the MaxMemInQueues option? It tries to limit the total amount of memory Tor uses.

comment:3 Changed 9 months ago by nickm

Milestone: Tor: 0.3.0.x-final

comment:4 Changed 9 months ago by arma

How much memory does the system have? I'm having trouble reading the log lines for it, but it looks like maybe not very much?

comment:5 Changed 9 months ago by ryru

The system has about 4 GB memory:

# cat /proc/meminfo
MemTotal: 3877580 kB
MemFree: 2250280 kB

We run this configuration since mid 2015.

No, we have not tried the MaxMemInQueues setting yet. It is unset, therefor Tor should pick a proper configuration based on the physically available memory, shouldn't it?

Many thanks for your support!

comment:6 in reply to:  5 Changed 9 months ago by arma

Replying to ryru:

we have not tried the MaxMemInQueues setting yet. It is unset, therefor Tor should pick a proper configuration based on the physically available memory, shouldn't it?

There's a notice-level log when Tor starts up, telling you what value it picked, e.g.

May 10 03:34:40.480 [notice] Based on detected system memory, MaxMemInQueues is set to 8192 MB. You can override this by setting MaxMemInQueues by hand.

comment:7 in reply to:  5 Changed 9 months ago by arma

Replying to ryru:

The system has about 4 GB memory:

It looks from your earlier log like it takes 12 hours or something before it dies?

Can you track the memory size over that 12 hours? That is, is it fine until suddenly everything goes bad, or does it slowly grow?

Changed 9 months ago by DeS

Attachment: mem.log added

Logfile of memory usage

comment:8 Changed 9 months ago by DeS

I uploaded the output of a script I made. It shows memory usage as result of following commands and runs every 60 seconds.

....
#!/bin/bash -e

export LC_ALL=C
FILE=/home/user/mem.log
echo "      date     time $(free -m | grep total | sed -E 's/^    (.*)/\1/g')" >>$FILE
while true; do
    echo "$(date '+%Y-%m-%d %H:%M:%S') $(/usr/bin/free -m | grep Mem: | sed 's/Mem://g')" >>$FILE
    echo "$(date '+%Y-%m-%d %H:%M:%S') $(/usr/bin/free -m | grep Swap: | sed 's/Swap://g')" >>$FILE
    sleep 60 
done


You can see it starts 3 days ago. Always the amount is quite stable around usage of 1563 Mbyte.
If you fast forward to Line 4448 of mem.log which is about 1 hour before this documented OOM you will see a fast increase of the memory usage to 3706 Mbyte before the process is killed.

Many thanks for your support and help

comment:9 Changed 9 months ago by nickm

Component: Core TorCore Tor/Tor

comment:10 Changed 9 months ago by DeS

Before problems occured following ubuntu updates where installed

2017-05-06 06:50:53 status installed man-db:amd64 2.6.7.1-1ubuntu1
2017-05-06 06:50:53 status installed login:amd64 1:4.1.5.1-1ubuntu9.4
2017-05-06 06:50:53 status installed man-db:amd64 2.6.7.1-1ubuntu1
2017-05-06 06:50:53 status installed ureadahead:amd64 0.100.0-16
2017-05-06 06:50:53 status installed passwd:amd64 1:4.1.5.1-1ubuntu9.4
2017-05-06 13:29:10 status installed install-info:amd64 5.2.0.dfsg.1-2
2017-05-06 13:29:11 status installed man-db:amd64 2.6.7.1-1ubuntu1
2017-05-06 13:29:11 status installed bash:amd64 4.3-7ubuntu1.6
2017-05-10 06:26:21 status installed librtmp0:amd64 2.4+20121230.gitdf6c518-1ubuntu0.1
2017-05-10 06:26:21 status installed libfreetype6:amd64 2.5.2-1ubuntu2.8
2017-05-10 06:26:22 status installed libc-bin:amd64 2.19-0ubuntu6.11

Do you think one of these update packages could have trigged this behaviour ?

comment:11 Changed 9 months ago by arma

I found this part interesting:

3159    2017-05-13 23:22:01          3931          0       3931
3160	2017-05-13 23:23:01           3786       1852       1934          0         66        518
3161	2017-05-13 23:23:01          3931          0       3931
3162	2017-05-13 23:24:01           3786       2217       1569          0         66        518
3163	2017-05-13 23:24:01          3931          0       3931
3164	2017-05-13 23:25:01           3786       2335       1451          0         66        518
3165	2017-05-13 23:25:01          3931          0       3931
3166	2017-05-13 23:26:01           3786       2527       1259          0         66        496
3167	2017-05-13 23:26:01          3931          0       3931
3168	2017-05-13 23:27:01           3786       2952        833          0         66        496
3169	2017-05-13 23:27:01          3931          0       3931
3170	2017-05-13 23:28:01           3786       3020        766          0         66        496
3171	2017-05-13 23:28:01          3931          0       3931
3172	2017-05-13 23:29:01           3786       3110        675          0         66        496
3173	2017-05-13 23:29:01          3931          0       3931
3174	2017-05-13 23:30:01           3786       3023        763          0         66        496
3175	2017-05-13 23:30:01          3931          0       3931
3176	2017-05-13 23:31:01           3786       2694       1091          0         66        496
3177	2017-05-13 23:31:01          3931          0       3931
3178	2017-05-13 23:32:01           3786       2338       1448          0         66        496
3179	2017-05-13 23:32:01          3931          0       3931
3180	2017-05-13 23:33:01           3786       1893       1893          0         66        496
3181	2017-05-13 23:33:01          3931          0       3931
3182	2017-05-13 23:34:01           3786       1614       2171          0         66        496
3183	2017-05-13 23:34:01          3931          0       3931

It looks like memory load went up, but not enough to kill it, and then went down again on its own.

So it looks like it is not always following the pattern of "grow forever until it's too big".

comment:12 Changed 9 months ago by teor

Summary: Frequent OOM kills of tor precessFrequent OOM kills of tor process

comment:13 in reply to:  11 Changed 9 months ago by DeS

I checked this time slot. No OOM kill there. So you are right. The process is not always killed.
However yesterday we hat 4-5 OOM on this machine with two tor processes

Replying to arma:

I found this part interesting:

3159    2017-05-13 23:22:01          3931          0       3931
3160	2017-05-13 23:23:01           3786       1852       1934          0         66        518
3161	2017-05-13 23:23:01          3931          0       3931
3162	2017-05-13 23:24:01           3786       2217       1569          0         66        518
3163	2017-05-13 23:24:01          3931          0       3931
3164	2017-05-13 23:25:01           3786       2335       1451          0         66        518
3165	2017-05-13 23:25:01          3931          0       3931
3166	2017-05-13 23:26:01           3786       2527       1259          0         66        496
3167	2017-05-13 23:26:01          3931          0       3931
3168	2017-05-13 23:27:01           3786       2952        833          0         66        496
3169	2017-05-13 23:27:01          3931          0       3931
3170	2017-05-13 23:28:01           3786       3020        766          0         66        496
3171	2017-05-13 23:28:01          3931          0       3931
3172	2017-05-13 23:29:01           3786       3110        675          0         66        496
3173	2017-05-13 23:29:01          3931          0       3931
3174	2017-05-13 23:30:01           3786       3023        763          0         66        496
3175	2017-05-13 23:30:01          3931          0       3931
3176	2017-05-13 23:31:01           3786       2694       1091          0         66        496
3177	2017-05-13 23:31:01          3931          0       3931
3178	2017-05-13 23:32:01           3786       2338       1448          0         66        496
3179	2017-05-13 23:32:01          3931          0       3931
3180	2017-05-13 23:33:01           3786       1893       1893          0         66        496
3181	2017-05-13 23:33:01          3931          0       3931
3182	2017-05-13 23:34:01           3786       1614       2171          0         66        496
3183	2017-05-13 23:34:01          3931          0       3931

It looks like memory load went up, but not enough to kill it, and then went down again on its own.

So it looks like it is not always following the pattern of "grow forever until it's too big".

comment:14 Changed 9 months ago by DeS

I added this line to the torrc of both processes:

MaxMemInQueues 3000MB

Since then (about 13 hours ago) all stable. Lets see how it works out

comment:15 Changed 9 months ago by DeS

Hence the measure with MaxMemInQueues we still expierence the problem that memory usage starts rising until the tor process gets killed (OOM).

In the last 24 hours 4 times such an OOM took place on our two servers with 2 tor processes each.

I also had a quick discussion with another operator based on this tweet https://twitter.com/FrennVunDerEnn/status/864583496072876034

It seems they expierence spontanous rise of memory consupmtion as well.

comment:16 Changed 9 months ago by cypherpunks

I experience the same problem on 3 non-exit relays. It started on 2017-05-18 (running debian 8, 0.3.0.5-rc at the time)

comment:17 Changed 9 months ago by DeS

Please see Mailinglist tor-relays@…

Several Exit Operators suffer with this problem.

comment:18 Changed 9 months ago by nickm

Is there any commonality between versions reporting this issue? Is it all 0.3.0? Is it all one operating system?

Is anybody able to build with some heap profiling tool (such as the one in gperftools) to see where all the memory is going?

comment:19 Changed 9 months ago by nickm

Priority: MediumHigh
Severity: NormalMajor

comment:20 in reply to:  19 Changed 9 months ago by DeS

Replying to nickm:
Our System is Ubuntu 14.04.5 LTS and we tried with 0.3.0.5-rc-1~trusty+1 and 0.2.9.10-1~trusty+1

The feedback from the Mailing list was:
Debian Jessie with tor-0.2.9.10

Niftybunny reported. Same problems with 0.3.0.5. Upgrading to 0.3.0.7 helped on most relays.
(Roger remarked that there would no be a relevant change between both 0.3.0.5/7 Versions.)

So it seems the same over Debian/Ubuntu as well 0.2.9x as 0.3.x are affected

comment:21 Changed 9 months ago by DeS

I will try to geht valgrind running with the tor process as described by roger in the Mailing List and report back.

MaxMemInQueues 3000MB does not seem to be a solution since I`m down to 2000MB for two processes on a machine with 4GB physical memory and absolutley no change so far.

comment:22 in reply to:  description ; Changed 9 months ago by arma

Replying to DeS:

May 14 11:02:47 tor1 kernel: [125982.993478] Killed process 33180 (tor) total-vm:444012kB, anon-rss:0kB, file-rss:0kB

This part has been bothering me. Am I reading that right, that the tor process it kills is using 444 megabytes of memory? That is way less than 4 gigs.

Changed 9 months ago by DeS

Attachment: valgrind.txt added

comment:23 Changed 9 months ago by DeS

As proposed by arma I ran this command on a fresh compiled alpha tor.

nohup valgrind --leak-check=yes --error-limit=no --undef-value-errors=no src/or/tor geoipfile src/config/geoip -f /etc/tor/torrc2 &>valgrind.out &

Somewhere during the last day this process stopped for no know reason. I attach the output:
https://trac.torproject.org/projects/tor/attachment/ticket/22255/valgrind.txt

Hope that helps.

Frankly our tor serivce is almost not maintainable anymore. I will try a last think and check for processes by script and restart. But this problem is serious.

comment:24 in reply to:  23 ; Changed 9 months ago by arma

Replying to DeS:

Somewhere during the last day this process stopped for no know reason. I attach the output:
https://trac.torproject.org/projects/tor/attachment/ticket/22255/valgrind.txt

Awesome, you found a bug! I filed it as #22368.

Alas, it isn't enough to explain the OOM issues on this ticket. Once we've fixed that ticket, I'll ask everybody to run the valgrinds again and see what comes up next.

Thanks!

comment:25 in reply to:  24 Changed 9 months ago by arma

Replying to arma:

Once we've fixed that ticket, I'll ask everybody to run the valgrinds again and see what comes up next.

Oh, if you are excited to get going before that, trying valgrind on the "maint-0.3.0" branch will be more stable, but even more interesting if we find bugs in.

To get to it, after the "git clone" line, do a "git checkout maint-0.3.0" and then proceed like before.

comment:26 Changed 9 months ago by DeS

Thanks for your efforts.
I´m not technical enough to read anything out of the valgrind output. And I not really understand the defects you mention.
Nevertheless if you need a valgrind testrun I can do so.

But our relays went so much up and down I want to give them some "recovery" time. As I write another OOM Kill took place and the relay went from previously (3 weeks ago) around 400Mbit to 40 Mbits.

Still my only solution to the problem ist to check if the processes are around - and if not start them.
So if it is a big benefit for you - let me know (you have the eMail from the Mailinglist) and I will make another testrun.

comment:27 Changed 9 months ago by DeS

Following teors adivce in tor-relays@… Mailinglist and set following values:
DirPort 0
DirCache 0

comment:28 in reply to:  22 Changed 9 months ago by arma

Replying to arma:

Replying to DeS:

May 14 11:02:47 tor1 kernel: [125982.993478] Killed process 33180 (tor) total-vm:444012kB, anon-rss:0kB, file-rss:0kB

This part has been bothering me. Am I reading that right, that the tor process it kills is using 444 megabytes of memory? That is way less than 4 gigs.

DeS, have you explored this one? A fast Tor relay will use way more than 444 megabytes of memory, so of course it will be dying if that's all it can have. What else is happening on the system such that it can't get any more memory?

Changed 9 months ago by DeS

comment:29 Changed 9 months ago by DeS

Obviously a process will die if it can´t get sufficient memory.
But as you saw in the memory logs. The normal usage is way below 400 MB.
And our Exits are limited in regards to bandwith. So we never really exceed 4xx Mbit/s.

But sometimes as you saw in the memlog memory consumtion starts rising fast - ending with the kill or an completly unresponsive machine.
What we are looking for is the cause for the rising memory consumpion.
I repeat this machines have been running 1-2 years without such problems and the problem starte occuring some weeks ago. As we seen on the mailing list it is also not just this servers - otherwise I would just set them freshly up.

dirk@tor2:~$ ps aux
USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root         1  0.0  0.0  33432  1528 ?        Ss   Mai22   0:00 /sbin/init
root         2  0.0  0.0      0     0 ?        S    Mai22   0:00 [kthreadd]
root         3  0.1  0.0      0     0 ?        S    Mai22   9:18 [ksoftirqd/0]
root         5  0.0  0.0      0     0 ?        S<   Mai22   0:00 [kworker/0:0H]
root         7  0.2  0.0      0     0 ?        S    Mai22  16:50 [rcu_sched]
root         8  0.1  0.0      0     0 ?        S    Mai22   7:02 [rcuos/0]
root         9  0.1  0.0      0     0 ?        S    Mai22   7:16 [rcuos/1]
root        10  0.1  0.0      0     0 ?        S    Mai22   6:50 [rcuos/2]
root        11  0.1  0.0      0     0 ?        S    Mai22   7:25 [rcuos/3]
root        12  0.0  0.0      0     0 ?        S    Mai22   0:00 [rcuos/4]
root        13  0.0  0.0      0     0 ?        S    Mai22   0:00 [rcuos/5]
root        14  0.0  0.0      0     0 ?        S    Mai22   0:00 [rcuos/6]
root        15  0.0  0.0      0     0 ?        S    Mai22   0:00 [rcuos/7]
root        16  0.0  0.0      0     0 ?        S    Mai22   0:00 [rcuos/8]
root        17  0.0  0.0      0     0 ?        S    Mai22   0:00 [rcuos/9]
root        18  0.0  0.0      0     0 ?        S    Mai22   0:00 [rcuos/10]
root        19  0.0  0.0      0     0 ?        S    Mai22   0:00 [rcuos/11]
root        20  0.0  0.0      0     0 ?        S    Mai22   0:00 [rcuos/12]
root        21  0.0  0.0      0     0 ?        S    Mai22   0:00 [rcuos/13]
root        22  0.0  0.0      0     0 ?        S    Mai22   0:00 [rcuos/14]
root        23  0.0  0.0      0     0 ?        S    Mai22   0:00 [rcuos/15]
root        24  0.0  0.0      0     0 ?        S    Mai22   0:00 [rcuos/16]
root        25  0.0  0.0      0     0 ?        S    Mai22   0:00 [rcuos/17]
root        26  0.0  0.0      0     0 ?        S    Mai22   0:00 [rcuos/18]
root        27  0.0  0.0      0     0 ?        S    Mai22   0:00 [rcuos/19]
root        28  0.0  0.0      0     0 ?        S    Mai22   0:00 [rcuos/20]
root        29  0.0  0.0      0     0 ?        S    Mai22   0:00 [rcuos/21]
root        30  0.0  0.0      0     0 ?        S    Mai22   0:00 [rcuos/22]
root        31  0.0  0.0      0     0 ?        S    Mai22   0:00 [rcuos/23]
root        32  0.0  0.0      0     0 ?        S    Mai22   0:00 [rcuos/24]
root        33  0.0  0.0      0     0 ?        S    Mai22   0:00 [rcuos/25]
root        34  0.0  0.0      0     0 ?        S    Mai22   0:00 [rcuos/26]
root        35  0.0  0.0      0     0 ?        S    Mai22   0:00 [rcuos/27]
root        36  0.0  0.0      0     0 ?        S    Mai22   0:00 [rcuos/28]
root        37  0.0  0.0      0     0 ?        S    Mai22   0:00 [rcuos/29]
root        38  0.0  0.0      0     0 ?        S    Mai22   0:00 [rcuos/30]
root        39  0.0  0.0      0     0 ?        S    Mai22   0:00 [rcuos/31]
root        40  0.0  0.0      0     0 ?        S    Mai22   0:00 [rcuos/32]
root        41  0.0  0.0      0     0 ?        S    Mai22   0:00 [rcuos/33]
root        42  0.0  0.0      0     0 ?        S    Mai22   0:00 [rcuos/34]
root        43  0.0  0.0      0     0 ?        S    Mai22   0:00 [rcuos/35]
root        44  0.0  0.0      0     0 ?        S    Mai22   0:00 [rcuos/36]
root        45  0.0  0.0      0     0 ?        S    Mai22   0:00 [rcuos/37]
root        46  0.0  0.0      0     0 ?        S    Mai22   0:00 [rcuos/38]
root        47  0.0  0.0      0     0 ?        S    Mai22   0:00 [rcuos/39]
root        48  0.0  0.0      0     0 ?        S    Mai22   0:00 [rcuos/40]
root        49  0.0  0.0      0     0 ?        S    Mai22   0:00 [rcuos/41]
root        50  0.0  0.0      0     0 ?        S    Mai22   0:00 [rcuos/42]
root        51  0.0  0.0      0     0 ?        S    Mai22   0:00 [rcuos/43]
root        52  0.0  0.0      0     0 ?        S    Mai22   0:00 [rcuos/44]
root        53  0.0  0.0      0     0 ?        S    Mai22   0:00 [rcuos/45]
root        54  0.0  0.0      0     0 ?        S    Mai22   0:00 [rcuos/46]
root        55  0.0  0.0      0     0 ?        S    Mai22   0:00 [rcuos/47]
root        56  0.0  0.0      0     0 ?        S    Mai22   0:00 [rcuos/48]
root        57  0.0  0.0      0     0 ?        S    Mai22   0:00 [rcuos/49]
root        58  0.0  0.0      0     0 ?        S    Mai22   0:00 [rcuos/50]
root        59  0.0  0.0      0     0 ?        S    Mai22   0:00 [rcuos/51]
root        60  0.0  0.0      0     0 ?        S    Mai22   0:00 [rcuos/52]
root        61  0.0  0.0      0     0 ?        S    Mai22   0:00 [rcuos/53]
root        62  0.0  0.0      0     0 ?        S    Mai22   0:00 [rcuos/54]
root        63  0.0  0.0      0     0 ?        S    Mai22   0:00 [rcuos/55]
root        64  0.0  0.0      0     0 ?        S    Mai22   0:00 [rcuos/56]
root        65  0.0  0.0      0     0 ?        S    Mai22   0:00 [rcuos/57]
root        66  0.0  0.0      0     0 ?        S    Mai22   0:00 [rcuos/58]
root        67  0.0  0.0      0     0 ?        S    Mai22   0:00 [rcuos/59]
root        68  0.0  0.0      0     0 ?        S    Mai22   0:00 [rcuos/60]
root        69  0.0  0.0      0     0 ?        S    Mai22   0:00 [rcuos/61]
root        70  0.0  0.0      0     0 ?        S    Mai22   0:00 [rcuos/62]
root        71  0.0  0.0      0     0 ?        S    Mai22   0:00 [rcuos/63]
root        72  0.0  0.0      0     0 ?        S    Mai22   0:00 [rcu_bh]
root        73  0.0  0.0      0     0 ?        S    Mai22   0:00 [rcuob/0]
root        74  0.0  0.0      0     0 ?        S    Mai22   0:00 [rcuob/1]
root        75  0.0  0.0      0     0 ?        S    Mai22   0:00 [rcuob/2]
root        76  0.0  0.0      0     0 ?        S    Mai22   0:00 [rcuob/3]
root        77  0.0  0.0      0     0 ?        S    Mai22   0:00 [rcuob/4]
root        78  0.0  0.0      0     0 ?        S    Mai22   0:00 [rcuob/5]
root        79  0.0  0.0      0     0 ?        S    Mai22   0:00 [rcuob/6]
root        80  0.0  0.0      0     0 ?        S    Mai22   0:00 [rcuob/7]
root        81  0.0  0.0      0     0 ?        S    Mai22   0:00 [rcuob/8]
root        82  0.0  0.0      0     0 ?        S    Mai22   0:00 [rcuob/9]
root        83  0.0  0.0      0     0 ?        S    Mai22   0:00 [rcuob/10]
root        84  0.0  0.0      0     0 ?        S    Mai22   0:00 [rcuob/11]
root        85  0.0  0.0      0     0 ?        S    Mai22   0:00 [rcuob/12]
root        86  0.0  0.0      0     0 ?        S    Mai22   0:00 [rcuob/13]
root        87  0.0  0.0      0     0 ?        S    Mai22   0:00 [rcuob/14]
root        88  0.0  0.0      0     0 ?        S    Mai22   0:00 [rcuob/15]
root        89  0.0  0.0      0     0 ?        S    Mai22   0:00 [rcuob/16]
root        90  0.0  0.0      0     0 ?        S    Mai22   0:00 [rcuob/17]
root        91  0.0  0.0      0     0 ?        S    Mai22   0:00 [rcuob/18]
root        92  0.0  0.0      0     0 ?        S    Mai22   0:00 [rcuob/19]
root        93  0.0  0.0      0     0 ?        S    Mai22   0:00 [rcuob/20]
root        94  0.0  0.0      0     0 ?        S    Mai22   0:00 [rcuob/21]
root        95  0.0  0.0      0     0 ?        S    Mai22   0:00 [rcuob/22]
root        96  0.0  0.0      0     0 ?        S    Mai22   0:00 [rcuob/23]
root        97  0.0  0.0      0     0 ?        S    Mai22   0:00 [rcuob/24]
root        98  0.0  0.0      0     0 ?        S    Mai22   0:00 [rcuob/25]
root        99  0.0  0.0      0     0 ?        S    Mai22   0:00 [rcuob/26]
root       100  0.0  0.0      0     0 ?        S    Mai22   0:00 [rcuob/27]
root       101  0.0  0.0      0     0 ?        S    Mai22   0:00 [rcuob/28]
root       102  0.0  0.0      0     0 ?        S    Mai22   0:00 [rcuob/29]
root       103  0.0  0.0      0     0 ?        S    Mai22   0:00 [rcuob/30]
root       104  0.0  0.0      0     0 ?        S    Mai22   0:00 [rcuob/31]
root       105  0.0  0.0      0     0 ?        S    Mai22   0:00 [rcuob/32]
root       106  0.0  0.0      0     0 ?        S    Mai22   0:00 [rcuob/33]
root       107  0.0  0.0      0     0 ?        S    Mai22   0:00 [rcuob/34]
root       108  0.0  0.0      0     0 ?        S    Mai22   0:00 [rcuob/35]
root       109  0.0  0.0      0     0 ?        S    Mai22   0:00 [rcuob/36]
root       110  0.0  0.0      0     0 ?        S    Mai22   0:00 [rcuob/37]
root       111  0.0  0.0      0     0 ?        S    Mai22   0:00 [rcuob/38]
root       112  0.0  0.0      0     0 ?        S    Mai22   0:00 [rcuob/39]
root       113  0.0  0.0      0     0 ?        S    Mai22   0:00 [rcuob/40]
root       114  0.0  0.0      0     0 ?        S    Mai22   0:00 [rcuob/41]
root       115  0.0  0.0      0     0 ?        S    Mai22   0:00 [rcuob/42]
root       116  0.0  0.0      0     0 ?        S    Mai22   0:00 [rcuob/43]
root       117  0.0  0.0      0     0 ?        S    Mai22   0:00 [rcuob/44]
root       118  0.0  0.0      0     0 ?        S    Mai22   0:00 [rcuob/45]
root       119  0.0  0.0      0     0 ?        S    Mai22   0:00 [rcuob/46]
root       120  0.0  0.0      0     0 ?        S    Mai22   0:00 [rcuob/47]
root       121  0.0  0.0      0     0 ?        S    Mai22   0:00 [rcuob/48]
root       122  0.0  0.0      0     0 ?        S    Mai22   0:00 [rcuob/49]
root       123  0.0  0.0      0     0 ?        S    Mai22   0:00 [rcuob/50]
root       124  0.0  0.0      0     0 ?        S    Mai22   0:00 [rcuob/51]
root       125  0.0  0.0      0     0 ?        S    Mai22   0:00 [rcuob/52]
root       126  0.0  0.0      0     0 ?        S    Mai22   0:00 [rcuob/53]
root       127  0.0  0.0      0     0 ?        S    Mai22   0:00 [rcuob/54]
root       128  0.0  0.0      0     0 ?        S    Mai22   0:00 [rcuob/55]
root       129  0.0  0.0      0     0 ?        S    Mai22   0:00 [rcuob/56]
root       130  0.0  0.0      0     0 ?        S    Mai22   0:00 [rcuob/57]
root       131  0.0  0.0      0     0 ?        S    Mai22   0:00 [rcuob/58]
root       132  0.0  0.0      0     0 ?        S    Mai22   0:00 [rcuob/59]
root       133  0.0  0.0      0     0 ?        S    Mai22   0:00 [rcuob/60]
root       134  0.0  0.0      0     0 ?        S    Mai22   0:00 [rcuob/61]
root       135  0.0  0.0      0     0 ?        S    Mai22   0:00 [rcuob/62]
root       136  0.0  0.0      0     0 ?        S    Mai22   0:00 [rcuob/63]
root       137  0.0  0.0      0     0 ?        S    Mai22   0:00 [migration/0]
root       138  0.0  0.0      0     0 ?        S    Mai22   0:01 [watchdog/0]
root       139  0.0  0.0      0     0 ?        S    Mai22   0:00 [watchdog/1]
root       140  0.0  0.0      0     0 ?        S    Mai22   0:00 [migration/1]
root       141  0.0  0.0      0     0 ?        S    Mai22   4:14 [ksoftirqd/1]
root       143  0.0  0.0      0     0 ?        S<   Mai22   0:00 [kworker/1:0H]
root       144  0.0  0.0      0     0 ?        S    Mai22   0:01 [watchdog/2]
root       145  0.0  0.0      0     0 ?        S    Mai22   0:00 [migration/2]
root       146  0.1  0.0      0     0 ?        S    Mai22   8:21 [ksoftirqd/2]
root       148  0.0  0.0      0     0 ?        S<   Mai22   0:00 [kworker/2:0H]
root       149  0.0  0.0      0     0 ?        S    Mai22   0:01 [watchdog/3]
root       150  0.0  0.0      0     0 ?        S    Mai22   0:00 [migration/3]
root       151  0.0  0.0      0     0 ?        S    Mai22   3:46 [ksoftirqd/3]
root       153  0.0  0.0      0     0 ?        S<   Mai22   0:00 [kworker/3:0H]
root       154  0.0  0.0      0     0 ?        S<   Mai22   0:00 [khelper]
root       155  0.0  0.0      0     0 ?        S    Mai22   0:00 [kdevtmpfs]
root       156  0.0  0.0      0     0 ?        S<   Mai22   0:00 [netns]
root       157  0.0  0.0      0     0 ?        S<   Mai22   0:00 [writeback]
root       158  0.0  0.0      0     0 ?        S<   Mai22   0:00 [kintegrityd]
root       159  0.0  0.0      0     0 ?        S<   Mai22   0:00 [bioset]
root       160  0.0  0.0      0     0 ?        S<   Mai22   0:00 [kworker/u129:0]
root       161  0.0  0.0      0     0 ?        S<   Mai22   0:00 [kblockd]
root       162  0.0  0.0      0     0 ?        S<   Mai22   0:00 [ata_sff]
root       163  0.0  0.0      0     0 ?        S    Mai22   0:00 [khubd]
root       164  0.0  0.0      0     0 ?        S<   Mai22   0:00 [md]
root       165  0.0  0.0      0     0 ?        S<   Mai22   0:00 [devfreq_wq]
root       166  0.3  0.0      0     0 ?        R    Mai22  20:23 [kworker/0:1]
root       168  0.0  0.0      0     0 ?        S    Mai22   0:00 [khungtaskd]
root       169  2.3  0.0      0     0 ?        S    Mai22 154:44 [kswapd0]
root       170  0.0  0.0      0     0 ?        S<   Mai22   0:00 [vmstat]
root       171  0.0  0.0      0     0 ?        SN   Mai22   0:00 [ksmd]
root       172  0.0  0.0      0     0 ?        SN   Mai22   0:02 [khugepaged]
root       173  0.0  0.0      0     0 ?        S    Mai22   0:00 [fsnotify_mark]
root       174  0.0  0.0      0     0 ?        S    Mai22   0:00 [ecryptfs-kthrea]
root       175  0.0  0.0      0     0 ?        S<   Mai22   0:00 [crypto]
root       187  0.0  0.0      0     0 ?        S<   Mai22   0:00 [kthrotld]
root       189  0.2  0.0      0     0 ?        S    Mai22  13:59 [kworker/1:1]
root       208  0.0  0.0      0     0 ?        S<   Mai22   0:00 [deferwq]
root       209  0.0  0.0      0     0 ?        S<   Mai22   0:00 [charger_manager]
root       264  0.0  0.0      0     0 ?        S<   Mai22   0:00 [kpsmoused]
root       266  0.0  0.0      0     0 ?        S    Mai22   0:00 [scsi_eh_0]
root       267  0.0  0.0      0     0 ?        S    Mai22   0:00 [hpsa]
root       270  0.0  0.0      0     0 ?        S    Mai22   0:00 [scsi_eh_1]
root       271  0.0  0.0      0     0 ?        S    Mai22   0:00 [scsi_eh_2]
root       272  0.0  0.0      0     0 ?        S    Mai22   0:00 [scsi_eh_3]
root       273  0.0  0.0      0     0 ?        S    Mai22   0:00 [scsi_eh_4]
root       274  0.0  0.0      0     0 ?        S    Mai22   0:00 [scsi_eh_5]
root       275  0.0  0.0      0     0 ?        S    Mai22   0:00 [scsi_eh_6]
root       281  0.2  0.0      0     0 ?        S    Mai22  18:31 [kworker/2:1]
root       284  0.4  0.0      0     0 ?        S    Mai22  26:21 [kworker/3:1]
root       293  0.0  0.0      0     0 ?        S<   Mai22   0:00 [kdmflush]
root       294  0.0  0.0      0     0 ?        S<   Mai22   0:00 [bioset]
root       296  0.0  0.0      0     0 ?        S<   Mai22   0:00 [kdmflush]
root       297  0.0  0.0      0     0 ?        S<   Mai22   0:00 [bioset]
root       314  0.0  0.0      0     0 ?        S    Mai22   0:01 [jbd2/dm-0-8]
root       315  0.0  0.0      0     0 ?        S<   Mai22   0:00 [ext4-rsv-conver]
root       440  0.0  0.0      0     0 ?        S<   Mai22   0:00 [ext4-rsv-conver]
root       479  0.0  0.0  19484   196 ?        S    Mai22   0:00 upstart-udev-bridge --daemon
root       506  0.0  0.0  51540     4 ?        Ss   Mai22   0:00 /lib/systemd/systemd-udevd --daemon
root       525  0.0  0.0  15380   208 ?        S    Mai22   0:00 upstart-file-bridge --daemon
root       557  0.0  0.0      0     0 ?        SN   Mai22   0:21 [kipmi0]
message+   617  0.0  0.0  39252   640 ?        Ss   Mai22   0:00 dbus-daemon --system --fork
syslog     639  0.0  0.0 255848  1688 ?        Ssl  Mai22   0:00 rsyslogd
root       644  0.0  0.0      0     0 ?        S<   Mai22   0:00 [kvm-irqfd-clean]
root       704  0.0  0.0  43456   924 ?        Ss   Mai22   0:00 /lib/systemd/systemd-logind
root       738  0.0  0.0  15268   216 ?        S    Mai22   0:00 upstart-socket-bridge --daemon
root      1034  0.0  0.0  17312     8 tty4     Ss+  Mai22   0:00 /sbin/getty -8 38400 tty4
root      1037  0.0  0.0  17312     8 tty5     Ss+  Mai22   0:00 /sbin/getty -8 38400 tty5
root      1042  0.0  0.0  17312     8 tty2     Ss+  Mai22   0:00 /sbin/getty -8 38400 tty2
root      1043  0.0  0.0  17312     8 tty3     Ss+  Mai22   0:00 /sbin/getty -8 38400 tty3
root      1045  0.0  0.0  17312     8 tty6     Ss+  Mai22   0:00 /sbin/getty -8 38400 tty6
root      1060  0.0  0.0  61388   320 ?        Ss   Mai22   0:00 /usr/sbin/sshd -D
root      1069  0.0  0.0   4376     4 ?        Ss   Mai22   0:00 acpid -c /etc/acpi/events -s /var/run/acpid.socket
root      1095  0.0  0.0  23660   192 ?        Ss   Mai22   0:00 cron
daemon    1096  0.0  0.0  19144     0 ?        Ss   Mai22   0:00 atd
root      1116  0.0  0.0  19200   400 ?        Ss   Mai22   0:29 /usr/sbin/irqbalance
unbound   1265  2.9  1.3  96720 51748 ?        Ss   Mai22 193:16 /usr/sbin/unbound
vnstat    1317  0.0  0.0   7484   360 ?        Ss   Mai22   0:16 /usr/sbin/vnstatd -d --pidfile /run/vnstat/vnstat.pid
root      1396  0.0  0.0  17312     8 tty1     Ss+  Mai22   0:00 /sbin/getty -8 38400 tty1
root      1410  0.0  0.0      0     0 ?        S    Mai22   0:00 [kauditd]
dirk      1610  0.0  0.0  14100   840 ?        S    Mai22   0:10 /bin/bash -e ./memlog.sh
root      1836  0.0  0.0      0     0 ?        S    Mai26   0:00 [kworker/1:0]
root     13028  0.0  0.0      0     0 ?        S    Mai23   0:00 [kworker/0:0]
root     14819  0.0  0.0      0     0 ?        S<   Mai23   0:00 [kworker/u129:1]
root     15495  0.0  0.0      0     0 ?        S    02:26   0:00 [kworker/u128:1]
root     20329  0.0  0.0      0     0 ?        S    08:06   0:00 [kworker/3:2]
debian-+ 20802 28.6  3.3 309628 128476 ?       Sl   08:11   7:49 /usr/bin/tor --defaults-torrc /usr/share/tor/tor-service-defaults-torrc -f /etc/tor/torrc1 --hush
debian-+ 21033 18.4  3.2 305140 124116 ?       Rl   08:12   4:49 /usr/bin/tor --defaults-torrc /usr/share/tor/tor-service-defaults-torrc -f /etc/tor/torrc2 --hush
root     21243  0.0  0.0      0     0 ?        S    08:23   0:00 [kworker/u128:2]
root     21386  0.0  0.1 105656  4340 ?        Ss   08:34   0:00 sshd: dirk [priv]   
dirk     21451  0.0  0.0 105656  1888 ?        S    08:34   0:00 sshd: dirk@pts/0    
dirk     21452  0.0  0.0  24080  3592 pts/0    Ss   08:34   0:00 -bash
dirk     21521  0.0  0.0   4352   360 ?        S    08:37   0:00 sleep 60
dirk     21523  0.0  0.0  20000  1316 pts/0    R+   08:38   0:00 ps aux
root     47576  0.0  0.0      0     0 ?        S    Mai25   0:00 [kworker/2:2]

I will attach a screenshot https://trac.torproject.org/projects/tor/attachment/ticket/22255/Screenshot_2017-05-26_09-56-09.png of top when the problem was akut and I was logged in yesterday. I could not investigate more this the machine was very unresponsive (I was kicked out of shell / local console)

Last edited 9 months ago by DeS (previous) (diff)

comment:30 Changed 9 months ago by cypherpunks

Do you have a system monitoring in place that shows you memory usage over time?
(i.e. munin) Can you attach the following graphs:

  • memory usage
  • swap usage
  • swap in/out per second

Also consider adding the following to your currently running regularly data collection scripts so we might find out what else is using your memory so that tor only gets a small chunk of your 4GB.

ps aux --sort=-%mem

comment:31 Changed 9 months ago by cypherpunks

Also: When you look at "top" output sort by used memory (press "M" key after starting top).

Changed 9 months ago by DeS

Attachment: syslog20170528 added

Changed 9 months ago by DeS

Attachment: SMEM20170528.log added

comment:32 in reply to:  30 ; Changed 9 months ago by DeS

Replying to cypherpunks:

Do you have a system monitoring in place that shows you memory usage over time?
(i.e. munin) Can you attach the following graphs:

  • memory usage
  • swap usage
  • swap in/out per second

Also consider adding the following to your currently running regularly data collection scripts so we might find out what else is using your memory so that tor only gets a small chunk of your 4GB.

ps aux --sort=-%mem

I have no munin running up to know. But let me through at you what I have.

1) Logs wich show that a process died and was restartet from today

Son Mai 28 03:50:06 CEST 2017 Process tor1 is not running -> starting
Son Mai 28 04:27:01 CEST 2017 Process tor2 is not running -> starting
Son Mai 28 04:32:13 CEST 2017 Process tor2 is not running -> starting

This will give use the rough time of three events today

2) Next syslog (shortened) will show what happend https://trac.torproject.org/projects/tor/attachment/ticket/22255/syslog20170528
3) I use smem all 5 minutes to provide infos about the process memory (should be like ps aux) usage see attachment https://trac.torproject.org/projects/tor/attachment/ticket/22255/SMEM20170528.log

I hope that gives you what you need. If not - let me know what tools would provide the best suitible output for you and I see to it.

comment:33 in reply to:  32 Changed 9 months ago by cypherpunks

Replying to DeS:

I have no munin running up to know.

I think you want to have system monitoring (CPU, memory, swap, network) on a system anyway for proper management (i.e. recognize rising memory usage trends, excessive swapping, very high CPU usage, ...)

let me know what tools would provide the best suitible output for you and I see to it.

maybe /proc/meminfo

comment:34 Changed 9 months ago by DeS

o.k. I installed a local munin. Will this give you all information needed ?

comment:35 Changed 9 months ago by DeS

curren /proc/meminfo I will add it to be generated every 5 minutes

MemTotal:        3877580 kB
MemFree:          777392 kB
Buffers:          112468 kB
Cached:           688980 kB
SwapCached:        10748 kB
Active:          1102040 kB
Inactive:         249044 kB
Active(anon):     512100 kB
Inactive(anon):    39020 kB
Active(file):     589940 kB
Inactive(file):   210024 kB
Unevictable:           0 kB
Mlocked:               0 kB
SwapTotal:       4026364 kB
SwapFree:        4007204 kB
Dirty:               128 kB
Writeback:             0 kB
AnonPages:        545892 kB
Mapped:            87924 kB
Shmem:              1484 kB
Slab:             269576 kB
SReclaimable:     112900 kB
SUnreclaim:       156676 kB
KernelStack:        2864 kB
PageTables:         5096 kB
NFS_Unstable:          0 kB
Bounce:                0 kB
WritebackTmp:          0 kB
CommitLimit:     5965152 kB
Committed_AS:    1074140 kB
VmallocTotal:   34359738367 kB
VmallocUsed:       84072 kB
VmallocChunk:   34359649392 kB
HardwareCorrupted:     0 kB
AnonHugePages:    428032 kB
HugePages_Total:       0
HugePages_Free:        0
HugePages_Rsvd:        0
HugePages_Surp:        0
Hugepagesize:       2048 kB
DirectMap4k:      149388 kB
DirectMap2M:     3878912 kB
DirectMap1G:           0 kB

Changed 9 months ago by DeS

Changed 9 months ago by DeS

Attachment: SMEM_Mo.log added

Changed 9 months ago by DeS

Attachment: PROC_MEMINFO_Mon.log added

comment:36 Changed 9 months ago by DeS

So after I got Munin installed I waited.
Yesterday two other OOMs occured:

Mon Mai 29 16:06:07 CEST 2017 Process tor1 is not running -> starting
Mon Mai 29 17:20:01 CEST 2017 Process tor2 is not running -> starting

Syslog:

May 29 16:05:51 tor1 kernel: [502910.776183] Out of memory: Kill process 53045 (tor) score 32 or sacrifice child
May 29 17:17:07 tor1 kernel: [507187.634566] Out of memory: Kill process 53165 (tor) score 25 or sacrifice child

Unfortunatley munin was not too successfull. It died when things got interessting and just lived very shortly between both occurences.
https://trac.torproject.org/projects/tor/attachment/ticket/22255/Screenshot_2017-05-30_20-11-49.png

But I got you the logs from SMEM and cat /proc/meminfo.
SMEM
https://trac.torproject.org/projects/tor/attachment/ticket/22255/SMEM_Mo.log
cat /proc/meminfo
https://trac.torproject.org/projects/tor/attachment/ticket/22255/PROC_MEMINFO_Mon.log

Hope this helps.

comment:37 Changed 9 months ago by cypherpunks

When these extreme spikes in memory usage happen (example: before 2017-05-30 12:00) how do the load/CPU/bandwidth/established connections look like at that point in time?

Changed 9 months ago by DeS

Attachment: munin.tar.gz added

comment:38 Changed 9 months ago by DeS

Here are the graphics. https://trac.torproject.org/projects/tor/attachment/ticket/22255/munin.tar.gz
At this slot today 12:00 it seems not very out of the ordinary.

As said. Once the real problem starts - munin fails.

comment:39 Changed 9 months ago by cypherpunks

looks like a significant increase in connections when the memory spikes happen please change the netstat graphs to non-log scale by adding this to your munin.conf (in your host's section somewhere after [localhost])

netstat.graph_args -l 0 --base 1000

so we can have a better look.

This is kind of obvious:

  • when you system was knocked off (i/o wait was extremely high probably due to swaping)

please enable the swap munin plugin so we can see swap i/o

comment:40 in reply to:  39 ; Changed 9 months ago by DeS

Replying to cypherpunks:

looks like a significant increase in connections when the memory spikes happen please change the netstat graphs to non-log scale by adding this to your munin.conf (in your host's section somewhere after [localhost])

netstat.graph_args -l 0 --base 1000

so we can have a better look.

Done.

This is kind of obvious:

  • when you system was knocked off (i/o wait was extremely high probably due to swaping)

please enable the swap munin plugin so we can see swap i/o

Swap is enabled. The picture is in the Link https://trac.torproject.org/projects/tor/attachment/ticket/22255/munin.tar.gz

comment:41 in reply to:  40 Changed 9 months ago by cypherpunks

Replying to DeS:

Replying to cypherpunks:

looks like a significant increase in connections when the memory spikes happen please change the netstat graphs to non-log scale by adding this to your munin.conf (in your host's section somewhere after [localhost])

netstat.graph_args -l 0 --base 1000

so we can have a better look.

Done.

Please upload the netstat graph again (the change should take effect once the graphs are regenerated, happens every 5min and works retroactively)

Swap is enabled. The picture is in the Link https://trac.torproject.org/projects/tor/attachment/ticket/22255/munin.tar.gz

I see the swap partition file "ubuntu_vg_swap_1-day.png", but I don't see a swap-day.png (Pages per seconds in/out)

Changed 9 months ago by DeS

Changed 9 months ago by DeS

comment:43 in reply to:  42 ; Changed 9 months ago by cypherpunks

Changed 9 months ago by DeS

comment:45 Changed 9 months ago by cypherpunks

Now you should have the tools to see if the spike in tor's memory usage is related to a spike in connections or bandwidth, but you won't have any data to compare with (from times where you didn't have the OOM problem).

On the graph around 2017-05-31 9:00 there is an (usual?) spike from 13k to 19k connections, I guess the memory spiked as well (you didn't provide memory graphs).

Also note: Be aware that connection graphs at that granularity are considered sensitive.

comment:46 in reply to:  45 ; Changed 9 months ago by DeS

Also note: Be aware that connection graphs at that granularity are considered sensitive.

I´m open to other ideas how to exchange this information. But you will agree this problem needs to be looked at.

Clearly there is a correlation between memory an connections. Please refer to:

https://trac.torproject.org/projects/tor/attachment/ticket/22255/memory-pinpoint%3D1496233206%2C1496341206.png

https://trac.torproject.org/projects/tor/attachment/ticket/22255/netstat-pinpoint%3D1496233206%2C1496341206.png

BUT
1) as you can see on this pictures the system "dies" once at 19 k connections. But two times later everything is fine at more than 20k connections
2) At every time (execpt the blank obviously) there is plenty of free memory.
3) Our machines operatate limited to ~320 - 400 Mbits and memory has never been an issue in 1-2 years
4) How do we explain the tor process just uses 400 Meg of memory each (see arma comment above)

Something is not adding up here and this is what we have to find out.

comment:47 Changed 9 months ago by DeS

... btw. no OOM today last one (blank zone on immages) yesterday:
Mit Mai 31 16:01:23 CEST 2017 Process tor1 is not running -> starting
Mit Mai 31 16:01:23 CEST 2017 Process tor1 is not running -> starting
Mit Mai 31 17:33:53 CEST 2017 Process tor2 is not running -> starting

comment:48 in reply to:  46 ; Changed 9 months ago by cypherpunks

Replying to DeS:

1) as you can see on this pictures the system "dies" once at 19 k connections. But two times later everything is fine at more than 20k connections

I would not be so sure about that since who knows what amount of connections it reached when it failed.

4) How do we explain the tor process just uses 400 Meg of memory each (see arma comment above)
Something is not adding up here and this is what we have to find out.

You didn't provide the OOM and your other logs this time that would tell us how much it used this time? Please provide all logs you have for all new OOMs (not just munin graphs).

I think running two instances with MaxMemInQueues at 2000MB is to high, because you do not have more than 4GB memory.

Lets try to set it to
x = (4000 MB - (memory_your_os_uses_without_tor + 400 MB safety)) / number_of_tor_instances

Are these physical or virtual machines?


comment:49 in reply to:  48 Changed 9 months ago by DeS

Replying to cypherpunks:

I think running two instances with MaxMemInQueues at 2000MB is to high, because you do not have more than 4GB memory.

Lets try to set it to
x = (4000 MB - (memory_your_os_uses_without_tor + 400 MB safety)) / number_of_tor_instances

will do so after the next commet with a hopefully full documented problem

Are these physical or virtual machines?

This are physical machines

Changed 9 months ago by DeS

Attachment: PROC_MEMINFO.log added

Changed 9 months ago by DeS

Attachment: SMEM.log added

Changed 9 months ago by DeS

Attachment: mem.2.log added

Changed 9 months ago by DeS

Attachment: syslog added

Changed 9 months ago by DeS

Attachment: collectl.log added

comment:51 Changed 9 months ago by DeS

And the question really boils down to:
How can a rising number of connections (to tor - nothing else running) drain the 4G memory out of the machine when the tor processes itself only use about 1G.
Where does the memory go and why ?

comment:52 Changed 9 months ago by DeS

x = (4000 MB - (memory_your_os_uses_without_tor + 400 MB safety)) / number_of_tor_instances

4000MB - (300MB +400 MB) = 3300MB
MaxMemInQueues 1650 MB

Trying this on server2 - leaving server1 with current values to make sure we can continue analyzing.

comment:53 Changed 9 months ago by cypherpunks

some observations:

minimal logged free memory was 84MB at 2017-06-02 18:59:11
https://trac.torproject.org/projects/tor/attachment/ticket/22255/mem.2.log#L2256

2017-06-02 18:59:11           3786       3702         84          0          0          4

but https://trac.torproject.org/projects/tor/attachment/ticket/22255/PROC_MEMINFO.log
tells another story? (lower polling rate), lowest value at:

Fre Jun  2 18:45:49 CEST 2017
MemTotal:        3877580 kB
MemFree:          139124 kB

already better after that:

Fre Jun  2 18:56:41 CEST 2017
MemTotal:        3877580 kB
MemFree:         3052860 kB

4000MB - (300MB +400 MB) = 3300MB

you have 3786 MB memory (not 4000 MB)

I find percentage (or at least '-k') output with relevant process at the top much more readable/useful:

smem -pr

(you might also reduce the output to relevant columns only)

-c 'command rss swap'

How can a rising number of connections (to tor - nothing else running) drain the 4G memory out of the machine when the tor processes itself only use about 1G.

My feeling is that tor uses about 500MB per process when it gets killed but uses much more before that.
Running 'smem' every 60 seconds might provide some valuable data.

comment:54 in reply to:  53 Changed 9 months ago by DeS

Thx for the feedback
smem log pushed to every 60 seconds.
Options added as described.
Waiting for next incedent to document.

Replying to cypherpunks:

some observations:

minimal logged free memory was 84MB at 2017-06-02 18:59:11
https://trac.torproject.org/projects/tor/attachment/ticket/22255/mem.2.log#L2256

2017-06-02 18:59:11           3786       3702         84          0          0          4

but https://trac.torproject.org/projects/tor/attachment/ticket/22255/PROC_MEMINFO.log
tells another story? (lower polling rate), lowest value at:

Fre Jun  2 18:45:49 CEST 2017
MemTotal:        3877580 kB
MemFree:          139124 kB

already better after that:

Fre Jun  2 18:56:41 CEST 2017
MemTotal:        3877580 kB
MemFree:         3052860 kB

4000MB - (300MB +400 MB) = 3300MB

you have 3786 MB memory (not 4000 MB)

I find percentage (or at least '-k') output with relevant process at the top much more readable/useful:

smem -pr

(you might also reduce the output to relevant columns only)

-c 'command rss swap'

How can a rising number of connections (to tor - nothing else running) drain the 4G memory out of the machine when the tor processes itself only use about 1G.

My feeling is that tor uses about 500MB per process when it gets killed but uses much more before that.
Running 'smem' every 60 seconds might provide some valuable data.

comment:55 Changed 9 months ago by cypherpunks

please also add the

-t

parameter (to have a total line)

Changed 9 months ago by DeS

Attachment: 20170607.tar.gz added

comment:56 Changed 9 months ago by DeS

O.k. Good News I added all the parameters to my skript and today - we had two OOMs.
Now bad news. If you will look at the relevant time you will see the logs start to be strange.

Very easy explaination. The skript which collects the stats needs more than a minute to run. And cron fires the next one. So we have several outputs in the files at the same time.

Stupid me - I did not put an lock file in - but wait. If the machine is that slow - what good would it be to have disk access - so no real solution as well. As well not running it via cron would not really help. The server has not even reseources to execute date and put it into an file.

The fact is i can try some things but to have clean logs about a completely dying machine is like sending a probe into a black hole and expect pictures back.

But now. Lets step back a little and think.

a) We see the tor process use around 250-400 MB normally.
b) When they get killed they are bigger 550 - 600 MB
c) We have a machine with 3877580 MB (3.8~ GB)

We can not try to have logfiles from everything but if you ask me this will not help. All you would see is the memory usage of tor going up to 600 MB. So what.
The Question is - where are around 1-2 Gig of Memory gone?
Even in the messed up logs you will no monster process consuming this memory. The biggest user and growing process is tor. And the question is why does it behave like this.
I do not belive a 25% rise of tcp connection will swallow up all the memory so the problem must be somewhere else - if you ask me - in tor

any chance tor has debug features (memory allocation ...)

https://trac.torproject.org/projects/tor/attachment/ticket/22255/20170607.tar.gz

Last edited 9 months ago by DeS (previous) (diff)

Changed 8 months ago by DeS

Attachment: syslog.2 added

Changed 8 months ago by DeS

Attachment: 20170615.tar.gz added

comment:57 Changed 8 months ago by DeS

It became awfully quite in this ticket.

Since last report I changed the measurement script. It runs permanently and sleeps 60 seconds in between. As will I added an nice of -19.

After the latest update a had 2 days hope the problem was fixed - until today.

I could measure an OOM this morning.
https://trac.torproject.org/projects/tor/attachment/ticket/22255/20170615.tar.gz

The logs look better now. But you will realize they are not happening every minute. This is because of the machine slowdown once the problem starts.
Anyway I think you well get an good impression of what happens.

Last edited 8 months ago by DeS (previous) (diff)

comment:58 Changed 8 months ago by cypherpunks

peak time was at Don Jun 15 04:21:40 CEST 2017

your smem logs say that

  • RSS system total: 0.33%
  • tor used very little memory but a lot of swap space (>270MB per process)

(which very likely causes your system to hang completely)
(again you do not provide any swap in/out graphs)

your /proc/meminfo says:

  • 3664 MB memory used (does not show in smem -> kernel memory?)
  • 679 MB SWAP used

What is your swapiness?

Possible next items:

  • add another smem output to list kernel memory:
  • smem -kwt
  • Try disabling tor's swap usage: DisableAllSwap 1

I guess this will kill tor earlier but prevent the extreme swaping/trashing.

  • try to reproduce on clean system with same relay keys
  • guess I was wrong when I said tor might be using more memory before it was killed

Changed 8 months ago by DeS

comment:60 Changed 8 months ago by cypherpunks

https://en.wikipedia.org/wiki/Swappiness

cat /proc/sys/vm/swappiness

(but not very relevant anyway - I guess)

comment:61 in reply to:  59 Changed 8 months ago by cypherpunks

Replying to DeS:

Swapiness is a new word for me.
Do you mean this
https://trac.torproject.org/projects/tor/attachment/ticket/22255/swap-pinpoint%3D1497453606%2C1497561606.png

This is not the swapiness, but the swap in/out pages per second I asked for.
And yes a value >250 will make a system hang.

comment:63 Changed 8 months ago by DeS

Sorry about the late reply - Summer flu got me.

Swappiness

root@tor1:~#  cat /proc/sys/vm/swappiness
60

smem -kwt added to memory logger.

DisableAllSwap 1 configured. Had to add

  capability sys_resource,
  capability ipc_lock,

to apparmor config for tor to start it.

Waiting for next event.

Changed 8 months ago by DeS

Attachment: 20170708.tar.gz added

comment:64 Changed 8 months ago by DeS

Finally I got around to check again. OOM still happen. But not as bad anymore as before.
The logs should contain above information.

Config contains DisableAllSwap 1 and is based on tor 0.3.0.9-1~trusty+1 since it is the stable.
Here the logfiles/munin screenshots :
https://trac.torproject.org/projects/tor/attachment/ticket/22255/20170708.tar.gz

comment:65 Changed 8 months ago by cypherpunks

4) How do we explain the tor process just uses 400 Meg of memory each (see arma comment above)
How can a rising number of connections (to tor - nothing else running) drain the 4G memory out of the machine when the tor processes itself only use about 1G.
Where does the memory go and why ?

Sockets' buffers!

comment:66 Changed 7 months ago by cypherpunks

DeS, check network hardware, uplink. If everything ok then try to reduce memory usage by:

net.ipv4.tcp_rmem
net.ipv4.tcp_wmem

Replace thirds values (6, 4 megabytes by default for your RAM size) by 262144 (256K)

comment:67 Changed 7 months ago by cypherpunks

try to reduce memory usage

If UDP involved here, then look for net.ipv4.udp_mem (pages) values too, and/or net.core.rmem_* net.core.wmem_*
TCP + UDP sockets can to eat up to 1G in your system by default.

comment:68 Changed 7 months ago by cypherpunks

Collecting puzzle.

#18580

I'm 80-90% sure it's a bug in the way the Tor daemon interacts with unbound's behavior w/r/t large numbers of timing-out DNS queries. Unbound appears to be perfectly fine with the situation when it occurs. Tor daemon DNS queries lock-up wholesale, thus preventing normal exit browsing behaivor.

as result #21394
unbound
OOM in final

comment:69 Changed 7 months ago by cypherpunks

Tor daemon DNS queries

I believe Tor does DNS queries mostly wrong/broken logic because tor's and libevent's (evdns) code bugs. Leading to large queues, errors, timeouts. Anybody want to discuss/fix it?

comment:70 Changed 7 months ago by cypherpunks

can confirm oom kills on two nodes

  • 1: Debian GNU/Linux 8.8 (jessie), 1G memory, Tor version 0.2.5.14, Exit
  • 2: Debian GNU/Linux 8.8 (jessie), 2G memory, Tor version 0.2.5.14, Non-Exit

No unbound running on either. OOM never occured in the last 5 years, now it suddenly happens? Seems very strange to me.

Side question: where did tor weather go?

comment:71 Changed 4 months ago by DeS

Long overdue update.
The Problem disappeared in August. On on host there where view occations in September.
What helped is:

MaxMemInQueues 1650MB
DisableAllSwap 1

It did not resolve the problem bat made it much better. Later on maybe due to an update of tor it seems to have mostly stopped.

comment:72 Changed 4 months ago by DeS

Resolution: worksforme
Status: newclosed

comment:73 Changed 2 months ago by cypherpunks

Resolution: worksforme
Status: closedreopened

I think I have to re-open this ticket, it still happens almost daily to my instance, Tor 0.2.9.14-1 on Debian 9.3 (stretch). It's now getting to the point where it's not only annoying to have it restart, but also the OOM kills sometimes affect other processes and daemons now.

I've tried to go down with the MaxMemInQueues setting until it the OOM kills stop happening, but it's now at < 500MB and the problem still occurs.

It would be really nice to solve the root cause of this.

comment:74 Changed 2 months ago by cypherpunks

Hi,
could you upgrade to 0.3.2.8-rc to test if the latest fixes solve your problem?
thanks!

comment:75 Changed 8 weeks ago by DeS

Can someone comment finally if this is the same thing - It awfully sounds like it.

https://lists.torproject.org/pipermail/tor-project/2017-December/001604.html

comment:76 Changed 8 weeks ago by teor

Resolution: fixed
Status: reopenedclosed

There are many reasons for load spikes in Tor: HSDirs are one of them, millions of added clients are another.
Given the timing, it is unlikely that these have the same cause, but they may trigger similar bugs in Tor.

We are dealing with various bugs in a few different tickets.
Please:

  • Upgrade to tor 0.3.2.8-rc
  • Set MaxMemInQueues to your available memory, minus a few hundred megabytes
  • give the tor user enough file descriptors

Then, if the problem persists, open a new ticket for any out of memory kill issues.

comment:77 in reply to:  74 Changed 7 weeks ago by cypherpunks

Replying to cypherpunks:

Hi,
could you upgrade to 0.3.2.8-rc to test if the latest fixes solve your problem?
thanks!

done, runs stable for a couple days now. so the problem seems to be fixed.

would it be helpful to do a version bisection to see where the bug was fixed? the root cause still seems to be unknown if i got this correctly.

comment:78 Changed 7 weeks ago by starlight

please allow me to jump in here with some advice:

0.3.2.8-rc fixed the problem for you most likely due to #24666, which dramatically reduces memory consumption w/r/t DESTROY cells

however I see above an assumption that on a 4G machine MaxMemInQueues=1650MB is reasonable -- without a doubt this value is much too high

empirical observation indicates the daemon consumes 2x MaxMemInQueues once the actual maximum has been obtained (most likely due to sniper attacks)

therefore to truly harden tor daemon against OOM kills on a 4G machine, one should set MaxMemInQueues=1024MB

for further detail and additional tuning tips see #24737

comment:79 in reply to:  78 ; Changed 7 weeks ago by teor

Replying to starlight:

please allow me to jump in here with some advice:

0.3.2.8-rc fixed the problem for you most likely due to #24666, which dramatically reduces memory consumption w/r/t DESTROY cells

however I see above an assumption that on a 4G machine MaxMemInQueues=1650MB is reasonable -- without a doubt this value is much too high

empirical observation indicates the daemon consumes 2x MaxMemInQueues once the actual maximum has been obtained (most likely due to sniper attacks)

I'd still like to see someone repeat this analysis with 0.3.2.8-rc, and post the results to #24737.
It's going to be hard for us to close that ticket without any idea of the effect of our changes.

comment:80 in reply to:  79 ; Changed 7 weeks ago by starlight

Cc: starlight@… added

Replying to teor:

I'd still like to see someone repeat this analysis with 0.3.2.8-rc, and post the results to #24737.
It's going to be hard for us to close that ticket without any idea of the effect of our changes.

I'm not willing to run a newer version till one is declared LTS, but can say that even when my relay is not under attack memory consumption goes to 1.5G with the 1G max queue setting. Seems to me the 2x max queues memory consumption is a function of the overheads associated with tor daemon queues and related processing, including malloc slack space.

Anyone running a busy relay on an older/slower system and with MaxMemInQueues=1024MB can check /proc/<pid>/status to see how much memory is consumed. Be sure DisableAllSwap=1 is set and the queue limit is not higher since the point is to observe actual memory consumed relative to a limit likely to be approached under normal operation.

Another idea is to add an option to the daemon to cause queue memory preallocation. This would be a nice hardening feature as it will reduce malloc() calls issued under stress, and of course would allow more accurate estimates of worst-case memory consumption. If OOM strikes with preallocated queues that would indicate memory leakage.

comment:81 Changed 7 weeks ago by starlight

Also allow me to remind that memory is cheap. Is better to set the queue max conservatively (or buy more memory) than to experience OOM crashes.

comment:82 in reply to:  80 Changed 7 weeks ago by teor

Replying to starlight:

Replying to teor:

I'd still like to see someone repeat this analysis with 0.3.2.8-rc, and post the results to #24737.
It's going to be hard for us to close that ticket without any idea of the effect of our changes.

I'm not willing to run a newer version till one is declared LTS, but can say that even when my relay is not under attack memory consumption goes to 1.5G with the 1G max queue setting. Seems to me the 2x max queues memory consumption is a function of the overheads associated with tor daemon queues and related processing, including malloc slack space.

Anyone running a busy relay on an older/slower system and with MaxMemInQueues=1024MB can check /proc/<pid>/status to see how much memory is consumed. Be sure DisableAllSwap=1 is set and the queue limit is not higher since the point is to observe actual memory consumed relative to a limit likely to be approached under normal operation.

Another idea is to add an option to the daemon to cause queue memory preallocation. This would be a nice hardening feature as it will reduce malloc() calls issued under stress, and of course would allow more accurate estimates of worst-case memory consumption. If OOM strikes with preallocated queues that would indicate memory leakage.

I have answered this on #24737, because it will get lost on a closed ticket.
Please move all future discussion there.

Note: See TracTickets for help on using tickets.