#22255 closed defect (fixed)
Frequent OOM kills of tor process
Reported by: | DeS | Owned by: | |
---|---|---|---|
Priority: | High | Milestone: | Tor: 0.3.0.x-final |
Component: | Core Tor/Tor | Version: | |
Severity: | Major | Keywords: | |
Cc: | starlight@… | Actual Points: | |
Parent ID: | Points: | ||
Reviewer: | Sponsor: |
Description
We are currently expierencing frequent Out of memory kills of or tor processes on two boxes running ubuntu 14.04.
root@tor1:~# lsb_release -a No LSB modules are available. Distributor ID: Ubuntu Description: Ubuntu 14.04.5 LTS Release: 14.04 Codename: trusty
We experience this on following versions:
0.3.0.5-rc-1~trusty+1 and 0.2.9.10-1~trusty+1
I changed to the rc after the normal 2.9.10-1 Version keept having this problems. The 0.3.0.5-rc has the same problems.
From syslog
May 14 10:21:42 tor1 kernel: [123517.565688] UDP: bad checksum. From 82.247.214.125:51413 to 176.10.104.240:21446 ulen 38 May 14 11:02:47 tor1 kernel: [125982.993268] tor invoked oom-killer: gfp_mask=0x200da, order=0, oom_score_adj=0 May 14 11:02:47 tor1 kernel: [125982.993270] tor cpuset=/ mems_allowed=0 May 14 11:02:47 tor1 kernel: [125982.993272] CPU: 2 PID: 33180 Comm: tor Not tainted 3.13.0-117-generic !#164-Ubuntu May 14 11:02:47 tor1 kernel: [125982.993273] Hardware name: HP !ProLiant ML310e Gen8 v2, BIOS P78 03/28/2014 May 14 11:02:47 tor1 kernel: [125982.993274] 0000000000000000 ffff8800a8679990 ffffffff8172d1c9 ffff88013c868000 May 14 11:02:47 tor1 kernel: [125982.993276] 0000000000000000 ffff8800a8679a18 ffffffff81727768 ffffffff8106a926 May 14 11:02:47 tor1 kernel: [125982.993278] ffff8800a86799f0 ffffffff810cb27c 0000000001320122 0000000000000000 May 14 11:02:47 tor1 kernel: [125982.993279] Call Trace: May 14 11:02:47 tor1 kernel: [125982.993285] [<ffffffff8172d1c9>] dump_stack+0x64/0x82 May 14 11:02:47 tor1 kernel: [125982.993286] [<ffffffff81727768>] dump_header+0x7f/0x1f1 May 14 11:02:47 tor1 kernel: [125982.993289] [<ffffffff8106a926>] ? put_online_cpus+0x56/0x80 May 14 11:02:47 tor1 kernel: [125982.993291] [<ffffffff810cb27c>] ? rcu_oom_notify+0xcc/0xf0 May 14 11:02:47 tor1 kernel: [125982.993294] [<ffffffff81156d81>] oom_kill_process+0x201/0x360 May 14 11:02:47 tor1 kernel: [125982.993297] [<ffffffff812dde35>] ? security_capable_noaudit+0x15/0x20 May 14 11:02:47 tor1 kernel: [125982.993299] [<ffffffff81157511>] out_of_memory+0x471/0x4b0 May 14 11:02:47 tor1 kernel: [125982.993301] [<ffffffff8115d82c>] !__alloc_pages_nodemask+0xa6c/0xb90 May 14 11:02:47 tor1 kernel: [125982.993304] [<ffffffff8119e25a>] alloc_pages_vma+0x9a/0x140 May 14 11:02:47 tor1 kernel: [125982.993307] [<ffffffff811908eb>] read_swap_cache_async+0xeb/0x160 May 14 11:02:47 tor1 kernel: [125982.993308] [<ffffffff811909f8>] swapin_readahead+0x98/0xe0 May 14 11:02:47 tor1 kernel: [125982.993320] [<ffffffff8117e90b>] handle_mm_fault+0xa7b/0xf10 May 14 11:02:47 tor1 kernel: [125982.993322] [<ffffffff81739344>] !__do_page_fault+0x184/0x560 May 14 11:02:47 tor1 kernel: [125982.993325] [<ffffffff810a3af5>] ? set_next_entity+0x95/0xb0 May 14 11:02:47 tor1 kernel: [125982.993328] [<ffffffff810135db>] ? !__switch_to+0x16b/0x4f0 May 14 11:02:47 tor1 kernel: [125982.993329] [<ffffffff8173973a>] do_page_fault+0x1a/0x70 May 14 11:02:47 tor1 kernel: [125982.993331] [<ffffffff81735a68>] page_fault+0x28/0x30 May 14 11:02:47 tor1 kernel: [125982.993350] Mem-Info: May 14 11:02:47 tor1 kernel: [125982.993351] Node 0 DMA per-cpu: May 14 11:02:47 tor1 kernel: [125982.993352] CPU 0: hi: 0, btch: 1 usd: 0 May 14 11:02:47 tor1 kernel: [125982.993353] CPU 1: hi: 0, btch: 1 usd: 0 May 14 11:02:47 tor1 kernel: [125982.993353] CPU 2: hi: 0, btch: 1 usd: 0 May 14 11:02:47 tor1 kernel: [125982.993354] CPU 3: hi: 0, btch: 1 usd: 0 May 14 11:02:47 tor1 kernel: [125982.993355] Node 0 DMA32 per-cpu: May 14 11:02:47 tor1 kernel: [125982.993356] CPU 0: hi: 186, btch: 31 usd: 25 May 14 11:02:47 tor1 kernel: [125982.993356] CPU 1: hi: 186, btch: 31 usd: 2 May 14 11:02:47 tor1 kernel: [125982.993357] CPU 2: hi: 186, btch: 31 usd: 0 May 14 11:02:47 tor1 kernel: [125982.993358] CPU 3: hi: 186, btch: 31 usd: 26 May 14 11:02:47 tor1 kernel: [125982.993358] Node 0 Normal per-cpu: May 14 11:02:47 tor1 kernel: [125982.993359] CPU 0: hi: 186, btch: 31 usd: 0 May 14 11:02:47 tor1 kernel: [125982.993360] CPU 1: hi: 186, btch: 31 usd: 0 May 14 11:02:47 tor1 kernel: [125982.993360] CPU 2: hi: 186, btch: 31 usd: 0 May 14 11:02:47 tor1 kernel: [125982.993361] CPU 3: hi: 186, btch: 31 usd: 0 May 14 11:02:47 tor1 kernel: [125982.993363] active_!anon:0 inactive_!anon:0 isolated_!anon:0 May 14 11:02:47 tor1 kernel: [125982.993363] active_!file:141 inactive_!file:204 isolated_!file:0 May 14 11:02:47 tor1 kernel: [125982.993363] !unevictable:0 !dirty:0 !writeback:0 !unstable:0 May 14 11:02:47 tor1 kernel: [125982.993363] !free:20517 slab_!reclaimable:7176 slab_!unreclaimable:41948 May 14 11:02:47 tor1 kernel: [125982.993363] !mapped:80 !shmem:0 !pagetables:692 !bounce:0 May 14 11:02:47 tor1 kernel: [125982.993363] free_!cma:0 May 14 11:02:47 tor1 kernel: [125982.993365] Node 0 DMA !free:15344kB !min:268kB !low:332kB !high:400kB active_!anon:0kB inactive_!anon:0kB active_!file:4kB inactive_!file:28kB !unevictable:0kB isolated(anon):0kB isolated(file):0kB !present:15976kB !managed:15892kB !mlocked:0kB !dirty:0kB !writeback:8kB !mapped:36kB !shmem:0kB slab_!reclaimable:0kB slab_!unreclaimable:156kB kernel_!stack:0kB !pagetables:4kB !unstable:0kB !bounce:0kB free_!cma:0kB writeback_!tmp:0kB pages_!scanned:18 all_unreclaimable? no May 14 11:02:47 tor1 kernel: [125982.993368] lowmem_reserve[]: 0 2688 3771 3771 May 14 11:02:47 tor1 kernel: [125982.993370] Node 0 DMA32 !free:51080kB !min:46532kB !low:58164kB !high:69796kB active_!anon:0kB inactive_!anon:0kB active_!file:516kB inactive_!file:788kB !unevictable:0kB isolated(anon):0kB isolated(file):0kB !present:2832272kB !managed:2753204kB !mlocked:0kB !dirty:0kB !writeback:0kB !mapped:284kB !shmem:0kB slab_!reclaimable:15612kB slab_!unreclaimable:83432kB kernel_!stack:312kB !pagetables:2064kB !unstable:0kB !bounce:0kB free_!cma:0kB writeback_!tmp:0kB pages_!scanned:17 all_unreclaimable? no May 14 11:02:47 tor1 kernel: [125982.993372] lowmem_reserve[]: 0 0 1082 1082 May 14 11:02:47 tor1 kernel: [125982.993373] Node 0 Normal !free:15644kB !min:18732kB !low:23412kB !high:28096kB active_!anon:0kB inactive_!anon:0kB active_!file:44kB inactive_!file:0kB !unevictable:0kB isolated(anon):0kB isolated(file):0kB !present:1179644kB !managed:1108484kB !mlocked:0kB !dirty:0kB !writeback:0kB !mapped:0kB !shmem:0kB slab_!reclaimable:13092kB slab_!unreclaimable:84204kB kernel_!stack:2072kB !pagetables:700kB !unstable:0kB !bounce:0kB free_!cma:0kB writeback_!tmp:0kB pages_!scanned:203 all_unreclaimable? yes May 14 11:02:47 tor1 kernel: [125982.993376] lowmem_reserve[]: 0 0 0 0 May 14 11:02:47 tor1 kernel: [125982.993377] Node 0 DMA: 3*4kB (M) 3*8kB (UM) 2*16kB (UM) 0*32kB 1*64kB (U) 1*128kB (M) 1*256kB (M) 1*512kB (M) 2*1024kB (UM) 2*2048kB (MR) 2*4096kB (M) = 15364kB May 14 11:02:47 tor1 kernel: [125982.993384] Node 0 DMA32: 303*4kB (UE) 662*8kB (UEM) 418*16kB (UEM) 952*32kB (UEM) 51*64kB (UM) 1*128kB (M) 0*256kB 0*512kB 0*1024kB 0*2048kB 1*4096kB (R) = 51148kB May 14 11:02:47 tor1 kernel: [125982.993394] Node 0 Normal: 375*4kB (UEM) 330*8kB (UEM) 649*16kB (UEM) 3*32kB (M) 2*64kB (R) 1*128kB (R) 1*256kB (R) 1*512kB (R) 0*1024kB 0*2048kB 0*4096kB = 15644kB May 14 11:02:47 tor1 kernel: [125982.993401] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB May 14 11:02:47 tor1 kernel: [125982.993401] 385 total pagecache pages May 14 11:02:47 tor1 kernel: [125982.993402] 20 pages in swap cache May 14 11:02:47 tor1 kernel: [125982.993403] Swap cache stats: add 2432647, delete 2432627, find 700250/1142133 May 14 11:02:47 tor1 kernel: [125982.993404] Free swap = 3627940kB May 14 11:02:47 tor1 kernel: [125982.993404] Total swap = 4026364kB May 14 11:02:47 tor1 kernel: [125982.993405] 1006973 pages RAM May 14 11:02:47 tor1 kernel: [125982.993405] 0 pages !HighMem/MovableOnly May 14 11:02:47 tor1 kernel: [125982.993406] 17790 pages reserved May 14 11:02:47 tor1 kernel: [125982.993406] [ pid ] uid tgid total_vm rss nr_ptes swapents oom_score_adj name May 14 11:02:47 tor1 kernel: [125982.993422] ![32898] 0 32898 5073 0 13 251 0 upstart-udev-br May 14 11:02:47 tor1 kernel: [125982.993423] ![32903] 0 32903 12900 1 27 228 -1000 systemd-udevd May 14 11:02:47 tor1 kernel: [125982.993425] ![33049] 0 33049 3821 0 11 48 0 upstart-file-br May 14 11:02:47 tor1 kernel: [125982.993426] ![33092] 102 33092 9813 1 23 95 0 dbus-daemon May 14 11:02:47 tor1 kernel: [125982.993436] ![33096] 101 33096 63962 0 28 722 0 rsyslogd May 14 11:02:47 tor1 kernel: [125982.993437] ![33119] 0 33119 10864 1 26 88 0 systemd-logind May 14 11:02:47 tor1 kernel: [125982.993439] ![33206] 0 33206 3817 0 12 55 0 upstart-socket- May 14 11:02:47 tor1 kernel: [125982.993440] ![33053] 0 33053 3699 1 12 41 0 getty May 14 11:02:47 tor1 kernel: [125982.993441] ![33056] 0 33056 3699 1 12 41 0 getty May 14 11:02:47 tor1 kernel: [125982.993442] ![33061] 0 33061 3699 1 12 39 0 getty May 14 11:02:47 tor1 kernel: [125982.993443] ![33062] 0 33062 3699 1 11 40 0 getty May 14 11:02:47 tor1 kernel: [125982.993444] ![33064] 0 33064 3699 1 12 39 0 getty May 14 11:02:47 tor1 kernel: [125982.993446] ![33079] 0 33079 1094 0 8 35 0 acpid May 14 11:02:47 tor1 kernel: [125982.993447] ![33089] 0 33089 15347 1 33 171 -1000 sshd May 14 11:02:47 tor1 kernel: [125982.993448] ![33094] 0 33094 5915 0 17 81 0 cron May 14 11:02:47 tor1 kernel: [125982.993449] ![33095] 0 33095 4786 0 12 48 0 atd May 14 11:02:47 tor1 kernel: [125982.993450] ![33155] 0 33155 4801 34 13 73 0 irqbalance May 14 11:02:47 tor1 kernel: [125982.993457] ![33180] 106 33180 111003 0 152 50520 0 tor May 14 11:02:47 tor1 kernel: [125982.993458] ![33317] 108 33317 20352 0 41 9936 0 unbound May 14 11:02:47 tor1 kernel: [125982.993459] ![33369] 105 33369 1871 12 9 33 0 vnstatd May 14 11:02:47 tor1 kernel: [125982.993460] ![32890] 0 32890 3699 1 11 40 0 getty May 14 11:02:47 tor1 kernel: [125982.993461] ![32991] 1002 32991 2867 0 10 84 0 memlog.sh May 14 11:02:47 tor1 kernel: [125982.993462] ![33057] 106 33057 97196 0 128 36538 0 tor May 14 11:02:47 tor1 kernel: [125982.993464] ![57323] 1002 57323 2867 0 10 91 0 memlog.sh May 14 11:02:47 tor1 kernel: [125982.993465] ![57325] 1002 57325 2218 78 10 32 0 grep May 14 11:02:47 tor1 kernel: [125982.993466] ![57326] 1002 57326 2746 44 11 46 0 sed May 14 11:02:47 tor1 kernel: [125982.993467] Out of memory: Kill process 33180 (tor) score 25 or sacrifice child May 14 11:02:47 tor1 kernel: [125982.993478] Killed process 33180 (tor) !total-vm:444012kB, !anon-rss:0kB, !file-rss:0kB
notices.log does not show anything around the kill time. First record ist the restart of the process at 20:32
May 14 06:54:03.000 [notice] Tor 0.3.0.5-rc (git-61f68662421142d2) opening new log file. May 14 07:13:01.000 [warn] eventdns: All nameservers have failed May 14 07:13:01.000 [notice] eventdns: Nameserver 127.0.0.1:53 is back up May 14 07:47:36.000 [warn] eventdns: All nameservers have failed May 14 07:47:36.000 [notice] eventdns: Nameserver 127.0.0.1:53 is back up May 14 08:10:38.000 [warn] eventdns: All nameservers have failed May 14 08:10:38.000 [notice] eventdns: Nameserver 127.0.0.1:53 is back up May 14 08:12:28.000 [warn] eventdns: All nameservers have failed May 14 08:12:28.000 [notice] eventdns: Nameserver 127.0.0.1:53 is back up May 14 08:20:08.000 [warn] eventdns: All nameservers have failed May 14 08:20:09.000 [notice] eventdns: Nameserver 127.0.0.1:53 is back up May 14 20:23:18.000 [notice] Tor 0.3.0.5-rc (git-61f68662421142d2) opening log file. May 14 20:23:18.000 [notice] Parsing GEOIP IPv4 file /usr/share/tor/geoip. May 14 20:23:18.000 [notice] Parsing GEOIP IPv6 file /usr/share/tor/geoip6. May 14 20:23:18.000 [notice] Configured to measure statistics. Look for the *-stats files that will first be written to the data directory in 24 hours from now.
We are operation the * Exit Servers since years and they became unstable about a week ago. No changes other the normal software updates where performed.
Other than the two tor processes an local unbound resolver and ssh are running.
It feels like an memory leak. Since tor is the one getting killed I assume the problem would be the tor software. Any thoughts ?
Child Tickets
Attachments (27)
Change History (109)
comment:1 Changed 22 months ago by
Summary: | Frequent OOM Kill of tor precess → Frequent OOM kills of tor precess |
---|
comment:2 Changed 22 months ago by
comment:3 Changed 22 months ago by
Milestone: | → Tor: 0.3.0.x-final |
---|
comment:4 Changed 22 months ago by
How much memory does the system have? I'm having trouble reading the log lines for it, but it looks like maybe not very much?
comment:5 follow-ups: 6 7 Changed 22 months ago by
The system has about 4 GB memory:
# cat /proc/meminfo
MemTotal: 3877580 kB
MemFree: 2250280 kB
We run this configuration since mid 2015.
No, we have not tried the MaxMemInQueues setting yet. It is unset, therefor Tor should pick a proper configuration based on the physically available memory, shouldn't it?
Many thanks for your support!
comment:6 Changed 22 months ago by
Replying to ryru:
we have not tried the MaxMemInQueues setting yet. It is unset, therefor Tor should pick a proper configuration based on the physically available memory, shouldn't it?
There's a notice-level log when Tor starts up, telling you what value it picked, e.g.
May 10 03:34:40.480 [notice] Based on detected system memory, MaxMemInQueues is set to 8192 MB. You can override this by setting MaxMemInQueues by hand.
comment:7 Changed 22 months ago by
Replying to ryru:
The system has about 4 GB memory:
It looks from your earlier log like it takes 12 hours or something before it dies?
Can you track the memory size over that 12 hours? That is, is it fine until suddenly everything goes bad, or does it slowly grow?
comment:8 Changed 22 months ago by
I uploaded the output of a script I made. It shows memory usage as result of following commands and runs every 60 seconds.
.... #!/bin/bash -e export LC_ALL=C FILE=/home/user/mem.log echo " date time $(free -m | grep total | sed -E 's/^ (.*)/\1/g')" >>$FILE while true; do echo "$(date '+%Y-%m-%d %H:%M:%S') $(/usr/bin/free -m | grep Mem: | sed 's/Mem://g')" >>$FILE echo "$(date '+%Y-%m-%d %H:%M:%S') $(/usr/bin/free -m | grep Swap: | sed 's/Swap://g')" >>$FILE sleep 60 done
You can see it starts 3 days ago. Always the amount is quite stable around usage of 1563 Mbyte.
If you fast forward to Line 4448 of mem.log which is about 1 hour before this documented OOM you will see a fast increase of the memory usage to 3706 Mbyte before the process is killed.
Many thanks for your support and help
comment:9 Changed 22 months ago by
Component: | Core Tor → Core Tor/Tor |
---|
comment:10 Changed 22 months ago by
Before problems occured following ubuntu updates where installed
2017-05-06 06:50:53 status installed man-db:amd64 2.6.7.1-1ubuntu1 2017-05-06 06:50:53 status installed login:amd64 1:4.1.5.1-1ubuntu9.4 2017-05-06 06:50:53 status installed man-db:amd64 2.6.7.1-1ubuntu1 2017-05-06 06:50:53 status installed ureadahead:amd64 0.100.0-16 2017-05-06 06:50:53 status installed passwd:amd64 1:4.1.5.1-1ubuntu9.4 2017-05-06 13:29:10 status installed install-info:amd64 5.2.0.dfsg.1-2 2017-05-06 13:29:11 status installed man-db:amd64 2.6.7.1-1ubuntu1 2017-05-06 13:29:11 status installed bash:amd64 4.3-7ubuntu1.6 2017-05-10 06:26:21 status installed librtmp0:amd64 2.4+20121230.gitdf6c518-1ubuntu0.1 2017-05-10 06:26:21 status installed libfreetype6:amd64 2.5.2-1ubuntu2.8 2017-05-10 06:26:22 status installed libc-bin:amd64 2.19-0ubuntu6.11
Do you think one of these update packages could have trigged this behaviour ?
comment:11 follow-up: 13 Changed 21 months ago by
I found this part interesting:
3159 2017-05-13 23:22:01 3931 0 3931 3160 2017-05-13 23:23:01 3786 1852 1934 0 66 518 3161 2017-05-13 23:23:01 3931 0 3931 3162 2017-05-13 23:24:01 3786 2217 1569 0 66 518 3163 2017-05-13 23:24:01 3931 0 3931 3164 2017-05-13 23:25:01 3786 2335 1451 0 66 518 3165 2017-05-13 23:25:01 3931 0 3931 3166 2017-05-13 23:26:01 3786 2527 1259 0 66 496 3167 2017-05-13 23:26:01 3931 0 3931 3168 2017-05-13 23:27:01 3786 2952 833 0 66 496 3169 2017-05-13 23:27:01 3931 0 3931 3170 2017-05-13 23:28:01 3786 3020 766 0 66 496 3171 2017-05-13 23:28:01 3931 0 3931 3172 2017-05-13 23:29:01 3786 3110 675 0 66 496 3173 2017-05-13 23:29:01 3931 0 3931 3174 2017-05-13 23:30:01 3786 3023 763 0 66 496 3175 2017-05-13 23:30:01 3931 0 3931 3176 2017-05-13 23:31:01 3786 2694 1091 0 66 496 3177 2017-05-13 23:31:01 3931 0 3931 3178 2017-05-13 23:32:01 3786 2338 1448 0 66 496 3179 2017-05-13 23:32:01 3931 0 3931 3180 2017-05-13 23:33:01 3786 1893 1893 0 66 496 3181 2017-05-13 23:33:01 3931 0 3931 3182 2017-05-13 23:34:01 3786 1614 2171 0 66 496 3183 2017-05-13 23:34:01 3931 0 3931
It looks like memory load went up, but not enough to kill it, and then went down again on its own.
So it looks like it is not always following the pattern of "grow forever until it's too big".
comment:12 Changed 21 months ago by
Summary: | Frequent OOM kills of tor precess → Frequent OOM kills of tor process |
---|
comment:13 Changed 21 months ago by
I checked this time slot. No OOM kill there. So you are right. The process is not always killed.
However yesterday we hat 4-5 OOM on this machine with two tor processes
Replying to arma:
I found this part interesting:
3159 2017-05-13 23:22:01 3931 0 3931 3160 2017-05-13 23:23:01 3786 1852 1934 0 66 518 3161 2017-05-13 23:23:01 3931 0 3931 3162 2017-05-13 23:24:01 3786 2217 1569 0 66 518 3163 2017-05-13 23:24:01 3931 0 3931 3164 2017-05-13 23:25:01 3786 2335 1451 0 66 518 3165 2017-05-13 23:25:01 3931 0 3931 3166 2017-05-13 23:26:01 3786 2527 1259 0 66 496 3167 2017-05-13 23:26:01 3931 0 3931 3168 2017-05-13 23:27:01 3786 2952 833 0 66 496 3169 2017-05-13 23:27:01 3931 0 3931 3170 2017-05-13 23:28:01 3786 3020 766 0 66 496 3171 2017-05-13 23:28:01 3931 0 3931 3172 2017-05-13 23:29:01 3786 3110 675 0 66 496 3173 2017-05-13 23:29:01 3931 0 3931 3174 2017-05-13 23:30:01 3786 3023 763 0 66 496 3175 2017-05-13 23:30:01 3931 0 3931 3176 2017-05-13 23:31:01 3786 2694 1091 0 66 496 3177 2017-05-13 23:31:01 3931 0 3931 3178 2017-05-13 23:32:01 3786 2338 1448 0 66 496 3179 2017-05-13 23:32:01 3931 0 3931 3180 2017-05-13 23:33:01 3786 1893 1893 0 66 496 3181 2017-05-13 23:33:01 3931 0 3931 3182 2017-05-13 23:34:01 3786 1614 2171 0 66 496 3183 2017-05-13 23:34:01 3931 0 3931It looks like memory load went up, but not enough to kill it, and then went down again on its own.
So it looks like it is not always following the pattern of "grow forever until it's too big".
comment:14 Changed 21 months ago by
I added this line to the torrc of both processes:
MaxMemInQueues 3000MB
Since then (about 13 hours ago) all stable. Lets see how it works out
comment:15 Changed 21 months ago by
Hence the measure with MaxMemInQueues we still expierence the problem that memory usage starts rising until the tor process gets killed (OOM).
In the last 24 hours 4 times such an OOM took place on our two servers with 2 tor processes each.
I also had a quick discussion with another operator based on this tweet https://twitter.com/FrennVunDerEnn/status/864583496072876034
It seems they expierence spontanous rise of memory consupmtion as well.
comment:16 Changed 21 months ago by
I experience the same problem on 3 non-exit relays. It started on 2017-05-18 (running debian 8, 0.3.0.5-rc at the time)
comment:17 Changed 21 months ago by
Please see Mailinglist tor-relays@…
Several Exit Operators suffer with this problem.
comment:18 Changed 21 months ago by
Is there any commonality between versions reporting this issue? Is it all 0.3.0? Is it all one operating system?
Is anybody able to build with some heap profiling tool (such as the one in gperftools) to see where all the memory is going?
comment:19 follow-up: 20 Changed 21 months ago by
Priority: | Medium → High |
---|---|
Severity: | Normal → Major |
comment:20 Changed 21 months ago by
Replying to nickm:
Our System is Ubuntu 14.04.5 LTS and we tried with 0.3.0.5-rc-1~trusty+1 and 0.2.9.10-1~trusty+1
The feedback from the Mailing list was:
Debian Jessie with tor-0.2.9.10
Niftybunny reported. Same problems with 0.3.0.5. Upgrading to 0.3.0.7 helped on most relays.
(Roger remarked that there would no be a relevant change between both 0.3.0.5/7 Versions.)
So it seems the same over Debian/Ubuntu as well 0.2.9x as 0.3.x are affected
comment:21 Changed 21 months ago by
I will try to geht valgrind running with the tor process as described by roger in the Mailing List and report back.
MaxMemInQueues 3000MB does not seem to be a solution since I`m down to 2000MB for two processes on a machine with 4GB physical memory and absolutley no change so far.
comment:22 follow-up: 28 Changed 21 months ago by
Replying to DeS:
May 14 11:02:47 tor1 kernel: [125982.993478] Killed process 33180 (tor) total-vm:444012kB, anon-rss:0kB, file-rss:0kB
This part has been bothering me. Am I reading that right, that the tor process it kills is using 444 megabytes of memory? That is way less than 4 gigs.
Changed 21 months ago by
Attachment: | valgrind.txt added |
---|
comment:23 follow-up: 24 Changed 21 months ago by
As proposed by arma I ran this command on a fresh compiled alpha tor.
nohup valgrind --leak-check=yes --error-limit=no --undef-value-errors=no src/or/tor geoipfile src/config/geoip -f /etc/tor/torrc2 &>valgrind.out &
Somewhere during the last day this process stopped for no know reason. I attach the output:
https://trac.torproject.org/projects/tor/attachment/ticket/22255/valgrind.txt
Hope that helps.
Frankly our tor serivce is almost not maintainable anymore. I will try a last think and check for processes by script and restart. But this problem is serious.
comment:24 follow-up: 25 Changed 21 months ago by
Replying to DeS:
Somewhere during the last day this process stopped for no know reason. I attach the output:
https://trac.torproject.org/projects/tor/attachment/ticket/22255/valgrind.txt
Awesome, you found a bug! I filed it as #22368.
Alas, it isn't enough to explain the OOM issues on this ticket. Once we've fixed that ticket, I'll ask everybody to run the valgrinds again and see what comes up next.
Thanks!
comment:25 Changed 21 months ago by
Replying to arma:
Once we've fixed that ticket, I'll ask everybody to run the valgrinds again and see what comes up next.
Oh, if you are excited to get going before that, trying valgrind on the "maint-0.3.0" branch will be more stable, but even more interesting if we find bugs in.
To get to it, after the "git clone" line, do a "git checkout maint-0.3.0" and then proceed like before.
comment:26 Changed 21 months ago by
Thanks for your efforts.
I´m not technical enough to read anything out of the valgrind output. And I not really understand the defects you mention.
Nevertheless if you need a valgrind testrun I can do so.
But our relays went so much up and down I want to give them some "recovery" time. As I write another OOM Kill took place and the relay went from previously (3 weeks ago) around 400Mbit to 40 Mbits.
Still my only solution to the problem ist to check if the processes are around - and if not start them.
So if it is a big benefit for you - let me know (you have the eMail from the Mailinglist) and I will make another testrun.
comment:27 Changed 21 months ago by
Following teors adivce in tor-relays@… Mailinglist and set following values:
DirPort 0
DirCache 0
comment:28 Changed 21 months ago by
Replying to arma:
Replying to DeS:
May 14 11:02:47 tor1 kernel: [125982.993478] Killed process 33180 (tor) total-vm:444012kB, anon-rss:0kB, file-rss:0kB
This part has been bothering me. Am I reading that right, that the tor process it kills is using 444 megabytes of memory? That is way less than 4 gigs.
DeS, have you explored this one? A fast Tor relay will use way more than 444 megabytes of memory, so of course it will be dying if that's all it can have. What else is happening on the system such that it can't get any more memory?
Changed 21 months ago by
Attachment: | Screenshot_2017-05-26_09-56-09.png added |
---|
comment:29 Changed 21 months ago by
Obviously a process will die if it can´t get sufficient memory.
But as you saw in the memory logs. The normal usage is way below 400 MB.
And our Exits are limited in regards to bandwith. So we never really exceed 4xx Mbit/s.
But sometimes as you saw in the memlog memory consumtion starts rising fast - ending with the kill or an completly unresponsive machine.
What we are looking for is the cause for the rising memory consumpion.
I repeat this machines have been running 1-2 years without such problems and the problem starte occuring some weeks ago. As we seen on the mailing list it is also not just this servers - otherwise I would just set them freshly up.
dirk@tor2:~$ ps aux USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND root 1 0.0 0.0 33432 1528 ? Ss Mai22 0:00 /sbin/init root 2 0.0 0.0 0 0 ? S Mai22 0:00 [kthreadd] root 3 0.1 0.0 0 0 ? S Mai22 9:18 [ksoftirqd/0] root 5 0.0 0.0 0 0 ? S< Mai22 0:00 [kworker/0:0H] root 7 0.2 0.0 0 0 ? S Mai22 16:50 [rcu_sched] root 8 0.1 0.0 0 0 ? S Mai22 7:02 [rcuos/0] root 9 0.1 0.0 0 0 ? S Mai22 7:16 [rcuos/1] root 10 0.1 0.0 0 0 ? S Mai22 6:50 [rcuos/2] root 11 0.1 0.0 0 0 ? S Mai22 7:25 [rcuos/3] root 12 0.0 0.0 0 0 ? S Mai22 0:00 [rcuos/4] root 13 0.0 0.0 0 0 ? S Mai22 0:00 [rcuos/5] root 14 0.0 0.0 0 0 ? S Mai22 0:00 [rcuos/6] root 15 0.0 0.0 0 0 ? S Mai22 0:00 [rcuos/7] root 16 0.0 0.0 0 0 ? S Mai22 0:00 [rcuos/8] root 17 0.0 0.0 0 0 ? S Mai22 0:00 [rcuos/9] root 18 0.0 0.0 0 0 ? S Mai22 0:00 [rcuos/10] root 19 0.0 0.0 0 0 ? S Mai22 0:00 [rcuos/11] root 20 0.0 0.0 0 0 ? S Mai22 0:00 [rcuos/12] root 21 0.0 0.0 0 0 ? S Mai22 0:00 [rcuos/13] root 22 0.0 0.0 0 0 ? S Mai22 0:00 [rcuos/14] root 23 0.0 0.0 0 0 ? S Mai22 0:00 [rcuos/15] root 24 0.0 0.0 0 0 ? S Mai22 0:00 [rcuos/16] root 25 0.0 0.0 0 0 ? S Mai22 0:00 [rcuos/17] root 26 0.0 0.0 0 0 ? S Mai22 0:00 [rcuos/18] root 27 0.0 0.0 0 0 ? S Mai22 0:00 [rcuos/19] root 28 0.0 0.0 0 0 ? S Mai22 0:00 [rcuos/20] root 29 0.0 0.0 0 0 ? S Mai22 0:00 [rcuos/21] root 30 0.0 0.0 0 0 ? S Mai22 0:00 [rcuos/22] root 31 0.0 0.0 0 0 ? S Mai22 0:00 [rcuos/23] root 32 0.0 0.0 0 0 ? S Mai22 0:00 [rcuos/24] root 33 0.0 0.0 0 0 ? S Mai22 0:00 [rcuos/25] root 34 0.0 0.0 0 0 ? S Mai22 0:00 [rcuos/26] root 35 0.0 0.0 0 0 ? S Mai22 0:00 [rcuos/27] root 36 0.0 0.0 0 0 ? S Mai22 0:00 [rcuos/28] root 37 0.0 0.0 0 0 ? S Mai22 0:00 [rcuos/29] root 38 0.0 0.0 0 0 ? S Mai22 0:00 [rcuos/30] root 39 0.0 0.0 0 0 ? S Mai22 0:00 [rcuos/31] root 40 0.0 0.0 0 0 ? S Mai22 0:00 [rcuos/32] root 41 0.0 0.0 0 0 ? S Mai22 0:00 [rcuos/33] root 42 0.0 0.0 0 0 ? S Mai22 0:00 [rcuos/34] root 43 0.0 0.0 0 0 ? S Mai22 0:00 [rcuos/35] root 44 0.0 0.0 0 0 ? S Mai22 0:00 [rcuos/36] root 45 0.0 0.0 0 0 ? S Mai22 0:00 [rcuos/37] root 46 0.0 0.0 0 0 ? S Mai22 0:00 [rcuos/38] root 47 0.0 0.0 0 0 ? S Mai22 0:00 [rcuos/39] root 48 0.0 0.0 0 0 ? S Mai22 0:00 [rcuos/40] root 49 0.0 0.0 0 0 ? S Mai22 0:00 [rcuos/41] root 50 0.0 0.0 0 0 ? S Mai22 0:00 [rcuos/42] root 51 0.0 0.0 0 0 ? S Mai22 0:00 [rcuos/43] root 52 0.0 0.0 0 0 ? S Mai22 0:00 [rcuos/44] root 53 0.0 0.0 0 0 ? S Mai22 0:00 [rcuos/45] root 54 0.0 0.0 0 0 ? S Mai22 0:00 [rcuos/46] root 55 0.0 0.0 0 0 ? S Mai22 0:00 [rcuos/47] root 56 0.0 0.0 0 0 ? S Mai22 0:00 [rcuos/48] root 57 0.0 0.0 0 0 ? S Mai22 0:00 [rcuos/49] root 58 0.0 0.0 0 0 ? S Mai22 0:00 [rcuos/50] root 59 0.0 0.0 0 0 ? S Mai22 0:00 [rcuos/51] root 60 0.0 0.0 0 0 ? S Mai22 0:00 [rcuos/52] root 61 0.0 0.0 0 0 ? S Mai22 0:00 [rcuos/53] root 62 0.0 0.0 0 0 ? S Mai22 0:00 [rcuos/54] root 63 0.0 0.0 0 0 ? S Mai22 0:00 [rcuos/55] root 64 0.0 0.0 0 0 ? S Mai22 0:00 [rcuos/56] root 65 0.0 0.0 0 0 ? S Mai22 0:00 [rcuos/57] root 66 0.0 0.0 0 0 ? S Mai22 0:00 [rcuos/58] root 67 0.0 0.0 0 0 ? S Mai22 0:00 [rcuos/59] root 68 0.0 0.0 0 0 ? S Mai22 0:00 [rcuos/60] root 69 0.0 0.0 0 0 ? S Mai22 0:00 [rcuos/61] root 70 0.0 0.0 0 0 ? S Mai22 0:00 [rcuos/62] root 71 0.0 0.0 0 0 ? S Mai22 0:00 [rcuos/63] root 72 0.0 0.0 0 0 ? S Mai22 0:00 [rcu_bh] root 73 0.0 0.0 0 0 ? S Mai22 0:00 [rcuob/0] root 74 0.0 0.0 0 0 ? S Mai22 0:00 [rcuob/1] root 75 0.0 0.0 0 0 ? S Mai22 0:00 [rcuob/2] root 76 0.0 0.0 0 0 ? S Mai22 0:00 [rcuob/3] root 77 0.0 0.0 0 0 ? S Mai22 0:00 [rcuob/4] root 78 0.0 0.0 0 0 ? S Mai22 0:00 [rcuob/5] root 79 0.0 0.0 0 0 ? S Mai22 0:00 [rcuob/6] root 80 0.0 0.0 0 0 ? S Mai22 0:00 [rcuob/7] root 81 0.0 0.0 0 0 ? S Mai22 0:00 [rcuob/8] root 82 0.0 0.0 0 0 ? S Mai22 0:00 [rcuob/9] root 83 0.0 0.0 0 0 ? S Mai22 0:00 [rcuob/10] root 84 0.0 0.0 0 0 ? S Mai22 0:00 [rcuob/11] root 85 0.0 0.0 0 0 ? S Mai22 0:00 [rcuob/12] root 86 0.0 0.0 0 0 ? S Mai22 0:00 [rcuob/13] root 87 0.0 0.0 0 0 ? S Mai22 0:00 [rcuob/14] root 88 0.0 0.0 0 0 ? S Mai22 0:00 [rcuob/15] root 89 0.0 0.0 0 0 ? S Mai22 0:00 [rcuob/16] root 90 0.0 0.0 0 0 ? S Mai22 0:00 [rcuob/17] root 91 0.0 0.0 0 0 ? S Mai22 0:00 [rcuob/18] root 92 0.0 0.0 0 0 ? S Mai22 0:00 [rcuob/19] root 93 0.0 0.0 0 0 ? S Mai22 0:00 [rcuob/20] root 94 0.0 0.0 0 0 ? S Mai22 0:00 [rcuob/21] root 95 0.0 0.0 0 0 ? S Mai22 0:00 [rcuob/22] root 96 0.0 0.0 0 0 ? S Mai22 0:00 [rcuob/23] root 97 0.0 0.0 0 0 ? S Mai22 0:00 [rcuob/24] root 98 0.0 0.0 0 0 ? S Mai22 0:00 [rcuob/25] root 99 0.0 0.0 0 0 ? S Mai22 0:00 [rcuob/26] root 100 0.0 0.0 0 0 ? S Mai22 0:00 [rcuob/27] root 101 0.0 0.0 0 0 ? S Mai22 0:00 [rcuob/28] root 102 0.0 0.0 0 0 ? S Mai22 0:00 [rcuob/29] root 103 0.0 0.0 0 0 ? S Mai22 0:00 [rcuob/30] root 104 0.0 0.0 0 0 ? S Mai22 0:00 [rcuob/31] root 105 0.0 0.0 0 0 ? S Mai22 0:00 [rcuob/32] root 106 0.0 0.0 0 0 ? S Mai22 0:00 [rcuob/33] root 107 0.0 0.0 0 0 ? S Mai22 0:00 [rcuob/34] root 108 0.0 0.0 0 0 ? S Mai22 0:00 [rcuob/35] root 109 0.0 0.0 0 0 ? S Mai22 0:00 [rcuob/36] root 110 0.0 0.0 0 0 ? S Mai22 0:00 [rcuob/37] root 111 0.0 0.0 0 0 ? S Mai22 0:00 [rcuob/38] root 112 0.0 0.0 0 0 ? S Mai22 0:00 [rcuob/39] root 113 0.0 0.0 0 0 ? S Mai22 0:00 [rcuob/40] root 114 0.0 0.0 0 0 ? S Mai22 0:00 [rcuob/41] root 115 0.0 0.0 0 0 ? S Mai22 0:00 [rcuob/42] root 116 0.0 0.0 0 0 ? S Mai22 0:00 [rcuob/43] root 117 0.0 0.0 0 0 ? S Mai22 0:00 [rcuob/44] root 118 0.0 0.0 0 0 ? S Mai22 0:00 [rcuob/45] root 119 0.0 0.0 0 0 ? S Mai22 0:00 [rcuob/46] root 120 0.0 0.0 0 0 ? S Mai22 0:00 [rcuob/47] root 121 0.0 0.0 0 0 ? S Mai22 0:00 [rcuob/48] root 122 0.0 0.0 0 0 ? S Mai22 0:00 [rcuob/49] root 123 0.0 0.0 0 0 ? S Mai22 0:00 [rcuob/50] root 124 0.0 0.0 0 0 ? S Mai22 0:00 [rcuob/51] root 125 0.0 0.0 0 0 ? S Mai22 0:00 [rcuob/52] root 126 0.0 0.0 0 0 ? S Mai22 0:00 [rcuob/53] root 127 0.0 0.0 0 0 ? S Mai22 0:00 [rcuob/54] root 128 0.0 0.0 0 0 ? S Mai22 0:00 [rcuob/55] root 129 0.0 0.0 0 0 ? S Mai22 0:00 [rcuob/56] root 130 0.0 0.0 0 0 ? S Mai22 0:00 [rcuob/57] root 131 0.0 0.0 0 0 ? S Mai22 0:00 [rcuob/58] root 132 0.0 0.0 0 0 ? S Mai22 0:00 [rcuob/59] root 133 0.0 0.0 0 0 ? S Mai22 0:00 [rcuob/60] root 134 0.0 0.0 0 0 ? S Mai22 0:00 [rcuob/61] root 135 0.0 0.0 0 0 ? S Mai22 0:00 [rcuob/62] root 136 0.0 0.0 0 0 ? S Mai22 0:00 [rcuob/63] root 137 0.0 0.0 0 0 ? S Mai22 0:00 [migration/0] root 138 0.0 0.0 0 0 ? S Mai22 0:01 [watchdog/0] root 139 0.0 0.0 0 0 ? S Mai22 0:00 [watchdog/1] root 140 0.0 0.0 0 0 ? S Mai22 0:00 [migration/1] root 141 0.0 0.0 0 0 ? S Mai22 4:14 [ksoftirqd/1] root 143 0.0 0.0 0 0 ? S< Mai22 0:00 [kworker/1:0H] root 144 0.0 0.0 0 0 ? S Mai22 0:01 [watchdog/2] root 145 0.0 0.0 0 0 ? S Mai22 0:00 [migration/2] root 146 0.1 0.0 0 0 ? S Mai22 8:21 [ksoftirqd/2] root 148 0.0 0.0 0 0 ? S< Mai22 0:00 [kworker/2:0H] root 149 0.0 0.0 0 0 ? S Mai22 0:01 [watchdog/3] root 150 0.0 0.0 0 0 ? S Mai22 0:00 [migration/3] root 151 0.0 0.0 0 0 ? S Mai22 3:46 [ksoftirqd/3] root 153 0.0 0.0 0 0 ? S< Mai22 0:00 [kworker/3:0H] root 154 0.0 0.0 0 0 ? S< Mai22 0:00 [khelper] root 155 0.0 0.0 0 0 ? S Mai22 0:00 [kdevtmpfs] root 156 0.0 0.0 0 0 ? S< Mai22 0:00 [netns] root 157 0.0 0.0 0 0 ? S< Mai22 0:00 [writeback] root 158 0.0 0.0 0 0 ? S< Mai22 0:00 [kintegrityd] root 159 0.0 0.0 0 0 ? S< Mai22 0:00 [bioset] root 160 0.0 0.0 0 0 ? S< Mai22 0:00 [kworker/u129:0] root 161 0.0 0.0 0 0 ? S< Mai22 0:00 [kblockd] root 162 0.0 0.0 0 0 ? S< Mai22 0:00 [ata_sff] root 163 0.0 0.0 0 0 ? S Mai22 0:00 [khubd] root 164 0.0 0.0 0 0 ? S< Mai22 0:00 [md] root 165 0.0 0.0 0 0 ? S< Mai22 0:00 [devfreq_wq] root 166 0.3 0.0 0 0 ? R Mai22 20:23 [kworker/0:1] root 168 0.0 0.0 0 0 ? S Mai22 0:00 [khungtaskd] root 169 2.3 0.0 0 0 ? S Mai22 154:44 [kswapd0] root 170 0.0 0.0 0 0 ? S< Mai22 0:00 [vmstat] root 171 0.0 0.0 0 0 ? SN Mai22 0:00 [ksmd] root 172 0.0 0.0 0 0 ? SN Mai22 0:02 [khugepaged] root 173 0.0 0.0 0 0 ? S Mai22 0:00 [fsnotify_mark] root 174 0.0 0.0 0 0 ? S Mai22 0:00 [ecryptfs-kthrea] root 175 0.0 0.0 0 0 ? S< Mai22 0:00 [crypto] root 187 0.0 0.0 0 0 ? S< Mai22 0:00 [kthrotld] root 189 0.2 0.0 0 0 ? S Mai22 13:59 [kworker/1:1] root 208 0.0 0.0 0 0 ? S< Mai22 0:00 [deferwq] root 209 0.0 0.0 0 0 ? S< Mai22 0:00 [charger_manager] root 264 0.0 0.0 0 0 ? S< Mai22 0:00 [kpsmoused] root 266 0.0 0.0 0 0 ? S Mai22 0:00 [scsi_eh_0] root 267 0.0 0.0 0 0 ? S Mai22 0:00 [hpsa] root 270 0.0 0.0 0 0 ? S Mai22 0:00 [scsi_eh_1] root 271 0.0 0.0 0 0 ? S Mai22 0:00 [scsi_eh_2] root 272 0.0 0.0 0 0 ? S Mai22 0:00 [scsi_eh_3] root 273 0.0 0.0 0 0 ? S Mai22 0:00 [scsi_eh_4] root 274 0.0 0.0 0 0 ? S Mai22 0:00 [scsi_eh_5] root 275 0.0 0.0 0 0 ? S Mai22 0:00 [scsi_eh_6] root 281 0.2 0.0 0 0 ? S Mai22 18:31 [kworker/2:1] root 284 0.4 0.0 0 0 ? S Mai22 26:21 [kworker/3:1] root 293 0.0 0.0 0 0 ? S< Mai22 0:00 [kdmflush] root 294 0.0 0.0 0 0 ? S< Mai22 0:00 [bioset] root 296 0.0 0.0 0 0 ? S< Mai22 0:00 [kdmflush] root 297 0.0 0.0 0 0 ? S< Mai22 0:00 [bioset] root 314 0.0 0.0 0 0 ? S Mai22 0:01 [jbd2/dm-0-8] root 315 0.0 0.0 0 0 ? S< Mai22 0:00 [ext4-rsv-conver] root 440 0.0 0.0 0 0 ? S< Mai22 0:00 [ext4-rsv-conver] root 479 0.0 0.0 19484 196 ? S Mai22 0:00 upstart-udev-bridge --daemon root 506 0.0 0.0 51540 4 ? Ss Mai22 0:00 /lib/systemd/systemd-udevd --daemon root 525 0.0 0.0 15380 208 ? S Mai22 0:00 upstart-file-bridge --daemon root 557 0.0 0.0 0 0 ? SN Mai22 0:21 [kipmi0] message+ 617 0.0 0.0 39252 640 ? Ss Mai22 0:00 dbus-daemon --system --fork syslog 639 0.0 0.0 255848 1688 ? Ssl Mai22 0:00 rsyslogd root 644 0.0 0.0 0 0 ? S< Mai22 0:00 [kvm-irqfd-clean] root 704 0.0 0.0 43456 924 ? Ss Mai22 0:00 /lib/systemd/systemd-logind root 738 0.0 0.0 15268 216 ? S Mai22 0:00 upstart-socket-bridge --daemon root 1034 0.0 0.0 17312 8 tty4 Ss+ Mai22 0:00 /sbin/getty -8 38400 tty4 root 1037 0.0 0.0 17312 8 tty5 Ss+ Mai22 0:00 /sbin/getty -8 38400 tty5 root 1042 0.0 0.0 17312 8 tty2 Ss+ Mai22 0:00 /sbin/getty -8 38400 tty2 root 1043 0.0 0.0 17312 8 tty3 Ss+ Mai22 0:00 /sbin/getty -8 38400 tty3 root 1045 0.0 0.0 17312 8 tty6 Ss+ Mai22 0:00 /sbin/getty -8 38400 tty6 root 1060 0.0 0.0 61388 320 ? Ss Mai22 0:00 /usr/sbin/sshd -D root 1069 0.0 0.0 4376 4 ? Ss Mai22 0:00 acpid -c /etc/acpi/events -s /var/run/acpid.socket root 1095 0.0 0.0 23660 192 ? Ss Mai22 0:00 cron daemon 1096 0.0 0.0 19144 0 ? Ss Mai22 0:00 atd root 1116 0.0 0.0 19200 400 ? Ss Mai22 0:29 /usr/sbin/irqbalance unbound 1265 2.9 1.3 96720 51748 ? Ss Mai22 193:16 /usr/sbin/unbound vnstat 1317 0.0 0.0 7484 360 ? Ss Mai22 0:16 /usr/sbin/vnstatd -d --pidfile /run/vnstat/vnstat.pid root 1396 0.0 0.0 17312 8 tty1 Ss+ Mai22 0:00 /sbin/getty -8 38400 tty1 root 1410 0.0 0.0 0 0 ? S Mai22 0:00 [kauditd] dirk 1610 0.0 0.0 14100 840 ? S Mai22 0:10 /bin/bash -e ./memlog.sh root 1836 0.0 0.0 0 0 ? S Mai26 0:00 [kworker/1:0] root 13028 0.0 0.0 0 0 ? S Mai23 0:00 [kworker/0:0] root 14819 0.0 0.0 0 0 ? S< Mai23 0:00 [kworker/u129:1] root 15495 0.0 0.0 0 0 ? S 02:26 0:00 [kworker/u128:1] root 20329 0.0 0.0 0 0 ? S 08:06 0:00 [kworker/3:2] debian-+ 20802 28.6 3.3 309628 128476 ? Sl 08:11 7:49 /usr/bin/tor --defaults-torrc /usr/share/tor/tor-service-defaults-torrc -f /etc/tor/torrc1 --hush debian-+ 21033 18.4 3.2 305140 124116 ? Rl 08:12 4:49 /usr/bin/tor --defaults-torrc /usr/share/tor/tor-service-defaults-torrc -f /etc/tor/torrc2 --hush root 21243 0.0 0.0 0 0 ? S 08:23 0:00 [kworker/u128:2] root 21386 0.0 0.1 105656 4340 ? Ss 08:34 0:00 sshd: dirk [priv] dirk 21451 0.0 0.0 105656 1888 ? S 08:34 0:00 sshd: dirk@pts/0 dirk 21452 0.0 0.0 24080 3592 pts/0 Ss 08:34 0:00 -bash dirk 21521 0.0 0.0 4352 360 ? S 08:37 0:00 sleep 60 dirk 21523 0.0 0.0 20000 1316 pts/0 R+ 08:38 0:00 ps aux root 47576 0.0 0.0 0 0 ? S Mai25 0:00 [kworker/2:2]
I will attach a screenshot https://trac.torproject.org/projects/tor/attachment/ticket/22255/Screenshot_2017-05-26_09-56-09.png of top when the problem was akut and I was logged in yesterday. I could not investigate more this the machine was very unresponsive (I was kicked out of shell / local console)
comment:30 follow-up: 32 Changed 21 months ago by
Do you have a system monitoring in place that shows you memory usage over time?
(i.e. munin) Can you attach the following graphs:
- memory usage
- swap usage
- swap in/out per second
Also consider adding the following to your currently running regularly data collection scripts so we might find out what else is using your memory so that tor only gets a small chunk of your 4GB.
ps aux --sort=-%mem
comment:31 Changed 21 months ago by
Also: When you look at "top" output sort by used memory (press "M" key after starting top).
Changed 21 months ago by
Attachment: | syslog20170528 added |
---|
Changed 21 months ago by
Attachment: | SMEM20170528.log added |
---|
comment:32 follow-up: 33 Changed 21 months ago by
Replying to cypherpunks:
Do you have a system monitoring in place that shows you memory usage over time?
(i.e. munin) Can you attach the following graphs:
- memory usage
- swap usage
- swap in/out per second
Also consider adding the following to your currently running regularly data collection scripts so we might find out what else is using your memory so that tor only gets a small chunk of your 4GB.
ps aux --sort=-%mem
I have no munin running up to know. But let me through at you what I have.
1) Logs wich show that a process died and was restartet from today
Son Mai 28 03:50:06 CEST 2017 Process tor1 is not running -> starting Son Mai 28 04:27:01 CEST 2017 Process tor2 is not running -> starting Son Mai 28 04:32:13 CEST 2017 Process tor2 is not running -> starting
This will give use the rough time of three events today
2) Next syslog (shortened) will show what happend https://trac.torproject.org/projects/tor/attachment/ticket/22255/syslog20170528
3) I use smem all 5 minutes to provide infos about the process memory (should be like ps aux) usage see attachment https://trac.torproject.org/projects/tor/attachment/ticket/22255/SMEM20170528.log
I hope that gives you what you need. If not - let me know what tools would provide the best suitible output for you and I see to it.
comment:33 Changed 21 months ago by
Replying to DeS:
I have no munin running up to know.
I think you want to have system monitoring (CPU, memory, swap, network) on a system anyway for proper management (i.e. recognize rising memory usage trends, excessive swapping, very high CPU usage, ...)
let me know what tools would provide the best suitible output for you and I see to it.
maybe /proc/meminfo
comment:34 Changed 21 months ago by
o.k. I installed a local munin. Will this give you all information needed ?
comment:35 Changed 21 months ago by
curren /proc/meminfo I will add it to be generated every 5 minutes
MemTotal: 3877580 kB MemFree: 777392 kB Buffers: 112468 kB Cached: 688980 kB SwapCached: 10748 kB Active: 1102040 kB Inactive: 249044 kB Active(anon): 512100 kB Inactive(anon): 39020 kB Active(file): 589940 kB Inactive(file): 210024 kB Unevictable: 0 kB Mlocked: 0 kB SwapTotal: 4026364 kB SwapFree: 4007204 kB Dirty: 128 kB Writeback: 0 kB AnonPages: 545892 kB Mapped: 87924 kB Shmem: 1484 kB Slab: 269576 kB SReclaimable: 112900 kB SUnreclaim: 156676 kB KernelStack: 2864 kB PageTables: 5096 kB NFS_Unstable: 0 kB Bounce: 0 kB WritebackTmp: 0 kB CommitLimit: 5965152 kB Committed_AS: 1074140 kB VmallocTotal: 34359738367 kB VmallocUsed: 84072 kB VmallocChunk: 34359649392 kB HardwareCorrupted: 0 kB AnonHugePages: 428032 kB HugePages_Total: 0 HugePages_Free: 0 HugePages_Rsvd: 0 HugePages_Surp: 0 Hugepagesize: 2048 kB DirectMap4k: 149388 kB DirectMap2M: 3878912 kB DirectMap1G: 0 kB
Changed 21 months ago by
Attachment: | Screenshot_2017-05-30_20-11-49.png added |
---|
Changed 21 months ago by
Attachment: | SMEM_Mo.log added |
---|
Changed 21 months ago by
Attachment: | PROC_MEMINFO_Mon.log added |
---|
comment:36 Changed 21 months ago by
So after I got Munin installed I waited.
Yesterday two other OOMs occured:
Mon Mai 29 16:06:07 CEST 2017 Process tor1 is not running -> starting Mon Mai 29 17:20:01 CEST 2017 Process tor2 is not running -> starting
Syslog:
May 29 16:05:51 tor1 kernel: [502910.776183] Out of memory: Kill process 53045 (tor) score 32 or sacrifice child May 29 17:17:07 tor1 kernel: [507187.634566] Out of memory: Kill process 53165 (tor) score 25 or sacrifice child
Unfortunatley munin was not too successfull. It died when things got interessting and just lived very shortly between both occurences.
https://trac.torproject.org/projects/tor/attachment/ticket/22255/Screenshot_2017-05-30_20-11-49.png
But I got you the logs from SMEM and cat /proc/meminfo.
SMEM
https://trac.torproject.org/projects/tor/attachment/ticket/22255/SMEM_Mo.log
cat /proc/meminfo
https://trac.torproject.org/projects/tor/attachment/ticket/22255/PROC_MEMINFO_Mon.log
Hope this helps.
comment:37 Changed 21 months ago by
When these extreme spikes in memory usage happen (example: before 2017-05-30 12:00) how do the load/CPU/bandwidth/established connections look like at that point in time?
Changed 21 months ago by
Attachment: | munin.tar.gz added |
---|
comment:38 Changed 21 months ago by
Here are the graphics. https://trac.torproject.org/projects/tor/attachment/ticket/22255/munin.tar.gz
At this slot today 12:00 it seems not very out of the ordinary.
As said. Once the real problem starts - munin fails.
comment:39 follow-up: 40 Changed 21 months ago by
looks like a significant increase in connections when the memory spikes happen please change the netstat graphs to non-log scale by adding this to your munin.conf (in your host's section somewhere after [localhost])
netstat.graph_args -l 0 --base 1000
so we can have a better look.
This is kind of obvious:
- when you system was knocked off (i/o wait was extremely high probably due to swaping)
please enable the swap munin plugin so we can see swap i/o
comment:40 follow-up: 41 Changed 21 months ago by
Replying to cypherpunks:
looks like a significant increase in connections when the memory spikes happen please change the netstat graphs to non-log scale by adding this to your munin.conf (in your host's section somewhere after [localhost])
netstat.graph_args -l 0 --base 1000so we can have a better look.
Done.
This is kind of obvious:
- when you system was knocked off (i/o wait was extremely high probably due to swaping)
please enable the swap munin plugin so we can see swap i/o
Swap is enabled. The picture is in the Link https://trac.torproject.org/projects/tor/attachment/ticket/22255/munin.tar.gz
comment:41 Changed 21 months ago by
Replying to DeS:
Replying to cypherpunks:
looks like a significant increase in connections when the memory spikes happen please change the netstat graphs to non-log scale by adding this to your munin.conf (in your host's section somewhere after [localhost])
netstat.graph_args -l 0 --base 1000so we can have a better look.
Done.
Please upload the netstat graph again (the change should take effect once the graphs are regenerated, happens every 5min and works retroactively)
Swap is enabled. The picture is in the Link https://trac.torproject.org/projects/tor/attachment/ticket/22255/munin.tar.gz
I see the swap partition file "ubuntu_vg_swap_1-day.png", but I don't see a swap-day.png (Pages per seconds in/out)
Changed 21 months ago by
Attachment: | Screenshot_2017-05-30_23-15-37.png added |
---|
Changed 21 months ago by
Attachment: | Screenshot_2017-05-30_23-14-02.png added |
---|
comment:42 follow-up: 43 Changed 21 months ago by
comment:43 follow-up: 44 Changed 21 months ago by
Replying to DeS:
Netstat
https://trac.torproject.org/projects/tor/attachment/ticket/22255/Screenshot_2017-05-30_23-15-37.png
The graph is still in log scale.
Changed 21 months ago by
Attachment: | Screenshot_2017-05-31_20-43-46.png added |
---|
comment:44 Changed 21 months ago by
comment:45 follow-up: 46 Changed 21 months ago by
Now you should have the tools to see if the spike in tor's memory usage is related to a spike in connections or bandwidth, but you won't have any data to compare with (from times where you didn't have the OOM problem).
On the graph around 2017-05-31 9:00 there is an (usual?) spike from 13k to 19k connections, I guess the memory spiked as well (you didn't provide memory graphs).
Also note: Be aware that connection graphs at that granularity are considered sensitive.
Changed 21 months ago by
Attachment: | netstat-pinpoint=1496233206,1496341206.png added |
---|
Changed 21 months ago by
Attachment: | memory-pinpoint=1496233206,1496341206.png added |
---|
Changed 21 months ago by
Attachment: | netstat-pinpoint=1496233206,1496341206.2.png added |
---|
comment:46 follow-up: 48 Changed 21 months ago by
Also note: Be aware that connection graphs at that granularity are considered sensitive.
I´m open to other ideas how to exchange this information. But you will agree this problem needs to be looked at.
Clearly there is a correlation between memory an connections. Please refer to:
BUT
1) as you can see on this pictures the system "dies" once at 19 k connections. But two times later everything is fine at more than 20k connections
2) At every time (execpt the blank obviously) there is plenty of free memory.
3) Our machines operatate limited to ~320 - 400 Mbits and memory has never been an issue in 1-2 years
4) How do we explain the tor process just uses 400 Meg of memory each (see arma comment above)
Something is not adding up here and this is what we have to find out.
comment:47 Changed 21 months ago by
... btw. no OOM today last one (blank zone on immages) yesterday:
Mit Mai 31 16:01:23 CEST 2017 Process tor1 is not running -> starting
Mit Mai 31 16:01:23 CEST 2017 Process tor1 is not running -> starting
Mit Mai 31 17:33:53 CEST 2017 Process tor2 is not running -> starting
comment:48 follow-up: 49 Changed 21 months ago by
Replying to DeS:
1) as you can see on this pictures the system "dies" once at 19 k connections. But two times later everything is fine at more than 20k connections
I would not be so sure about that since who knows what amount of connections it reached when it failed.
4) How do we explain the tor process just uses 400 Meg of memory each (see arma comment above)
Something is not adding up here and this is what we have to find out.
You didn't provide the OOM and your other logs this time that would tell us how much it used this time? Please provide all logs you have for all new OOMs (not just munin graphs).
I think running two instances with MaxMemInQueues at 2000MB is to high, because you do not have more than 4GB memory.
Lets try to set it to
x = (4000 MB - (memory_your_os_uses_without_tor + 400 MB safety)) / number_of_tor_instances
Are these physical or virtual machines?
comment:49 Changed 21 months ago by
Replying to cypherpunks:
I think running two instances with MaxMemInQueues at 2000MB is to high, because you do not have more than 4GB memory.
Lets try to set it to
x = (4000 MB - (memory_your_os_uses_without_tor + 400 MB safety)) / number_of_tor_instances
will do so after the next commet with a hopefully full documented problem
Are these physical or virtual machines?
This are physical machines
Changed 21 months ago by
Attachment: | memory-pinpoint=1496354400,1496430027.png added |
---|
Changed 21 months ago by
Attachment: | netstat-pinpoint=1496354412,1496425827.png added |
---|
Changed 21 months ago by
Attachment: | PROC_MEMINFO.log added |
---|
Changed 21 months ago by
Changed 21 months ago by
Changed 21 months ago by
Changed 21 months ago by
Attachment: | collectl.log added |
---|
comment:50 Changed 21 months ago by
O.k. here we are with a new documented case:
Fre Jun 2 19:01:48 CEST 2017 Process tor2 is not running -> starting
We got the Munin graphs:
Memory
https://trac.torproject.org/projects/tor/attachment/ticket/22255/memory-pinpoint%3D1496354400%2C1496430027.png
Netstat
https://trac.torproject.org/projects/tor/attachment/ticket/22255/netstat-pinpoint%3D1496354412%2C1496425827.png
Syslog:
https://trac.torproject.org/projects/tor/raw-attachment/ticket/22255/syslog
/proc/meminfo:
https://trac.torproject.org/projects/tor/attachment/ticket/22255/PROC_MEMINFO.log
SMEM: (Process memory overview)
https://trac.torproject.org/projects/tor/attachment/ticket/22255/SMEM.log
free output:
https://trac.torproject.org/projects/tor/attachment/ticket/22255/mem.2.log
collectl output:
https://trac.torproject.org/projects/tor/attachment/ticket/22255/collectl.log
Hope that helps.
And please be so kind and state as exact as possible what logs (in whole) you need or not need in order that I can provide them.
comment:51 Changed 21 months ago by
And the question really boils down to:
How can a rising number of connections (to tor - nothing else running) drain the 4G memory out of the machine when the tor processes itself only use about 1G.
Where does the memory go and why ?
comment:52 Changed 21 months ago by
x = (4000 MB - (memory_your_os_uses_without_tor + 400 MB safety)) / number_of_tor_instances
4000MB - (300MB +400 MB) = 3300MB
MaxMemInQueues 1650 MB
Trying this on server2 - leaving server1 with current values to make sure we can continue analyzing.
comment:53 follow-up: 54 Changed 21 months ago by
some observations:
minimal logged free memory was 84MB at 2017-06-02 18:59:11
https://trac.torproject.org/projects/tor/attachment/ticket/22255/mem.2.log#L2256
2017-06-02 18:59:11 3786 3702 84 0 0 4
but https://trac.torproject.org/projects/tor/attachment/ticket/22255/PROC_MEMINFO.log
tells another story? (lower polling rate), lowest value at:
Fre Jun 2 18:45:49 CEST 2017 MemTotal: 3877580 kB MemFree: 139124 kB
already better after that:
Fre Jun 2 18:56:41 CEST 2017 MemTotal: 3877580 kB MemFree: 3052860 kB
4000MB - (300MB +400 MB) = 3300MB
you have 3786 MB memory (not 4000 MB)
I find percentage (or at least '-k') output with relevant process at the top much more readable/useful:
smem -pr
(you might also reduce the output to relevant columns only)
-c 'command rss swap'
How can a rising number of connections (to tor - nothing else running) drain the 4G memory out of the machine when the tor processes itself only use about 1G.
My feeling is that tor uses about 500MB per process when it gets killed but uses much more before that.
Running 'smem' every 60 seconds might provide some valuable data.
comment:54 Changed 21 months ago by
Thx for the feedback
smem log pushed to every 60 seconds.
Options added as described.
Waiting for next incedent to document.
Replying to cypherpunks:
some observations:
minimal logged free memory was 84MB at 2017-06-02 18:59:11
https://trac.torproject.org/projects/tor/attachment/ticket/22255/mem.2.log#L2256
2017-06-02 18:59:11 3786 3702 84 0 0 4but https://trac.torproject.org/projects/tor/attachment/ticket/22255/PROC_MEMINFO.log
tells another story? (lower polling rate), lowest value at:
Fre Jun 2 18:45:49 CEST 2017 MemTotal: 3877580 kB MemFree: 139124 kBalready better after that:
Fre Jun 2 18:56:41 CEST 2017 MemTotal: 3877580 kB MemFree: 3052860 kB4000MB - (300MB +400 MB) = 3300MB
you have 3786 MB memory (not 4000 MB)
I find percentage (or at least '-k') output with relevant process at the top much more readable/useful:
smem -pr(you might also reduce the output to relevant columns only)
-c 'command rss swap'How can a rising number of connections (to tor - nothing else running) drain the 4G memory out of the machine when the tor processes itself only use about 1G.
My feeling is that tor uses about 500MB per process when it gets killed but uses much more before that.
Running 'smem' every 60 seconds might provide some valuable data.
Changed 21 months ago by
Attachment: | 20170607.tar.gz added |
---|
comment:56 Changed 21 months ago by
O.k. Good News I added all the parameters to my skript and today - we had two OOMs.
Now bad news. If you will look at the relevant time you will see the logs start to be strange.
Very easy explaination. The skript which collects the stats needs more than a minute to run. And cron fires the next one. So we have several outputs in the files at the same time.
Stupid me - I did not put an lock file in - but wait. If the machine is that slow - what good would it be to have disk access - so no real solution as well. As well not running it via cron would not really help. The server has not even reseources to execute date and put it into an file.
The fact is i can try some things but to have clean logs about a completely dying machine is like sending a probe into a black hole and expect pictures back.
But now. Lets step back a little and think.
a) We see the tor process use around 250-400 MB normally.
b) When they get killed they are bigger 550 - 600 MB
c) We have a machine with 3877580 MB (3.8~ GB)
We can not try to have logfiles from everything but if you ask me this will not help. All you would see is the memory usage of tor going up to 600 MB. So what.
The Question is - where are around 1-2 Gig of Memory gone?
Even in the messed up logs you will no monster process consuming this memory. The biggest user and growing process is tor. And the question is why does it behave like this.
I do not belive a 25% rise of tcp connection will swallow up all the memory so the problem must be somewhere else - if you ask me - in tor
any chance tor has debug features (memory allocation ...)
https://trac.torproject.org/projects/tor/attachment/ticket/22255/20170607.tar.gz
Changed 20 months ago by
Changed 20 months ago by
Attachment: | 20170615.tar.gz added |
---|
comment:57 Changed 20 months ago by
It became awfully quite in this ticket.
Since last report I changed the measurement script. It runs permanently and sleeps 60 seconds in between. As will I added an nice of -19.
After the latest update a had 2 days hope the problem was fixed - until today.
I could measure an OOM this morning.
https://trac.torproject.org/projects/tor/attachment/ticket/22255/20170615.tar.gz
The logs look better now. But you will realize they are not happening every minute. This is because of the machine slowdown once the problem starts.
Anyway I think you well get an good impression of what happens.
comment:58 Changed 20 months ago by
peak time was at Don Jun 15 04:21:40 CEST 2017
your smem logs say that
- RSS system total: 0.33%
- tor used very little memory but a lot of swap space (>270MB per process)
(which very likely causes your system to hang completely)
(again you do not provide any swap in/out graphs)
your /proc/meminfo says:
- 3664 MB memory used (does not show in smem -> kernel memory?)
- 679 MB SWAP used
What is your swapiness?
Possible next items:
- add another smem output to list kernel memory:
smem -kwt
- Try disabling tor's swap usage:
DisableAllSwap 1
I guess this will kill tor earlier but prevent the extreme swaping/trashing.
- try to reproduce on clean system with same relay keys
- guess I was wrong when I said tor might be using more memory before it was killed
Changed 20 months ago by
Attachment: | swap-pinpoint=1497453606,1497561606.png added |
---|
comment:59 follow-up: 61 Changed 20 months ago by
Swapiness is a new word for me.
Do you mean this
https://trac.torproject.org/projects/tor/attachment/ticket/22255/swap-pinpoint%3D1497453606%2C1497561606.png
comment:60 Changed 20 months ago by
https://en.wikipedia.org/wiki/Swappiness
cat /proc/sys/vm/swappiness
(but not very relevant anyway - I guess)
comment:61 Changed 20 months ago by
Replying to DeS:
Swapiness is a new word for me.
Do you mean this
https://trac.torproject.org/projects/tor/attachment/ticket/22255/swap-pinpoint%3D1497453606%2C1497561606.png
This is not the swapiness, but the swap in/out pages per second I asked for.
And yes a value >250 will make a system hang.
comment:63 Changed 20 months ago by
Sorry about the late reply - Summer flu got me.
Swappiness
root@tor1:~# cat /proc/sys/vm/swappiness 60
smem -kwt added to memory logger.
DisableAllSwap 1 configured. Had to add
capability sys_resource, capability ipc_lock,
to apparmor config for tor to start it.
Waiting for next event.
Changed 20 months ago by
Attachment: | 20170708.tar.gz added |
---|
comment:64 Changed 20 months ago by
Finally I got around to check again. OOM still happen. But not as bad anymore as before.
The logs should contain above information.
Config contains DisableAllSwap 1 and is based on tor 0.3.0.9-1~trusty+1 since it is the stable.
Here the logfiles/munin screenshots :
https://trac.torproject.org/projects/tor/attachment/ticket/22255/20170708.tar.gz
comment:65 Changed 20 months ago by
4) How do we explain the tor process just uses 400 Meg of memory each (see arma comment above)
How can a rising number of connections (to tor - nothing else running) drain the 4G memory out of the machine when the tor processes itself only use about 1G.
Where does the memory go and why ?
Sockets' buffers!
comment:66 Changed 20 months ago by
DeS, check network hardware, uplink. If everything ok then try to reduce memory usage by:
net.ipv4.tcp_rmem net.ipv4.tcp_wmem
Replace thirds values (6, 4 megabytes by default for your RAM size) by 262144 (256K)
comment:67 Changed 20 months ago by
try to reduce memory usage
If UDP involved here, then look for net.ipv4.udp_mem (pages) values too, and/or net.core.rmem_* net.core.wmem_*
TCP + UDP sockets can to eat up to 1G in your system by default.
comment:68 Changed 20 months ago by
Collecting puzzle.
I'm 80-90% sure it's a bug in the way the Tor daemon interacts with unbound's behavior w/r/t large numbers of timing-out DNS queries. Unbound appears to be perfectly fine with the situation when it occurs. Tor daemon DNS queries lock-up wholesale, thus preventing normal exit browsing behaivor.
as result #21394
unbound
OOM in final
comment:69 Changed 20 months ago by
Tor daemon DNS queries
I believe Tor does DNS queries mostly wrong/broken logic because tor's and libevent's (evdns) code bugs. Leading to large queues, errors, timeouts. Anybody want to discuss/fix it?
comment:70 Changed 19 months ago by
can confirm oom kills on two nodes
- 1: Debian GNU/Linux 8.8 (jessie), 1G memory, Tor version 0.2.5.14, Exit
- 2: Debian GNU/Linux 8.8 (jessie), 2G memory, Tor version 0.2.5.14, Non-Exit
No unbound running on either. OOM never occured in the last 5 years, now it suddenly happens? Seems very strange to me.
Side question: where did tor weather go?
comment:71 Changed 17 months ago by
Long overdue update.
The Problem disappeared in August. On on host there where view occations in September.
What helped is:
MaxMemInQueues 1650MB
DisableAllSwap 1
It did not resolve the problem bat made it much better. Later on maybe due to an update of tor it seems to have mostly stopped.
comment:72 Changed 17 months ago by
Resolution: | → worksforme |
---|---|
Status: | new → closed |
comment:73 Changed 14 months ago by
Resolution: | worksforme |
---|---|
Status: | closed → reopened |
I think I have to re-open this ticket, it still happens almost daily to my instance, Tor 0.2.9.14-1 on Debian 9.3 (stretch). It's now getting to the point where it's not only annoying to have it restart, but also the OOM kills sometimes affect other processes and daemons now.
I've tried to go down with the MaxMemInQueues setting until it the OOM kills stop happening, but it's now at < 500MB and the problem still occurs.
It would be really nice to solve the root cause of this.
comment:74 follow-up: 77 Changed 14 months ago by
Hi,
could you upgrade to 0.3.2.8-rc to test if the latest fixes solve your problem?
thanks!
comment:75 Changed 14 months ago by
Can someone comment finally if this is the same thing - It awfully sounds like it.
https://lists.torproject.org/pipermail/tor-project/2017-December/001604.html
comment:76 Changed 14 months ago by
Resolution: | → fixed |
---|---|
Status: | reopened → closed |
There are many reasons for load spikes in Tor: HSDirs are one of them, millions of added clients are another.
Given the timing, it is unlikely that these have the same cause, but they may trigger similar bugs in Tor.
We are dealing with various bugs in a few different tickets.
Please:
- Upgrade to tor 0.3.2.8-rc
- Set MaxMemInQueues to your available memory, minus a few hundred megabytes
- give the tor user enough file descriptors
Then, if the problem persists, open a new ticket for any out of memory kill issues.
comment:77 Changed 14 months ago by
Replying to cypherpunks:
Hi,
could you upgrade to 0.3.2.8-rc to test if the latest fixes solve your problem?
thanks!
done, runs stable for a couple days now. so the problem seems to be fixed.
would it be helpful to do a version bisection to see where the bug was fixed? the root cause still seems to be unknown if i got this correctly.
comment:78 follow-up: 79 Changed 14 months ago by
please allow me to jump in here with some advice:
0.3.2.8-rc fixed the problem for you most likely due to #24666, which dramatically reduces memory consumption w/r/t DESTROY cells
however I see above an assumption that on a 4G machine MaxMemInQueues=1650MB is reasonable -- without a doubt this value is much too high
empirical observation indicates the daemon consumes 2x MaxMemInQueues once the actual maximum has been obtained (most likely due to sniper attacks)
therefore to truly harden tor daemon against OOM kills on a 4G machine, one should set MaxMemInQueues=1024MB
for further detail and additional tuning tips see #24737
comment:79 follow-up: 80 Changed 14 months ago by
Replying to starlight:
please allow me to jump in here with some advice:
0.3.2.8-rc fixed the problem for you most likely due to #24666, which dramatically reduces memory consumption w/r/t DESTROY cells
however I see above an assumption that on a 4G machine MaxMemInQueues=1650MB is reasonable -- without a doubt this value is much too high
empirical observation indicates the daemon consumes 2x MaxMemInQueues once the actual maximum has been obtained (most likely due to sniper attacks)
I'd still like to see someone repeat this analysis with 0.3.2.8-rc, and post the results to #24737.
It's going to be hard for us to close that ticket without any idea of the effect of our changes.
comment:80 follow-up: 82 Changed 14 months ago by
Cc: | starlight@… added |
---|
Replying to teor:
I'd still like to see someone repeat this analysis with 0.3.2.8-rc, and post the results to #24737.
It's going to be hard for us to close that ticket without any idea of the effect of our changes.
I'm not willing to run a newer version till one is declared LTS, but can say that even when my relay is not under attack memory consumption goes to 1.5G with the 1G max queue setting. Seems to me the 2x max queues memory consumption is a function of the overheads associated with tor daemon queues and related processing, including malloc slack space.
Anyone running a busy relay on an older/slower system and with MaxMemInQueues=1024MB can check /proc/<pid>/status to see how much memory is consumed. Be sure DisableAllSwap=1 is set and the queue limit is not higher since the point is to observe actual memory consumed relative to a limit likely to be approached under normal operation.
Another idea is to add an option to the daemon to cause queue memory preallocation. This would be a nice hardening feature as it will reduce malloc() calls issued under stress, and of course would allow more accurate estimates of worst-case memory consumption. If OOM strikes with preallocated queues that would indicate memory leakage.
comment:81 Changed 14 months ago by
Also allow me to remind that memory is cheap. Is better to set the queue max conservatively (or buy more memory) than to experience OOM crashes.
comment:82 Changed 14 months ago by
Replying to starlight:
Replying to teor:
I'd still like to see someone repeat this analysis with 0.3.2.8-rc, and post the results to #24737.
It's going to be hard for us to close that ticket without any idea of the effect of our changes.
I'm not willing to run a newer version till one is declared LTS, but can say that even when my relay is not under attack memory consumption goes to 1.5G with the 1G max queue setting. Seems to me the 2x max queues memory consumption is a function of the overheads associated with tor daemon queues and related processing, including malloc slack space.
Anyone running a busy relay on an older/slower system and with MaxMemInQueues=1024MB can check /proc/<pid>/status to see how much memory is consumed. Be sure DisableAllSwap=1 is set and the queue limit is not higher since the point is to observe actual memory consumed relative to a limit likely to be approached under normal operation.
Another idea is to add an option to the daemon to cause queue memory preallocation. This would be a nice hardening feature as it will reduce malloc() calls issued under stress, and of course would allow more accurate estimates of worst-case memory consumption. If OOM strikes with preallocated queues that would indicate memory leakage.
I have answered this on #24737, because it will get lost on a closed ticket.
Please move all future discussion there.
Have you tried setting the MaxMemInQueues option? It tries to limit the total amount of memory Tor uses.