Opened 12 years ago

Last modified 7 years ago

#468 closed defect (Fixed)

Tor Server excessive ram usage

Reported by: phobos Owned by: nickm
Priority: High Milestone: post 0.2.0.x
Component: Core Tor/Tor Version: 0.1.2.14
Severity: Keywords:
Cc: phobos, nickm, plebno, arma Actual Points:
Parent ID: Points:
Reviewer: Sponsor:

Description

A number of people on or-talk are reporting very high memory usage (512MB to 1GB) for their servers. This entry is
to track the issues and provide feedback to those experiencing the problem.
See http://archives.seul.org/or/talk/Jul-2007/msg00135.html for the start of the thread.

If you are experiencing this problem, please list your operating system (uname -rmpio), tor version, zlib version,
openssl version, exit server or not, and virtual/resident memory usage of Tor.

Thanks!

[Automatically added by flyspray2trac: Operating System: Other Linux]

Child Tickets

Change History (28)

comment:1 Changed 12 years ago by xiando

Linux godzilla 2.6.20-gentoo-r8 #5 Sat Jun 9 13:31:34 CEST 2007 i686 Pentium II (Deschutes) GenuineIntel GNU/Linux

PID USER PR NI RES VIRT SHR S %CPU %MEM TIME+ COMMAND
885 tor 15 0 115m 154m 23m S 7.8 26.2 230:11.18 tor

Checked out revision 10894.
sys-libs/zlib 1.2.3-r1
dev-libs/openssl 0.9.8d

# Bandwidth limits
BandwidthRate 32 KB
BandwidthBurst 64 KB

This isn't "very high" memory usage (as in 512 MB), but poor Godzilla only has 384 MB, so it'd be nice if Tor used a lower percentage of it's total.

comment:2 Changed 12 years ago by mikeperry

I noticed this a bit ago, and have been running 0.1.1.26 for about a couple
weeks. It appears to be leaking at about the same rate as 0.1.2.14 did.
It is currently consuming 1gig of memory (it also took 0.1.2.14 about
2-3 weeks to get to 1gig on my setup).

Since this problem suddenly showed up, yet 0.1.1.26 has been out for
ages, perhaps it is a client problem? There is that issue where
clients can send too many SENDMEs and fill up server buffers.. Maybe
there is a SENDME leak? Or some other behavioral change caused by 0.1.2.x?

Linux 2.6, zlib-1.2.3, openssl-0.9.8b

comment:3 Changed 12 years ago by phobos

So, is this still occurring with 0.1.2.17 or 0.2.0.6-alpha?

comment:4 Changed 12 years ago by plebno

For me, it eats up memory much slower with 0.1.2.17 than it did earlier - however it still uses about 400M after 3 days. Is this considered normal with "BandwidthRate 1 MB"? I restartet it and now it uses 327mVIRT/296mRES after 12 hours with constant 1MB/s traffic.

(Linux 2.6.18-5-k7 i686, Debian etch, zlib 1.2.3-13, openssl 0.9.8c-4)

comment:5 Changed 11 years ago by phobos

We believe this has been partially fixed in 0.2.0.12-alpha. It appears to be related to dns caching of query responses.

comment:6 Changed 11 years ago by nickm

14:42 < nickm> Okay.
14:42 < nickm> It looks like our remaining memory options are:
14:43 < nickm> 1) try to fix the stuff that dmalloc says is major, since that's

all the data we have

14:43 < nickm> 2) look for a tool that can give better data
14:43 < nickm> what are the other options?
14:43 < arma> if the most dmalloc can tell us about is a few megs here and

there, then option 1 isn't going to buy us much

14:43 < arma> option 3 would be to think really hard and guess what might be

triggering bad memory behavior and try fixing it and see if stuff
improves.

14:44 < nickm> Right
14:44 < nickm> So, what could be giving bad behavior?
14:44 < nickm> Hypotheses:
14:44 < nickm> 1) It is something we're mallocing that dmalloc doesn't know

about.

14:44 < nickm> 2) It is something we are not mallocing.
14:44 < nickm> 2a) It is malloced in libevent or zlib
14:45 < nickm> 2b) It is something we are getting through some means other than

mallocing.

14:45 < nickm> 3) The bloat is caused by memory fragmentation or some other

weird libc issue.

14:45 < nickm> Other hypotheses?
14:47 < arma> the other hint is that valgrind doesn't find any leaks, when run

on a server for several hours

14:48 < arma> historically, valgrind has been good about knowing if there are

leaks. so i think there aren't any.

14:48 < nickm> right
14:48 < nickm> i am pretty sure this is not a matter of leaks
14:48 < arma> ok. and the other other hint is that our memory usage is

basically zero when we exit. meaning we're freeing it all,
including the mystery bloat.

14:48 < nickm> if it is a (3) above, then dmalloc stuff _will_ help: mallocing

less reduces fragmentation.

[....]
14:52 < arma> theory 4) massive memory confusion due to mmaps, correctly

handled or not.

14:54 < nickm> That theory 4 looks like 2b to me.

comment:7 Changed 11 years ago by nickm

I'm calling this a memory fragmentation issue. Theory 3 is still my favorite bet.

I got some usage-over-time numbers on a couple of runs of peacetime, and then restarted peacetime, using a port
of the OpenBSD malloc code as mentioned here:

http://blog.pavlov.net/2007/11/10/memory-fragmentation/

(Lots of other neat stuff is mentioned there too.)

After 40 minutes, it's at 93 MB resident (as opposed to 145 MB resident with the glibc malloc).

More data to follow.

After 1 hour, 99m (glibc had 152 MB resident at the 1-hour mark.)

After 3 hours, _ (glibc had 160 MB at the 3-hour mark.)

comment:8 Changed 11 years ago by nickm

Of course, VSS isn't RSS, but this is almost kinda progress.

To try the openbsd allocator yourself (on linux)

  1. download http://mr.himki.net/OpenBSD_malloc_Linux.c
  2. gcc -shared -fPIC -O2 OpenBSD_malloc_Linux.c -o malloc.so
  3. LD_PRELOAD=/path/to/malloc.so /path/to/tor

[Directions snarfed from link above]

comment:9 Changed 11 years ago by arma

I'm running moria1 and moria2 on the new allocator. They're still using a lot
of ram (480M and 484M respectively). But that's slightly less than before.

What's more interesting is that their cpu usage is down a whole lot.

comment:10 Changed 11 years ago by nickm

What's more interesting is that their cpu usage is down a whole lot.

How's their throughput?

comment:11 Changed 11 years ago by arma

Throughput is doing fine, I think.

moria1 and moria2 are probably atypical, in that they're directory authorities.

Moria1 is:
Dec 04 09:27:22.161 [info] Average bandwidth: 14603363634/249777 = 58465 bytes/sec reading
Dec 04 09:27:22.161 [info] Average bandwidth: 200818173905/249777 = 803989 bytes/sec writing

Moria2 is:
Dec 04 08:05:32.648 [info] Average bandwidth: 12797620736/245152 = 52202 bytes/sec reading
Dec 04 08:05:32.648 [info] Average bandwidth: 168733587914/245152 = 688281 bytes/sec writing

They both have "MaxAdvertisedBandwith 20 KB" set, you see.

It would be nice to get some comparisons from some more "normal" fast servers, like Tonga or
blutmagie.

comment:12 Changed 11 years ago by phobos

[22:14:50] < Lucky> Tonga memory usage seems to hold steady at about 0.5 GB with the BSD malloc. "Good enough for me".

on 2007-12-14

comment:13 Changed 11 years ago by nickm

Neat. This is probably memory fragmentation under glibc malloc then. I think that the use of BSD malloc is only
a workaround; we need to solve the actual fragmentation.

comment:14 Changed 11 years ago by Falo

after 11 days of operation blutmagie memory usage is still below 500 MB with the BSD malloc. Very Good!

comment:15 Changed 11 years ago by shamrock

Tonga remained constant at ~500MB memory usage after about a week of operation. The BSD malloc definitely solves the problem for me.

comment:16 Changed 11 years ago by nickm

It might also be cool to see if Google's TCMalloc helps:

http://code.google.com/p/google-perftools/wiki/GooglePerformanceTools

comment:17 Changed 11 years ago by nickm

I can confirm that the following code has horrible heap behavior on glibc, and fine heap behavior on openbsd malloc:

#include <stdio.h>
#include <stdlib.h>
int main(int c, char v) {

int i = 0;
void *p[10000];
for (i=0; i < 10000; ++i) {

p[i]= malloc(1);
p[i] = realloc(p[i], 1024*500);
p[i] = realloc(p[i], 1);

}
puts("DONE.");
sleep(200);

}

This is very similar to how Tor up to 0.2.0.15-alpha handles chunks of memory used for buf_t and put on a freelist.
I think that the new buffer.c code (in 0.2.0.15-alpha-dev) fixes this instance of the problem at least. We should
audit our other uses of relloc, though.

comment:18 Changed 11 years ago by nickm

Another trick I've been told about, if none of the above works, is to instrument our malloc/etc

wrappers to dump a record of all uses to a file, and replay the record independently to figure out
which subsequence of it is tickling bad platform behavior.

comment:19 Changed 11 years ago by phobos

0.2.0.18 seems to fix this problem to a degree.

comment:20 Changed 11 years ago by Fredzupy

0.2.0.18 seems to be OK for me too. But 0.2.0.19 not.

Here is Tor RAM usage for my system and various release:
http://pastebin.ca/900948 0.1.0.19
http://pastebin.ca/900950 0.2.0.18
http://pastebin.ca/900952 0.2.0.19

relay name ethnao:
2.6.17-hardened-r1 i686 VIA Ezra CentaurHauls GNU/Linux
zlib version 1.2.3
openssl version 0.9.8g

comment:21 Changed 11 years ago by nickm

For 0.2.0.x-rc, add the openbsd malloc as a build option.

comment:22 Changed 11 years ago by nickm

Okay; as of 0.2.0.20-alpha (0.2.0.20-rc?), Tor will ship with the openbsd malloc code, but it will be off
by default. You can turn it on by passing --enable-openbsd-malloc to configure.

Also, if you want to use tcmalloc from google's perftools, I've added a fragmentary --with-tcmalloc
option to configure.

comment:23 Changed 11 years ago by nickm

Firefox also likes the "use a different allocator" solution, it seems:

http://ventnorsblog.blogspot.com/2008/02/beta-3.html

comment:24 Changed 11 years ago by arma

It looks like 0.2.1.1-alpha solves the memory fragmentation problem on Linux.
Yay, finally.

Shall we close this bug?

comment:25 Changed 11 years ago by coderman

I say close it!

0.2.1.2-alpha has been running quite stable for weeks now. Memory use is drastically reduced; previously this node would have hundreds of MB allocated, sometimes over 0.5GB. the latest alpha has stayed at or below 22MB resident for weeks now, and actually reduces memory footprint during periods of lower network activity (only a few meg, but still impressive :)

comment:26 Changed 11 years ago by nickm

Closing. There are remaining memory issues that we'd do well to address, but the memory use is now merely
"unpleasant" rather than "excessive".

comment:27 Changed 11 years ago by nickm

flyspray2trac: bug closed.

comment:28 Changed 7 years ago by nickm

Component: Tor RelayTor
Note: See TracTickets for help on using tickets.