Opened 2 years ago

Closed 8 weeks ago

#20424 closed defect (fixed)

Remove --enable-openbsd-malloc (Tor maxes CPU when --enable-openbsd-malloc is used)

Reported by: icanhasaccount Owned by:
Priority: Low Milestone: Tor: 0.3.5.x-final
Component: Core Tor/Tor Version:
Severity: Minor Keywords: review-group-31, 034-triage-20180328, fast-fix, 035-triaged-in-20180711
Cc: alex_y_xu@… Actual Points:
Parent ID: Points: .2
Reviewer: ahf Sponsor:

Description

When tor is built with --enable-openbsd-malloc running 'make test' consumes 100% of the CPU. Attaching a debugger to the process shows it looping forever in calloc (src/ext/OpenBSD_malloc_Linux.c). Changing from -O2 to -O3 (or -O1) in the Makefile "fixes" (as in the tests run) the issues so it seems to be a compiler bug.

Some more details:
Debian testing, x86_64

$ gcc -v
Using built-in specs.
COLLECT_GCC=gcc
COLLECT_LTO_WRAPPER=/usr/lib/gcc/x86_64-linux-gnu/6/lto-wrapper
Target: x86_64-linux-gnu
Configured with: ../src/configure -v --with-pkgversion='Debian 6.2.0-6' --with-bugurl=file:///usr/share/doc/gcc-6/README.Bugs --enable-languages=c,ada,c++,java,go,d,fortran,objc,obj-c++ --prefix=/usr --program-suffix=-6 --program-prefix=x86_64-linux-gnu- --enable-shared --enable-linker-build-id --libexecdir=/usr/lib --without-included-gettext --enable-threads=posix --libdir=/usr/lib --enable-nls --with-sysroot=/ --enable-clocale=gnu --enable-libstdcxx-debug --enable-libstdcxx-time=yes --with-default-libstdcxx-abi=new --enable-gnu-unique-object --disable-vtable-verify --enable-libmpx --enable-plugin --with-system-zlib --disable-browser-plugin --enable-java-awt=gtk --enable-gtk-cairo --with-java-home=/usr/lib/jvm/java-1.5.0-gcj-6-amd64/jre --enable-java-home --with-jvm-root-dir=/usr/lib/jvm/java-1.5.0-gcj-6-amd64 --with-jvm-jar-dir=/usr/lib/jvm-exports/java-1.5.0-gcj-6-amd64 --with-arch-directory=amd64 --with-ecj-jar=/usr/share/java/eclipse-ecj.jar --enable-objc-gc --enable-multiarch --with-arch-32=i686 --with-abi=m64 --with-multilib-list=m32,m64,mx32 --enable-multilib --with-tune=generic --enable-checking=release --build=x86_64-linux-gnu --host=x86_64-linux-gnu --target=x86_64-linux-gnu
Thread model: posix
gcc version 6.2.0 20161010 (Debian 6.2.0-6)

I've attached the output of objdump for the failing and working object files, but I'm too stupid to debug it any further, so hopefully someone more intelligent can.

Child Tickets

Attachments (9)

O1.txt (701.1 KB) - added by icanhasaccount 2 years ago.
O1 - make test passes
O2.txt (525.0 KB) - added by icanhasaccount 2 years ago.
O2 - make test fails
example.c (1.5 KB) - added by nickm 2 years ago.
example.2.c (1.4 KB) - added by nickm 2 years ago.
example2.c (883 bytes) - added by nickm 2 years ago.
Variant example, with no headers included.
remove_openbsd_malloc.diff (50.6 KB) - added by icanhasaccount 2 years ago.
Diff to remove openbsd malloc if thats desirable
configure_ac_patch.patch (468 bytes) - added by icanhasaccount 2 years ago.
Work around compilation issues with some versions of gcc
0001-Add-with-jemalloc.-20424-23777.patch (2.4 KB) - added by Hello71 11 months ago.
patch to add --with-jemalloc
0001-Add-with-malloc-configure-option.-20424-23777.patch (5.4 KB) - added by Hello71 11 months ago.
so apparently autoconf calls it LIBS

Download all attachments as: .zip

Change History (50)

Changed 2 years ago by icanhasaccount

Attachment: O1.txt added

O1 - make test passes

Changed 2 years ago by icanhasaccount

Attachment: O2.txt added

O2 - make test fails

comment:1 Changed 2 years ago by nickm

I can confirm this (reproducing it on Fedora with gcc-6.2.1-2.fc24.x86_64). This definitely looks like a compiler bug to me.

comment:2 Changed 2 years ago by dgoulet

Milestone: Tor: 0.3.0.x-final

comment:3 Changed 2 years ago by nickm

This bug seems to depend on the name of the "malloc" function. I'm attaching a minimal example to reproduce this bug (down to 67 lines from 2049). The comment near the head of the file shows three ways to build this: two of which work, and one of which gives us the infinite loop that you ran into.

I think it's time to report this upstream to gcc, unless they already know about it.

Changed 2 years ago by nickm

Attachment: example.c added

Changed 2 years ago by nickm

Attachment: example.2.c added

comment:4 Changed 2 years ago by nickm

I've attached an even simpler example.

comment:5 Changed 2 years ago by icanhasaccount

Yep - example.2.c shows the same problem. I'll report it to the gcc folk if its not reported already.

Thanks!

Changed 2 years ago by nickm

Attachment: example2.c added

Variant example, with no headers included.

comment:6 Changed 2 years ago by nickm

I think they might already know: here's a bug report that looks quite similar.

https://gcc.gnu.org/ml/gcc-help/2015-03/msg00091.html

comment:7 Changed 2 years ago by nickm

The -fno-builtin-malloc option seems to be a workaround here. In production, you might need -fno-builtin-... for all of malloc, free, realloc, etc.

comment:8 Changed 2 years ago by icanhasaccount

I wonder if it would be worth removing the option to build with openbsd malloc from future releases?

If the email relates to the same issue its been broken for a while now (the email is from 2015 and gcc 5) and no one else has run into it.

comment:9 Changed 2 years ago by cypherpunks

FWIW the OpenBSD memory allocator was added in ticket:468#comment:22.

I agree with comment:8 with regards to removing the openbsd malloc build option. It's obvious the code isn't tested or maintained and causes people to shoot themselves in the foot.

Changed 2 years ago by icanhasaccount

Attachment: remove_openbsd_malloc.diff added

Diff to remove openbsd malloc if thats desirable

comment:10 Changed 2 years ago by nickm

I wonder if it would be worth removing the option to build with openbsd malloc from future releases?

A while ago, lots and lots of relays were built to use it, because of glibc malloc problems with memory fragmentation. Have those all gone away?

comment:11 Changed 2 years ago by icanhasaccount

Sorry - I didn't look too closely at why the openbsd allocator was added in the first place. That was ignorant and lazy of me.

Adding the following to configure.ac (as suggested by Nick in comment 7) allows the tests to run. I'm not familiar with autotools though so perhaps there would be a cleaner way to do this if we could narrow it down to a specific version of gcc?

Changed 2 years ago by icanhasaccount

Attachment: configure_ac_patch.patch added

Work around compilation issues with some versions of gcc

comment:12 Changed 2 years ago by asn

Summary: Make test fails when --enable-openbsd-malloc is usedTor maxes CPU when --enable-openbsd-malloc is used

Renaming ticket, as my understanding is that this is not test-specific, and it occurs with tor in general when the --enable-openbsd-malloc feature is enabled.

comment:13 Changed 22 months ago by dgoulet

Keywords: triage-out-030-201612 added
Milestone: Tor: 0.3.0.x-finalTor: 0.3.1.x-final

Triaged out on December 2016 from 030 to 031.

comment:14 Changed 19 months ago by nickm

Points: .2
Status: newneeds_revision

comment:15 Changed 17 months ago by nickm

Milestone: Tor: 0.3.1.x-finalTor: 0.3.2.x-final

0.3.1 feature freeze today: these don't seem like they will be all sorted out in time. Let's try to make progress in 0.3.2.

comment:16 Changed 16 months ago by nickm

Keywords: triage-out-030-201612 removed

comment:17 Changed 13 months ago by nickm

Milestone: Tor: 0.3.2.x-finalTor: 0.3.4.x-final
Summary: Tor maxes CPU when --enable-openbsd-malloc is usedRemove --enable-openbsd-malloc (Tor maxes CPU when --enable-openbsd-malloc is used)

comment:18 Changed 12 months ago by Hello71

gcc 7 simply deletes the functions, so --enable-openbsd-malloc now instead does nothing except slightly increase compilation resources and executable size.

comment:19 Changed 11 months ago by Hello71

Cc: alex_y_xu@… added

proposal: remove tcmalloc and openbsd malloc, add jemalloc and enable by default on Linux if installed.

pros compared to ptmalloc2:

  • moderately decreased CPU usage (#23777)
  • slightly decreased memory usage (fragmentation is much less bad in 0.3.2 already, but jemalloc still does better by default)

pros compared to openbsd malloc:

  • remove dead code
  • code path will be actually tested (unlike with openbsd malloc)
  • jemalloc is actually maintained by distributions and -ljemalloc is a standard method of overriding glibc. this way will continue to work, unlike just defining functions

cons:

  • if always-on, adds an extra prerequisite. if used only when available, might cause more Linux-only bugs
  • glibc does slightly more runtime checks, but AFAIK these are almost entirely useless for security
  • might be fingerprintable? seems highly unlikely, and would also apply to other OSes

Changed 11 months ago by Hello71

patch to add --with-jemalloc

comment:20 Changed 11 months ago by Hello71

Status: needs_revisionneeds_review

so I read what I'm pretty sure is all of the information online about replacing malloc, and I'm pretty sure that this is the best global option possible. another option that can be explored is to have tor_malloc_ call jemalloc, but AFAICT most distros build jemalloc with no prefix, so dlopen magic would be needed.

Changed 11 months ago by Hello71

so apparently autoconf calls it LIBS

comment:21 Changed 8 months ago by nickm

Keywords: review-group-31 added

comment:22 Changed 8 months ago by nickm

Reviewer: nickm

comment:23 Changed 8 months ago by nickm

Status: needs_reviewneeds_revision

(I'm guessing that the second patch should be applied _without_ the first, right?)

I like this patch, but I'm not so sure about having jemalloc be the default right away. Let's let people test it as an option before we decide it's the best choice for everyone. Also, if I'm reading the patch right, this code will break on Linux if jemalloc isn't installed, which probably isn't what you'd intended?

Other questions:

  • Why move the mlockall check?
  • Why does the patch move dmalloc? Have you tested this? Does it work?
  • Where and how have you tested this patch?

Other notes:

  • This will need a changes file, explaining to people how to use this feature.
  • We'll still want to remove the openbsd_malloc code.

comment:24 Changed 8 months ago by Hello71

I wanted to put all of the allocator-related configure.ac code together. I tested at least one of the versions of the patches both with and without jemalloc installed, but I only tested on Linux. The plan was to remove openbsd malloc first. I see no benefit to using it over jemalloc.

comment:25 Changed 8 months ago by Hello71

I still need to test what happens when rust is linked in.

comment:26 Changed 6 months ago by nickm

Keywords: 034-triage-20180328 added

comment:27 Changed 6 months ago by nickm

Keywords: 034-removed-20180328 added

Per our triage process, these tickets are pending removal from 0.3.4.

comment:28 Changed 6 months ago by nickm

Keywords: fast-fix added; 034-removed-20180328 removed

comment:29 Changed 4 months ago by nickm

Milestone: Tor: 0.3.4.x-finalTor: 0.3.5.x-final

Move most needs_revision tickets from 0.3.4 to 0.3.5: we're about to hit the feature freeze.

If you really need to get this into 0.3.4, please drop me a line. :)

comment:30 Changed 4 months ago by cypherpunks

I'd like to point out that jemalloc has more issues with security than either ptmalloc or OpenBSD's malloc. The large, aligned heaps make ASLR less effective (and that's one reason why Firefox on *nix machines is so much easier to compromise than on Windows).

comment:31 Changed 4 months ago by Hello71

I asked Firefox people and they said that Firefox uses "jemalloc for Windows/Mac/Linux/Android, system malloc on iOS".

comment:32 Changed 4 months ago by Hello71

additionally, starting with 5, jemalloc now uses extent based allocation, so from what I understand, ASLR will be just as good as with other allocators.

comment:33 Changed 4 months ago by cypherpunks

They use jemalloc for Windows? I was under the impression Windows used the system malloc. And yeah I know newer versions of jemalloc are improved (redzones, etc), but I'm still a little weary, especially given that OpenBSD's malloc is actually really well-designed from a security standpoint. Though honestly the Copperhead malloc (a hybrid of bionic and OpenBSD's) is even better, but I feel it's unlikely that it would be included in Tor.

comment:34 Changed 3 months ago by nickm

Keywords: 035-triaged-in-20180711 added

comment:35 Changed 3 months ago by nickm

Reviewer: nickm
Status: needs_revisionneeds_review

Let's try to get this unstuck, by disentangling the various options.

I've made a branch remove_openbsd_malloc that ports icanhasaccount's removal code to master if we decide to do that. PR at https://github.com/torproject/tor/pull/224 . Not my preferred approach.

And I've made a branch called with_malloc that ports the parts of Hello71's patch that I agree with to master: it adds --with-malloc=, doesn't change our default, doesn't break old configure options, and keeps openbsd malloc as an option. Works for me, but probably needs more review and testing. PR at https://github.com/torproject/tor/pull/225 .

The above two branches apply to master only and are mutually exclusive, though we could probably combine them into one.

For backport purposes, I have a branch fix_nonstandard_malloc_029 that only does the minimum needed to fix the originally supported issue. PR at https://github.com/torproject/tor/pull/226 . Works for me.

comment:36 in reply to:  35 Changed 3 months ago by Hello71

Replying to nickm:

Let's try to get this unstuck, by disentangling the various options.

I've made a branch remove_openbsd_malloc that ports icanhasaccount's removal code to master if we decide to do that. PR at https://github.com/torproject/tor/pull/224 . Not my preferred approach.

And I've made a branch called with_malloc that ports the parts of Hello71's patch that I agree with to master: it adds --with-malloc=, doesn't change our default, doesn't break old configure options, and keeps openbsd malloc as an option. Works for me, but probably needs more review and testing. PR at https://github.com/torproject/tor/pull/225 .

The above two branches apply to master only and are mutually exclusive, though we could probably combine them into one.

For backport purposes, I have a branch fix_nonstandard_malloc_029 that only does the minimum needed to fix the originally supported issue. PR at https://github.com/torproject/tor/pull/226 . Works for me.

Thanks for fixing my patch and giving credit, but I'd appreciate it if you called me Alex Xu in changelog, or at least not put my nick in quotation marks (looks like scare quotes to me). Also, I think you put too many twos in the change file.

On the patch itself, I did actually test what happens when Rust is linked in: it works fine. I think Rust probably uses the system allocator on Linux, so replacing the malloc this way in glibc should also replace the Rust allocator and everything works fine.

Also, I'm not sure it's a good idea to leave all of these options in if nobody is actually going to test them (as shown by openbsd malloc being broken by all this time and nobody really caring that much to fix it).

comment:37 Changed 3 months ago by nickm

ok, added a squash commit to fix ticket number and credit you in the form you prefer.

(FYI, our house style has been to put pseudonyms that don't look like names into quotation marks, since not doing so has caused confusion in the past with strings like "patch from randomguessing" or "patch from Some Guy".)

comment:38 Changed 3 months ago by Hello71

OK, thanks!

comment:39 Changed 2 months ago by dgoulet

Reviewer: ahf

comment:40 Changed 8 weeks ago by ahf

Status: needs_reviewmerge_ready

comment:41 Changed 8 weeks ago by nickm

Resolution: fixed
Status: merge_readyclosed

I've merged fix_nonstandard_malloc_029 to 0.2.9 and forward.

I've squashed with_malloc and merged it to master, then added a commit in 5597ddc36047a4 to mark openbsd-malloc as deprecated.

Note: See TracTickets for help on using tickets.