I experienced an segfault after I upgraded my Tor to 0.2.4.16-rc.
The log entries before this follows:
Aug 27 04:40:02 tor kernel: [8051709.437719] grsec: From 38.229.70.34: Segmentation fault occurred at (nil) in /usr/bin/tor[tor:21268] uid/euid:220/220 gid/egid:220/220, parent /sbin/init[init:1] uid/euid:0/0 gid/egid:0/0
Aug 27 04:40:02 tor kernel: [8051709.437663] tor![21268] general protection !ip:54fda6ae !sp:5e4e0eb0 !error:0 in libc-2.15.so[54f94000+17e000]
It was probably after SIGUSR1 to the process.
Unfortunately I don't have any more details about this, as it has happened only once so far. Before this, I've been running 0.2.3.25 for quite a while without any problems.
Trac: Username: pyllyukko
To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information
Child items ...
Show closed items
Linked items 0
Link issues together to show that they're related.
Learn more.
Ok. It crashed again, and now I modified ulimits to get the core dump when it does it the third time. I'm just a bit hesitant about enabling core dumps on a production server.
Here are the log lines:
Sep 4 04:40:04 tor kernel: [8742911.434324] tor[10460] general protection ip:4bf9c6ae sp:5c33df00 error:0 in libc-2.15.so[4bf56000+17e000]Sep 4 04:40:04 tor kernel: [8742911.434384] grsec: From 176.31.156.199: Segmentation fault occurred at (nil) in /usr/bin/tor[tor:10460] uid/euid:220/220 gid/egid:220/220, parent /sbin/init[init:1] uid/euid:0/0 gid/egid:0/0
It definitely has got something to do with stuff happening from cron. The logs were not rotated at that point, so again the SIGUSR1 could be the trigger to this.
Ok, so now I got some core dumps. Here is some further info.
[New LWP 8539][New LWP 8543]warning: Could not load shared library symbols for linux-gate.so.1.Do you need "set solib-search-path" or "set sysroot"?[Thread debugging using libthread_db enabled]Using host libthread_db library "/lib/libthread_db.so.1".Core was generated by `/usr/bin/tor -f /etc/tor/torrc'.Program terminated with signal 11, Segmentation fault.#0 0x50a186ae in vfprintf () from /lib/libc.so.6
bt full:
#0 0x50a186ae in vfprintf () from /lib/libc.so.6No symbol table info available.#1 0x50ace048 in __vsnprintf_chk () from /lib/libc.so.6No symbol table info available.#2 0x11272127 in ?? ()No symbol table info available.Backtrace stopped: previous frame inner to this frame (corrupt stack?)
The backtrace doesn't look really useful, so I'll probably need further instructions on how to dig deeper with this.
I'm able to reproduce this by sending SIGUSR1 signal to the process, it doesn't crash always with the sig, but when you send enough it's guaranteed to crash at some point.
Hm. Maybe we can work backwards here. Is there anything about your OS or your computer or your setup or your Tor configuration or the way you're building or running Tor that might be unusual? Maybe we can figure this out by investigating why you're seeing this and other people aren'.
Nothing I can think of. Just a standard x86 running Slackware 14.0 + grsec kernel.
I'd like to add that I've been running this Tor node for years with similar config (of course with OS upgrades + patches), but the baseline has been similar for ages. Only until now that I upgraded Tor to 0.2.4.x it has started crashing. It's been really stable before that.
Many operators might not be sending SIGUSR1 to the Tor process every now and then. The grsec should be easy enough to rule out, I'll just boot a regular kernel and see if it still happens.
You can see from here how it's built: [http://slackbuilds.org/slackbuilds/14.0/network/tor/tor.SlackBuild]
I've just added "SLKCFLAGS+=" -g"" to get the debug symbols and of course changed the version. Also changed the 'make install-strip' to 'make install', but that was only after it started crashing, as was the debug symbols.
FWIW, I had 2 v0.2.4.17-rc relays down on Sunday. They apparently did not recover from the reload ("kill -1 $PID") done when the logs were rotated (via logrotate, initiated by a cron job). I never saw this with the v0.2.3.x builds.
Sorry I can't provide more info. Just this datapoint.
Hope this helps. Maybe you can start looking the code, from where this is called and try to find the bug from there. I'll start digging through Tor's code as soon as I get the change, but I'm definitely not an expert, so no guarantees on any results :)
Okay. I'll look when I can, though I'm afraid I'm not very good at assembly-level stuff.
One more thing to consider: have you tried reproducing this while running Tor under valgrind? (See instructions in doc/HACKING for how to avoid spurious errors.) Often that can produce better stack traces than gdb for stack-corruption cases.
One more thing to consider: have you tried reproducing this while running Tor under valgrind? (See instructions in doc/HACKING for how to avoid spurious errors.) Often that can produce better stack traces than gdb for stack-corruption cases.
I'm now trying with Valgrind, but now I'm unable to send the SIGUSR1 to Tor. Since it's not Tor's process but Valgrind's. Any ideas?
In theory, if the internet can be believed, you can send a SIGUSR1 to the valgrind process. I tried it out on my linux desktop just now, and it worked okay for me.