Segfault in Tor 0.2.4.1[67]-rc after SIGUSR1

changed milestone to %Tor: 0.2.4.x-final

added 2016-bug-retrospective component::core tor/tor milestone::Tor: 0.2.4.x-final priority::medium reporter::pyllyukko resolution::fixed status::closed type::defect version::tor 0.2.4.16-rc labels

Okay. If at all possible, try to get a stack trace if this happens again. Without a stack trace, there isn't much I can do to figure this out.

Trac:
Status: new to needs_information
Milestone: N/A to Tor: 0.2.4.x-final

Ok. It crashed again, and now I modified ulimits to get the core dump when it does it the third time. I'm just a bit hesitant about enabling core dumps on a production server.

Here are the log lines:

Sep  4 04:40:04 tor kernel: [8742911.434324] tor[10460] general protection ip:4bf9c6ae sp:5c33df00 error:0 in libc-2.15.so[4bf56000+17e000]
Sep  4 04:40:04 tor kernel: [8742911.434384] grsec: From 176.31.156.199: Segmentation fault occurred at    (nil) in /usr/bin/tor[tor:10460] uid/euid:220/220 gid/egid:220/220, parent /sbin/init[init:1] uid/euid:0/0 gid/egid:0/0

It definitely has got something to do with stuff happening from cron. The logs were not rotated at that point, so again the SIGUSR1 could be the trigger to this.

Trac:
Username: pyllyukko

Ok, so now I got some core dumps. Here is some further info.

[New LWP 8539]
[New LWP 8543]

warning: Could not load shared library symbols for linux-gate.so.1.
Do you need "set solib-search-path" or "set sysroot"?
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/libthread_db.so.1".
Core was generated by `/usr/bin/tor -f /etc/tor/torrc'.
Program terminated with signal 11, Segmentation fault.
#0  0x50a186ae in vfprintf () from /lib/libc.so.6

bt full:

#0  0x50a186ae in vfprintf () from /lib/libc.so.6
No symbol table info available.
#1  0x50ace048 in __vsnprintf_chk () from /lib/libc.so.6
No symbol table info available.
#2  0x11272127 in ?? ()
No symbol table info available.
Backtrace stopped: previous frame inner to this frame (corrupt stack?)

info reg:

eax            0x0	0
ecx            0xffffffff	-1
edx            0xcccccccc	-858993460
ebx            0x50b52ff4	1354051572
esp            0x5f30d360	0x5f30d360
ebp            0x5f30d8d8	0x5f30d8d8
esi            0x5f30d910	1597036816
edi            0xcccccccc	-858993460
eip            0x50a186ae	0x50a186ae <vfprintf+6398>
eflags         0x210246	[ PF ZF IF RF ID ]
cs             0x73	115
ss             0x7b	123
ds             0x7b	123
es             0x7b	123
fs             0x0	0
gs             0x33	51

The backtrace doesn't look really useful, so I'll probably need further instructions on how to dig deeper with this.

I'm able to reproduce this by sending SIGUSR1 signal to the process, it doesn't crash always with the sig, but when you send enough it's guaranteed to crash at some point.

Trac:
Username: pyllyukko

Trac:
Username: pyllyukko
Summary: Segfault in Tor 0.2.4.16-rc to Segfault in Tor 0.2.4.1[67]-rc

I updated to 0.2.4.17-rc and it's still segfaulting.

Trac:
Username: pyllyukko

Hm. Maybe we can work backwards here. Is there anything about your OS or your computer or your setup or your Tor configuration or the way you're building or running Tor that might be unusual? Maybe we can figure this out by investigating why you're seeing this and other people aren'.

Nothing I can think of. Just a standard x86 running Slackware 14.0 + grsec kernel.

I'd like to add that I've been running this Tor node for years with similar config (of course with OS upgrades + patches), but the baseline has been similar for ages. Only until now that I upgraded Tor to 0.2.4.x it has started crashing. It's been really stable before that.

Trac:
Username: pyllyukko

Trac:
Username: pyllyukko
Summary: Segfault in Tor 0.2.4.1[67]-rc to Segfault in Tor 0.2.4.1[67]-rc after SIGUSR1

Hm. I don't want to go blaming the grsec thing, but I can't think what else could be unusual. But surely many other operators are using grsec too.

Are you building with any special compiler options, or using a stock package, or building from source?

Many operators might not be sending SIGUSR1 to the Tor process every now and then. The grsec should be easy enough to rule out, I'll just boot a regular kernel and see if it still happens.

You can see from here how it's built: [http://slackbuilds.org/slackbuilds/14.0/network/tor/tor.SlackBuild] I've just added "SLKCFLAGS+=" -g"" to get the debug symbols and of course changed the version. Also changed the 'make install-strip' to 'make install', but that was only after it started crashing, as was the debug symbols.

Trac:
Username: pyllyukko

It's not about the grsec. I'm able to reproduce it with regular kernel.

Trac:
Username: pyllyukko

That's good news. Did the -g make the stack trace any more useful this time?

FWIW, I had 2 v0.2.4.17-rc relays down on Sunday. They apparently did not recover from the reload ("kill -1 $PID") done when the logs were rotated (via logrotate, initiated by a cron job). I never saw this with the v0.2.3.x builds.

Sorry I can't provide more info. Just this datapoint.

Trac:
Username: tmpname0901

Replying to nickm:

That's good news. Did the -g make the stack trace any more useful this time?

Unfortunately no, it still says "Backtrace stopped".

Trac:
Username: pyllyukko

Bit more details:

(gdb) x/4a $ebp
0xbff95778:	0xbff957b0	0xb728c048 <__vsnprintf_chk+232>	0xbff957b0	0xb777c173
(gdb) x/20s 0xb777c173
0xb777c173:	"%s:%u"
0xb777c179:	"TLS channel (no connection)"
0xb777c195:	"TLS channel (connection %llu)"
0xb777c1b3:	"conn->chan == chan"
0xb777c1c6:	"chan->conn == conn"
0xb777c1d9:	"non-versioned"
0xb777c1e7:	"a v1"
0xb777c1ec:	"behind"
0xb777c1f3:	"ahead"
0xb777c1f9:	"<none>"
0xb777c200:	"chan->conn->link_proto >= 3"
0xb777c21c:	" NETINFO"
0xb777c225:	" AUTH_CHALLENGE"
0xb777c235:	" CERTS"
0xb777c23c:	" VERSIONS"
0xb777c246:	"chan->conn->handshake_state"
0xb777c262:	"Sending cells:"
0xb777c271:	"Couldn't send versions cell"
0xb777c28d:	"Couldn't send certs cell"
0xb777c2a6:	"Couldn't send netinfo cell"
(gdb) x/5i $pc
=> 0xb71d66ae <vfprintf+6398>:	repnz scas %es:(%edi),%al
   0xb71d66b0 <vfprintf+6400>:	movl   $0x0,-0x494(%ebp)
   0xb71d66ba <vfprintf+6410>:	not    %ecx
   0xb71d66bc <vfprintf+6412>:	lea    -0x1(%ecx),%edi
   0xb71d66bf <vfprintf+6415>:	jmp    0xb71d6412 <vfprintf+5730>
(gdb) info reg edi
edi            0xcccccccc	-858993460
(gdb) info reg al
al             0x0	0

Hope this helps. Maybe you can start looking the code, from where this is called and try to find the bug from there. I'll start digging through Tor's code as soon as I get the change, but I'm definitely not an expert, so no guarantees on any results :)

Trac:
Username: pyllyukko

Okay. I'll look when I can, though I'm afraid I'm not very good at assembly-level stuff.

One more thing to consider: have you tried reproducing this while running Tor under valgrind? (See instructions in doc/HACKING for how to avoid spurious errors.) Often that can produce better stack traces than gdb for stack-corruption cases.

Replying to nickm:

One more thing to consider: have you tried reproducing this while running Tor under valgrind? (See instructions in doc/HACKING for how to avoid spurious errors.) Often that can produce better stack traces than gdb for stack-corruption cases.

I'm now trying with Valgrind, but now I'm unable to send the SIGUSR1 to Tor. Since it's not Tor's process but Valgrind's. Any ideas?

Trac:
Username: pyllyukko

In theory, if the internet can be believed, you can send a SIGUSR1 to the valgrind process. I tried it out on my linux desktop just now, and it worked okay for me.

Segfault in Tor 0.2.4.1[67]-rc after SIGUSR1

Child items ...

Activity