Opened 4 years ago

Closed 4 years ago

Last modified 18 months ago

#9602 closed defect (fixed)

Segfault in Tor 0.2.4.1[67]-rc after SIGUSR1

Reported by: pyllyukko Owned by:
Priority: Medium Milestone: Tor: 0.2.4.x-final
Component: Core Tor/Tor Version: Tor: 0.2.4.16-rc
Severity: Keywords: 2016-bug-retrospective
Cc: Actual Points:
Parent ID: Points:
Reviewer: Sponsor:

Description

I experienced an segfault after I upgraded my Tor to 0.2.4.16-rc.

The log entries before this follows:

Aug 27 04:40:02 tor kernel: [8051709.437719] grsec: From 38.229.70.34: Segmentation fault occurred at (nil) in /usr/bin/tor[tor:21268] uid/euid:220/220 gid/egid:220/220, parent /sbin/init[init:1] uid/euid:0/0 gid/egid:0/0
Aug 27 04:40:02 tor kernel: [8051709.437663] tor[21268] general protection ip:54fda6ae sp:5e4e0eb0 error:0 in libc-2.15.so[54f94000+17e000]

It was probably after SIGUSR1 to the process.

Unfortunately I don't have any more details about this, as it has happened only once so far. Before this, I've been running 0.2.3.25 for quite a while without any problems.

Child Tickets

Change History (40)

comment:1 Changed 4 years ago by nickm

Milestone: Tor: 0.2.4.x-final
Status: newneeds_information

Okay. If at all possible, try to get a stack trace if this happens again. Without a stack trace, there isn't much I can do to figure this out.

comment:2 Changed 4 years ago by pyllyukko

Ok. It crashed again, and now I modified ulimits to get the core dump when it does it the third time. I'm just a bit hesitant about enabling core dumps on a production server.

Here are the log lines:

Sep  4 04:40:04 tor kernel: [8742911.434324] tor[10460] general protection ip:4bf9c6ae sp:5c33df00 error:0 in libc-2.15.so[4bf56000+17e000]
Sep  4 04:40:04 tor kernel: [8742911.434384] grsec: From 176.31.156.199: Segmentation fault occurred at    (nil) in /usr/bin/tor[tor:10460] uid/euid:220/220 gid/egid:220/220, parent /sbin/init[init:1] uid/euid:0/0 gid/egid:0/0

It definitely has got something to do with stuff happening from cron. The logs were not rotated at that point, so again the SIGUSR1 could be the trigger to this.

Last edited 4 years ago by pyllyukko (previous) (diff)

comment:3 Changed 4 years ago by pyllyukko

Ok, so now I got some core dumps. Here is some further info.

[New LWP 8539]
[New LWP 8543]

warning: Could not load shared library symbols for linux-gate.so.1.
Do you need "set solib-search-path" or "set sysroot"?
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/libthread_db.so.1".
Core was generated by `/usr/bin/tor -f /etc/tor/torrc'.
Program terminated with signal 11, Segmentation fault.
#0  0x50a186ae in vfprintf () from /lib/libc.so.6

bt full:

#0  0x50a186ae in vfprintf () from /lib/libc.so.6
No symbol table info available.
#1  0x50ace048 in __vsnprintf_chk () from /lib/libc.so.6
No symbol table info available.
#2  0x11272127 in ?? ()
No symbol table info available.
Backtrace stopped: previous frame inner to this frame (corrupt stack?)

info reg:

eax            0x0	0
ecx            0xffffffff	-1
edx            0xcccccccc	-858993460
ebx            0x50b52ff4	1354051572
esp            0x5f30d360	0x5f30d360
ebp            0x5f30d8d8	0x5f30d8d8
esi            0x5f30d910	1597036816
edi            0xcccccccc	-858993460
eip            0x50a186ae	0x50a186ae <vfprintf+6398>
eflags         0x210246	[ PF ZF IF RF ID ]
cs             0x73	115
ss             0x7b	123
ds             0x7b	123
es             0x7b	123
fs             0x0	0
gs             0x33	51

The backtrace doesn't look really useful, so I'll probably need further instructions on how to dig deeper with this.

I'm able to reproduce this by sending SIGUSR1 signal to the process, it doesn't crash always with the sig, but when you send enough it's guaranteed to crash at some point.

Last edited 4 years ago by pyllyukko (previous) (diff)

comment:4 Changed 4 years ago by pyllyukko

Summary: Segfault in Tor 0.2.4.16-rcSegfault in Tor 0.2.4.1[67]-rc

comment:5 Changed 4 years ago by pyllyukko

I updated to 0.2.4.17-rc and it's still segfaulting.

comment:6 Changed 4 years ago by nickm

Hm. Maybe we can work backwards here. Is there anything about your OS or your computer or your setup or your Tor configuration or the way you're building or running Tor that might be unusual? Maybe we can figure this out by investigating why you're seeing this and other people aren'.

comment:7 Changed 4 years ago by pyllyukko

Nothing I can think of. Just a standard x86 running Slackware 14.0 + grsec kernel.

I'd like to add that I've been running this Tor node for years with similar config (of course with OS upgrades + patches), but the baseline has been similar for ages. Only until now that I upgraded Tor to 0.2.4.x it has started crashing. It's been really stable before that.

comment:8 Changed 4 years ago by pyllyukko

Summary: Segfault in Tor 0.2.4.1[67]-rcSegfault in Tor 0.2.4.1[67]-rc after SIGUSR1

comment:9 Changed 4 years ago by nickm

Hm. I don't want to go blaming the grsec thing, but I can't think what else could be unusual. But surely many other operators are using grsec too.

Are you building with any special compiler options, or using a stock package, or building from source?

comment:10 Changed 4 years ago by pyllyukko

Many operators might not be sending SIGUSR1 to the Tor process every now and then. The grsec should be easy enough to rule out, I'll just boot a regular kernel and see if it still happens.

You can see from here how it's built: http://slackbuilds.org/slackbuilds/14.0/network/tor/tor.SlackBuild
I've just added "SLKCFLAGS+=" -g"" to get the debug symbols and of course changed the version. Also changed the 'make install-strip' to 'make install', but that was only after it started crashing, as was the debug symbols.

Last edited 4 years ago by pyllyukko (previous) (diff)

comment:11 Changed 4 years ago by pyllyukko

It's not about the grsec. I'm able to reproduce it with regular kernel.

comment:12 Changed 4 years ago by nickm

That's good news. Did the -g make the stack trace any more useful this time?

comment:13 Changed 4 years ago by tmpname0901

FWIW, I had 2 v0.2.4.17-rc relays down on Sunday. They apparently did not recover from the reload ("kill -1 $PID") done when the logs were rotated (via logrotate, initiated by a cron job). I never saw this with the v0.2.3.x builds.

Sorry I can't provide more info. Just this datapoint.

comment:14 in reply to:  12 Changed 4 years ago by pyllyukko

Replying to nickm:

That's good news. Did the -g make the stack trace any more useful this time?

Unfortunately no, it still says "Backtrace stopped".

comment:15 Changed 4 years ago by pyllyukko

Bit more details:

(gdb) x/4a $ebp
0xbff95778:	0xbff957b0	0xb728c048 <__vsnprintf_chk+232>	0xbff957b0	0xb777c173
(gdb) x/20s 0xb777c173
0xb777c173:	"%s:%u"
0xb777c179:	"TLS channel (no connection)"
0xb777c195:	"TLS channel (connection %llu)"
0xb777c1b3:	"conn->chan == chan"
0xb777c1c6:	"chan->conn == conn"
0xb777c1d9:	"non-versioned"
0xb777c1e7:	"a v1"
0xb777c1ec:	"behind"
0xb777c1f3:	"ahead"
0xb777c1f9:	"<none>"
0xb777c200:	"chan->conn->link_proto >= 3"
0xb777c21c:	" NETINFO"
0xb777c225:	" AUTH_CHALLENGE"
0xb777c235:	" CERTS"
0xb777c23c:	" VERSIONS"
0xb777c246:	"chan->conn->handshake_state"
0xb777c262:	"Sending cells:"
0xb777c271:	"Couldn't send versions cell"
0xb777c28d:	"Couldn't send certs cell"
0xb777c2a6:	"Couldn't send netinfo cell"
(gdb) x/5i $pc
=> 0xb71d66ae <vfprintf+6398>:	repnz scas %es:(%edi),%al
   0xb71d66b0 <vfprintf+6400>:	movl   $0x0,-0x494(%ebp)
   0xb71d66ba <vfprintf+6410>:	not    %ecx
   0xb71d66bc <vfprintf+6412>:	lea    -0x1(%ecx),%edi
   0xb71d66bf <vfprintf+6415>:	jmp    0xb71d6412 <vfprintf+5730>
(gdb) info reg edi
edi            0xcccccccc	-858993460
(gdb) info reg al
al             0x0	0

Hope this helps. Maybe you can start looking the code, from where this is called and try to find the bug from there. I'll start digging through Tor's code as soon as I get the change, but I'm definitely not an expert, so no guarantees on any results :)

comment:16 Changed 4 years ago by nickm

Okay. I'll look when I can, though I'm afraid I'm not very good at assembly-level stuff.

One more thing to consider: have you tried reproducing this while running Tor under valgrind? (See instructions in doc/HACKING for how to avoid spurious errors.) Often that can produce better stack traces than gdb for stack-corruption cases.

comment:17 in reply to:  16 Changed 4 years ago by pyllyukko

Replying to nickm:

One more thing to consider: have you tried reproducing this while running Tor under valgrind? (See instructions in doc/HACKING for how to avoid spurious errors.) Often that can produce better stack traces than gdb for stack-corruption cases.

I'm now trying with Valgrind, but now I'm unable to send the SIGUSR1 to Tor. Since it's not Tor's process but Valgrind's. Any ideas?

comment:18 Changed 4 years ago by nickm

In theory, if the internet can be believed, you can send a SIGUSR1 to the valgrind process. I tried it out on my linux desktop just now, and it worked okay for me.

comment:19 Changed 4 years ago by arma

I just killed moria1 with -USR1, and it died. Here's the trace:

#0  0x00007f1c27a5b1b5 in *__GI_raise (sig=<value optimized out>)
    at ../nptl/sysdeps/unix/sysv/linux/raise.c:64
#1  0x00007f1c27a5dfc0 in *__GI_abort () at abort.c:92
#2  0x00000000004fa8be in crash_handler (sig=<value optimized out>, 
    si=<value optimized out>, ctx_=0x3) at src/common/backtrace.c:130
#3  <signal handler called>
#4  0x00007f1c27a6e88d in _IO_vfprintf_internal (s=0x7fff297a8370, 
    format=<value optimized out>, ap=0x7fff297a84f0) at vfprintf.c:1617
#5  0x00007f1c27a90732 in _IO_vsnprintf (string=0x79ac60 "", 
    maxlen=<value optimized out>, format=0x53cc1a "%s:%u", args=0x7fff297a84f0)
    at vsnprintf.c:120
#6  0x00000000004fc9c4 in tor_vsnprintf (str=0x79ac60 "", size=4294961464, 
    format=0x20 <Address 0x20 out of bounds>, args=0xffffffffffffffff)
    at src/common/compat.c:385
#7  0x00000000004fca61 in tor_snprintf (
    str=0xcccccccccccccccc <Address 0xcccccccccccccccc out of bounds>, 
    size=4294961464, format=0x20 <Address 0x20 out of bounds>)
    at src/common/compat.c:366
#8  0x000000000048c11a in channel_tls_get_remote_descr_method (
    chan=<value optimized out>, flags=0) at src/or/channeltls.c:463
#9  0x0000000000484435 in channel_dump_statistics (chan=0x45727f0, severity=6)
    at src/or/channel.c:3281
#10 0x000000000048496c in channel_dumpstats (severity=6)
    at src/or/channel.c:2736
#11 0x000000000042dee0 in dumpstats (sig=<value optimized out>)
    at src/or/main.c:2207
#12 process_signal (sig=<value optimized out>) at src/or/main.c:2088
#13 0x00007f1c28796344 in event_base_loop () from /usr/lib/libevent-1.4.so.2
#14 0x000000000042c4d1 in do_main_loop () at src/or/main.c:2018
#15 0x000000000042d1f5 in tor_main (argc=<value optimized out>, 
    argv=<value optimized out>) at src/or/main.c:2882
#16 0x00007f1c27a47c8d in __libc_start_main (main=<value optimized out>, 
    argc=<value optimized out>, ubp_av=<value optimized out>, 
    init=<value optimized out>, fini=<value optimized out>, 
    rtld_fini=<value optimized out>, stack_end=0x7fff297a8b98)
    at libc-start.c:228
#17 0x000000000042aff9 in _start ()

comment:20 Changed 4 years ago by arma

Last info level log:

Feb 06 13:02:30.092 [info] Channel 36525638 (at 0x45727f0) with transport TLS channel (connection 14757395258967641292) is in state channel error (5)
Feb 06 13:02:30.092 [info]  * Channel 36525638 was created at 1391709570 (180 seconds ago) and last active at 1391709570 (180 seconds ago)
Feb 06 13:02:30.092 [info]  * Channel 36525638 says it is connected to an OR with digest 0AD519DCCA1EBCE1A5430FDED641781723F1BE0B and no known nickname

comment:21 Changed 4 years ago by arma

In valgrind, after the -USR1, here's the first complaint:

==22579== Invalid read of size 8
==22579==    at 0x48C223: channel_tls_describe_transport_method (channeltls.c:374)
==22579==    by 0x48431C: channel_dump_statistics (channel.c:3229)
==22579==    by 0x48496B: channel_dumpstats (channel.c:2736)
==22579==    by 0x42DEDF: process_signal (main.c:2207)
==22579==    by 0x52C9343: event_base_loop (in /usr/lib/libevent-1.4.so.2.1.3)
==22579==    by 0x42C4D0: do_main_loop (main.c:2018)
==22579==    by 0x42D1F4: tor_main (main.c:2882)
==22579==    by 0x5EFEC8C: (below main) (libc-start.c:228)
==22579==  Address 0x130b5210 is 144 bytes inside a block of size 320 free'd
==22579==    at 0x4C240FD: free (vg_replace_malloc.c:366)
==22579==    by 0x42E5B7: close_closeable_connections (main.c:947)
==22579==    by 0x52C9343: event_base_loop (in /usr/lib/libevent-1.4.so.2.1.3)
==22579==    by 0x42C4D0: do_main_loop (main.c:2018)
==22579==    by 0x42D1F4: tor_main (main.c:2882)
==22579==    by 0x5EFEC8C: (below main) (libc-start.c:228)

comment:22 Changed 4 years ago by arma

For extra kicks I am running Tor 0.2.5.1-alpha-dev (git-573ee36eae63962f)

comment:23 Changed 4 years ago by nickm

It looks to me like we've got a connection that was freed, but for which the corresponding channel still has a link to it.

comment:24 Changed 4 years ago by nickm

Somewhere in the whole channel_closed() call stack ... or possibly somewhere else called from connection_or_about_to_close ... we need to set the channel pointer NULL.

I have a brute-force kludge at bug9602_kludge, which is done against 0.2.4. It should also merge cleanly into master.

comment:25 Changed 4 years ago by nickm

Arma confirms that my bug9602_kludge branch no longer dies when he does a USR1 on it.

comment:26 Changed 4 years ago by andrea

Yeah, I think something like this is the right fix; this is probably being caused by a connection being closed but the channel still hanging around until channel_run_cleanup() gets to it.

comment:27 Changed 4 years ago by andrea

I think you missed tor_assert(tlschan->conn) instances in channel_tls_get_transport_name_method(), channel_tls_has_queued_writes_method() and a few other places though; if we're going to ever set conn to NULL, should we perhaps make those do something other than crash, even if we don't think they're ever going to be called in those conditions?

comment:28 Changed 4 years ago by andrea

Try my bug9602 branch.

comment:29 Changed 4 years ago by arma

nickm's bug9602_kludge and andrea's bug9602 both work for me.

comment:30 Changed 4 years ago by nickm

Resolution: fixed
Status: needs_informationclosed

Merging andrea's patch to maint-0.2.4 and master.

comment:31 Changed 4 years ago by arma

Resolution: fixed
Status: closedreopened

The plot thickens! On shutdown, my Tor running Andrea's branch asserted.

Feb 07 09:11:08.978 [notice] Clean shutdown finished. Exiting.
Feb 07 09:11:09.057 [info] or_state_save(): Saved state to "moria1/state"
Feb 07 09:11:09.548 [info] cpuworker_main(): read request failed. Exiting.
Feb 07 09:11:09.549 [info] connection_free_(): Freeing orconn at 0x69b4bf0, saw channel 0x155ae650 with ID 423724 left un-NULLed
Feb 07 09:11:09.549 [info] connection_free_(): Freeing linked Exit connection [open] with 0 bytes on inbuf, 0 on outbuf.
Feb 07 09:11:09.549 [info] connection_free_(): Freeing orconn at 0x5fb5550, saw channel 0x6319af0 with ID 221628 left un-NULLed
Feb 07 09:11:09.549 [info] connection_free_(): Freeing orconn at 0x24020620, saw channel 0x5a6e920 with ID 59814 left un-NULLed
Feb 07 09:11:09.549 [info] connection_free_(): Freeing orconn at 0x1c8b6b50, saw channel 0x22e8a9c0 with ID 423837 left un-NULLed
Feb 07 09:11:09.549 [info] connection_free_(): Freeing orconn at 0x25765c70, saw channel 0x614cd90 with ID 193228 left un-NULLed
Feb 07 09:11:09.549 [info] connection_free_(): Freeing orconn at 0x11401660, saw channel 0x6ad94e0 with ID 346149 left un-NULLed
Feb 07 09:11:09.549 [info] connection_free_(): Freeing orconn at 0x5d6db10, saw channel 0x25996270 with ID 84891 left un-NULLed
Feb 07 09:11:09.549 [info] connection_free_(): Freeing orconn at 0x6891ab0, saw channel 0x5bf0760 with ID 585161680 left un-NULLed
Feb 07 09:11:09.550 [info] connection_free_(): Freeing orconn at 0x6be89c0, saw channel 0x652e7e0 with ID 265367 left un-NULLed
Feb 07 09:11:09.550 [info] connection_free_(): Freeing orconn at 0x690b510, saw channel 0x54822a0 with ID 21468 left un-NULLed
Feb 07 09:11:09.550 [info] connection_free_(): Freeing orconn at 0x7b489f0, saw channel 0x6613530 with ID 310365 left un-NULLed
Feb 07 09:11:09.550 [info] connection_free_(): Freeing linked Directory connection [waiting for command] with 3486 bytes on inbuf, 0 on outbuf.
Feb 07 09:11:09.550 [info] connection_free_(): Freeing linked Exit connection [open] with 16423 bytes on inbuf, 0 on outbuf.
Feb 07 09:11:09.550 [info] connection_free_(): Freeing orconn at 0x28556550, saw channel 0x25065fc0 with ID 297190 left un-NULLed
Feb 07 09:11:09.550 [info] connection_free_(): Freeing orconn at 0x7a72dd0, saw channel 0x21bfa090 with ID 424411 left un-NULLed
Feb 07 09:11:09.550 [info] connection_free_(): Freeing orconn at 0x260d2ff0, saw channel 0x7cf11b0 with ID 131868 left un-NULLed
Feb 07 09:11:09.550 [info] connection_free_(): Freeing orconn at 0x5716980, saw channel 0x2588fca0 with ID 423042 left un-NULLed
Feb 07 09:11:09.550 [info] connection_free_(): Freeing orconn at 0x28955600, saw channel 0x289553d0 with ID 131927 left un-NULLed
Feb 07 09:11:09.551 [info] connection_free_(): Freeing orconn at 0x66bb880, saw channel 0x15fd0a60 with ID 425162 left un-NULLed
Feb 07 09:11:09.551 [info] connection_free_(): Freeing orconn at 0x5c439e0, saw channel 0x273908c0 with ID 117204 left un-NULLed
Feb 07 09:11:09.551 [info] connection_free_(): Freeing orconn at 0x663a250, saw channel 0x2300e810 with ID 126573 left un-NULLed
Feb 07 09:11:09.551 [info] connection_free_(): Freeing linked Directory connection [waiting for command] with 996 bytes on inbuf, 0 on outbuf.
Feb 07 09:11:09.551 [info] connection_free_(): Freeing orconn at 0x24711e80, saw channel 0x5f5b250 with ID 154510 left un-NULLed
Feb 07 09:11:09.551 [info] connection_free_(): Freeing orconn at 0x6383d50, saw channel 0x26720d10 with ID 421042 left un-NULLed
Feb 07 09:11:09.551 [info] connection_free_(): Freeing linked Exit connection [open] with 14976 bytes on inbuf, 0 on outbuf.
Feb 07 09:11:09.551 [info] cpuworker_main(): read request failed. Exiting.
Feb 07 09:11:09.551 [info] connection_free_(): Freeing orconn at 0x60d3350, saw channel 0x52298a0 with ID 150267 left un-NULLed
Feb 07 09:11:09.552 [info] connection_free_(): Freeing orconn at 0x5cd3600, saw channel 0x6a83f60 with ID 247606 left un-NULLed
Feb 07 09:11:09.552 [info] connection_free_(): Freeing linked Directory connection [waiting for command] with 0 bytes on inbuf, 0 on outbuf.
Feb 07 09:11:09.552 [info] connection_free_(): Freeing linked Directory connection [waiting for command] with 0 bytes on inbuf, 0 on outbuf.
Feb 07 09:11:09.552 [info] connection_free_(): Freeing orconn at 0x756e2a0, saw channel 0x2699c5d0 with ID 424209 left un-NULLed
Feb 07 09:11:09.552 [info] connection_free_(): Freeing linked Exit connection [open] with 16387 bytes on inbuf, 0 on outbuf.
Feb 07 09:11:09.552 [info] connection_free_(): Freeing orconn at 0x6371710, saw channel 0x5e453e0 with ID 309581 left un-NULLed
Feb 07 09:11:09.552 [info] connection_free_(): Freeing orconn at 0x5635100, saw channel 0x2540ef90 with ID 13594 left un-NULLed
Feb 07 09:11:09.552 [info] connection_free_(): Freeing orconn at 0x629d240, saw channel 0x23ad1ee0 with ID 210607 left un-NULLed
Feb 07 09:11:09.552 [info] connection_free_(): Freeing orconn at 0x77c84c0, saw channel 0x67ec370 with ID 104223 left un-NULLed
Feb 07 09:11:09.552 [info] connection_free_(): Freeing orconn at 0x789dfe0, saw channel 0x35ee0d0 with ID 110394928 left un-NULLed
Feb 07 09:11:09.552 [info] connection_free_(): Freeing orconn at 0x257652e0, saw channel 0x6c4a860 with ID 420851 left un-NULLed
Feb 07 09:11:09.553 [info] connection_free_(): Freeing orconn at 0x5fce9c0, saw channel 0x6491a90 with ID 15824 left un-NULLed
Feb 07 09:11:09.553 [info] connection_free_(): Freeing orconn at 0x5af1db0, saw channel 0x5bd1620 with ID 69650 left un-NULLed
Feb 07 09:11:09.553 [info] connection_free_(): Freeing orconn at 0x66e5ca0, saw channel 0x26c558d0 with ID 8419 left un-NULLed
Feb 07 09:11:09.553 [info] connection_free_(): Freeing orconn at 0x6b5b4d0, saw channel 0x23a795e0 with ID 365863 left un-NULLed
Feb 07 09:11:09.553 [info] connection_free_(): Freeing orconn at 0x24e92d80, saw channel 0x637d5f0 with ID 199493 left un-NULLed
Feb 07 09:11:09.553 [info] connection_free_(): Freeing linked Exit connection [open] with 16466 bytes on inbuf, 0 on outbuf.
Feb 07 09:11:09.554 [info] connection_free_(): Freeing linked Exit connection [open] with 0 bytes on inbuf, 0 on outbuf.
Feb 07 09:11:09.554 [info] connection_free_(): Freeing orconn at 0x16a592c0, saw channel 0x159708d0 with ID 417527 left un-NULLed
Feb 07 09:11:09.554 [info] connection_free_(): Freeing orconn at 0x22aae6c0, saw channel 0x24ec09c0 with ID 425063 left un-NULLed
Feb 07 09:11:09.554 [info] connection_free_(): Freeing linked Directory connection [writing] with 0 bytes on inbuf, 4094 on outbuf.
Feb 07 09:11:09.554 [info] connection_free_(): Freeing orconn at 0x5abe8b0, saw channel 0x67ff9c0 with ID 104023552 left un-NULLed
Feb 07 09:11:09.554 [info] connection_free_(): Freeing orconn at 0x76bcdb0, saw channel 0x2706cc50 with ID 626755120 left un-NULLed
Feb 07 09:11:09.554 [info] connection_free_(): Freeing orconn at 0x25b7f370, saw channel 0x7a79ff0 with ID 86479 left un-NULLed
Feb 07 09:11:09.554 [info] connection_free_(): Freeing orconn at 0x6aee8e0, saw channel 0x65cf250 with ID 425191 left un-NULLed
Feb 07 09:11:09.555 [info] connection_free_(): Freeing orconn at 0x67609f0, saw channel 0x23a01de0 with ID 425169 left un-NULLed
Feb 07 09:11:09.555 [info] connection_free_(): Freeing orconn at 0x79af5c0, saw channel 0x2868c3b0 with ID 424331 left un-NULLed
Feb 07 09:11:09.555 [info] connection_free_(): Freeing orconn at 0x5fdf470, saw channel 0x2618e4b0 with ID 151121 left un-NULLed
Feb 07 09:11:09.555 [info] connection_free_(): Freeing orconn at 0x79caf30, saw channel 0x61f9490 with ID 300089 left un-NULLed
Feb 07 09:11:09.555 [info] connection_free_(): Freeing orconn at 0x745e650, saw channel 0x67fe0b0 with ID 650459024 left un-NULLed
Feb 07 09:11:09.555 [info] connection_free_(): Freeing orconn at 0x27ec16e0, saw channel 0x7a2ff40 with ID 424989 left un-NULLed
Feb 07 09:11:09.555 [info] connection_free_(): Freeing orconn at 0x38a9f20, saw channel 0x23e6b6c0 with ID 424770 left un-NULLed
Feb 07 09:11:09.555 [info] connection_free_(): Freeing orconn at 0x6022430, saw channel 0x25f01020 with ID 421990 left un-NULLed
Feb 07 09:11:09.555 [info] connection_free_(): Freeing orconn at 0x22e24890, saw channel 0x27249180 with ID 425070 left un-NULLed
Feb 07 09:11:09.556 [info] connection_free_(): Freeing orconn at 0x5079f40, saw channel 0x5ff7050 with ID 126805 left un-NULLed
Feb 07 09:11:09.556 [info] connection_free_(): Freeing orconn at 0x25a491b0, saw channel 0x22116f90 with ID 424986 left un-NULLed
Feb 07 09:11:09.556 [info] connection_free_(): Freeing orconn at 0x747e4e0, saw channel 0x6c36cf0 with ID 423938 left un-NULLed
Feb 07 09:11:09.556 [info] connection_free_(): Freeing orconn at 0x5d46630, saw channel 0x232a6250 with ID 267096 left un-NULLed
Feb 07 09:11:09.556 [info] connection_free_(): Freeing orconn at 0x24713850, saw channel 0x28573210 with ID 424763 left un-NULLed
Feb 07 09:11:09.556 [info] connection_free_(): Freeing orconn at 0x26aa5bf0, saw channel 0xe71cf50 with ID 288727 left un-NULLed
Feb 07 09:11:09.556 [info] connection_free_(): Freeing orconn at 0x63bb1a0, saw channel 0x64f5020 with ID 293784 left un-NULLed
Feb 07 09:11:09.556 [info] connection_free_(): Freeing orconn at 0x27b03c80, saw channel 0x7a5f710 with ID 423969 left un-NULLed
Feb 07 09:11:09.556 [info] connection_free_(): Freeing orconn at 0x6646bc0, saw channel 0x62b7710 with ID 425278 left un-NULLed
Feb 07 09:11:09.557 [info] connection_free_(): Freeing orconn at 0x69f8130, saw channel 0x25ea8fd0 with ID 420942 left un-NULLed
Feb 07 09:11:09.557 [info] connection_free_(): Freeing orconn at 0x5ec3a70, saw channel 0x6da8f10 with ID 280086 left un-NULLed
Feb 07 09:11:09.557 [info] connection_free_(): Freeing orconn at 0x27a0f570, saw channel 0x27dfa920 with ID 424741 left un-NULLed
Feb 07 09:11:09.557 [info] connection_free_(): Freeing orconn at 0x2dfffe0, saw channel 0x25721080 with ID 51710 left un-NULLed
Feb 07 09:11:09.557 [info] connection_free_(): Freeing orconn at 0x2664ecc0, saw channel 0x6c5da70 with ID 425161 left un-NULLed
Feb 07 09:11:09.557 [info] connection_free_(): Freeing orconn at 0x622fc80, saw channel 0x258eb020 with ID 296957 left un-NULLed
Feb 07 09:11:09.557 [info] connection_free_(): Freeing orconn at 0x506ce90, saw channel 0x63f55a0 with ID 110020 left un-NULLed
Feb 07 09:11:09.557 [info] connection_free_(): Freeing orconn at 0x62c5490, saw channel 0x6850da0 with ID 4019 left un-NULLed
Feb 07 09:11:09.558 [info] connection_free_(): Freeing orconn at 0x573fd20, saw channel 0x260f16a0 with ID 424055 left un-NULLed
Feb 07 09:11:09.558 [info] connection_free_(): Freeing orconn at 0x5a4fd80, saw channel 0x65c31c0 with ID 420753 left un-NULLed
Feb 07 09:11:09.558 [info] connection_free_(): Freeing orconn at 0x7f014c6e0e90, saw channel 0x7f014c6e10d0 with ID 139643546766288 left un-NULLed
Feb 07 09:11:09.558 [err] channel_change_state(): Bug: src/or/channel.c:1864: channel_change_state: Assertion channel_state_is_valid(from_state) failed; aborting.
#0  0x00007f01574bb1b5 in *__GI_raise (sig=<value optimized out>)
    at ../nptl/sysdeps/unix/sysv/linux/raise.c:64
#1  0x00007f01574bdfc0 in *__GI_abort () at abort.c:92
#2  0x000000000046ec49 in channel_change_state (chan=0x7f014c6e10d0, 
    to_state=CHANNEL_STATE_CLOSING) at src/or/channel.c:1864
#3  0x00000000004a05da in connection_free_ (conn=0x7f014c6e0e90)
    at src/or/connection.c:527
#4  0x00000000004a0af4 in connection_free_all () at src/or/connection.c:4505
#5  0x0000000000408d4a in tor_free_all (postfork=0) at src/or/main.c:2490
#6  0x0000000000408e64 in tor_cleanup () at src/or/main.c:2549
#7  0x00000000004e7f65 in consider_hibernation (now=1391782268)
    at src/or/hibernate.c:914
#8  0x000000000040c8af in run_scheduled_events (timer=<value optimized out>, 
    arg=<value optimized out>) at src/or/main.c:1183
#9  second_elapsed_callback (timer=<value optimized out>, 
    arg=<value optimized out>) at src/or/main.c:1684
#10 0x00007f01581f6344 in event_base_loop () from /usr/lib/libevent-1.4.so.2
#11 0x000000000040a0d1 in do_main_loop () at src/or/main.c:1987
#12 0x000000000040a40d in tor_main (argc=<value optimized out>, 
    argv=0x7fffc9790008) at src/or/main.c:2703
#13 0x00007f01574a7c8d in __libc_start_main (main=<value optimized out>, 
    argc=<value optimized out>, ubp_av=<value optimized out>, 
    init=<value optimized out>, fini=<value optimized out>, 
    rtld_fini=<value optimized out>, stack_end=0x7fffc978fff8)
    at libc-start.c:228
#14 0x0000000000408899 in _start ()
#2  0x000000000046ec49 in channel_change_state (chan=0x7f014c6e10d0,
    to_state=CHANNEL_STATE_CLOSING) at src/or/channel.c:1864
1864      tor_assert(channel_state_is_valid(from_state));
(gdb) print from_state
$1 = 32513
(gdb) print chan
$2 = (channel_t *) 0x7f014c6e10d0
(gdb) print *chan
$3 = {magic = 1281393552, state = 32513, global_identifier = 139643546766288, 
  registered = 0 '\000', has_been_open = 0, 
  reason_for_closing = CHANNEL_CLOSE_FOR_ERROR, timestamp_created = 0, 
  timestamp_active = 1391782255, free = 0, 
  close = 0x473c40 <channel_tls_close_method>, 
  describe_transport = 0x473b70 <channel_tls_describe_transport_method>, 
  dumpstats = 0, cell_handler = 0x48d7a0 <command_process_cell>, 
  var_cell_handler = 0x48ca80 <command_process_var_cell>, 
  get_remote_addr = 0x473a70 <channel_tls_get_remote_addr_method>, 
  get_remote_descr = 0x4738c0 <channel_tls_get_remote_descr_method>, 
  has_queued_writes = 0x4737d0 <channel_tls_has_queued_writes_method>, 
  is_canonical = 0x46fdb0 <channel_tls_is_canonical_method>, 
  matches_extend_info = 0x472420 <channel_tls_matches_extend_info_method>, 
  matches_target = 0x4722f0 <channel_tls_matches_target_method>,
  write_cell = 0x4736a0 <channel_tls_write_cell_method>,
  write_packed_cell = 0x473550 <channel_tls_write_packed_cell_method>,
  write_var_cell = 0x473420 <channel_tls_write_var_cell_method>,
  identity_digest = '\000' <repeats 19 times>, nickname = 0x0,
  next_with_same_id = {le_next = 0x0, le_prev = 0x7f014c381308},
  incoming_queue = {sqh_first = 0x0, sqh_last = 0x7f014c6e11a0},
  outgoing_queue = {sqh_first = 0x0, sqh_last = 0x7f014c6e11b0}, cmux = 0x0,
  circ_id_type = CIRC_ID_TYPE_LOWER, wide_circ_ids = 0, next_circ_id = 10355,
  num_n_circuits = 0, num_p_circuits = 0, is_bad_for_new_circs = 0,
  is_client = 0, is_incoming = 1, is_local = 0, timestamp_client = 0,
  timestamp_drained = 1391777454, timestamp_recv = 1391777454,
  timestamp_xmit = 1391777454, timestamp_last_added_nonpadding = 1391777454,
  dirreq_id = 139894, n_cells_recved = 71, n_cells_xmitted = 29}

comment:32 Changed 4 years ago by arma

Here's a valgrind of the failure, on current git master (a0577aacb4)

==23346== Invalid read of size 8
==23346==    at 0x4B1826: connection_free_ (connection.c:545)
==23346==    by 0x4B1C4B: connection_free_all (connection.c:4834)
==23346==    by 0x42B0F9: tor_free_all (main.c:2527)
==23346==    by 0x42B223: tor_cleanup (main.c:2588)
==23346==    by 0x4F3854: consider_hibernation (hibernate.c:921)
==23346==    by 0x42EFFF: second_elapsed_callback (main.c:1203)
==23346==    by 0x52C9343: event_base_loop (in /usr/lib/libevent-1.4.so.2.1.3)
==23346==    by 0x42C2C0: do_main_loop (main.c:2013)
==23346==    by 0x42CFC4: tor_main (main.c:2877)
==23346==    by 0x5EFEC8C: (below main) (libc-start.c:228)
==23346==  Address 0xb324cd8 is 8 bytes inside a block of size 352 free'd
==23346==    at 0x4C240FD: free (vg_replace_malloc.c:366)
==23346==    by 0x485DA0: channel_free_list (channel.c:899)
==23346==    by 0x4860BA: channel_free_all (channel.c:2942)
==23346==    by 0x42B0F4: tor_free_all (main.c:2526)
==23346==    by 0x42B223: tor_cleanup (main.c:2588)
==23346==    by 0x4F3854: consider_hibernation (hibernate.c:921)
==23346==    by 0x42EFFF: second_elapsed_callback (main.c:1203)
==23346==    by 0x52C9343: event_base_loop (in /usr/lib/libevent-1.4.so.2.1.3)
==23346==    by 0x42C2C0: do_main_loop (main.c:2013)
==23346==    by 0x42CFC4: tor_main (main.c:2877)
==23346==    by 0x5EFEC8C: (below main) (libc-start.c:228)

comment:33 Changed 4 years ago by andrea

Looks like adding the extra test in connection_free() means we also need to make sure orconn->chan gets properly nulled out when the channel is freed without being closed/errored out first from channel_free_all(); fix is to add a free method to channel_tls_t to do that.

comment:34 Changed 4 years ago by andrea

Updated bug9602 branch seems to fix it.

comment:35 Changed 4 years ago by andrea

Status: reopenedneeds_review

comment:36 Changed 4 years ago by arma

I see Nick merged the new commits into master. I tried them on moria1 and it looks good so far.

comment:37 Changed 4 years ago by nickm

Resolution: fixed
Status: needs_reviewclosed

They're merged in 0.2.4 as well. Closing.

comment:38 Changed 18 months ago by nickm

Keywords: 2016-bug-retrospective added

Mark bugs for 2016 bug retrospective based on hand-examination of changelogs for 0.2.5 onwards.

comment:39 Changed 18 months ago by nickm

Mark bugs for 2016 bug retrospective based on hand-examination of changelogs for 0.2.5 onwards.

comment:40 Changed 18 months ago by nickm

Mark bugs for 2016 bug retrospective based on hand-examination of changelogs for 0.2.5 onwards.

Note: See TracTickets for help on using tickets.