Opened 11 years ago

Last modified 7 years ago

#957 closed defect (Fixed)

v0.2.1.13-alpha crashing

Reported by: Falo Owned by:
Priority: High Milestone: 0.2.1.x-final
Component: Core Tor/Tor Version: 0.2.1.13-alpha
Severity: Keywords:
Cc: Falo, nickm, arma Actual Points:
Parent ID: Points:
Reviewer: Sponsor:

Description

Tor v0.2.1.13-alpha (r18828) on Debian Lenny x86_64

My Blutmagie server process died dumping a core file. gdb backtrace output refers to eventdns.

anonymizer2:~# gdb /usr/sbin/tor ~debian-tor/.tor/core.20793
GNU gdb 6.8-debian
Copyright (C) 2008 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law. Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu"...

warning: core file may not match specified executable file.

warning: Can't read pathname for load map: Input/output error.
Reading symbols from /usr/lib/libz.so.1...done.
Loaded symbols for /usr/lib/libz.so.1
Reading symbols from /usr/lib/libevent-1.3e.so.1...done.
Loaded symbols for /usr/lib/libevent-1.3e.so.1
Reading symbols from /usr/lib/libssl.so.0.9.8...done.
Loaded symbols for /usr/lib/libssl.so.0.9.8
Reading symbols from /usr/lib/libcrypto.so.0.9.8...done.
Loaded symbols for /usr/lib/libcrypto.so.0.9.8
Reading symbols from /lib/libpthread.so.0...done.
Loaded symbols for /lib/libpthread.so.0
Reading symbols from /lib/libdl.so.2...done.
Loaded symbols for /lib/libdl.so.2
Reading symbols from /lib/libc.so.6...done.
Loaded symbols for /lib/libc.so.6
Reading symbols from /lib/ld-linux-x86-64.so.2...done.
Loaded symbols for /lib64/ld-linux-x86-64.so.2
Reading symbols from /lib/libnsl.so.1...done.
Loaded symbols for /lib/libnsl.so.1
Reading symbols from /lib/librt.so.1...done.
Loaded symbols for /lib/librt.so.1
Reading symbols from /lib/libresolv.so.2...done.
Loaded symbols for /lib/libresolv.so.2
Core was generated by `/usr/sbin/tor'.
Program terminated with signal 6, Aborted.
[New process 20793]
[New process 20803]
[New process 20802]
[New process 20801]
[New process 20800]
[New process 20799]
[New process 20798]
[New process 20797]
[New process 20796]
[New process 20795]
[New process 20794]
#0 0x00007f99debe8ed5 in raise () from /lib/libc.so.6
(gdb) bt
#0 0x00007f99debe8ed5 in raise () from /lib/libc.so.6
#1 0x00007f99debea3f3 in abort () from /lib/libc.so.6
#2 0x00007f99debe1dc9 in assert_fail () from /lib/libc.so.6
#3 0x00007f99df91c453 in ?? () from /usr/lib/libevent-1.3e.so.1
#4 0x00007f99df91c734 in event_add () from /usr/lib/libevent-1.3e.so.1
#5 0x0000000000496922 in evdns_request_transmit (req=0x7f99d0acee00)

at eventdns.c:482

#6 0x000000000049716a in evdns_requests_pump_waiting_queue ()

at eventdns.c:2168

#7 0x000000000049769a in reply_handle (req=0x7f99d36b3c00,

flags=<value optimized out>, ttl=<value optimized out>,
reply=<value optimized out>) at eventdns.c:826

#8 0x0000000000497f47 in nameserver_read (ns=0x7f99dff4e800)

at eventdns.c:1031

#9 0x00007f99df91cec1 in event_base_loop () from /usr/lib/libevent-1.3e.so.1
#10 0x000000000045f606 in do_main_loop () at main.c:1435
#11 0x000000000045f855 in tor_main (argc=1, argv=<value optimized out>)

at main.c:2060

#12 0x00007f99debd51a6 in libc_start_main () from /lib/libc.so.6
#13 0x0000000000407689 in _start ()
(gdb)

[Automatically added by flyspray2trac: Operating System: Other Linux]

Child Tickets

Change History (64)

comment:1 Changed 11 years ago by Falo

today tor segfaulted again:

anonymizer2:~# gdb /usr/sbin/tor ~debian-tor/.tor/core.22844
GNU gdb 6.8-debian
Copyright (C) 2008 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law. Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu"...

warning: Can't read pathname for load map: Input/output error.
Reading symbols from /usr/lib/libz.so.1...done.
Loaded symbols for /usr/lib/libz.so.1
Reading symbols from /usr/lib/libevent-1.3e.so.1...done.
Loaded symbols for /usr/lib/libevent-1.3e.so.1
Reading symbols from /usr/lib/libssl.so.0.9.8...done.
Loaded symbols for /usr/lib/libssl.so.0.9.8
Reading symbols from /usr/lib/libcrypto.so.0.9.8...done.
Loaded symbols for /usr/lib/libcrypto.so.0.9.8
Reading symbols from /lib/libpthread.so.0...done.
Loaded symbols for /lib/libpthread.so.0
Reading symbols from /lib/libdl.so.2...done.
Loaded symbols for /lib/libdl.so.2
Reading symbols from /lib/libc.so.6...done.
Loaded symbols for /lib/libc.so.6
Reading symbols from /lib/ld-linux-x86-64.so.2...done.
Loaded symbols for /lib64/ld-linux-x86-64.so.2
Reading symbols from /lib/libnsl.so.1...done.
Loaded symbols for /lib/libnsl.so.1
Reading symbols from /lib/librt.so.1...done.
Loaded symbols for /lib/librt.so.1
Reading symbols from /lib/libresolv.so.2...done.
Loaded symbols for /lib/libresolv.so.2
Reading symbols from /lib/libgcc_s.so.1...done.
Loaded symbols for /lib/libgcc_s.so.1
Core was generated by `/usr/sbin/tor'.
Program terminated with signal 11, Segmentation fault.
[New process 22844]
[New process 29276]
[New process 29275]
[New process 29274]
[New process 29273]
[New process 29272]
[New process 29271]
[New process 29269]
[New process 29268]
[New process 29267]
[New process 29266]
#0 0x00007fa9e4b72b4b in event_tree_RB_REMOVE_COLOR ()

from /usr/lib/libevent-1.3e.so.1

(gdb) bt
#0 0x00007fa9e4b72b4b in event_tree_RB_REMOVE_COLOR ()

from /usr/lib/libevent-1.3e.so.1

#1 0x00007fa9e4b72e93 in event_tree_RB_REMOVE ()

from /usr/lib/libevent-1.3e.so.1

#2 0x00007fa9e4b735a3 in event_del () from /usr/lib/libevent-1.3e.so.1
#3 0x00000000004971fa in request_finished (req=0x7fa9cd4ae000,

head=<value optimized out>) at eventdns.c:648

#4 0x000000000049769a in reply_handle (req=0x7fa9cd4ae000,

flags=<value optimized out>, ttl=<value optimized out>,
reply=<value optimized out>) at eventdns.c:826

#5 0x0000000000497f47 in nameserver_read (ns=0x7fa9e51a5600)

at eventdns.c:1031

#6 0x00007fa9e4b73ec1 in event_base_loop () from /usr/lib/libevent-1.3e.so.1
#7 0x000000000045f606 in do_main_loop () at main.c:1435
#8 0x000000000045f855 in tor_main (argc=1, argv=<value optimized out>)

at main.c:2060

#9 0x00007fa9e3e2c1a6 in libc_start_main () from /lib/libc.so.6
#10 0x0000000000407689 in _start ()
(gdb)

comment:2 Changed 11 years ago by Falo

and here comes crash number three within two weeks:

anonymizer2:~# gdb /usr/sbin/tor ~debian-tor/.tor/core.30629
GNU gdb 6.8-debian
Copyright (C) 2008 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law. Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu"...

warning: Can't read pathname for load map: Input/output error.
Reading symbols from /usr/lib/libz.so.1...done.
Loaded symbols for /usr/lib/libz.so.1
Reading symbols from /usr/lib/libevent-1.3e.so.1...done.
Loaded symbols for /usr/lib/libevent-1.3e.so.1
Reading symbols from /usr/lib/libssl.so.0.9.8...done.
Loaded symbols for /usr/lib/libssl.so.0.9.8
Reading symbols from /usr/lib/libcrypto.so.0.9.8...done.
Loaded symbols for /usr/lib/libcrypto.so.0.9.8
Reading symbols from /lib/libpthread.so.0...done.
Loaded symbols for /lib/libpthread.so.0
Reading symbols from /lib/libdl.so.2...done.
Loaded symbols for /lib/libdl.so.2
Reading symbols from /lib/libc.so.6...done.
Loaded symbols for /lib/libc.so.6
Reading symbols from /lib/ld-linux-x86-64.so.2...done.
Loaded symbols for /lib64/ld-linux-x86-64.so.2
Reading symbols from /lib/libnsl.so.1...done.
Loaded symbols for /lib/libnsl.so.1
Reading symbols from /lib/librt.so.1...done.
Loaded symbols for /lib/librt.so.1
Reading symbols from /lib/libresolv.so.2...done.
Loaded symbols for /lib/libresolv.so.2
Core was generated by `/usr/sbin/tor'.
Program terminated with signal 11, Segmentation fault.
[New process 30629]
[New process 30639]
[New process 30638]
[New process 30637]
[New process 30636]
[New process 30635]
[New process 30634]
[New process 30633]
[New process 30632]
[New process 30631]
[New process 30630]
#0 0x00007f2b60a74a88 in event_tree_RB_REMOVE_COLOR ()

from /usr/lib/libevent-1.3e.so.1

(gdb) bt
#0 0x00007f2b60a74a88 in event_tree_RB_REMOVE_COLOR ()

from /usr/lib/libevent-1.3e.so.1

#1 0x00007f2b60a74e93 in event_tree_RB_REMOVE ()

from /usr/lib/libevent-1.3e.so.1

#2 0x00007f2b60a755a3 in event_del () from /usr/lib/libevent-1.3e.so.1
#3 0x00000000004971fa in request_finished (req=0x7f2b516d2600,

head=<value optimized out>) at eventdns.c:648

#4 0x000000000049769a in reply_handle (req=0x7f2b516d2600,

flags=<value optimized out>, ttl=<value optimized out>,
reply=<value optimized out>) at eventdns.c:826

#5 0x0000000000497f47 in nameserver_read (ns=0x7f2b610a7600)

at eventdns.c:1031

#6 0x00007f2b60a75ec1 in event_base_loop () from /usr/lib/libevent-1.3e.so.1
#7 0x000000000045f606 in do_main_loop () at main.c:1435
#8 0x000000000045f855 in tor_main (argc=1, argv=<value optimized out>)

at main.c:2060

#9 0x00007f2b5fd2e1a6 in libc_start_main () from /lib/libc.so.6
#10 0x0000000000407689 in _start ()
(gdb)

comment:3 Changed 11 years ago by Falo

this is crash number four within a few TB

anonymizer2:~# gdb /usr/sbin/tor ~debian-tor/.tor/core.31857
GNU gdb 6.8-debian
Copyright (C) 2008 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law. Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu"...

warning: Can't read pathname for load map: Input/output error.
Reading symbols from /usr/lib/libz.so.1...done.
Loaded symbols for /usr/lib/libz.so.1
Reading symbols from /usr/lib/libevent-1.3e.so.1...done.
Loaded symbols for /usr/lib/libevent-1.3e.so.1
Reading symbols from /usr/lib/libssl.so.0.9.8...done.
Loaded symbols for /usr/lib/libssl.so.0.9.8
Reading symbols from /usr/lib/libcrypto.so.0.9.8...done.
Loaded symbols for /usr/lib/libcrypto.so.0.9.8
Reading symbols from /lib/libpthread.so.0...done.
Loaded symbols for /lib/libpthread.so.0
Reading symbols from /lib/libdl.so.2...done.
Loaded symbols for /lib/libdl.so.2
Reading symbols from /lib/libc.so.6...done.
Loaded symbols for /lib/libc.so.6
Reading symbols from /lib/ld-linux-x86-64.so.2...done.
Loaded symbols for /lib64/ld-linux-x86-64.so.2
Reading symbols from /lib/libnsl.so.1...done.
Loaded symbols for /lib/libnsl.so.1
Reading symbols from /lib/librt.so.1...done.
Loaded symbols for /lib/librt.so.1
Reading symbols from /lib/libresolv.so.2...done.
Loaded symbols for /lib/libresolv.so.2
Reading symbols from /lib/libgcc_s.so.1...done.
Loaded symbols for /lib/libgcc_s.so.1
Core was generated by `/usr/sbin/tor'.
Program terminated with signal 11, Segmentation fault.
[New process 31857]
[New process 399]
[New process 398]
[New process 397]
[New process 396]
#0 0x00007f8a078c5a88 in event_tree_RB_REMOVE_COLOR ()

from /usr/lib/libevent-1.3e.so.1

(gdb) bt
#0 0x00007f8a078c5a88 in event_tree_RB_REMOVE_COLOR ()

from /usr/lib/libevent-1.3e.so.1

#1 0x00007f8a078c5e93 in event_tree_RB_REMOVE ()

from /usr/lib/libevent-1.3e.so.1

#2 0x00007f8a078c65a3 in event_del () from /usr/lib/libevent-1.3e.so.1
#3 0x00000000004971fa in request_finished (req=0x7f89ff75d800,

head=<value optimized out>) at eventdns.c:648

#4 0x000000000049769a in reply_handle (req=0x7f89ff75d800,

flags=<value optimized out>, ttl=<value optimized out>,
reply=<value optimized out>) at eventdns.c:826

#5 0x0000000000497f47 in nameserver_read (ns=0x7f8a07ef8800)

at eventdns.c:1031

#6 0x00007f8a078c6ec1 in event_base_loop () from /usr/lib/libevent-1.3e.so.1
#7 0x000000000045f606 in do_main_loop () at main.c:1435
#8 0x000000000045f855 in tor_main (argc=1, argv=<value optimized out>)

at main.c:2060

#9 0x00007f8a06b7f1a6 in libc_start_main () from /lib/libc.so.6
#10 0x0000000000407689 in _start ()
(gdb)

comment:4 Changed 11 years ago by nickm

Hm. We have a guard for the timeout event case now, so this must be on port->event, at line 1342. How weird.

comment:5 Changed 11 years ago by nickm

oh wait, that's stupid. I need to get some coffee. It must somehow be at timeout_event, which makes no sense... or
wait. That's the request's timeout. Maybe there's a bug there.

comment:6 Changed 11 years ago by Falo

is there anything I can do helping? Perhaps compiling tor with gcc -g option?

comment:7 Changed 11 years ago by nickm

First, in gdb, see if you can get the value of *req at level 3 of the stack trace (request_finished).
This would typically be a matter of "up 3; print *req" ?

Then, try replacing the 0xF0 on line 155 with 0x00 and see if that makes the problem go away?

comment:8 Changed 11 years ago by Falo

you mean replacing in line 155 of eventdns.c?
#define CLEAR(x) do { memset((x), 0xF0, sizeof(*(x))); } while(0)

gdb output
==========
crash #2
[...]
(gdb) up
#1 0x00007fa9e4b72e93 in event_tree_RB_REMOVE ()

from /usr/lib/libevent-1.3e.so.1

(gdb) up
#2 0x00007fa9e4b735a3 in event_del () from /usr/lib/libevent-1.3e.so.1
(gdb) up
#3 0x00000000004971fa in request_finished (req=0x7fa9cd4ae000,

head=<value optimized out>) at eventdns.c:648

648 del_timeout_event(req);
(gdb) print *req
$1 = {request = 0x7fa9cd4ae108 "É+\001", request_len = 50, reissue_count = 0,

tx_count = 1, request_type = 1, user_pointer = 0x7fa9e088ef80,
user_callback = 0x457740 <evdns_callback>, ns = 0x7fa9e51a5600,
search_index = 0, search_state = 0x0, search_origname = 0x0,
search_flags = 0, next = 0x7fa9da667400, prev = 0x7fa9df837200,
timeout_event_deleted = 648, timeout_event = {ev_next = {tqe_next = 0x0,

tqe_prev = 0x0}, ev_active_next = {tqe_next = 0x0, tqe_prev = 0x0},

ev_signal_next = {tqe_next = 0x0, tqe_prev = 0x0}, ev_timeout_node = {

rbe_left = 0x0, rbe_right = 0x0, rbe_parent = 0x7fa9df837268,
rbe_color = 0}, ev_base = 0x7fa9e519f000, ev_fd = -1, ev_events = 0,

ev_ncalls = 0, ev_pncalls = 0x0, ev_timeout = {tv_sec = 2141062,

tv_usec = 125369}, ev_pri = 0,

ev_callback = 0x497790 <evdns_request_timeout_callback>,
ev_arg = 0x7fa9cd4ae000, ev_res = 0, ev_flags = 128}, trans_id = 51499,

request_appended = 1 '\001', transmit_me = 0 '\0'}

crash #3
[...]
(gdb) up
#1 0x00007f8a078c5e93 in event_tree_RB_REMOVE ()

from /usr/lib/libevent-1.3e.so.1

(gdb) up
#2 0x00007f8a078c65a3 in event_del () from /usr/lib/libevent-1.3e.so.1
(gdb) up
#3 0x00000000004971fa in request_finished (req=0x7f89ff75d800,

head=<value optimized out>) at eventdns.c:648

648 del_timeout_event(req);
(gdb) print *req
$1 = {request = 0x7f89ff75d908 "Í7\001", request_len = 28, reissue_count = 0,

tx_count = 1, request_type = 1, user_pointer = 0x7f89f6dfe8a0,
user_callback = 0x457740 <evdns_callback>, ns = 0x7f8a07ef8800,
search_index = 0, search_state = 0x0, search_origname = 0x0,
search_flags = 0, next = 0x7f8a02a9dc00, prev = 0x7f89fece2400,
timeout_event_deleted = 648, timeout_event = {ev_next = {tqe_next = 0x0,

tqe_prev = 0x0}, ev_active_next = {tqe_next = 0x0, tqe_prev = 0x0},

ev_signal_next = {tqe_next = 0x0, tqe_prev = 0x0}, ev_timeout_node = {

rbe_left = 0x0, rbe_right = 0x0, rbe_parent = 0x7f89fece2468,
rbe_color = 0}, ev_base = 0x7f8a07ef2000, ev_fd = -1, ev_events = 0,

ev_ncalls = 0, ev_pncalls = 0x0, ev_timeout = {tv_sec = 2457856,

tv_usec = 641449}, ev_pri = 0,

ev_callback = 0x497790 <evdns_request_timeout_callback>,
ev_arg = 0x7f89ff75d800, ev_res = 0, ev_flags = 128}, trans_id = 52535,

request_appended = 1 '\001', transmit_me = 0 '\0'}

(gdb)

and crash #4
[...]
(gdb) up
#1 0x00007f2b60a74e93 in event_tree_RB_REMOVE ()

from /usr/lib/libevent-1.3e.so.1

(gdb) up
#2 0x00007f2b60a755a3 in event_del () from /usr/lib/libevent-1.3e.so.1
(gdb) up
#3 0x00000000004971fa in request_finished (req=0x7f2b516d2600,

head=<value optimized out>) at eventdns.c:648

648 del_timeout_event(req);
(gdb) print *req
$1 = {request = 0x7f2b516d2708 "\në\001", request_len = 31, reissue_count = 0,

tx_count = 1, request_type = 1, user_pointer = 0x7f2b544aad90,
user_callback = 0x457740 <evdns_callback>, ns = 0x7f2b610a7600,
search_index = 0, search_state = 0x0, search_origname = 0x0,
search_flags = 0, next = 0x7f2b5dd09800, prev = 0x7f2b5dddf200,
timeout_event_deleted = 648, timeout_event = {ev_next = {tqe_next = 0x0,

tqe_prev = 0x0}, ev_active_next = {tqe_next = 0x0, tqe_prev = 0x0},

ev_signal_next = {tqe_next = 0x0, tqe_prev = 0x0}, ev_timeout_node = {

rbe_left = 0x7f2b5dddf268, rbe_right = 0x0, rbe_parent = 0x7f2b5dd68068,
rbe_color = 0}, ev_base = 0x7f2b610a1000, ev_fd = -1, ev_events = 0,

ev_ncalls = 0, ev_pncalls = 0x0, ev_timeout = {tv_sec = 2297441,

tv_usec = 353938}, ev_pri = 0,

ev_callback = 0x497790 <evdns_request_timeout_callback>,
ev_arg = 0x7f2b516d2600, ev_res = 0, ev_flags = 128}, trans_id = 2795,

request_appended = 1 '\001', transmit_me = 0 '\0'}

comment:9 Changed 11 years ago by Falo

changing eventdns.c line 155 into
#define CLEAR(x) do { memset((x), 0x00, sizeof(*(x))); } while(0)
doesn't help. Today tor process just died again.

anonymizer2:~# gdb /usr/sbin/tor ~debian-tor/.tor/core.16489
GNU gdb 6.8-debian
Copyright (C) 2008 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law. Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu"...

warning: Can't read pathname for load map: Input/output error.
Reading symbols from /usr/lib/libz.so.1...done.
Loaded symbols for /usr/lib/libz.so.1
Reading symbols from /usr/lib/libevent-1.3e.so.1...done.
Loaded symbols for /usr/lib/libevent-1.3e.so.1
Reading symbols from /usr/lib/libssl.so.0.9.8...done.
Loaded symbols for /usr/lib/libssl.so.0.9.8
Reading symbols from /usr/lib/libcrypto.so.0.9.8...done.
Loaded symbols for /usr/lib/libcrypto.so.0.9.8
Reading symbols from /lib/libpthread.so.0...done.
Loaded symbols for /lib/libpthread.so.0
Reading symbols from /lib/libdl.so.2...done.
Loaded symbols for /lib/libdl.so.2
Reading symbols from /lib/libc.so.6...done.
Loaded symbols for /lib/libc.so.6
Reading symbols from /lib/ld-linux-x86-64.so.2...done.
Loaded symbols for /lib64/ld-linux-x86-64.so.2
Reading symbols from /lib/libnsl.so.1...done.
Loaded symbols for /lib/libnsl.so.1
Reading symbols from /lib/librt.so.1...done.
Loaded symbols for /lib/librt.so.1
Reading symbols from /lib/libresolv.so.2...done.
Loaded symbols for /lib/libresolv.so.2
Reading symbols from /lib/libgcc_s.so.1...done.
Loaded symbols for /lib/libgcc_s.so.1
Core was generated by `/usr/sbin/tor'.
Program terminated with signal 11, Segmentation fault.
[New process 16489]
[New process 17078]
[New process 17077]
[New process 17076]
[New process 17075]
[New process 17074]
[New process 17073]
[New process 17072]
[New process 17071]
[New process 17070]
[New process 17069]
#0 0x00007f55a75f2b4b in event_tree_RB_REMOVE_COLOR ()

from /usr/lib/libevent-1.3e.so.1

(gdb) bt
#0 0x00007f55a75f2b4b in event_tree_RB_REMOVE_COLOR ()

from /usr/lib/libevent-1.3e.so.1

#1 0x00007f55a75f2e93 in event_tree_RB_REMOVE ()

from /usr/lib/libevent-1.3e.so.1

#2 0x00007f55a75f35a3 in event_del () from /usr/lib/libevent-1.3e.so.1
#3 0x000000000049715a in request_finished (req=0x7f55a3e6fa00,

head=<value optimized out>) at eventdns.c:648

#4 0x00000000004975ba in reply_handle (req=0x7f55a3e6fa00,

flags=<value optimized out>, ttl=<value optimized out>,
reply=<value optimized out>) at eventdns.c:826

#5 0x0000000000497e57 in nameserver_read (ns=0x7f55a7c26a00)

at eventdns.c:1031

#6 0x00007f55a75f3ec1 in event_base_loop () from /usr/lib/libevent-1.3e.so.1
#7 0x000000000045f606 in do_main_loop () at main.c:1435
#8 0x000000000045f855 in tor_main (argc=1, argv=<value optimized out>)

at main.c:2060

#9 0x00007f55a68ad5a6 in libc_start_main () from /lib/libc.so.6
#10 0x0000000000407689 in _start () at ../sysdeps/x86_64/elf/start.S:113
(gdb) up
#1 0x00007f55a75f2e93 in event_tree_RB_REMOVE ()

from /usr/lib/libevent-1.3e.so.1

(gdb) up
#2 0x00007f55a75f35a3 in event_del () from /usr/lib/libevent-1.3e.so.1
(gdb) up
#3 0x000000000049715a in request_finished (req=0x7f55a3e6fa00,

head=<value optimized out>) at eventdns.c:648

648 del_timeout_event(req);
(gdb) print *req
$1 = {request = 0x7f55a3e6fb08 "Ä5\001", request_len = 38, reissue_count = 0,

tx_count = 1, request_type = 1, user_pointer = 0x7f559595a440,
user_callback = 0x457740 <evdns_callback>, ns = 0x7f55a7c26a00,
search_index = 0, search_state = 0x0, search_origname = 0x0,
search_flags = 0, next = 0x7f55a3e73c00, prev = 0x7f55a3dd6800,
timeout_event_deleted = 648, timeout_event = {ev_next = {tqe_next = 0x0,

tqe_prev = 0x0}, ev_active_next = {tqe_next = 0x0, tqe_prev = 0x0},

ev_signal_next = {tqe_next = 0x0, tqe_prev = 0x0}, ev_timeout_node = {

rbe_left = 0x0, rbe_right = 0x0, rbe_parent = 0x7f55a3d44e68,
rbe_color = 0}, ev_base = 0x7f55a7c20000, ev_fd = -1, ev_events = 0,

ev_ncalls = 0, ev_pncalls = 0x0, ev_timeout = {tv_sec = 2710055,

tv_usec = 581950}, ev_pri = 0,

ev_callback = 0x4976b0 <evdns_request_timeout_callback>,
ev_arg = 0x7f55a3e6fa00, ev_res = 0, ev_flags = 128}, trans_id = 50229,

request_appended = 1 '\001', transmit_me = 0 '\0'}

(gdb)

comment:10 Changed 11 years ago by Falo

Would help a core dump from v0.2.1.14-rc compiled with --enable-debug? I've got one.

comment:11 Changed 10 years ago by nickm

Would it be hard to try Libevent 1.4.10 and see if that works any better? The 1.3e red-black tree code is pretty
complicated.

comment:12 Changed 10 years ago by nickm

Oh! And if possible , can you try Libevent built with debugging symbols?

comment:13 Changed 10 years ago by nickm

(Observation: the timeval values are very strange. All the tv_sec values are in 1970.)

comment:14 Changed 10 years ago by Falo

ok, I just restartet v0.2.1.14-rc (r19307) using the most recent Debian package source libevent-1.4.9 compiled with CFLAGS=-DUSE_DEBUG.

comment:15 Changed 10 years ago by Falo

tomorrow morning tor core dumped in libevent-1.4.9

anonymizer2:~# gdb /usr/sbin/tor ~debian-tor/.tor/core.25761
GNU gdb 6.8-debian
Copyright (C) 2008 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law. Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu"...

warning: Can't read pathname for load map: Input/output error.
Reading symbols from /usr/lib/libz.so.1...done.
Loaded symbols for /usr/lib/libz.so.1
Reading symbols from /usr/local/lib/libevent-1.4.so.2...done.
Loaded symbols for /usr/local/lib/libevent-1.4.so.2
Reading symbols from /usr/lib/libssl.so.0.9.8...done.
Loaded symbols for /usr/lib/libssl.so.0.9.8
Reading symbols from /usr/lib/libcrypto.so.0.9.8...done.
Loaded symbols for /usr/lib/libcrypto.so.0.9.8
Reading symbols from /lib/libpthread.so.0...done.
Loaded symbols for /lib/libpthread.so.0
Reading symbols from /lib/libdl.so.2...done.
Loaded symbols for /lib/libdl.so.2
Reading symbols from /lib/libc.so.6...done.
Loaded symbols for /lib/libc.so.6
Reading symbols from /lib/ld-linux-x86-64.so.2...done.
Loaded symbols for /lib64/ld-linux-x86-64.so.2
Reading symbols from /lib/libnsl.so.1...done.
Loaded symbols for /lib/libnsl.so.1
Reading symbols from /lib/librt.so.1...done.
Loaded symbols for /lib/librt.so.1
Reading symbols from /lib/libresolv.so.2...done.
Loaded symbols for /lib/libresolv.so.2
Reading symbols from /lib/libgcc_s.so.1...done.
Loaded symbols for /lib/libgcc_s.so.1
Core was generated by `/usr/sbin/tor'.
Program terminated with signal 11, Segmentation fault.
[New process 25761]
[New process 30098]
[New process 30097]
[New process 30096]
[New process 30095]
[New process 30094]
[New process 30093]
[New process 30091]
[New process 30090]
[New process 30089]
[New process 30088]
#0 0x00007f10ce478d37 in event_queue_insert ()

from /usr/local/lib/libevent-1.4.so.2

(gdb) bt
#0 0x00007f10ce478d37 in event_queue_insert ()

from /usr/local/lib/libevent-1.4.so.2

#1 0x00007f10ce478463 in event_active () from /usr/local/lib/libevent-1.4.so.2
#2 0x00007f10ce47887d in timeout_process ()

from /usr/local/lib/libevent-1.4.so.2

#3 0x00007f10ce4779f6 in event_base_loop ()

from /usr/local/lib/libevent-1.4.so.2

#4 0x00007f10ce477827 in event_loop () from /usr/local/lib/libevent-1.4.so.2
#5 0x000000000045f966 in do_main_loop () at main.c:1435
#6 0x000000000045fbb5 in tor_main (argc=1, argv=<value optimized out>)

at main.c:2060

#7 0x00007f10cd7315a6 in libc_start_main () from /lib/libc.so.6
#8 0x00000000004076b9 in _start () at ../sysdeps/x86_64/elf/start.S:113
Current language: auto; currently asm
(gdb) up
#1 0x00007f10ce478463 in event_active () from /usr/local/lib/libevent-1.4.so.2
(gdb) up
#2 0x00007f10ce47887d in timeout_process ()

from /usr/local/lib/libevent-1.4.so.2

(gdb) up
#3 0x00007f10ce4779f6 in event_base_loop ()

from /usr/local/lib/libevent-1.4.so.2

(gdb) up
#4 0x00007f10ce477827 in event_loop () from /usr/local/lib/libevent-1.4.so.2
(gdb) up
#5 0x000000000045f966 in do_main_loop () at main.c:1435
1435 loop_result = event_loop(called_loop_once ? EVLOOP_ONCE : 0);
Current language: auto; currently c
(gdb) up
#6 0x000000000045fbb5 in tor_main (argc=1, argv=<value optimized out>)

at main.c:2060

2060 result = do_main_loop();
(gdb)

comment:16 Changed 10 years ago by Falo

s/tomorrow/today :-)

comment:17 Changed 10 years ago by Falo

and again...

anonymizer2:~# gdb /usr/sbin/tor ~debian-tor/.tor/core.30154
GNU gdb 6.8-debian
Copyright (C) 2008 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law. Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu"...

warning: Can't read pathname for load map: Input/output error.
Reading symbols from /usr/lib/libz.so.1...done.
Loaded symbols for /usr/lib/libz.so.1
Reading symbols from /usr/local/lib/libevent-1.4.so.2...done.
Loaded symbols for /usr/local/lib/libevent-1.4.so.2
Reading symbols from /usr/lib/libssl.so.0.9.8...done.
Loaded symbols for /usr/lib/libssl.so.0.9.8
Reading symbols from /usr/lib/libcrypto.so.0.9.8...done.
Loaded symbols for /usr/lib/libcrypto.so.0.9.8
Reading symbols from /lib/libpthread.so.0...done.
Loaded symbols for /lib/libpthread.so.0
Reading symbols from /lib/libdl.so.2...done.
Loaded symbols for /lib/libdl.so.2
Reading symbols from /lib/libc.so.6...done.
Loaded symbols for /lib/libc.so.6
Reading symbols from /lib/ld-linux-x86-64.so.2...done.
Loaded symbols for /lib64/ld-linux-x86-64.so.2
Reading symbols from /lib/libnsl.so.1...done.
Loaded symbols for /lib/libnsl.so.1
Reading symbols from /lib/librt.so.1...done.
Loaded symbols for /lib/librt.so.1
Reading symbols from /lib/libresolv.so.2...done.
Loaded symbols for /lib/libresolv.so.2
Core was generated by `/usr/sbin/tor'.
Program terminated with signal 11, Segmentation fault.
[New process 30154]
[New process 30164]
[New process 30163]
[New process 30162]
[New process 30161]
[New process 30160]
[New process 30159]
[New process 30158]
[New process 30157]
[New process 30156]
[New process 30155]
#0 0x00007fee8f3d1d37 in event_queue_insert () from /usr/local/lib/libevent-1.4.so.2
(gdb) bt
#0 0x00007fee8f3d1d37 in event_queue_insert () from /usr/local/lib/libevent-1.4.so.2
#1 0x00007fee8f3d1463 in event_active () from /usr/local/lib/libevent-1.4.so.2
#2 0x00007fee8f3d187d in timeout_process () from /usr/local/lib/libevent-1.4.so.2
#3 0x00007fee8f3d09f6 in event_base_loop () from /usr/local/lib/libevent-1.4.so.2
#4 0x00007fee8f3d0827 in event_loop () from /usr/local/lib/libevent-1.4.so.2
#5 0x000000000045f966 in do_main_loop () at main.c:1435
#6 0x000000000045fbb5 in tor_main (argc=1, argv=<value optimized out>) at main.c:2060
#7 0x00007fee8e68a5a6 in libc_start_main () from /lib/libc.so.6
#8 0x00000000004076b9 in _start () at ../sysdeps/x86_64/elf/start.S:113
Current language: auto; currently asm
(gdb) up
#1 0x00007fee8f3d1463 in event_active () from /usr/local/lib/libevent-1.4.so.2
(gdb) up
#2 0x00007fee8f3d187d in timeout_process () from /usr/local/lib/libevent-1.4.so.2
(gdb) up
#3 0x00007fee8f3d09f6 in event_base_loop () from /usr/local/lib/libevent-1.4.so.2
(gdb) up
#4 0x00007fee8f3d0827 in event_loop () from /usr/local/lib/libevent-1.4.so.2
(gdb) up
#5 0x000000000045f966 in do_main_loop () at main.c:1435
1435 loop_result = event_loop(called_loop_once ? EVLOOP_ONCE : 0);
Current language: auto; currently c

comment:18 Changed 10 years ago by nickm

Hmmn. Does installing the debugging symbols for libevent do anything? It would be neat to know what's going on in levels
1 and 2.

comment:19 Changed 10 years ago by Falo

don't know. At least debugging syms are available:

anonymizer2:~# nm --debug-syms /usr/local/lib/libevent-1.4.so.2
000000000021ef00 b .bss
0000000000000000 n .comment
000000000021e5c0 d .ctors
000000000021ee80 d .data
000000000021e600 d .data.rel.ro
0000000000000000 N .debug_abbrev
0000000000000000 N .debug_aranges
[...]

comment:20 Changed 10 years ago by nickm

At level 0 (event_queue_insert), can you tell me the values for "*base", "*ev", and "queue" ?

(If gdb won't print those out, it probably means there's another libevent-dgb package you need to install or something.)

comment:21 Changed 10 years ago by nickm

(Getting a line number in event_queue_insert could be very helpful too. The fact that gdb isn't reporting one
suggests to me that there might be a missing libevent-debug or libevent-dbg or something.)

comment:22 Changed 10 years ago by Falo

[...]
#0 0x00007fee8f3d1d37 in event_queue_insert () from /usr/local/lib/libevent-1.4.so.2
(gdb) print *base
No symbol "base" in current context.
(gdb) print *ev;
No symbol "ev" in current context.
(gdb) print queue;
No symbol "queue" in current context.

I've built libevent with "CFLAGS=-DUSE_DEBUG ./configure"
What am I missing?

comment:23 Changed 10 years ago by nickm

The -g flag for building libevent, I think. The -DUSE_DEBUG flag has nothing to do with generating debugging
symbols. I think you should just be able to build libevent with the default flags, in fact.

comment:24 Changed 10 years ago by Falo

ok, I just compiled libevent with the -g flag. Let's wait for the next crash.

comment:25 Changed 10 years ago by nickm

Another thing to see is whether setting the environment variable "EVENT_NOEPOLL" to 1 makes this bug go away.
If so, then a bugfix that just went into Libevent svn and that will appear in Libevent 1.4.11 might help.

comment:26 Changed 10 years ago by nickm

(But one thing at a time.)

comment:27 Changed 10 years ago by Falo

and here comes the next crashed with libevent compiled with -g flag. Variables are printed now.

[...]
#0 0x00007f8ef8c0bb8c in event_queue_insert (base=0x0, ev=0x7f8edf4fba68,

queue=8) at event.c:980

980 base->event_count++;
(gdb) bt
#0 0x00007f8ef8c0bb8c in event_queue_insert (base=0x0, ev=0x7f8edf4fba68,

queue=8) at event.c:980

#1 0x00007f8ef8c0b305 in event_active (ev=0x7f8edf4fba68, res=1, ncalls=1)

at event.c:843

#2 0x00007f8ef8c0b6d6 in timeout_process (base=0x7f8ef923e000) at event.c:935
#3 0x00007f8ef8c0a98d in event_base_loop (base=0x7f8ef923e000, flags=0)

at event.c:536

#4 0x00007f8ef8c0a7d6 in event_loop (flags=0) at event.c:463
#5 0x000000000045f966 in do_main_loop () at main.c:1435
#6 0x000000000045fbb5 in tor_main (argc=1, argv=<value optimized out>)

at main.c:2060

#7 0x00007f8ef7ec45a6 in libc_start_main () from /lib/libc.so.6
#8 0x00000000004076b9 in _start () at ../sysdeps/x86_64/elf/start.S:113
(gdb) print *base
Cannot access memory at address 0x0
(gdb) print *ev
$1 = {ev_next = {tqe_next = 0x0, tqe_prev = 0x0}, ev_active_next = {

tqe_next = 0x0, tqe_prev = 0x0}, ev_signal_next = {tqe_next = 0x0,
tqe_prev = 0x0}, min_heap_idx = 0, ev_base = 0x0, ev_fd = 0,

ev_events = 0, ev_ncalls = 1, ev_pncalls = 0x0, ev_timeout = {tv_sec = 0,

tv_usec = 0}, ev_pri = 0, ev_callback = 0, ev_arg = 0x0, ev_res = 1,

ev_flags = 0}

(gdb) print queue
$2 = 8
(gdb)

comment:28 Changed 10 years ago by nickm

That actually helps!

The base field of the event shouldn't be null; something is zeroing it out that shouldn't be.

Time to pore over the code once more. In the mean time, let me know if running with the EVENT_NOEPOLL environment variable helps anything?

(And thanks for your patience: this bug is a hard one!)

comment:29 Changed 10 years ago by nickm

(Was this last stack trace with the change to CLEAR() I suggested on 12 april, or not?)

comment:30 Changed 10 years ago by Falo

no, I only tried it with tor-0.2.1.13 but not with tor-0.2.1.14-rc which I'm running now. Do you want me to change line 155 to
#define CLEAR(x) do { memset((x), 0x00, sizeof(*(x))); } while(0)
again?

comment:31 Changed 10 years ago by nickm

Actually, no need to make that change: the distinction between 0x00 and 0xF0 allows me to rule out some possible sources
of this very weird bug.

comment:32 Changed 10 years ago by Falo

got two additional core dumps within the last hours. The three variables have the same values like the dump I reported yesterday.

comment:33 Changed 10 years ago by Falo

this core dump looks different

#0 0x00007f1cf8c201ab in event_del (ev=0x7f1cf651e868) at event.c:806
806 evsel = base->evsel;
(gdb) bt
#0 0x00007f1cf8c201ab in event_del (ev=0x7f1cf651e868) at event.c:806
#1 0x00007f1cf8c206c3 in timeout_process (base=0x7f1cf9253000) at event.c:931
#2 0x00007f1cf8c1f98d in event_base_loop (base=0x7f1cf9253000, flags=0)

at event.c:536

#3 0x00007f1cf8c1f7d6 in event_loop (flags=0) at event.c:463
#4 0x000000000045f966 in do_main_loop () at main.c:1435
#5 0x000000000045fbb5 in tor_main (argc=1, argv=<value optimized out>)

at main.c:2060

#6 0x00007f1cf7ed95a6 in libc_start_main () from /lib/libc.so.6
#7 0x00000000004076b9 in _start () at ../sysdeps/x86_64/elf/start.S:113
(gdb) print *base
Cannot access memory at address 0xf0f0f0f0f0f0f0f0
(gdb) print *ev
$1 = {ev_next = {tqe_next = 0xf0f0f0f0f0f0f0f0,

tqe_prev = 0xf0f0f0f0f0f0f0f0}, ev_active_next = {
tqe_next = 0xf0f0f0f0f0f0f0f0, tqe_prev = 0xf0f0f0f0f0f0f0f0},

ev_signal_next = {tqe_next = 0xf0f0f0f0f0f0f0f0,

tqe_prev = 0xf0f0f0f0f0f0f0f0}, min_heap_idx = 0,

ev_base = 0xf0f0f0f0f0f0f0f0, ev_fd = -252645136, ev_events = -3856,
ev_ncalls = -3856, ev_pncalls = 0xf0f0f0f0f0f0f0f0, ev_timeout = {

tv_sec = -1085102592571150096, tv_usec = -1085102592571150096},

ev_pri = -252645136, ev_callback = 0xf0f0f0f0f0f0f0f0,
ev_arg = 0xf0f0f0f0f0f0f0f0, ev_res = -252645136, ev_flags = -252645136}

(gdb) print queue
No symbol "queue" in current context.

comment:34 Changed 10 years ago by Falo

and another one

#0 0x00007fe7340ba1ab in event_del (ev=0x7fe7266bd468) at event.c:806
806 evsel = base->evsel;
(gdb) bt
#0 0x00007fe7340ba1ab in event_del (ev=0x7fe7266bd468) at event.c:806
#1 0x00007fe7340ba6c3 in timeout_process (base=0x7fe7346ed000) at event.c:931
#2 0x00007fe7340b998d in event_base_loop (base=0x7fe7346ed000, flags=0)

at event.c:536

#3 0x00007fe7340b97d6 in event_loop (flags=0) at event.c:463
#4 0x000000000045f966 in do_main_loop () at main.c:1435
#5 0x000000000045fbb5 in tor_main (argc=1, argv=<value optimized out>)

at main.c:2060

#6 0x00007fe7333735a6 in libc_start_main () from /lib/libc.so.6
#7 0x00000000004076b9 in _start () at ../sysdeps/x86_64/elf/start.S:113
(gdb) print *base
Cannot access memory at address 0xaaaaaaaaaaaaaaaa
(gdb) print *ev
$1 = {ev_next = {tqe_next = 0xaaaaaaaaaaaaaaaa,

tqe_prev = 0xaaaaaaaaaaaaaaaa}, ev_active_next = {
tqe_next = 0xaaaaaaaaaaaaaaaa, tqe_prev = 0xaaaaaaaaaaaaaaaa},

ev_signal_next = {tqe_next = 0xaaaaaaaaaaaaaaaa,

tqe_prev = 0xaaaaaaaaaaaaaaaa}, min_heap_idx = 2863311530,

ev_base = 0xaaaaaaaaaaaaaaaa, ev_fd = -1431655766, ev_events = -21846,
ev_ncalls = -21846, ev_pncalls = 0xaaaaaaaaaaaaaaaa, ev_timeout = {

tv_sec = -6148914691236517206, tv_usec = -6148914691236517206},

ev_pri = -1431655766, ev_callback = 0xaaaaaaaaaaaaaaaa,
ev_arg = 0xaaaaaaaaaaaaaaaa, ev_res = -1431655766, ev_flags = -1431655766}

(gdb) print queue
No symbol "queue" in current context.

comment:35 Changed 10 years ago by nickm

Staring at this bug, a quote comes to mind:
"Oh, Mr Beeblebrox sir, you're so weird, you should be in pictures!"

-- Douglas Adams

So we use 0xf0 to clear unused events in eventds.c, and we use 0xaa to clear connections and circuits in
connection.c and circuitlist.c respectively. But what an 0xaa could possibly be doing in an event,
I have no idea!

These events are appearing in timeout_process because they are at the top of the timeout heap. But if they
are getting filled with 0xaa, that probably means that connection.c thinks that memory is part of a connection_t!
But that would imply that we freed the event structure but didn't delete it from the event base!

I will keep looking at this, but there are a couple of things I'd like you to try first if you don't mind:

  • Can you try using Libevent 1.4.11 (released yesterday)? We fixed a big epoll-related bug and I want to make sure it is not to blame here.
  • When you use Libevent 1.4.11, could you please rebuild Tor against it completely, double-checking to make sure that it is really building with the correct headers and the correct matching library?

comment:36 Changed 10 years ago by Falo

ok, I just built libevent-1.4.11 and afterwards tor from scratch with the new libevent and the correct headers in /usr/local/include/. Tor process really uses libevent-1.4.11:

anonymizer2:~# lsof -p cat /var/run/tor/tor.pid -a /usr/local/lib/libevent-1.4.so.2.1.3
COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME
tor 20350 debian-tor mem REG 8,1 293082 5325259 /usr/local/lib/libevent-1.4.so.2.1.3
anonymizer2:~# strings /usr/local/lib/libevent-1.4.so.2.1.3 | grep 1.4
libevent-1.4.so.2
1.4.11-stable

comment:37 Changed 10 years ago by Falo

libevent-1.4.11 seems to be evil, too. I made sure libevent-1.4.11 as well as the .h headers has been compiled into my tor 0.2.1.14-rc binary.

anonymizer2:~# lsof -p cat /var/run/tor/tor.pid -a /usr/local/lib/libevent.so
COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME
tor 21465 debian-tor mem REG 8,1 293082 5325259 /usr/local/lib/libevent-1.4.so.2.1.3
anonymizer2:~# strings /usr/local/lib/libevent-1.4.so.2.1.3 | grep 1.4
libevent-1.4.so.2
1.4.11-stable

and here's core dump's backtrace. It's looking like the other backtraces:

#0 0x00007f0c7058dc41 in event_queue_insert (base=0x0, ev=0x7f0c6d2bb268,

queue=8) at event.c:985

985 base->event_count++;
(gdb) bt
#0 0x00007f0c7058dc41 in event_queue_insert (base=0x0, ev=0x7f0c6d2bb268,

queue=8) at event.c:985

#1 0x00007f0c7058d35c in event_active (ev=0x7f0c6d2bb268, res=1, ncalls=1)

at event.c:848

#2 0x00007f0c7058d72d in timeout_process (base=0x7f0c70bc0000) at event.c:940
#3 0x00007f0c7058c9e4 in event_base_loop (base=0x7f0c70bc0000, flags=0)

at event.c:541

#4 0x00007f0c7058c82d in event_loop (flags=0) at event.c:468
#5 0x000000000045f966 in do_main_loop () at main.c:1435
#6 0x000000000045fbb5 in tor_main (argc=1, argv=<value optimized out>)

at main.c:2060

#7 0x00007f0c6f8465a6 in libc_start_main () from /lib/libc.so.6
#8 0x00000000004076b9 in _start () at ../sysdeps/x86_64/elf/start.S:113
(gdb) print *base
Cannot access memory at address 0x0
(gdb) print *ev
$1 = {ev_next = {tqe_next = 0x0, tqe_prev = 0x0}, ev_active_next = {

tqe_next = 0x0, tqe_prev = 0x0}, ev_signal_next = {tqe_next = 0x0,
tqe_prev = 0x0}, min_heap_idx = 0, ev_base = 0x0, ev_fd = 0,

ev_events = 0, ev_ncalls = 1, ev_pncalls = 0x0, ev_timeout = {tv_sec = 0,

tv_usec = 0}, ev_pri = 0, ev_callback = 0, ev_arg = 0x0, ev_res = 1,

ev_flags = 0}

(gdb) print queue
$2 = 8

comment:38 Changed 10 years ago by Falo

libevent-1.4.11 doesn't change anything. Tor process core dumped three times within the last 24 hours.

comment:39 Changed 10 years ago by Falo

I should have mentioned tor always was built with --enable-openbsd-malloc. I'll try it without this config option and wait what happens.

comment:40 Changed 10 years ago by nickm

Do the other Libevent-1.4.11 crashes have the same tracebacks, full of 0x0 bytes? Or do they sometimes do 0xf0 and
0xaa?

Also, here's a question I should have asked before: what were you running before 0.2.1.13-alpha that didn't have this
bug? Maybe we could isolate what version introduced it, then maybe what commit.

comment:41 Changed 10 years ago by Falo

yes, the other two libevent-1.4.11 tracebacks look full of 0x0 bytes, too. So far no 0xf0 and 0xaa.

0.2.1.13-alpha was the first version ever installed on my new x86_64 box. I don't remember having seen this issue on my old 32 bit node. So it might be 64 bit related, but on the other hand there're other x86_64 exit nodes like UBIT2 with long uptimes.

v0.2.1.14-rc with standard malloc() now runs for about half a day without crashing. Process size is 461 MB. This suggests there still might be the old glibc malloc() memory leak.

comment:42 Changed 10 years ago by Falo

compiling without --enable-openbsd-malloc doesn't change anything. This is the coredump now showing 0xf0:

#0 0x00007f07abb15c41 in event_queue_insert (base=0x0, ev=0x7f079ac9e118, queue=8)

at event.c:985

985 base->event_count++;
(gdb) bt
#0 0x00007f07abb15c41 in event_queue_insert (base=0x0, ev=0x7f079ac9e118, queue=8)

at event.c:985

#1 0x00007f07abb1535c in event_active (ev=0x7f079ac9e118, res=1, ncalls=1)

at event.c:848

#2 0x00007f07abb1572d in timeout_process (base=0x1f5b460) at event.c:940
#3 0x00007f07abb149e4 in event_base_loop (base=0x1f5b460, flags=0) at event.c:541
#4 0x00007f07abb1482d in event_loop (flags=0) at event.c:468
#5 0x000000000045f656 in do_main_loop () at main.c:1435
#6 0x000000000045f8a5 in tor_main (argc=1, argv=<value optimized out>) at main.c:2060
#7 0x00007f07aadce5a6 in libc_start_main () from /lib/libc.so.6
#8 0x00000000004073a9 in _start () at ../sysdeps/x86_64/elf/start.S:113
(gdb) print *base
Cannot access memory at address 0x0
(gdb) print *ev
$1 = {ev_next = {tqe_next = 0x45f8e0, tqe_prev = 0x7f079bc5f450}, ev_active_next = {

tqe_next = 0x8200000000, tqe_prev = 0xf0f0f0f0f0f0f0f0}, ev_signal_next = {
tqe_next = 0x95, tqe_prev = 0x7f079a0d5940}, min_heap_idx = 0, ev_base = 0x0,

ev_fd = 0, ev_events = 0, ev_ncalls = 1, ev_pncalls = 0x0, ev_timeout = {tv_sec = 0,

tv_usec = 4294967295}, ev_pri = 32879712, ev_callback = 0x12000017f5, ev_arg = 0x0,

ev_res = 1, ev_flags = 0}

(gdb) print queue
$2 = 8

comment:43 Changed 10 years ago by nickm

Okay, so 0.2.0.x worked fine, but 0.2.1.x never did. That's interesting.

Do you think you might be comfortable using "git bisect" to find which commit was the one that broke stuff?

In other news, the most recent traceback actually is pretty interesting. One part is set to 0xf0f0f0f0f0, but
the rest looks like a perfectly decent event. I wonder why.

comment:44 Changed 10 years ago by nickm

(Also, let's stick with Libevent 1.4.11 and no openbsd malloc for the time being. It is possible that there are
_multiple_ bugs happening here, so let's eliminate as many possible causes as we can.)

comment:45 Changed 10 years ago by Falo

0.2.0.x as well as 0.2.1.x worked fine on Linux i686. But 0.2.1.x never worked on x86_64.
Now I'll try 0.2.0.34 on my x86_64 box (libevent 1.4.11 and no openbsd malloc).

I'm not familiar with "git bisect". Pls give me some time to figure out what I've to do.

comment:46 Changed 10 years ago by Falo

after 22 hours 0.2.0.34 crashed with the same error like 0.2.1.14 did before.

Since there're other x86_64 nodes with long uptimes there might be something
special in my torrc file like "NumCpus 10".

comment:47 Changed 10 years ago by nickm

Are those nodes also exits?

comment:48 Changed 10 years ago by nickm

So, let me think this through some...

Lots of these tracebacks from inside event_active() from inside timeout_process() are happening when an timeout
triggers. But the RAM at which the event is stored seems to be totally munged: sometimes with 0x00 bytes,
sometimes with 0xaa, sometimes with 0xf0.

Now, if the event were clobbered before it was added, then libevent would not have been able to insert
it into the timeout queue. So the clobbering must happen after the event is clobbered, but before it triggers.

Given that the data with which the event can be clobbered seems to exist in a huge variety of forms, I would
_guess_ that it isn't a general pattern of events getting overwritten by the same thing every time. It seems
likeliest that an event is getting freed without getting deleted from the event base, then getting overwritten.

Oh! Here is a conceivable scenario. Suppose that we event_set an event, then add it, then event_set it again,
then add it again. There would then be two pointers to that event in the event base's timeout queue. When
we called event_del and freed the event, only one would actually be deleted. Gnarly!

comment:49 Changed 10 years ago by nickm

If my most recent comment guessed the cause right, and there were no mistakes in my analysis of where in the
code base this could occur, then the most recent versions of the maint-0.2.1 and master branches in the Git
repo might just fix this bug.

comment:50 Changed 10 years ago by Falo

yes, there are x86_64 exits with a long uptime.

I just compiled the version from the repo. Let's see what happens.

comment:51 Changed 10 years ago by Falo

there's some debugging output produced by the new code:

May 19 04:40:00.554 [warn] eventdns: BUG: Event that I thought was non-added as of line 2112 was actually added on line 0
May 19 04:51:54.966 [warn] eventdns: BUG: Event that I thought was non-added as of line 2112 was actually added on line 0

comment:52 Changed 10 years ago by nickm

That line means that my diagnosis of the bug is possible. Hooray.

comment:53 Changed 10 years ago by nickm

Any news?

comment:54 Changed 10 years ago by Falo

no news is good news :-)

The patched version didn't crash so far, being an indication the bug is really fixed.
I'll keep you posted.

comment:55 Changed 10 years ago by nickm

Let the record indicate that I am doing a happy little dance.

comment:56 Changed 10 years ago by nickm

So, assuming that this was the solution, I think the right fix for 0.2.1 (and for 0.2.0) is to just keep the
current detect-and-workaround code we have, but downgrade the warnings. It's not the pretties solution, but
it is known to work and won't introduce any new bugs.

For 0.2.2, we should probably drop the whole CLEAR() bug detection business, and only set up the timeout events
when their corresponding structures are first allocated. We can have an extra flag that tracks whether the event
is currently active in order to detect the double-delete and double-add cases.

comment:57 Changed 10 years ago by Falo

yes, I think the bug is solved. No crashes and core dumps anymore.

comment:58 Changed 10 years ago by nickm

Hooray! Thank you so much for all your patience and work on this, Olaf!

comment:59 Changed 10 years ago by Falo

Thank you for all your support on this issue, Nick!

Within its 8 weeks of operation the new anonymizer2.blutmagie.de moved 25 TB tor traffic. The E8600 cpu is much stronger than the old P4 Xeons anonymizer.blutmagie.de is equipped with.

anonymizer2:~# ifconfig eth0 | grep bytes

RX bytes:27436550504868 (24.9 TiB) TX bytes:27990353508331 (25.4 TiB)

comment:60 Changed 10 years ago by nickm

That is amazing. BTW, I hate to impose more on your time, but I had to do three versions of the fix: the one
you're using in 0.2.1, one for 0.2.0, and one for 0.2.2. There's no hurry here at all, but if you could see
whether the maint-0.2.0 and the master branch do or don't have this bug for you, that would be grand. Please
don't put a very high priority on it, though.

comment:61 Changed 10 years ago by Falo

ok, I'll do so. In the meantime some more log warnings appeared but tor still didn't crash.

May 23 00:47:39.080 [warn] eventdns: BUG: Event that I thought was non-added as of line 2112 was actually added on line 0
May 23 00:47:39.080 [warn] eventdns: BUG: Event that I thought was non-added as of line 2112 was actually added on line 0
May 23 03:58:32.258 [warn] eventdns: BUG: Event that I thought was non-added as of line 2112 was actually added on line 0

comment:62 Changed 10 years ago by nickm

Yeah. Those warnings are unnecessary now; I've turned them off. In effect, all they mean now is, "I just stopped
Tor from crashing."

I'm going to close this bug as fixed, at long long last. Please mark for reopening or open a new bug if it shows
up again.

Thanks again!

comment:63 Changed 10 years ago by nickm

flyspray2trac: bug closed.

comment:64 Changed 7 years ago by nickm

Component: Tor RelayTor
Note: See TracTickets for help on using tickets.