Opened 12 years ago

Last modified 7 years ago

#451 closed defect (Fixed)

Bug: main.c:367: connection_stop_writing: Assertion conn->write_event failed; aborting

Reported by: weasel Owned by:
Priority: Low Milestone: 0.2.0.x-final
Component: Core Tor/Tor Version: 0.2.0.2-alpha
Severity: Keywords:
Cc: weasel, nickm, arma, fk Actual Points:
Parent ID: Points:
Reviewer: Sponsor:

Description

On tor26, after running for about 2 weeks, I get:

[err] Bug: main.c:367: connection_stop_writing: Assertion conn->write_event failed; aborting.

tor26 is:

  • r10579 -O0 ../../patch-debian-rules-DDEBUG_ROUTERLIST+INSTRUMENTDOWNLOADS ../../patch-tor-docs ../../patch-routerlist--two-asserts # just assert pointers added to the rl->routers list are even (%2==0) ../../patch-routerlist--shouldrebuildstore # rebuild store iff stats->journal_len > (5000) ../../patch-routerlist--writeoutforeach # in routerlist_assert_ok write out the first instance of the SMARTLIST_FOREACH makro

(gdb) bt
#0 0xffffe410 in kernel_vsyscall ()
#1 0xb7d04885 in raise () from /lib/tls/i686/cmov/libc.so.6
#2 0xb7d06002 in abort () from /lib/tls/i686/cmov/libc.so.6
#3 0x080a0c0f in connection_stop_writing (conn=0xed58948) at main.c:367
#4 0x0807b8f4 in connection_or_finished_flushing (conn=0xed58948) at connection_or.c:283
#5 0x08071afe in connection_finished_flushing (conn=0xed58948) at connection.c:2601
#6 0x08070ad1 in connection_handle_write (conn=0xed58948, force=0) at connection.c:2144
#7 0x080a14a5 in conn_write_callback (fd=241, events=4, _conn=0xed58948) at main.c:525

(gdb) p *conn
$1 = {magic = 2100428547, type = 4 '\004', state = 5 '\005', purpose = 0 '\0',

read_blocked_on_bw = 0, write_blocked_on_bw = 0,
hold_open_until_flushed = 0, inbuf_reached_eof = 0, edge_has_sent_end = 0,
edge_blocked_on_circ = 0, or_is_obsolete = 0, chosen_exit_optional = 0,
s = -1, conn_array_index = 561, read_event = 0x0, write_event = 0x0,
inbuf = 0xae38088, outbuf = 0xa0ee8a0, outbuf_flushlen = 0,
timestamp_lastread = 1182903016, timestamp_lastwritten = 1182903023,
timestamp_created = 1182869105, socket_family = 2, addr = 2331677697,
port = 41178, marked_for_close = 2087,
marked_for_close_file = 0x80fb026 "connection.c",
address = 0x8c44208 "138.250.148.1", linked_conn = 0x0, linked = 0,
reading_from_linked_conn = 0, writing_to_linked_conn = 0,
active_on_link = 0, dns_server_port = 0x0}

Let me know if there's any more info you might want to have.

[Automatically added by flyspray2trac: Operating System: All]

Child Tickets

Change History (10)

comment:1 Changed 12 years ago by nickm

So that's an OR connection in state OPEN. It was marked for close on line 2087 of connection.c:
that'll be for a TLS_ERROR of some kind, which comes along with a close_immediate, which in turn explains how
the socket is -1 and the events are NULL. But I'm confused by how we actually got to execute the _rest_ of
connection_handle_write, since right after the mark_for_close on 2087, we return -1.

I guess there must be two calls to handle_write, the current one (that's calling finished_flushing) and an
earlier one (that called mark_for_close). But the current one is getting called from the libevent callback--
that's just weird.

comment:2 Changed 12 years ago by weasel

again:

Jul 11 20:05:04.153 [err] Bug: main.c:372: connection_stop_writing: Assertion conn->write_event failed; aborting.

  • r10753
  • ../../patch-debian-rules-DDEBUG_ROUTERLIST+INSTRUMENTDOWNLOADS ../../patch-routerlist--writeoutforeach ../../patch-tor-docs

(gdb) bt
#0 0xffffe410 in kernel_vsyscall ()
#1 0xb7cc1885 in raise () from /lib/tls/i686/cmov/libc.so.6
#2 0xb7cc3002 in abort () from /lib/tls/i686/cmov/libc.so.6
#3 0x08098434 in connection_stop_writing (conn=0x80f9706) at main.c:380
#4 0x08077aab in connection_or_finished_flushing (conn=0xeb7a990)

at connection_or.c:283

#5 0x0806d2f6 in connection_handle_write (conn=0xeb7a990, force=0)

at connection.c:2144

#6 0x08098c77 in conn_write_callback (fd=2346, events=4, _conn=0xeb7a990)

at main.c:530

#7 0xb7f13c79 in event_base_priority_init () from /usr/lib/libevent-1.1a.so.1
#8 0xb7f13f65 in event_base_loop () from /usr/lib/libevent-1.1a.so.1
#9 0xb7f13dcb in event_loop () from /usr/lib/libevent-1.1a.so.1
#10 0x0809a723 in do_main_loop () at main.c:1369
#11 0x0809b82d in tor_main (argc=0, argv=0x0) at main.c:2606
#12 0x080c6d8b in main (argc=0, argv=0x0) at tor_main.c:28
(gdb) up
..

The conn struct at #3 is completely broken. The magic is wrong, and
->write_event is not null:

#3 0x08098434 in connection_stop_writing (conn=0x80f9706) at main.c:380
380 log_warn(LD_NET, "Error from libevent setting write event state for %d "
(gdb) p conn
$1 = (connection_t *) 0x80f9706
(gdb) p *conn
$2 = {magic = 1852731235, type = 45 '-', state = 62 '>', purpose = 119 'w',

read_blocked_on_bw = 0, write_blocked_on_bw = 1,
hold_open_until_flushed = 0, inbuf_reached_eof = 0, edge_has_sent_end = 1,
edge_blocked_on_circ = 1, or_is_obsolete = 1, chosen_exit_optional = 0,
s = 1600484457, conn_array_index = 1852143205, read_event = 0x6f630074,
write_event = 0x63656e6e, inbuf = 0x6e6f6974, outbuf = 0x6174735f,
outbuf_flushlen = 2002744434, timestamp_lastread = 1769236850,
timestamp_lastwritten = 1660970862, timestamp_created = 762211951,
socket_family = 1852402750, addr = 543450475, port = 15677,
marked_for_close = 12576,
marked_for_close_file = 0x6e6f6300 <Address 0x6e6f6300 out of bounds>,
address = 0x65725f6e <Address 0x65725f6e out of bounds>,
linked_conn = 0x635f6461, linked = 1, reading_from_linked_conn = 0,
writing_to_linked_conn = 0, active_on_link = 0, dns_server_port = 0x6b6361}

going one step further up we get a sane conn, with a proper magic, so
clearly gdb is confused about something again.

(gdb) p conn
$3 = (or_connection_t *) 0xeb7a990
(gdb) p *conn
$4 = {_base = {magic = 2100428547, type = 4 '\004', state = 5 '\005',

purpose = 0 '\0', read_blocked_on_bw = 0, write_blocked_on_bw = 0,
hold_open_until_flushed = 0, inbuf_reached_eof = 0, edge_has_sent_end = 0,
edge_blocked_on_circ = 0, or_is_obsolete = 0, chosen_exit_optional = 0,
s = -1, conn_array_index = 637, read_event = 0x0, write_event = 0x0,
inbuf = 0x84a48e0, outbuf = 0xa28ec30, outbuf_flushlen = 0,
timestamp_lastread = 1184177102, timestamp_lastwritten = 1184177104,
timestamp_created = 1184176719, socket_family = 2, addr = 3480443501,
port = 64442, marked_for_close = 2087,
marked_for_close_file = 0x80e8a6a "connection.c",
address = 0xd976208 "207.115.90.109", linked_conn = 0x0, linked = 0,
reading_from_linked_conn = 0, writing_to_linked_conn = 0,
active_on_link = 0, dns_server_port = 0x0},

identity_digest = "*2/����\204Ƿ\000\222�G����\202�",
nickname = 0x8ab7be0 "$2A322FFFECEDD884C7B70092E047CDE5EEDD82F5",
tls = 0xe7abef0, tls_error = 0, client_used = 0 '\0',
circ_id_type = CIRC_ID_TYPE_LOWER, next_circ_id = 13643,
timestamp_lastempty = 1184177034, bandwidthrate = 3145728,
bandwidthburst = 10485760, read_bucket = 10485760, n_circuits = 3,
active_circuits = 0x9365780, next_with_same_id = 0x0}

(gdb) p &conn->_base
$5 = (connection_t *) 0xeb7a990

Good night

comment:3 Changed 12 years ago by johndoe32102002

I am getting this same problem every 24 hours. It is shutting down my tor server every 24 hours or less at random times.

main.c:367 connection_stop_writing: Assertion conn-->write_event_failed; aborting. Aborted.

It seems to only be in the alpha version. It is not in the stable.

Running:
Plenty of free space
Linux
Compiled from source

comment:4 Changed 12 years ago by nickm

John DOe: Can you get a stack trace for this?

comment:5 Changed 12 years ago by weasel

on r10953:

Jul 31 06:15:31.400 [err] Bug: main.c:372: connection_stop_writing: Assertion conn->write_event failed; aborting.

(gdb) bt
#0 0xffffe410 in kernel_vsyscall ()
#1 0xb7d3a885 in raise () from /lib/tls/i686/cmov/libc.so.6
#2 0xb7d3c002 in abort () from /lib/tls/i686/cmov/libc.so.6
#3 0x0809ae64 in connection_stop_writing (conn=0x80ff506) at main.c:380
#4 0x0807847b in connection_or_finished_flushing (conn=0x8e5adf8)

at connection_or.c:283

#5 0x0806d8a6 in connection_handle_write (conn=0x8e5adf8, force=0)

at connection.c:2154

#6 0x0809b6a7 in conn_write_callback (fd=427, events=4, _conn=0x8e5adf8)

at main.c:530

#7 0xb7f8cc79 in event_base_priority_init () from /usr/lib/libevent-1.1a.so.1
#8 0xb7f8cf65 in event_base_loop () from /usr/lib/libevent-1.1a.so.1
#9 0xb7f8cdcb in event_loop () from /usr/lib/libevent-1.1a.so.1
#10 0x0809d193 in do_main_loop () at main.c:1379
#11 0x0809e2cd in tor_main (argc=0, argv=0x0) at main.c:2624
#12 0x080ca29b in main (argc=0, argv=0x0) at tor_main.c:28
(gdb) up
#1 0xb7d3a885 in raise () from /lib/tls/i686/cmov/libc.so.6
(gdb)
#2 0xb7d3c002 in abort () from /lib/tls/i686/cmov/libc.so.6
(gdb)
#3 0x0809ae64 in connection_stop_writing (conn=0x80ff506) at main.c:380
380 log_warn(LD_NET, "Error from libevent setting write event state fo
r %d "
(gdb) p *conn
$1 = {magic = 1852731235, type = 45 '-', state = 62 '>', purpose = 119 'w',

read_blocked_on_bw = 0, write_blocked_on_bw = 1,
hold_open_until_flushed = 0, inbuf_reached_eof = 0, edge_has_sent_end = 1,
edge_blocked_on_circ = 1, or_is_obsolete = 1, chosen_exit_optional = 0,
s = 1600484457, conn_array_index = 1852143205, read_event = 0x6f630074,
write_event = 0x63656e6e, inbuf = 0x6e6f6974, outbuf = 0x6174735f,
outbuf_flushlen = 2002744434, timestamp_lastread = 1769236850,
timestamp_lastwritten = 1660970862, timestamp_created = 762211951,
socket_family = 1852402750, addr = 543450475, port = 15677,
marked_for_close = 12576,
marked_for_close_file = 0x6e6f6300 <Address 0x6e6f6300 out of bounds>,
address = 0x65725f6e <Address 0x65725f6e out of bounds>,
linked_conn = 0x635f6461, linked = 1, reading_from_linked_conn = 0,
writing_to_linked_conn = 0, active_on_link = 0, dns_server_port = 0x6b6361}

(gdb) up
#4 0x0807847b in connection_or_finished_flushing (conn=0x8e5adf8)

at connection_or.c:283

283 connection_stop_writing(TO_CONN(conn));
(gdb) p *conn
$2 = {_base = {magic = 2100428547, type = 4 '\004', state = 5 '\005',

purpose = 0 '\0', read_blocked_on_bw = 0, write_blocked_on_bw = 0,
hold_open_until_flushed = 0, inbuf_reached_eof = 0, edge_has_sent_end = 0,
edge_blocked_on_circ = 0, or_is_obsolete = 0, chosen_exit_optional = 0,
s = -1, conn_array_index = 1603, read_event = 0x0, write_event = 0x0,
inbuf = 0x86730d8, outbuf = 0x94e2be8, outbuf_flushlen = 0,
timestamp_lastread = 1185855331, timestamp_lastwritten = 1185855331,
timestamp_created = 1185851652, socket_family = 2, addr = 3269568014,
port = 49579, marked_for_close = 2097,
marked_for_close_file = 0x80edd4a "connection.c",
address = 0xad56a00 "194.225.166.14", linked_conn = 0x0, linked = 0,
reading_from_linked_conn = 0, writing_to_linked_conn = 0,
active_on_link = 0, dns_server_port = 0x0},

identity_digest = "Z#���D\025꼮���\a�\211\004\034\002�",
nickname = 0xa837d28 "$5A23D8E0E64415EABCAECBF2D107F389041C02F2",
tls = 0xcb7bce0, tls_error = 0, client_used = 0 '\0',
circ_id_type = CIRC_ID_TYPE_LOWER, next_circ_id = 20030,
timestamp_lastempty = 1185855320, bandwidthrate = 3145728,
bandwidthburst = 10485760, read_bucket = 10485174, n_circuits = 5,
active_circuits = 0x0, next_with_same_id = 0x0}

interesting how conn at #3 is as broken as in the original report.

comment:6 Changed 12 years ago by fk

With 0.2.0.6-alpha (r11277) running on FreeBSD 7.0-CURRENT i386:

Sep 17 19:08:08.881 [err] Bug: main.c:372: connection_stop_writing: Assertion conn->write_event failed; aborting.
Sep 17 23:57:37.509 [err] Bug: main.c:372: connection_stop_writing: Assertion conn->write_event failed; aborting.

I got the the first one with a stripped and optimized
build so the back trace is worthless. I was better prepared
for the second one though:

(gdb) where
#0 0x2837a537 in thr_kill () from /lib/libc.so.7
#1 0x2832e196 in pthread_kill () from /lib/libthr.so.3
#2 0x2832be7a in raise () from /lib/libthr.so.3
#3 0x2840a76a in abort () from /lib/libc.so.7
#4 0x080acb22 in connection_stop_writing (conn=0x28df6420) at main.c:372
#5 0x0807f8ed in connection_or_finished_flushing (conn=0x28df6420) at connection_or.c:283
#6 0x08074cf2 in connection_finished_flushing (conn=0x28df6420) at connection.c:2651
#7 0x08073a3b in connection_handle_write (conn=0x28df6420, force=0) at connection.c:2170
#8 0x08073d8c in _connection_write_to_buf_impl (string=0x2b7383c4 "[scrubbed]", len=512,

conn=0x28df6420, zlib=0) at connection.c:2265

#9 0x080b6a97 in connection_write_to_buf (string=0x2b7383c4 "[scrubbed]", len=512, conn=0x28df6420)

at or.h:2482

#10 0x080b8480 in connection_or_flush_from_first_active_circuit (conn=0x28df6420, max=1) at relay.c:1826
#11 0x0807f7b3 in connection_or_flushed_some (conn=0x28df6420) at connection_or.c:255
#12 0x08074c18 in connection_flushed_some (conn=0x28df6420) at connection.c:2631
#13 0x08073a02 in connection_handle_write (conn=0x28df6420, force=0) at connection.c:2165
#14 0x080ad450 in conn_write_callback (fd=151, events=4, _conn=0x28df6420) at main.c:530
#15 0x281793de in event_base_loop () from /usr/local/lib/libevent-1.3b.so.1
#16 0x281795a9 in event_loop () from /usr/local/lib/libevent-1.3b.so.1
#17 0x080af1db in do_main_loop () at main.c:1409
#18 0x080b0649 in tor_main (argc=15, argv=0xbfbfeb90) at main.c:2660
#19 0x080e82b2 in main (argc=Cannot access memory at address 0x3
) at tor_main.c:28
(gdb) f 4
[...]
(gdb) p *conn
$1 = {magic = 2100428547, state = 5 '\005', type = 4 '\004', purpose = 0 '\0', read_blocked_on_bw = 0,

write_blocked_on_bw = 0, hold_open_until_flushed = 0, inbuf_reached_eof = 0, edge_has_sent_end = 0,
edge_blocked_on_circ = 0, or_is_obsolete = 0, chosen_exit_optional = 0, linked = 0,
reading_from_linked_conn = 0, writing_to_linked_conn = 0, active_on_link = 0, s = -1, conn_array_index = 293,
read_event = 0x0, write_event = 0x0, inbuf = 0x28c9b4a0, outbuf = 0x28d0ae80, outbuf_flushlen = 0,
timestamp_lastread = 1190066256, timestamp_lastwritten = 1190066257, timestamp_created = 1190054656,
socket_family = 2, addr = 1427736811, port = 59001, marked_for_close = 2113,
marked_for_close_file = 0x810e39a "connection.c", address = 0x2856ce10 "[scrubbed]", linked_conn = 0x0,
dns_server_port = 0x0}

"[scrubbed]" added by me.

conn->write_event is 0x0 in all the frames.

comment:7 Changed 12 years ago by nickm

0.6-alpha, since that's where
the two most recent reports I've seen are from):

The connections are marked for close on connection.c:2113. That's
in connection_handle_write(), when flush_buf_tls() returns a TLS
error or a TLS close. Then, connection_handle_write returns -1.
Call this invocation of connection_handle_write [1].

The crash occurs nested in a call from connection_handle_write() on line

  1. This can't be the same invocation as [1], since [1] returned immediately after marking the connection. Call the crashing invocation [2]. [2] can't be in a subsequent call to connection_handle_write(), since connection_handle_write() returns 0 immediately if it's called with a marked-for-close connection. But [2] can't be in an earlier call than [1], or [1] would have returned for the same reason. Therefore, the call to [1] must be nested in a call for [2], as in:

connection_handle_write(conn) [2]

...

...

connection_handle_write(conn) [1]

mark-for-close on line 2113
return -1

call connection_finished_flushing(conn) on line 2170

all hell breaks lose.

So, where is [1] exactly? Who calls connection_handle_write()?

main.c:conn_write_callback()? Not likely; this is only invoked from
libevent.

control.c:send_control_event_string()? Also not likely; this is only
supposed to be called for control connections.

connection.c:_connection_write_to_buf_impl()? Aha.

comment:8 Changed 12 years ago by nickm

On furthre analysis, I think the bug happens like this:

connection_handle_write(conn) [2]

connection_flushed_some()

connection_or_flushed_some()

...
connection_write_to_buf()

connection_handle_write(conn) [1]

mark-for-close on line 2113
return -1

call connection_finished_flushing(conn) on line 2170

all hell breaks lose.

So, there are two logical fixes:

A) Don't call connection_handle_write from a connection_write_to_buf_impl() that's nested inside a

connection_flushed_some(). This is silly anyway and probably leads to unfair load balancing.

B) Check whether the connection's closed before calling connection_finished_flushing().

I'm definitely going to do B, and possibly A as well.

comment:9 Changed 12 years ago by nickm

flyspray2trac: bug closed.
Fixed in r11882. Fix should appear in 0.2.0.8-alpha and 0.1.2.18.

comment:10 Changed 7 years ago by nickm

Component: Tor RelayTor
Note: See TracTickets for help on using tickets.