Opened 7 years ago

Closed 7 years ago

#9017 closed defect (fixed)

Crash in assert_connection_ok

Reported by: usuality Owned by:
Priority: High Milestone: Tor: 0.2.4.x-final
Component: Core Tor/Tor Version: Tor: 0.2.3.25
Severity: Keywords: 023-backport
Cc: Actual Points:
Parent ID: Points:
Reviewer: Sponsor:

Description

After somewhere between a few and 24 hours, Tor on my exit node crashes with the same log message. This has happened every day for the past 3.

Jun 4 08:14:47 tor Tor[24513]: connection_exit_connect(): Bug: newly connected conn had data waiting!

Jun 4 08:14:57 tor Tor[24513]: assert_connection_ok(): Bug: connection.c:4066: assert_connection_ok: Assertion (conn->type == CONN_TYPE_EXIT && conn->state == EXIT_CONN_STATE_RESOLVING)
connection_is_writing(conn) conn->write_blocked_on_bw (CONN_IS_EDGE(conn) && TO_EDGE_CONN(conn)->edge_blocked_on_circ) failed; aborting.

Jun 04 15:48:12.279 [notice] Tor v0.2.3.25 (git-17c24b3118224d65) running on OpenBSD. (5.3, amd64)
Jun 04 15:48:12.289 [notice] Initialized libevent version 1.4.14b-stable using method kqueue. Good.

Child Tickets

Attachments (3)

9017.patch (550 bytes) - added by nickm 7 years ago.
connect-test.c (3.5 KB) - added by andrea 7 years ago.
Test case for async socket connect() behavior
9017_v2.patch (952 bytes) - added by nickm 7 years ago.

Download all attachments as: .zip

Change History (28)

comment:1 Changed 7 years ago by nickm

Status: newneeds_information

If you can run it under a debugger to find a stack trace, and the value of *conn at the point where the assertion happens, that would really help.

comment:2 Changed 7 years ago by usuality

What other information do you need from here?

Program received signal SIGPIPE, Broken pipe.
0x00000bfba90b5b1a in sendto () at <stdin>:2
2       <stdin>: No such file or directory.
        in <stdin>
Current language:  auto; currently asm
(gdb) where
#0  0x00000bfba90b5b1a in sendto () at <stdin>:2
#1  0x00000bfb9fbabcbd in sendto (fd=192, msg=0xbfba7bae020, len=219, flags=0, to=0x0, tolen=0)
    at /usr/src/lib/librthread/rthread_cancel.c:526
#2  0x00000bf99b9826c0 in flush_chunk (s=192, buf=0xbfba8232000, chunk=0xbfba7bae000, sz=219, 
    buf_flushlen=0xbfba6bb6630) at buffers.c:791
#3  0x00000bf99b982589 in flush_buf (s=192, buf=0xbfba8232000, sz=219, buf_flushlen=0xbfba6bb6630)
    at buffers.c:877
#4  0x00000bf99b9c3b87 in connection_handle_write_impl (conn=0xbfba6bb6600, force=0) at connection.c:3283
#5  0x00000bf99b9c3d39 in connection_handle_write (conn=0xbfba6bb6600, force=0) at connection.c:3340
#6  0x00000bf99b9114b8 in conn_write_callback (fd=192, events=4, _conn=0xbfba6bb6600) at main.c:738
#7  0x00000bfba33f1a38 in event_base_loop (base=0xbfba6ca0c00, flags=Variable "flags" is not available.
) at /usr/src/lib/libevent/event.c:402
#8  0x00000bf99b913cae in do_main_loop () at main.c:1959
#9  0x00000bf99b915341 in tor_main (argc=1, argv=0x7f7ffffc9000) at main.c:2652
#10 0x00000bf99b90f8ab in main (argc=1, argv=0x7f7ffffc9000) at tor_main.c:30
(gdb) 

comment:3 Changed 7 years ago by usuality

Oops, that wasn't actually the crash. Here it is:

connection.c:4066 assert_connection_ok: Assertion (conn->type == CONN_TYPE_EXIT && conn->state == EXIT_CONN_STATE_RESOLVING) || connection_is_writing(conn) || conn->write_blocked_on_bw || (CONN_IS_EDGE(conn) && TO_EDGE_CONN(conn)->edge_blocked_on_circ) failed; aborting.

Program received signal SIGABRT, Aborted.
0x00000bfba90c074a in kill () at <stdin>:2
2       <stdin>: No such file or directory.
        in <stdin>
(gdb) where                                                                                                  
#0  0x00000bfba90c074a in kill () at <stdin>:2
#1  0x00000bfba91247ca in abort () at /usr/src/lib/libc/stdlib/abort.c:70
#2  0x00000bf99b9c650c in assert_connection_ok (conn=0xbfba11eef00, now=0) at connection.c:4061
#3  0x00000bf99b9bd720 in _connection_mark_for_close (conn=0xbfba11eef00, line=1172, 
    file=0xbf99bb55f5b "relay.c") at connection.c:673
#4  0x00000bf99b92efe5 in connection_edge_process_relay_cell (cell=0x7f7ffffc8870, circ=0xbfba42e7e00, 
    conn=0xbfba11eef00, layer_hint=0x0) at relay.c:1172
#5  0x00000bf99b92b647 in circuit_receive_relay_cell (cell=0x7f7ffffc8870, circ=0xbfba42e7e00, 
    cell_direction=CELL_DIRECTION_OUT) at relay.c:192
#6  0x00000bf99b9a493a in command_process_relay_cell (cell=0x7f7ffffc8870, conn=0xbfba2a15200)
    at command.c:576
#7  0x00000bf99b9a378e in command_process_cell (cell=0x7f7ffffc8870, conn=0xbfba2a15200) at command.c:192
#8  0x00000bf99b9d7551 in connection_or_process_cells_from_inbuf (conn=0xbfba2a15200) at connection_or.c:1882
#9  0x00000bf99b9d3e32 in connection_or_process_inbuf (conn=0xbfba2a15200) at connection_or.c:395
#10 0x00000bf99b9c5231 in connection_process_inbuf (conn=0xbfba2a15200, package_partial=1)
    at connection.c:3827
#11 0x00000bf99b9c2b3a in connection_handle_read_impl (conn=0xbfba2a15200) at connection.c:2706
#12 0x00000bf99b9c2c48 in connection_handle_read (conn=0xbfba2a15200) at connection.c:2748
#13 0x00000bf99b9112d0 in conn_read_callback (fd=130, event=2, _conn=0xbfba2a15200) at main.c:705
#14 0x00000bfba33f1a38 in event_base_loop (base=0xbfba6ca0c00, flags=Variable "flags" is not available.
) at /usr/src/lib/libevent/event.c:402
#15 0x00000bf99b913cae in do_main_loop () at main.c:1959
#16 0x00000bf99b915341 in tor_main (argc=1, argv=0x7f7ffffc9000) at main.c:2652
#17 0x00000bf99b90f8ab in main (argc=1, argv=0x7f7ffffc9000) at tor_main.c:30
(gdb) 

comment:4 Changed 7 years ago by nickm

Okay. Can I see the value of *conn? Just go to stack level 2 ("up 2") and print out conn ("p *conn").

comment:5 Changed 7 years ago by usuality

(gdb) up 2
#2  0x00000bf99b9c650c in assert_connection_ok (conn=0xbfba11eef00, now=0) at connection.c:4061
4061    connection.c: No such file or directory.
        in connection.c
Current language:  auto; currently c
(gdb) p *conn
$1 = {magic = 4030152723, state = 3 '\003', type = 5, purpose = 1, read_blocked_on_bw = 0, 
  write_blocked_on_bw = 0, hold_open_until_flushed = 0, inbuf_reached_eof = 0, in_flushed_some = 0, 
  linked = 0, reading_from_linked_conn = 0, writing_to_linked_conn = 0, active_on_link = 0, 
  linked_conn_is_closed = 0, proxy_state = 0, s = 109, conn_array_index = 26, read_event = 0xbfba415e300, 
  write_event = 0xbfba966ba00, inbuf = 0xbfba0c4a700, outbuf = 0xbfba15807c0, outbuf_flushlen = 451, 
  timestamp_lastread = 1370399151, timestamp_lastwritten = 1370399151, timestamp_created = 1370399151, 
  socket_family = 2, addr = {family = 2 '\002', addr = {dummy_ = 3874701106, in_addr = {
        s_addr = 3874701106}, in6_addr = {__u6_addr = {__u6_addr8 = "2?óæ", '\0' <repeats 11 times>, 
          __u6_addr16 = {16178, 59123, 0, 0, 0, 0, 0, 0}, __u6_addr32 = {3874701106, 0, 0, 0}}}}}, 
  port = 80, marked_for_close = 0, marked_for_close_file = 0x0, address = 0xbfba81713a0 "ocsp.godaddy.com", 
  linked_conn = 0x0, global_identifier = 72631, dirreq_id = 0}
(gdb) 

comment:6 Changed 7 years ago by nickm

Okay, that's connection type CONN_TYPE_EXIT, state EXIT_CONN_STATE_OPEN, purpose EXIT_PURPOSE_CONNECT.

Does "Bug: newly connected conn had data waiting!" occur right before every crash?

Could I see the output of "p *conn->outbuf" ?

comment:7 Changed 7 years ago by usuality

Yes, that message occurs each time.

(gdb) p *conn->outbuf
$3 = {magic = 2969563922, datalen = 451, default_chunk_size = 4096, head = 0xbfba4741000, 
  tail = 0xbfba4741000}

Changed 7 years ago by nickm

Attachment: 9017.patch added

comment:8 Changed 7 years ago by nickm

Status: needs_informationneeds_review

Does the attached patch prevent the crashes for you? (The "Bug:..." messages will probably still occur with this one.)

comment:9 Changed 7 years ago by nickm

Keywords: 023-backport added
Milestone: Tor: 0.2.4.x-final

comment:10 Changed 7 years ago by usuality

I'm running with that change now, will report back in a day.

comment:11 Changed 7 years ago by andrea

I've written a small non-blocking connect() test program which I will attach, and made observations. On both Linux (kernel 3.9.1, x86-64) and OpenBSD 5.2 on x86-64, the output is as follows, which does not reproduce the presumed behavior if nickm's hypothesis on the cause is correct:

connect() gave EINPROGRESS
About to select()...
The socket has become writeable according to select() (returned 1)
Async connect() succeeded
Socket closed

Changed 7 years ago by andrea

Attachment: connect-test.c added

Test case for async socket connect() behavior

comment:12 Changed 7 years ago by nickm

I hypothesize that the fast-return case only happens under high network load. I could be wrong though.

comment:13 Changed 7 years ago by andrea

Hmm; time for me to go reading the OpenBSD kernel source, I think.

comment:14 Changed 7 years ago by usuality

Crashed again.

connection.c:4066 assert_connection_ok: Assertion (conn->type == CONN_TYPE_EXIT && conn->state == EXIT_CONN_STATE_RESOLVING) || connection_is_writing(conn) || conn->write_blocked_on_bw || (CONN_IS_EDGE(conn) && TO_EDGE_CONN(conn)->edge_blocked_on_circ) failed; aborting.

Program received signal SIGABRT, Aborted.
0x000008e1c28ae74a in kill () at <stdin>:2
2       <stdin>: No such file or directory.
        in <stdin>
Current language:  auto; currently asm
(gdb) where
#0  0x000008e1c28ae74a in kill () at <stdin>:2
#1  0x000008e1c29127ca in abort () at /usr/src/lib/libc/stdlib/abort.c:70
#2  0x000008dfbbcc650c in assert_connection_ok (conn=0x8e1c6655000, now=0) at connection.c:4061
#3  0x000008dfbbcbd720 in _connection_mark_for_close (conn=0x8e1c6655000, line=135, 
    file=0x8dfbbe8159f "connection_edge.c") at connection.c:673
#4  0x000008dfbbcc8187 in connection_edge_reached_eof (conn=0x8e1c6655000) at connection_edge.c:135
#5  0x000008dfbbcc582a in connection_reached_eof (conn=0x8e1c6655000) at connection.c:3945
#6  0x000008dfbbcc2c0e in connection_handle_read_impl (conn=0x8e1c6655000) at connection.c:2733
#7  0x000008dfbbcc2c48 in connection_handle_read (conn=0x8e1c6655000) at connection.c:2748
#8  0x000008dfbbc112d0 in conn_read_callback (fd=103, event=2, _conn=0x8e1c6655000) at main.c:705
#9  0x000008e1cb237a38 in event_base_loop (base=0x8e1cb63d800, flags=Variable "flags" is not available.
) at /usr/src/lib/libevent/event.c:402
#10 0x000008dfbbc13cae in do_main_loop () at main.c:1959
#11 0x000008dfbbc15341 in tor_main (argc=1, argv=0x7f7fffff4388) at main.c:2652
#12 0x000008dfbbc0f8ab in main (argc=1, argv=0x7f7fffff4388) at tor_main.c:30
(gdb) up 2
#2  0x000008dfbbcc650c in assert_connection_ok (conn=0x8e1c6655000, now=0) at connection.c:4061
4061    connection.c: No such file or directory.
        in connection.c
Current language:  auto; currently c
(gdb) p *conn
$1 = {magic = 4030152723, state = 3 '\003', type = 5, purpose = 1, read_blocked_on_bw = 0, 
  write_blocked_on_bw = 0, hold_open_until_flushed = 0, inbuf_reached_eof = 1, in_flushed_some = 0, 
  linked = 0, reading_from_linked_conn = 0, writing_to_linked_conn = 0, active_on_link = 0, 
  linked_conn_is_closed = 0, proxy_state = 0, s = 103, conn_array_index = 330, read_event = 0x8e1c85da480, 
  write_event = 0x8e1bdce5a80, inbuf = 0x8e1c01f3ac0, outbuf = 0x8e1c93d5200, outbuf_flushlen = 392, 
  timestamp_lastread = 1370519533, timestamp_lastwritten = 1370519523, timestamp_created = 1370519523, 
  socket_family = 2, addr = {family = 2 '\002', addr = {dummy_ = 3242793927, in_addr = {
        s_addr = 3242793927}, in6_addr = {__u6_addr = {__u6_addr8 = "Ç\033IÁ", '\0' <repeats 11 times>, 
          __u6_addr16 = {7111, 49481, 0, 0, 0, 0, 0, 0}, __u6_addr32 = {3242793927, 0, 0, 0}}}}}, 
  port = 80, marked_for_close = 0, marked_for_close_file = 0x0, 
  address = 0x8e1c85b2880 "api-public.addthis.com", linked_conn = 0x0, global_identifier = 326395, 
  dirreq_id = 0}
(gdb) p *conn->outbuf
$2 = {magic = 2969563922, datalen = 392, default_chunk_size = 4096, head = 0x8e1c20d0000, 
  tail = 0x8e1c20d0000}

comment:15 Changed 7 years ago by nickm

Status: needs_reviewneeds_revision

Ugh. Well, I still think the patch above is *right*, but apparently it's incomplete.

Did this one come with a "Bug: newly connected conn had data waiting" right before it?

Could I see p *conn->write_event and p *(edge_connection_t*)conn ?

comment:16 Changed 7 years ago by usuality

Yes, 10 seconds before the abort it logged that error.

(gdb) p *conn->write_event
$3 = {ev_next = {tqe_next = 0x8e1c85da480, tqe_prev = 0x8e1c5f54100}, ev_active_next = {tqe_next = 0x0, 
    tqe_prev = 0x0}, ev_signal_next = {tqe_next = 0x0, tqe_prev = 0x0}, min_heap_idx = 4294967295, 
  ev_base = 0x8e1cb63d800, ev_fd = 103, ev_events = 20, ev_ncalls = 0, ev_pncalls = 0x0, ev_timeout = {
    tv_sec = 0, tv_usec = 0}, ev_pri = 0, ev_callback = 0x8dfbbc1139e <conn_write_callback>, 
  ev_arg = 0x8e1c6655000, ev_res = 0, ev_flags = 128}
(gdb) p *(edge_connection_t*)conn
$4 = {_base = {magic = 4030152723, state = 3 '\003', type = 5, purpose = 1, read_blocked_on_bw = 0, 
    write_blocked_on_bw = 0, hold_open_until_flushed = 0, inbuf_reached_eof = 1, in_flushed_some = 0, 
    linked = 0, reading_from_linked_conn = 0, writing_to_linked_conn = 0, active_on_link = 0, 
    linked_conn_is_closed = 0, proxy_state = 0, s = 103, conn_array_index = 330, 
    read_event = 0x8e1c85da480, write_event = 0x8e1bdce5a80, inbuf = 0x8e1c01f3ac0, outbuf = 0x8e1c93d5200, 
    outbuf_flushlen = 392, timestamp_lastread = 1370519533, timestamp_lastwritten = 1370519523, 
    timestamp_created = 1370519523, socket_family = 2, addr = {family = 2 '\002', addr = {
        dummy_ = 3242793927, in_addr = {s_addr = 3242793927}, in6_addr = {__u6_addr = {
            __u6_addr8 = "Ç\033IÁ", '\0' <repeats 11 times>, __u6_addr16 = {7111, 49481, 0, 0, 0, 0, 0, 0}, 
            __u6_addr32 = {3242793927, 0, 0, 0}}}}}, port = 80, marked_for_close = 0, 
    marked_for_close_file = 0x0, address = 0x8e1c85b2880 "api-public.addthis.com", linked_conn = 0x0, 
    global_identifier = 326395, dirreq_id = 0}, next_stream = 0x8e1cde6ee00, package_window = 500, 
  deliver_window = 499, on_circuit = 0x8e1be07f400, cpath_layer = 0x0, rend_data = 0x0, address_ttl = 16, 
  stream_id = 51317, end_reason = 6, n_read = 0, n_written = 0, is_dns_request = 0, edge_has_sent_end = 1, 
  edge_blocked_on_circ = 0}
(gdb) 

comment:17 Changed 7 years ago by andrea

Hmm; the non-immediate-return case code for handling when the connection completes is in connection_edge_finished_connecting(); the code is substantially similar to the latter part of connection_exit_connect() and this should be refactored to avoid the duplication, but probably not in maint-0.2.3. The only differences I see are

  • connection_edge_finished_connecting() calls rep_hist_note_exit_stream_opened(conn->port) before it puts the connection in EXIT_CONN_STATE_OPEN.
  • connection_edge_finished_connecting() has an extra connection_edge_process_inbuf(edge_conn, 1) before it returns, which connection_exit_connect() lacks.

comment:18 Changed 7 years ago by nickm

Hm. So the best fix I can think of here right now then is to fix the known bug in connection_edge_finished_connecting(), and refactor that part of connection_assert_ok() into a new function that just logs a bug and enables writing. Not pretty, but neither are crash bugs. Andrea, does that sound plausible?

Changed 7 years ago by nickm

Attachment: 9017_v2.patch added

comment:19 Changed 7 years ago by nickm

Status: needs_revisionneeds_review

Oh duh; my fix would not have worked! connection_watch_events(READ_EVENT) turns off the write event. I have another patch to try; please see if "9017_v2.patch" works for you?

comment:20 Changed 7 years ago by usuality

Running with that now.

comment:21 Changed 7 years ago by andrea

Patch looks good to me if it works for usuality, but nickm will know libevent stuff much better than I do.

comment:22 Changed 7 years ago by usuality

Jun 8 10:23:28 tor Tor[32697]: Heartbeat: Tor's uptime is 18:00 hours, with 477 circuits open. I've sent 25.44 GB and received 25.13 GB.

Still up with no errors logged.

comment:23 Changed 7 years ago by arma

Sounds like a good sign!

comment:24 Changed 7 years ago by usuality

Still up 24 hours later with no crashes or errors logged.

Jun 9 10:23:28 tor Tor[32697]: Heartbeat: Tor's uptime is 1 day 18:00 hours, with 1004 circuits open. I've sent 92.92 GB and received 91.21 GB.

comment:25 Changed 7 years ago by nickm

Resolution: fixed
Status: needs_reviewclosed

Okay. I've made a branch "bug9017" and merged it to 0.2.3 and forward. Thanks for your patience and testing!

Note: See TracTickets for help on using tickets.