Opened 8 years ago

Closed 8 years ago

Last modified 8 years ago

#7212 closed defect (fixed)

circuitmux assertion failure in 0.2.4.4-alpha

Reported by: nickm Owned by:
Priority: High Milestone: Tor: 0.2.4.x-final
Component: Core Tor/Tor Version: Tor: 0.2.4.4-alpha
Severity: Keywords: tor-relay
Cc: andrea, shamrock, aagbsn, tor-admin@… Actual Points:
Parent ID: Points:
Reviewer: Sponsor:

Description

Andreas Lehner reports the following assertion failure on 0.2.4.4-alpha

Oct 25 14:56:51.000 [err] update_circuit_on_cmux(): Bug: src/or/relay.c:2002:
update_circuit_on_cmux: Assertion circuitmux_is_circuit_attached(cmux, circ)
failed; aborting.

I've asked for a stack trace.

Child Tickets

Change History (30)

comment:1 Changed 8 years ago by nickm

Version: Tor: 0.2.4.4-alpha

comment:2 Changed 8 years ago by arma

Lucky apparently experienced this on Tonga (linux, 0.2.4.5-alpha).

comment:3 Changed 8 years ago by nickm

Probably authority-related somehow; possibly related to authorities testing whether relays are up. A stack trace would really, really help.

comment:4 Changed 8 years ago by arma

Cc: shamrock added

comment:5 in reply to:  3 Changed 8 years ago by shamrock

Replying to nickm:
I promise you a stack trace as soon as this happens again on Tonga and I get a core.

comment:6 Changed 8 years ago by kargig

FWIW I got the same crash on an exit node

Nov 05 06:25:02.000 [notice] Tor 0.2.4.5-alpha (git-630e0786909c83dc) opening new log file.
Nov 05 06:58:20.000 [err] update_circuit_on_cmux(): Bug: src/or/relay.c:2002: update_circuit_on_cmux: Assertion circuitmux_is_circuit_attached(cmux, circ) failed; aborting.

comment:7 Changed 8 years ago by nickm

Okay, so it's not just authorities. But we really need a stack trace here to make progress.

comment:8 Changed 8 years ago by aagbsn

Cc: aagbsn added

Here's the stack trace:
My system is:

tor --version
Tor version 0.2.4.5-alpha (git-17a6aa56cefe9d83).
uname -a
Linux DC03R05DS10 2.6.32-5-amd64 #1 SMP Sun May 6 04:00:17 UTC 2012 x86_64 GNU/Linux
Debian Squeeze
Nov 05 10:00:17.000 [notice] Heartbeat: Tor's uptime is 1 day 0:00 hours, with 2213 circuits open. I've sent 119.52 GB and received 117.94 GB.
Nov 05 13:12:41.000 [err] update_circuit_on_cmux(): Bug: src/or/relay.c:2002: update_circuit_on_cmux: Assertion circuitmux_is_circuit_attached(cmux, circ) failed; aborting.
src/or/relay.c:2002 update_circuit_on_cmux: Assertion circuitmux_is_circuit_attached(cmux, circ) failed; aborting.

Program received signal SIGABRT, Aborted.


(gdb) bt full
#0  0x00007ffff68341b5 in raise () from /lib/libc.so.6
No symbol table info available.
#1  0x00007ffff6836fc0 in abort () from /lib/libc.so.6
No symbol table info available.
#2  0x00007ffff7e8f9fd in TO_OR_CIRCUIT (x=<optimized out>) at src/or/or.h:2954
No locals.
#3  update_circuit_on_cmux (circ=0x7ffff99f4eb0, direction=CELL_DIRECTION_IN)
    at src/or/relay.c:1991
        chan = <optimized out>
        or_circ = 0x7ffff99f4eb0
        __PRETTY_FUNCTION__ = "update_circuit_on_cmux"
        __func__ = "update_circuit_on_cmux"
#4  0x00007ffff7eee224 in circuit_mark_for_close_ (circ=0x7ffff99f4eb0, 
    reason=8, line=1461, file=0x7ffff7fb3a0f "src/or/circuitlist.c")
    at src/or/circuitlist.c:1441
        conn = <optimized out>
        __PRETTY_FUNCTION__ = "circuit_mark_for_close_"
        __func__ = "circuit_mark_for_close_"
#5  0x00007ffff7eee513 in circuit_mark_for_close_ (circ=0x7ffff8decff0, 
    reason=8, line=1047, file=0x7ffff7fb3a0f "src/or/circuitlist.c")
    at src/or/circuitlist.c:1461
        __PRETTY_FUNCTION__ = "circuit_mark_for_close_"
        __func__ = "circuit_mark_for_close_"
#6  0x00007ffff7eee947 in circuit_unlink_all_from_channel (
    chan=0x7fffe3fa13b0, reason=8) at src/or/circuitlist.c:1047
        mark = 0
        circ = 0x7ffff8decff0
#7  0x00007ffff7ede308 in channel_closed (chan=0x7fffe3fa13b0)
    at src/or/channel.c:1327
        __PRETTY_FUNCTION__ = "channel_closed"
        __func__ = "channel_closed"
#8  0x00007ffff7f2607e in connection_or_about_to_close (or_conn=0x7fffd41be540)
    at src/or/connection_or.c:601
        now = 1352139161
#9  0x00007ffff7e772bd in connection_unlink (conn=0x673c) at src/or/main.c:385
No locals.
#10 0x00007ffff7e775ad in conn_close_if_marked (i=<optimized out>)
    at src/or/main.c:932
        conn = 0x7fffd41be540
        retval = -132057072
        now = 1352139161
#11 close_closeable_connections () at src/or/main.c:702
        conn = 0x673c
        i = -1
#12 0x00007ffff757896c in event_base_loop () from /usr/lib/libevent-2.0.so.5
No symbol table info available.
#13 0x00007ffff7e755a1 in do_main_loop () at src/or/main.c:1987
        loop_result = <optimized out>
        now = <optimized out>
        __PRETTY_FUNCTION__ = "do_main_loop"
        __func__ = "do_main_loop"
#14 0x00007ffff7e7595d in tor_main (argc=<optimized out>, argv=0x7fffffffe218)
    at src/or/main.c:2699
        result = -1
        __PRETTY_FUNCTION__ = "tor_main"
#15 0x00007ffff6820c8d in __libc_start_main () from /lib/libc.so.6
No symbol table info available.
#16 0x00007ffff7e73579 in _start ()
No symbol table info available.

comment:9 Changed 8 years ago by aagbsn

Ah, forgot to mention that this is a non-exit relay.

I built Tor as follows:

./autogen.sh; ./configure --disable-asciidoc; make clean; make

My torrc is as follows:
SocksPort 0
Nickname lna24
AccountingMax 150 GB
AccountingStart day 00:00
DirPort 9024
MyFamily $8644744F908727C42720DB549F3D2BDB4D0B129A
ExitPolicy reject *:*
ORPort 2424
DisableDebuggerAttachment 0

comment:10 Changed 8 years ago by aagbsn

(gdb) select-frame 6
(gdb) info frame
Stack level 6, frame at 0x7fffffffdc40:
 rip = 0x7ffff7eee947 in circuit_unlink_all_from_channel
    (src/or/circuitlist.c:1047); saved rip 0x7ffff7ede308
 called by frame at 0x7fffffffdc70, caller of frame at 0x7fffffffdc00
 source language c.
 Arglist at 0x7fffffffdbf8, args: chan=0x7fffe3fa13b0, reason=8
 Locals at 0x7fffffffdbf8, Previous frame's sp is 0x7fffffffdc40
 Saved registers:
  rbx at 0x7fffffffdc20, rbp at 0x7fffffffdc28, r12 at 0x7fffffffdc30,
  rip at 0x7fffffffdc38
(gdb) info locals
mark = 0
circ = 0x7ffff8decff0
(gdb) p *circ
$1 = {magic = 2561392719, n_chan = 0x0, n_circ_id = 0, n_mux = 0x0, 
  n_chan_cells = {head = 0x0, tail = 0x0, n = 0, insertion_times = 0x0}, 
  n_hop = 0x0, streams_blocked_on_n_chan = 0, streams_blocked_on_p_chan = 0, 
  state = 3 '\003', purpose = 4 '\004', package_window = 1000, 
  deliver_window = 1000, n_chan_onionskin = 0x0, timestamp_created = {
    tv_sec = 1352138892, tv_usec = 948938}, timestamp_dirty = 0, 
  marked_for_close = 1047, 
  marked_for_close_file = 0x7ffff7fb3a0f "src/or/circuitlist.c", 
  dirreq_id = 0, next = 0x7ffff8a6b730, next_active_on_n_chan = 0x0, 
  prev_active_on_n_chan = 0x0}
(gdb) p *chan
$2 = {magic = 2316903463, state = CHANNEL_STATE_CLOSING, 
  global_identifier = 116130, registered = 1 '\001', 
  reason_for_closing = CHANNEL_CLOSE_FOR_ERROR, 
  timestamp_created = 1352131496, timestamp_active = 1352139161, free = 0, 
  close = 0x7ffff7ee31f0 <channel_tls_close_method>, 
  describe_transport = 0x7ffff7ee30d0 <channel_tls_describe_transport_method>, 
  dumpstats = 0, cell_handler = 0x7ffff7efdfd0 <command_process_cell>, 
  var_cell_handler = 0x7ffff7efdc80 <command_process_var_cell>, 
  get_remote_addr = 0x7ffff7ee2f20 <channel_tls_get_remote_addr_method>, 
  get_remote_descr = 0x7ffff7ee2cd0 <channel_tls_get_remote_descr_method>, 
  has_queued_writes = 0x7ffff7ee2b80 <channel_tls_has_queued_writes_method>, 
  is_canonical = 0x7ffff7ede820 <channel_tls_is_canonical_method>, 
  matches_extend_info = 0x7ffff7ee1380 <channel_tls_matches_extend_info_method>, matches_target = 0x7ffff7ee11d0 <channel_tls_matches_target_method>, 
  write_cell = 0x7ffff7ee29d0 <channel_tls_write_cell_method>, 
  write_packed_cell = 0x7ffff7ee2810 <channel_tls_write_packed_cell_method>, 
  write_var_cell = 0x7ffff7ee2660 <channel_tls_write_var_cell_method>, 
  identity_digest = ")�1:x O��PILu\f����Rt", nickname = 0x0,
  next_with_same_id = 0x0, prev_with_same_id = 0x0, incoming_queue = 0x0,
  outgoing_queue = 0x0, cmux = 0x7fffe1953ed0,
  circ_id_type = CIRC_ID_TYPE_LOWER, next_circ_id = 20738,
  num_n_circuits = 4294967280, num_p_circuits = 4294967290,
  is_bad_for_new_circs = 0, is_client = 0, is_incoming = 1, is_local = 0,
---Type <return> to continue, or q <return> to quit---
  timestamp_client = 0, timestamp_drained = 1352139161,
  timestamp_recv = 1352139160, timestamp_xmit = 1352139161,
  timestamp_last_added_nonpadding = 1352139160, dirreq_id = 0,
  n_cells_recved = 98393, n_cells_xmitted = 37552}

comment:11 Changed 8 years ago by aagbsn

(gdb) select-frame 7
(gdb) info locals
__PRETTY_FUNCTION__ = "channel_closed"
__func__ = "channel_closed"
(gdb) info args
chan = 0x7fffe3fa13b0
(gdb) p *chan
$15 = {magic = 2316903463, state = CHANNEL_STATE_CLOSING, 
  global_identifier = 116130, registered = 1 '\001', 
  reason_for_closing = CHANNEL_CLOSE_FOR_ERROR, 
  timestamp_created = 1352131496, timestamp_active = 1352139161, free = 0, 
  close = 0x7ffff7ee31f0 <channel_tls_close_method>, 
  describe_transport = 0x7ffff7ee30d0 <channel_tls_describe_transport_method>, 
  dumpstats = 0, cell_handler = 0x7ffff7efdfd0 <command_process_cell>, 
  var_cell_handler = 0x7ffff7efdc80 <command_process_var_cell>, 
  get_remote_addr = 0x7ffff7ee2f20 <channel_tls_get_remote_addr_method>, 
  get_remote_descr = 0x7ffff7ee2cd0 <channel_tls_get_remote_descr_method>, 
  has_queued_writes = 0x7ffff7ee2b80 <channel_tls_has_queued_writes_method>, 
  is_canonical = 0x7ffff7ede820 <channel_tls_is_canonical_method>, 
  matches_extend_info = 0x7ffff7ee1380 <channel_tls_matches_extend_info_method>, matches_target = 0x7ffff7ee11d0 <channel_tls_matches_target_method>, 
  write_cell = 0x7ffff7ee29d0 <channel_tls_write_cell_method>, 
  write_packed_cell = 0x7ffff7ee2810 <channel_tls_write_packed_cell_method>, 
  write_var_cell = 0x7ffff7ee2660 <channel_tls_write_var_cell_method>, 
  identity_digest = ")�1:x O��PILu\f����Rt", nickname = 0x0, 
  next_with_same_id = 0x0, prev_with_same_id = 0x0, incoming_queue = 0x0, 
  outgoing_queue = 0x0, cmux = 0x7fffe1953ed0, 
  circ_id_type = CIRC_ID_TYPE_LOWER, next_circ_id = 20738, 
  num_n_circuits = 4294967280, num_p_circuits = 4294967290, 
  is_bad_for_new_circs = 0, is_client = 0, is_incoming = 1, is_local = 0, 
---Type <return> to continue, or q <return> to quit---
  timestamp_client = 0, timestamp_drained = 1352139161, 
  timestamp_recv = 1352139160, timestamp_xmit = 1352139161, 
  timestamp_last_added_nonpadding = 1352139160, dirreq_id = 0, 
  n_cells_recved = 98393, n_cells_xmitted = 37552}
(gdb) 

comment:12 Changed 8 years ago by aagbsn

oops, that should have been:

(gdb) info frame
Stack level 10, frame at 0x7fffffffdfc0:
 rip = 0x7ffff7e775ad in conn_close_if_marked (src/or/main.c:932); 
    saved rip 0x7ffff757896c
 inlined into frame 11, caller of frame at 0x7fffffffdf00
 source language c.
 Arglist at unknown address.
 Locals at unknown address, Previous frame's sp is 0x7fffffffdf00
 Saved registers:
  rbx at 0x7fffffffdef0, rip at 0x7fffffffdef8
(gdb) p *conn
$17 = {magic = 2100428547, state = 8 '\b', type = 4, purpose = 0, 
  read_blocked_on_bw = 0, write_blocked_on_bw = 0, 
  hold_open_until_flushed = 0, inbuf_reached_eof = 0, in_flushed_some = 0, 
  linked = 0, reading_from_linked_conn = 0, writing_to_linked_conn = 0, 
  active_on_link = 0, linked_conn_is_closed = 0, proxy_state = 0, s = -1, 
  conn_array_index = 337, read_event = 0x0, write_event = 0x0, 
  inbuf = 0x7fffe3a13060, outbuf = 0x7fffe3a13090, outbuf_flushlen = 0, 
  timestamp_lastread = 1352139160, timestamp_lastwritten = 1352139161, 
  timestamp_created = 1352131496, socket_family = 2, addr = {family = 2, 
    addr = {dummy_ = 52650066, in_addr = {s_addr = 52650066}, in6_addr = {
        __in6_u = {__u6_addr8 = "R`#\003", '\000' <repeats 11 times>, 
          __u6_addr16 = {24658, 803, 0, 0, 0, 0, 0, 0}, __u6_addr32 = {
            52650066, 0, 0, 0}}}}}, port = 443, marked_for_close = 2730, 
  marked_for_close_file = 0x7ffff7fc1d18 "src/or/connection.c", 
  address = 0x7fffe1854410 "82.96.35.3", linked_conn = 0x0, 
  global_identifier = 116757}

comment:13 Changed 8 years ago by mo

Seeing the same on some of our exits (0.2.4.5-alpha).

comment:14 Changed 8 years ago by arma

Just closed #7429 as a duplicate. Looks like real relays are hitting this pretty often.

comment:15 Changed 8 years ago by torland

Cc: tor-admin@… added

comment:16 Changed 8 years ago by aagbsn

I updated to latest master,

tor --version
Tor version 0.2.4.5-alpha-dev (git-0523c8de7d7f775e).
oing through channels at src/or/hibernate.c:850
Nov 10 23:18:55.000 [warn] Something tried to close an or_connection_t without going through channels at src/or/hibernate.c:850
Nov 10 23:18:55.000 [warn] Something tried to close an or_connection_t without going through channels at src/or/hibernate.c:850
... snip (some hundreds or so of similar lines) ...
Nov 10 23:18:55.000 [warn] Something tried to close an or_connection_t without going through channels at src/or/hibernate.c:850
Nov 10 23:18:55.000 [err] update_circuit_on_cmux(): Bug: src/or/relay.c:2002: update_circuit_on_cmux: Assertion circuitmux_is_circuit_attached(cmux, circ) failed; aborting.
src/or/relay.c:2002 update_circuit_on_cmux: Assertion circuitmux_is_circuit_attached(cmux, circ) failed; aborting.

Program received signal SIGABRT, Aborted.
0x00007ffff68331b5 in raise () from /lib/libc.so.6
(gdb) bt full
#0  0x00007ffff68331b5 in raise () from /lib/libc.so.6
No symbol table info available.
#1  0x00007ffff6835fc0 in abort () from /lib/libc.so.6
No symbol table info available.
#2  0x00007ffff7e8ea7d in TO_OR_CIRCUIT (x=<optimized out>) at src/or/or.h:2954
No locals.
#3  update_circuit_on_cmux (circ=0x7fffec51f6a0, direction=CELL_DIRECTION_IN)
    at src/or/relay.c:1991
        chan = <optimized out>
        or_circ = 0x7fffec51f6a0
        __PRETTY_FUNCTION__ = "update_circuit_on_cmux"
        __func__ = "update_circuit_on_cmux"
#4  0x00007ffff7eed548 in circuit_mark_for_close_ (circ=0x7fffec51f6a0,
    reason=8, line=1471, file=0x7ffff7fb396f "src/or/circuitlist.c")
    at src/or/circuitlist.c:1446
        conn = <optimized out>
        __PRETTY_FUNCTION__ = "circuit_mark_for_close_"
        __func__ = "circuit_mark_for_close_"
#5  0x00007ffff7eed843 in circuit_mark_for_close_ (circ=0x7fffed0d8720,
    reason=8, line=1047, file=0x7ffff7fb396f "src/or/circuitlist.c")
    at src/or/circuitlist.c:1471
        __PRETTY_FUNCTION__ = "circuit_mark_for_close_"
---Type <return> to continue, or q <return> to quit---
        __func__ = "circuit_mark_for_close_"
#6  0x00007ffff7eedc77 in circuit_unlink_all_from_channel (
    chan=0x7ffff9bca200, reason=8) at src/or/circuitlist.c:1047
        mark = 0
        circ = 0x7fffed0d8720
#7  0x00007ffff7edc0e8 in channel_closed (chan=0x7ffff9bca200)
    at src/or/channel.c:1288
        __PRETTY_FUNCTION__ = "channel_closed"
        __func__ = "channel_closed"
#8  0x00007ffff7f257ce in connection_or_about_to_close (or_conn=0x7ffff927cfa0)
    at src/or/connection_or.c:601
        now = 1352607535
#9  0x00007ffff7e7634d in connection_unlink (conn=0x1ee6) at src/or/main.c:385
No locals.
#10 0x00007ffff7e7663d in conn_close_if_marked (i=<optimized out>)
    at src/or/main.c:932
        conn = 0x7ffff927cfa0
        retval = 601
        now = 1352607535
#11 close_closeable_connections () at src/or/main.c:702
        conn = 0x1ee6
        i = -1
---Type <return> to continue, or q <return> to quit---
#12 0x00007ffff7e77848 in run_scheduled_events (now=<optimized out>)
    at src/or/main.c:1535
        have_dir_info = <optimized out>
        last_rotated_x509_certificate = 1352606336
        time_to_check_descriptor = 1352607592
        time_to_add_entropy = 1352609944
        time_to_downrate_stability = 1352635128
        time_to_check_for_expired_networkstatus = 1352607654
        time_to_write_stats_files = 1352609928
        should_init_bridge_stats = 1
        time_to_check_ipaddress = 1352608188
        time_to_save_stability = 0
        time_to_clean_caches = 1352608160
        time_to_retry_dns_init = 1352607625
        i = 601
        time_to_check_v3_certificate = 1352607723
        time_to_write_bridge_status_file = 0
        time_to_write_bridge_stats = 0
        time_to_shrink_memory = 1352607592
        time_to_try_getting_descriptors = 1352607554
        time_to_reset_descriptor_failures = 1352609944
        time_to_recheck_bandwidth = 1352636397
---Type <return> to continue, or q <return> to quit---
        time_to_launch_reachability_tests = 0
        time_to_next_heartbeat = 1352613528
        options = <optimized out>
        is_server = 1
        time_to_check_listeners = 1352607348
        time_to_check_port_forwarding = 0
#13 second_elapsed_callback (timer=<optimized out>, arg=<optimized out>)
    at src/or/main.c:1691
        current_second = 1352607534
        now = 1352607535
        bytes_written = <optimized out>
        bytes_read = <optimized out>
        seconds_elapsed = <optimized out>
        options = 0x259
        __PRETTY_FUNCTION__ = "second_elapsed_callback"
#14 0x00007ffff757796c in event_base_loop () from /usr/lib/libevent-2.0.so.5
No symbol table info available.
#15 0x00007ffff7e74631 in do_main_loop () at src/or/main.c:1987
        loop_result = <optimized out>
        now = <optimized out>
        __PRETTY_FUNCTION__ = "do_main_loop"
        __func__ = "do_main_loop"
---Type <return> to continue, or q <return> to quit---
#16 0x00007ffff7e749ed in tor_main (argc=<optimized out>, argv=0x7fffffffe218)
    at src/or/main.c:2699
        result = -1
        __PRETTY_FUNCTION__ = "tor_main"
#17 0x00007ffff681fc8d in __libc_start_main () from /lib/libc.so.6
No symbol table info available.
#18 0x00007ffff7e72609 in _start ()
No symbol table info available.
(gdb) 

comment:17 in reply to:  16 ; Changed 8 years ago by nickm

Status: newneeds_review

Aha, Maybe.

Both of the failures had this pair of stack frames in common:

#4  0x00007ffff7eee224 in circuit_mark_for_close_ (circ=0x7ffff99f4eb0, 
    reason=8, line=1461, file=0x7ffff7fb3a0f "src/or/circuitlist.c")
    at src/or/circuitlist.c:1441
        conn = <optimized out>
        __PRETTY_FUNCTION__ = "circuit_mark_for_close_"
        __func__ = "circuit_mark_for_close_"
#5  0x00007ffff7eee513 in circuit_mark_for_close_ (circ=0x7ffff8decff0, 
    reason=8, line=1047, file=0x7ffff7fb3a0f "src/or/circuitlist.c")
    at src/or/circuitlist.c:1461
        __PRETTY_FUNCTION__ = "circuit_mark_for_close_"
        __func__ = "circuit_mark_for_close_"

That's a circuit_mark_for_close_() calling a circuit_mark_for_close_() on its spliced rendezvous circuit at what is now line 1471 of circuitlist.c. That's a clue, perhaps.

Also, circuit_mark_for_close_() doesn't call update_circuit_on_cmux() directly; it only calls circuit_cell_queue_clear(). That could be another clue.

I've taken a wild stab at a workaround in branch "bug7212_wild_stab" in my public repository. Andrea, what do you think of this one? Do you think it's safe to merge?

comment:18 in reply to:  17 Changed 8 years ago by andrea

Replying to nickm:

Aha, Maybe.

Both of the failures had this pair of stack frames in common:

#4  0x00007ffff7eee224 in circuit_mark_for_close_ (circ=0x7ffff99f4eb0, 
    reason=8, line=1461, file=0x7ffff7fb3a0f "src/or/circuitlist.c")
    at src/or/circuitlist.c:1441
        conn = <optimized out>
        __PRETTY_FUNCTION__ = "circuit_mark_for_close_"
        __func__ = "circuit_mark_for_close_"
#5  0x00007ffff7eee513 in circuit_mark_for_close_ (circ=0x7ffff8decff0, 
    reason=8, line=1047, file=0x7ffff7fb3a0f "src/or/circuitlist.c")
    at src/or/circuitlist.c:1461
        __PRETTY_FUNCTION__ = "circuit_mark_for_close_"
        __func__ = "circuit_mark_for_close_"

That's a circuit_mark_for_close_() calling a circuit_mark_for_close_() on its spliced rendezvous circuit at what is now line 1471 of circuitlist.c. That's a clue, perhaps.

Also, circuit_mark_for_close_() doesn't call update_circuit_on_cmux() directly; it only calls circuit_cell_queue_clear(). That could be another clue.

I've taken a wild stab at a workaround in branch "bug7212_wild_stab" in my public repository. Andrea, what do you think of this one? Do you think it's safe to merge?

Thanks, I think you can go ahead and merge that. Fix the typo in your log message first, though. :)

comment:19 in reply to:  17 Changed 8 years ago by andrea

Replying to nickm:

Aha, Maybe.

Both of the failures had this pair of stack frames in common:

#4  0x00007ffff7eee224 in circuit_mark_for_close_ (circ=0x7ffff99f4eb0, 
    reason=8, line=1461, file=0x7ffff7fb3a0f "src/or/circuitlist.c")
    at src/or/circuitlist.c:1441
        conn = <optimized out>
        __PRETTY_FUNCTION__ = "circuit_mark_for_close_"
        __func__ = "circuit_mark_for_close_"
#5  0x00007ffff7eee513 in circuit_mark_for_close_ (circ=0x7ffff8decff0, 
    reason=8, line=1047, file=0x7ffff7fb3a0f "src/or/circuitlist.c")
    at src/or/circuitlist.c:1461
        __PRETTY_FUNCTION__ = "circuit_mark_for_close_"
        __func__ = "circuit_mark_for_close_"

That's a circuit_mark_for_close_() calling a circuit_mark_for_close_() on its spliced rendezvous circuit at what is now line 1471 of circuitlist.c. That's a clue, perhaps.

Also, circuit_mark_for_close_() doesn't call update_circuit_on_cmux() directly; it only calls circuit_cell_queue_clear(). That could be another clue.

I've taken a wild stab at a workaround in branch "bug7212_wild_stab" in my public repository. Andrea, what do you think of this one? Do you think it's safe to merge?

Come to think of it, I think you probably nailed it. Spliced rendezvous circuits are an exception to 'every circuit has a channel and should be attached to its cmux', since IIRC they deliver cells to another circuit that is then attached to a channel instead.

comment:20 Changed 8 years ago by nickm

Thanks; merged! I hope I guessed what the typo was. :)

Spliced rendezvous circuits are an exception to 'every circuit has a channel and should be attached to its cmux', since IIRC they deliver cells to another circuit that is then attached to a channel instead

That's right, although they *do* have a p_chan. It's an n_chan that they should lack. Still think we should close?

comment:21 in reply to:  20 Changed 8 years ago by andrea

Replying to nickm:

Thanks; merged! I hope I guessed what the typo was. :)

Spliced rendezvous circuits are an exception to 'every circuit has a channel and should be attached to its cmux', since IIRC they deliver cells to another circuit that is then attached to a channel instead

That's right, although they *do* have a p_chan. It's an n_chan that they should lack. Still think we should close?

Well, I think it could happen in either direction. I'm going to have a close read through the rendezvous code and see what exactly happens and/or if anything should be changed to make a cleaner fix. Leave this open for now so if I have any further patches I have a placeholder for them, and we'll see if there are any more reports of it.

comment:22 Changed 8 years ago by andrea

I suspect the rendezvous splicing theory is correct and I have an instrumented version which emits the channel IDs/circuit IDs when a splice happens and indicates if the circuits are attached, and if this bug is triggered emits information about any spliced circuit, in my bug7212_instrumented branch. I've got that on a test run now.

comment:23 Changed 8 years ago by andrea

I've got one instance of a circuit being spliced for rendezvous with the extra debug logging:

Nov 13 10:35:00.000 [info] rend_mid_establish_rendezvous(): Established rendezvous point on circuit 60151 for cookie 0E5E85EF
Nov 13 10:38:18.000 [info] rend_mid_rendezvous(): Got request for rendezvous from circuit 43276 to cookie 0E5E85EF.
Nov 13 10:38:18.000 [info] rend_mid_rendezvous(): Completing rendezvous: circuit 43276 joins circuit 60151 (cookie 0E5E85EF)
Nov 13 10:38:18.000 [debug] rend_mid_rendezvous(): Rendezvous circuit #1 (p_circ_id 43276, n_circ_id 0) has no n_chan
Nov 13 10:38:18.000 [debug] rend_mid_rendezvous(): Rendezvous circuit #1 (p_circ_id 43276, n_circ_id 0) has p_chan 3462 at 0x1821b50
Nov 13 10:38:18.000 [debug] rend_mid_rendezvous(): Rendezvous circuit #1 is attached to its p_chan's cmux
Nov 13 10:38:18.000 [debug] rend_mid_rendezvous(): Rendezvous circuit #2 (p_circ_id 60151, n_circ_id 0) has no n_chan
Nov 13 10:38:18.000 [debug] rend_mid_rendezvous(): Rendezvous circuit #2 (p_circ_id 60151, n_circ_id 0) has p_chan 6161 at 0x1054e60
Nov 13 10:38:18.000 [debug] rend_mid_rendezvous(): Rendezvous circuit #2 is attached to its p_chan's cmux

That's about what I expected to see; I haven't managed to personally observe the bug yet, though, with any version. I'd be curious to learn, if people see it and/or the warning messages that should replace it in 0.2.4.6, how much uptime and how much data relayed is typical between occurrences.

comment:24 Changed 8 years ago by andrea

Well, I found *a* bug: hibernate.c calls connection_mark_for_close() directly on orconns rather than the proper channel functions. That's the cause of the numerous '[warn] Something tried to close an or_connection_t without going through channels at src/or/hibernate.c:850' aagbsn reported.

comment:25 Changed 8 years ago by andrea

Now, if this happens while a spliced circuit is active, it will hit one or the other end's p_chan first, and then circuit_unlink_all_from_channel() first calls channel_unlink_all_circuits(), then circuit_mark_for_close() on every unmarked circuit. If that hits the spliced circuit, it then recursively calls circuit_mark_for_close() on the spliced circuit, and then circuit_clear_cell_queue(), and finally update_circuit_on_cmux().

Before nickm's recent fix to circuit_clear_cell_queue(), it would not be checked whether the spliced circuit was attached to the cmux, so that could trigger the assert if it were not attached. Is it possible for a spliced rendezvous circuit to, by coincidence, have the same p_chan on both ends? I think it is, and if so that could trigger the bug.

If any instances of this are observed in the future, I'd quite like to see the p_chan as well as the n_chan, but I think nickm's change to circuit_clear_cell_queue() fixes it.

comment:26 Changed 8 years ago by andrea

Also, I note the doubtful num_n_circuits and num_p_circuits seen in the channel in the stack dump. I believe this is caused by channel_unlink_all_circuits() resetting the counters to zero and then them being subsequently decremented during the circuit_mark_for_close() loop in circuit_unlink_all_from_channel(). Clearly, circuit_unlink_all_from_channel() needs some re-engineering here, and the hibernate.c call to connection_mark_for_close() should change. These are independent bugs and I will open new tickets.

comment:27 Changed 8 years ago by andrea

Observation: if my hypothesis about this being triggered by spliced rendezvous segments coincidentally sharing the same p_chan is correct, then this is only feasible to observe on fast relays, since the frequency of rendezvous events should scale with relay bandwidth, but the probably of a second circuit choosing the same link would also scale with relay bandwidth, so I think the probability per unit time for this to occur is quadratic in relay bandwidth. That explains why I've been unable to observe this in weeks of test runs on my little 80 kbyte/sec test relay at home.

comment:28 Changed 8 years ago by andrea

The hibernate.c connection_mark_for_close() bug turned out to be the cause of bug 7267, so no new ticket to be filed there. A proposed fix is now in my bug 7267 branch.

comment:29 Changed 8 years ago by andrea

Resolution: fixed
Status: needs_reviewclosed

I've created ticket 7471 for demoronizing circuit_unlink_all_from_channel(); closing this one now.

comment:30 in reply to:  25 Changed 8 years ago by arma

Replying to andrea:

Is it possible for a spliced rendezvous circuit to, by coincidence, have the same p_chan on both ends? I think it is, and if so that could trigger the bug.

I think it is too.

Note: See TracTickets for help on using tickets.