Opened 6 years ago

Closed 5 years ago

#7350 closed defect (fixed)

Assertion chan->state == CHANNEL_STATE_OPENING || chan->state == CHANNEL_STATE_OPEN || chan->state == CHANNEL_STATE_MAINT failed

Reported by: andrea Owned by: andrea
Priority: High Milestone: Tor: 0.2.4.x-final
Component: Core Tor/Tor Version: Tor: 0.2.4.5-alpha
Severity: Keywords: tor-relay
Cc: andrea Actual Points:
Parent ID: Points:
Reviewer: Sponsor:

Description

I saw this while trying to reproduce #7267:

src/or/channel.c:1668 channel_write_cell_queue_entry: Assertion chan->state == CHANNEL_STATE_OPENING
chan->state == CHANNEL_STATE_OPEN chan->state == CHANNEL_STATE_MAINT failed; aborting.

The stack trace is:

#0 0x00007ffff69619b5 in raise () from /lib64/libc.so.6
#1 0x00007ffff696305f in abort () from /lib64/libc.so.6
#2 0x000000000047fbd9 in channel_write_cell_queue_entry (chan=0x1409340, q=0x7fffffffdd90) at src/or/channel.c:1666
#3 0x00000000004801ef in channel_write_cell (chan=0x1409340, cell=0x7fffffffddf0) at src/or/channel.c:1748
#4 0x00000000004831b6 in channel_send_destroy (circ_id=62492, chan=0x1409340, reason=8) at src/or/channel.c:2598
#5 0x0000000000498a5f in circuit_mark_for_close_ (circ=0x198fe80, reason=8, line=1047,

file=0x5753a9 "src/or/circuitlist.c") at src/or/circuitlist.c:1442

#6 0x0000000000497996 in circuit_unlink_all_from_channel (chan=0x11a7720, reason=8) at src/or/circuitlist.c:1047
#7 0x000000000047ec78 in channel_closed (chan=0x11a7720) at src/or/channel.c:1288
#8 0x00000000004d50b8 in connection_or_about_to_close (or_conn=0x26e2c80) at src/or/connection_or.c:601
#9 0x00000000004bf91a in connection_about_to_close_connection (conn=0x26e2c80) at src/or/connection.c:633
#10 0x0000000000409108 in connection_unlink (conn=0x26e2c80) at src/or/main.c:385
#11 0x000000000040aa69 in conn_close_if_marked (i=18) at src/or/main.c:932
#12 0x000000000040a151 in close_closeable_connections () at src/or/main.c:702
#13 0x000000000040be70 in run_scheduled_events (now=1352251511) at src/or/main.c:1535
#14 0x000000000040c396 in second_elapsed_callback (timer=0xd7afc0, arg=0x0) at src/or/main.c:1691
#15 0x000000000054f892 in periodic_timer_cb (fd=-1, what=1, arg=0xd7afc0) at src/common/compat_libevent.c:599
#16 0x00007ffff772f930 in event_process_active (base=0x7e04d0, flags=<value optimized out>) at event.c:395
#17 event_base_loop (base=0x7e04d0, flags=<value optimized out>) at event.c:547
#18 0x000000000040cbdd in do_main_loop () at src/or/main.c:1987
#19 0x000000000040e19d in tor_main (argc=3, argv=0x7fffffffe638) at src/or/main.c:2699
#20 0x00000000004087a4 in main (argc=3, argv=0x7fffffffe638) at src/or/tor_main.c:30

Child Tickets

Attachments (1)

bug7350-debug.log.txt (108.1 KB) - added by andrea 6 years ago.

Download all attachments as: .zip

Change History (21)

comment:1 Changed 6 years ago by andrea

This was after about 25 hours of operation with a bit under 3G relayed. It looks like what happened was it somehow got stalled for a while, and then this assert got triggered while resuming and closing old circuits.

From GDB:

Program received signal SIGPIPE, Broken pipe.

---Type <return> to continue, or q <return> to quit---
Program received signal SIGPIPE, Broken pipe.

src/or/channel.c:1668 channel_write_cell_queue_entry: Assertion chan->state == CHANNEL_STATE_OPENING
chan->state == CHANNEL_STATE_OPEN chan->state == CHANNEL_STATE_MAINT failed; aborting.

Program received signal SIGABRT, Aborted.
0x00007ffff69619b5 in raise () from /lib64/libc.so.6

From notice-level log:

Nov 06 16:09:29.000 [notice] Heartbeat: Tor's uptime is 1 day 0:30 hours, with 14 circuits open. I've sent 2.89 GB and received 2.72 GB.
Nov 06 17:25:11.000 [warn] Your system clock just jumped 3981 seconds forward; assuming established circuits no longer work.

Nov 06 17:25:11.000 [err] channel_write_cell_queue_entry(): Bug: src/or/channel.c:1668: channel_write_cell_queue_entry: Assertion chan->state == CHANNEL_STATE_OPENING
chan->state == CHANNEL_STATE_OPEN chan->state == CHANNEL_STATE_MAINT failed; aborting.

comment:2 Changed 6 years ago by andrea

Is it possible gdb is actually dumb enough to block the attached program after enough SIGPIPE messages to fill the terminal and make it issue the "Type <return> to continue" prompt?

comment:3 Changed 6 years ago by andrea

Anyway, it looks like something goes wonky resuming after the clock skips, so there's a bug there somewhere. I'll attach a section of debug log output.

Changed 6 years ago by andrea

Attachment: bug7350-debug.log.txt added

comment:4 Changed 6 years ago by andrea

Hmm, I don't think it quite makes sense to channel_send_destroy() if the channel is closing, so let's just make that function check for this case, and let's have circuit_mark_for_close_() avoid calling it too.

comment:5 Changed 6 years ago by andrea

For the record before I kill the process:

(gdb) print *chan
$3 = {magic = 2316903463, state = CHANNEL_STATE_CLOSING, global_identifier = 8008, registered = 1 '\001',

reason_for_closing = CHANNEL_CLOSE_FROM_BELOW, timestamp_created = 1352246346, timestamp_active = 1352251511,
free = 0, close = 0x487b53 <channel_tls_close_method>,
describe_transport = 0x487c50 <channel_tls_describe_transport_method>, dumpstats = 0,
cell_handler = 0x4ab9aa <command_process_cell>, var_cell_handler = 0x4abaa4 <command_process_var_cell>,
get_remote_addr = 0x487d8a <channel_tls_get_remote_addr_method>,
get_remote_descr = 0x487f55 <channel_tls_get_remote_descr_method>,
has_queued_writes = 0x488204 <channel_tls_has_queued_writes_method>,
is_canonical = 0x48834b <channel_tls_is_canonical_method>,
matches_extend_info = 0x4884c6 <channel_tls_matches_extend_info_method>,
matches_target = 0x488639 <channel_tls_matches_target_method>,
write_cell = 0x488807 <channel_tls_write_cell_method>,
write_packed_cell = 0x4889ce <channel_tls_write_packed_cell_method>,
write_var_cell = 0x488baa <channel_tls_write_var_cell_method>,
identity_digest = "\032z4\375\026\036\357# r\216y\373V9\026`2\231U", nickname = 0x0, next_with_same_id = {

le_next = 0x0, le_prev = 0x104fc98}, incoming_queue = {sqh_first = 0x0, sqh_last = 0x1409410}, outgoing_queue = {
sqh_first = 0x0, sqh_last = 0x1409420}, cmux = 0xe24630, circ_id_type = CIRC_ID_TYPE_LOWER, next_circ_id = 477,

num_n_circuits = 0, num_p_circuits = 1, is_bad_for_new_circs = 0, is_client = 0, is_incoming = 1, is_local = 0,
timestamp_client = 0, timestamp_drained = 1352251511, timestamp_recv = 1352251511, timestamp_xmit = 1352251511,
timestamp_last_added_nonpadding = 1352247530, dirreq_id = 0, n_cells_recved = 342, n_cells_xmitted = 988}

comment:6 Changed 6 years ago by andrea

Status: newneeds_review

Proposed fix in my bug7350 branch.

comment:7 Changed 6 years ago by nickm

(looks good; needs changes file-- then feel free to merge.)

comment:8 Changed 6 years ago by andrea

Resolution: fixed
Status: needs_reviewclosed

Reviewed by nickm and merged.

comment:9 Changed 6 years ago by fk

Resolution: fixed
Status: closedreopened

Apparently the assertion can still be triggered by Tor 0.2.4.7-alpha:

Dec 28 11:12:16.000 [notice] Received reload signal (hup). Reloading config and resetting internal state.
Dec 28 11:12:16.000 [notice] Read configuration file "/usr/share/tor/tor-service-defaults-torrc".
Dec 28 11:12:16.000 [notice] Read configuration file "/etc/tor/torrc".
Dec 28 11:12:16.000 [notice] Tor 0.2.4.7-alpha (git-91734c428358878d) opening log file.
Dec 28 11:12:16.000 [notice] Configured hibernation.  This interval began at 2012-12-01 00:00:00; 
the scheduled wake-up time was 2012-12-01 00:00:00; we expect to exhaust our quota for this
interval around 2013-01-01 00:00:00; the next interval begins at 2013-01-01 00:00:00 (all times local)
Dec 28 16:53:05.000 [notice] Heartbeat: Tor's uptime is 1 day 6:00 hours, with 1698 circuits open.
I've sent 89.43 GB and received 88.78 GB.
Dec 28 16:53:05.000 [notice] Average packaged cell fullness: 92.220%
Dec 28 22:07:41.000 [notice] Bandwidth soft limit reached; commencing hibernation. No new connections will be accepted
Dec 28 22:33:27.000 [notice] Going dormant. Blowing away remaining connections.
Dec 28 22:33:27.000 [err] channel_write_cell_queue_entry(): Bug: src/or/channel.c:1681:
channel_write_cell_queue_entry: Assertion chan->state == CHANNEL_STATE_OPENING ||
chan->state == CHANNEL_STATE_OPEN || chan->state == CHANNEL_STATE_MAINT failed; aborting.

comment:10 Changed 6 years ago by nickm

Cc: andrea added
Priority: normalmajor

Yuck. Did you get a stack trace on that one at all?

The "Going dormant. Blowing away remaining connections." line looks like it could be a clue.

comment:11 Changed 6 years ago by fk

Unfortunately core file size was set to 0 so the log messages are all the information I have.

I agree that the "Going dormant" message might be relevant and already slightly increased AccountingMax to see if the problem is reproducible.

If it is, I should be able to provide a stack trace.

comment:12 Changed 6 years ago by fk

While I reproduced the issue, it turns out that deb.torproject.org provides PIE binaries even for the alphas, and apparently this (or something else?) makes gdb sad:

GNU gdb (GDB) 7.0.1-debian
Copyright (C) 2009 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "i486-linux-gnu".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>...

warning: The current binary is a PIE (Position Independent Executable), which
GDB does NOT currently support.  Most debugger features will fail if used
in this session.

Reading symbols from /usr/sbin/tor...Reading symbols from /usr/lib/debug/usr/sbin/tor...done.
(no debugging symbols found)...done.
[New Thread 3899]
Core was generated by `/usr/sbin/tor --defaults-torrc /usr/share/tor/tor-service-defaults-torrc --hush'.
Program terminated with signal 6, Aborted.
#0  0xb7db3832 in ?? () from /lib/ld-linux.so.2
(gdb) where
#0  0xb7db3832 in ?? () from /lib/ld-linux.so.2
#1  0xb7a8b781 in ?? ()
#2  0xb7a8ebb2 in ?? ()
#3  0xb7e52b50 in ?? ()
[...]

Anyway, I also enabled debug logging so the exercise wasn't completely pointless:

Dec 29 16:58:55.067 [info] hibernate_go_dormant(): Closing conn type 4
Dec 29 16:58:55.067 [debug] channel_mark_for_close(): Closing channel 0xb55b7408 (global ID 5377) by request
Dec 29 16:58:55.067 [debug] channel_change_state(): Changing state of channel 0xb55b7408 (global ID 5377) from "open" to "closing"
Dec 29 16:58:55.067 [debug] channel_remove_from_digest_map(): Removed channel 0xb55b7408 (global ID 5377) from identity map in state closing (4) with digest B765CC9AAB8EA8E2FDD275D60433C56C0DE5645E
Dec 29 16:58:55.067 [debug] connection_mark_for_close_internal_(): Calling connection_mark_for_close_internal_() on an OR conn at src/or/connection_or.c:1148
Dec 29 16:58:55.067 [info] hibernate_go_dormant(): Closing conn type 4
[...]
Dec 29 16:58:55.085 [debug] channel_remove_from_digest_map(): Removed channel 0xb4ae7490 (global ID 6977) from identity map in state closing (4) with digest DF62347844BBFBF0A7C5ED340B96CB5A62481652
Dec 29 16:58:55.085 [debug] connection_mark_for_close_internal_(): Calling connection_mark_for_close_internal_() on an OR conn at src/or/connection_or.c:1148
Dec 29 16:58:55.085 [info] hibernate_go_dormant(): Closing conn type 4
Dec 29 16:58:55.085 [debug] channel_mark_for_close(): Closing channel 0xb3fc9178 (global ID 7066) by request
Dec 29 16:58:55.085 [debug] channel_change_state(): Changing state of channel 0xb3fc9178 (global ID 7066) from "opening" to "closing"
Dec 29 16:58:55.085 [debug] connection_mark_for_close_internal_(): Calling connection_mark_for_close_internal_() on an OR conn at src/or/connection_or.c:1148
Dec 29 16:58:55.085 [debug] connection_edge_end(): Sending end on conn (fd 181).
Dec 29 16:58:55.085 [debug] relay_send_command_from_edge(): delivering 3 cell backward.
Dec 29 16:58:55.085 [debug] append_cell_to_circuit_queue(): Made a circuit active.
Dec 29 16:58:55.085 [debug] append_cell_to_circuit_queue(): Primed a buffer.
Dec 29 16:58:55.085 [debug] channel_write_packed_cell(): Writing packed_cell_t 0xb3b9f3d8 to channel 0xb55b7408 with global ID 5377
Dec 29 16:58:55.085 [err] channel_write_cell_queue_entry(): Bug: src/or/channel.c:1681: channel_write_cell_queue_entry: Assertion chan->state == CHANNEL_STATE_OPENING || chan->state == CHANNEL_STATE_OPEN || chan->state == CHANNEL_STATE_MAINT failed; aborting.

comment:13 in reply to:  12 ; Changed 6 years ago by rransom

Replying to fk:

While I reproduced the issue, it turns out that deb.torproject.org provides PIE binaries even for the alphas, and apparently this (or something else?) makes gdb sad:

Install the tor-dbg package and try again.

comment:14 in reply to:  13 Changed 6 years ago by fk

Replying to rransom:

Replying to fk:

While I reproduced the issue, it turns out that deb.torproject.org provides PIE binaries even for the alphas, and apparently this (or something else?) makes gdb sad:

Install the tor-dbg package and try again.

tor-dbg is installed already and gdb seems to read /usr/lib/debug/usr/sbin/tor just fine.

comment:15 Changed 5 years ago by nickm

Keywords: tor-relay added

comment:16 Changed 5 years ago by nickm

Ah, okay. The channel is closing because we're hibernating. So it marks the or_connection, but tells it to flush. And then a connection_edge_end tells it to queue an END cell while the or_connection is marked but flushing and the channel is in state CHANNEL_STATE_CLOSING.

comment:17 in reply to:  16 Changed 5 years ago by andrea

Replying to nickm:

Ah, okay. The channel is closing because we're hibernating. So it marks the or_connection, but tells it to flush. And then a connection_edge_end tells it to queue an END cell while the or_connection is marked but flushing and the channel is in state CHANNEL_STATE_CLOSING.

I think the fix is to not mark the or_connection or close the channel right away, but keep track of that until after we queue the END cell. I'd have to have a closer look to know how complex that is.

comment:18 Changed 5 years ago by nickm

Status: reopenedneeds_review

How about branch "bug7350_redux" in my public repository.

comment:19 in reply to:  18 Changed 5 years ago by andrea

Replying to nickm:

How about branch "bug7350_redux" in my public repository.

I think this is correct and mergeable.

comment:20 Changed 5 years ago by nickm

Resolution: fixed
Status: needs_reviewclosed

Great; merging.

Note: See TracTickets for help on using tickets.