cycle% image symbol------ ----- ------ 19.37% tor circuit_unlink_all_from_channel
It looks like that function crawls the whole circuit list every time we close a connection. With circuit lists being huge these days, it is plausible that it takes a lot of effort to crawl it.
To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information
Child items 0
Show closed items
No child items are currently assigned. Use child items to break down this issue into smaller parts.
Linked items 0
Link issues together to show that they're related.
Learn more.
Maybe we could investigate some kind of cpu/memory trade-off, such as keeping a list of related circuits in the channel_t? This way, we can just iterate that list of circuits, instead of iterating global_circuitlist.
To see if it's worth it, we need to know some more info, like the size of the circuitlist in busy relays and the number of circuits that circuit_unlink_all_from_channel usually closes.
Mikeperry pointed out that "it looks like we should already have that mapping from channels to circuits in the circuitmux map".
I have an implementation in "bug9683_testing_024". It won't be faster yet, since it runs the old algorithm and the new cleverer one, and compares the results to make sure they match. Can somebody try it out on a relay, and see whether you get a bunch of BUG messages?
It also needs review. Adding andrea to the cc list here since she wrote the original circuitmux code.
Trac: Status: new to needs_review Cc: N/Ato andrea
I have an implementation in "bug9683_testing_024". It won't be faster yet, since it runs the old algorithm and the new cleverer one, and compares the results to make sure they match. Can somebody try it out on a relay, and see whether you get a bunch of BUG messages?
It also needs review. Adding andrea to the cc list here since she wrote the original circuitmux code.
Code review looks okay; I'll give it a bit of testing time on a relay.
Sep 25 04:04:37.000 [warn] circuit_unlink_all_from_channel(): Bug: Mismatch in list of detached circuits.Sep 25 04:04:37.000 [warn] circuit_unlink_all_from_channel(): Bug: Circuit on detached list which I had no reason to markSep 25 04:05:38.000 [err] channel_tls_from_base(): Bug: src/or/channeltls.c:319: channel_tls_from_base: Assertion chan->magic == TLS_CHAN_MAGIC failed; aborting.
Probably a memory overrun stomping the channel. Obviously I shouldn't review code at 3 AM.
I'm rerunning the test with the circuit_unlink_all_from_channel() bug messages breakpointed so I can catch the memory corruption as it happens and confirm. This is a bit slow since there has to be relay traffic to trigger it, so I get to wait until the next consensus goes out. Once I've got that, I'll modify it and retest.
Actually, there's another bug here: there exist a few circumstances where a circuit is not attached to its channel's mux but still has n_chan/p_chan set. Specifically, if the circuit is marked for close (see the circuitmux_detach_circuit() calls in circuit_mark_for_close_() which was the case I just observed this for), and briefly during circuit_set_circid_chan_helper() (but I don't think there's any way for this to be called in that interval).
Okay, is this something that has a fix, or do we need a different approach for this patch?
Not sure yet; need to check into whether there's actually any good reason for n_chan/p_chan to still be set after circuit_mark_for_close_() or it's just a historical accident. If the latter, then the easy fix is to set them to NULL immediately after calling circuitmux_detach_circuit().
I think it's safe to do this; we immediately exit handling any cells in relay.c if a circuit is marked for close - see append_cell_to_circuit_queue() and circuit_receive_relay_cell(). Probably best to patch circuit_mark_for_close() to null out n_chan/p_chan and test it a few days as a relay to see if it blows up, just to be sure.
Triage for 0.2.5.x: it sounds likely that this will be mergeable pending testing of the proposed circuit_mark_for_close_() change, so this and #7471 (moved) should remain in 0.2.5.x.
This branch just died with 'src/or/circuitlist.c:1726 assert_circuit_ok: Assertion c->magic == ORIGIN_CIRCUIT_MAGIC || c->magic == OR_CIRCUIT_MAGIC failed; aborting.' Stack trace follows:
#0 0x00007ffff69359b5 in raise () from /lib64/libc.so.6
#1 0x00007ffff693705f in abort () from /lib64/libc.so.6
#2 (closed) 0x00000000004a82d4 in assert_circuit_ok (c=0x2427ef0) at src/or/circuitlist.c:1726
#3 (closed) 0x00000000004a6bf3 in circuit_mark_for_close_ (circ=0x2427ef0, reason=1, line=352,
file=0x5bfbe0 "src/or/command.c") at src/or/circuitlist.c:1397
#4 (closed) 0x00000000004bdf76 in command_process_created_cell (cell=0x7fffffffdd40, chan=0x21cc0e0) at src/or/command.c:352
#5 (closed) 0x00000000004bd3c7 in command_process_cell (chan=0x21cc0e0, cell=0x7fffffffdd40) at src/or/command.c:143
#6 (closed) 0x000000000048c6c6 in channel_queue_cell (chan=0x21cc0e0, cell=0x7fffffffdd40) at src/or/channel.c:2540
#7 (closed) 0x00000000004941b8 in channel_tls_handle_cell (cell=0x7fffffffdd40, conn=0x224e780) at src/or/channeltls.c:923
#8 (closed) 0x00000000004ed0bf in connection_or_process_cells_from_inbuf (conn=0x224e780) at src/or/connection_or.c:1972
#9 (closed) 0x00000000004e9664 in connection_or_process_inbuf (conn=0x224e780) at src/or/connection_or.c:483
#10 (closed) 0x00000000004dc28e in connection_process_inbuf (conn=0x224e780, package_partial=1) at src/or/connection.c:4001
#11 (closed) 0x00000000004d9a2a in connection_handle_read_impl (conn=0x224e780) at src/or/connection.c:2839
#12 (closed) 0x00000000004d9b57 in connection_handle_read (conn=0x224e780) at src/or/connection.c:2880
#13 (closed) 0x000000000040bafb in conn_read_callback (fd=33, event=2, _conn=0x224e780) at src/or/main.c:718
#14 (closed) 0x00007ffff770cdd4 in event_process_active_single_queue (base=0x826e10, flags=)
at event.c:1350
#15 (closed) event_process_active (base=0x826e10, flags=) at event.c:1420
#16 (closed) event_base_loop (base=0x826e10, flags=) at event.c:1621
#17 (closed) 0x000000000040e65d in do_main_loop () at src/or/main.c:1987
#18 (closed) 0x000000000040fda3 in tor_main (argc=3, argv=0x7fffffffe5f8) at src/or/main.c:2703
#19 (closed) 0x0000000000409d74 in main (argc=3, argv=0x7fffffffe5f8) at src/or/tor_main.c:30
oh hey, maybe 957f252df5e8 is the problem. We're setting p_chan and/or n_chan to NULL without calling circuit_set_p_circid_chan() and/or circuit_set_n_circid_chan(). That means that later, when circuit_free() (or whoeever) calls circuit_set_[pn]_circid_chan() to remove the channel from the chan_circid_map, it won't find the old entry.
There's an obvious fix but I want to make sure it doesn't interfere with anything.
I added a fixup commit for the issue to my branch.
The changes to p/n_chan that I made above do seem like they might interfere with the channel_note_destroy_pending() mechanism, or they might not. I think not, but it needs consideration. We'll need to test this as merged into 0.2.5 separately from how we test it in 0.2.4.
(The reason I think that the changes to p/n_chan might interfere with channel_note_destroy_pending() is that the latter function assumes that a circuit that has been been through circuit_mark_for_close might still be present in the chan_circid_map. Then again, it might be that my patch in c8d9fd13edda93b049 actually simplifies this case so that the extra logic in channel_note_destroy_pending might not be needed any more. I'm not sure.)
Mar 26 02:05:02.940 [notice] List of 0 circuits was as expected.Mar 26 02:05:02.942 [notice] List of 0 circuits was as expected.Mar 26 02:05:02.944 [notice] List of 0 circuits was as expected.Mar 26 02:05:02.946 [notice] List of 0 circuits was as expected.Mar 26 02:05:02.948 [notice] List of 0 circuits was as expected.Mar 26 02:05:02.950 [notice] List of 0 circuits was as expected.Mar 26 02:05:02.952 [notice] List of 0 circuits was as expected.Mar 26 02:05:02.955 [notice] List of 0 circuits was as expected....