Opened 7 years ago

Closed 7 years ago

Last modified 6 years ago

#9683 closed defect (fixed)

circuit_unlink_all_from_channel() is a performance bottleneck?

Reported by: arma Owned by:
Priority: High Milestone: Tor: 0.2.5.x-final
Component: Core Tor/Tor Version:
Severity: Keywords: tor-relay, 024-backport, 025-triaged
Cc: andrea, jeroen@… Actual Points:
Parent ID: Points:
Reviewer: Sponsor:

Description

In https://lists.torproject.org/pipermail/tor-relays/2013-September/002715.html Andy reports some profiling on his fast exit. Top on his list is

cycle%   image              symbol
------   -----              ------
 19.37%  tor                circuit_unlink_all_from_channel

It looks like that function crawls the whole circuit list every time we close a connection. With circuit lists being huge these days, it is plausible that it takes a lot of effort to crawl it.

Child Tickets

Change History (33)

comment:1 Changed 7 years ago by asn

Maybe we could investigate some kind of cpu/memory trade-off, such as keeping a list of related circuits in the channel_t? This way, we can just iterate that list of circuits, instead of iterating global_circuitlist.

To see if it's worth it, we need to know some more info, like the size of the circuitlist in busy relays and the number of circuits that circuit_unlink_all_from_channel usually closes.

Mikeperry pointed out that "it looks like we should already have that mapping from channels to circuits in the circuitmux map".

comment:2 Changed 7 years ago by nickm

See also #7727 and #7471.

comment:3 Changed 7 years ago by nickm

Keywords: 024-backport added
Priority: normalmajor

comment:4 Changed 7 years ago by nickm

Cc: andrea added
Status: newneeds_review

I have an implementation in "bug9683_testing_024". It won't be faster yet, since it runs the old algorithm and the new cleverer one, and compares the results to make sure they match. Can somebody try it out on a relay, and see whether you get a bunch of BUG messages?

It also needs review. Adding andrea to the cc list here since she wrote the original circuitmux code.

comment:5 Changed 7 years ago by massar

Cc: jeroen@… added

comment:6 in reply to:  4 Changed 7 years ago by andrea

Replying to nickm:

I have an implementation in "bug9683_testing_024". It won't be faster yet, since it runs the old algorithm and the new cleverer one, and compares the results to make sure they match. Can somebody try it out on a relay, and see whether you get a bunch of BUG messages?

It also needs review. Adding andrea to the cc list here since she wrote the original circuitmux code.

Code review looks okay; I'll give it a bit of testing time on a relay.

comment:7 Changed 7 years ago by andrea

Test relay dies with this:

Sep 25 04:04:37.000 [warn] circuit_unlink_all_from_channel(): Bug: Mismatch in list of detached circuits.
Sep 25 04:04:37.000 [warn] circuit_unlink_all_from_channel(): Bug: Circuit on detached list which I had no reason to mark
Sep 25 04:05:38.000 [err] channel_tls_from_base(): Bug: src/or/channeltls.c:319: channel_tls_from_base: Assertion chan->magic == TLS_CHAN_MAGIC failed; aborting.

Probably a memory overrun stomping the channel. Obviously I shouldn't review code at 3 AM.

comment:8 Changed 7 years ago by andrea

Ah, I think I see the problem: smartlist_add(detached_2, chan) should be smartlist_add(detached_2, circ)

comment:9 Changed 7 years ago by nickm

Okay. Are you testing with that change now? Should I modify the branch, or wait for you?

comment:10 Changed 7 years ago by andrea

I'm rerunning the test with the circuit_unlink_all_from_channel() bug messages breakpointed so I can catch the memory corruption as it happens and confirm. This is a bit slow since there has to be relay traffic to trigger it, so I get to wait until the next consensus goes out. Once I've got that, I'll modify it and retest.

comment:11 Changed 7 years ago by andrea

Actually, there's another bug here: there exist a few circumstances where a circuit is not attached to its channel's mux but still has n_chan/p_chan set. Specifically, if the circuit is marked for close (see the circuitmux_detach_circuit() calls in circuit_mark_for_close_() which was the case I just observed this for), and briefly during circuit_set_circid_chan_helper() (but I don't think there's any way for this to be called in that interval).

comment:12 Changed 7 years ago by nickm

Okay, is this something that has a fix, or do we need a different approach for this patch?

comment:13 in reply to:  12 Changed 7 years ago by andrea

Replying to nickm:

Okay, is this something that has a fix, or do we need a different approach for this patch?

Not sure yet; need to check into whether there's actually any good reason for n_chan/p_chan to still be set after circuit_mark_for_close_() or it's just a historical accident. If the latter, then the easy fix is to set them to NULL immediately after calling circuitmux_detach_circuit().

comment:14 Changed 7 years ago by nickm

When we close this, we can maybe close #7471.

comment:15 Changed 7 years ago by andrea

I think it's safe to do this; we immediately exit handling any cells in relay.c if a circuit is marked for close - see append_cell_to_circuit_queue() and circuit_receive_relay_cell(). Probably best to patch circuit_mark_for_close() to null out n_chan/p_chan and test it a few days as a relay to see if it blows up, just to be sure.

comment:16 Changed 7 years ago by andrea

Triage for 0.2.5.x: it sounds likely that this will be mergeable pending testing of the proposed circuit_mark_for_close_() change, so this and #7471 should remain in 0.2.5.x.

comment:17 Changed 7 years ago by nickm

Okay, bug9683_testing_024 should do what you want now, I hope.

comment:18 Changed 7 years ago by andrea

I'm starting a test relay with this branch now.

comment:19 Changed 7 years ago by andrea

This branch just died with 'src/or/circuitlist.c:1726 assert_circuit_ok: Assertion c->magic == ORIGIN_CIRCUIT_MAGIC
c->magic == OR_CIRCUIT_MAGIC failed; aborting.' Stack trace follows:

#0 0x00007ffff69359b5 in raise () from /lib64/libc.so.6
#1 0x00007ffff693705f in abort () from /lib64/libc.so.6
#2 0x00000000004a82d4 in assert_circuit_ok (c=0x2427ef0) at src/or/circuitlist.c:1726
#3 0x00000000004a6bf3 in circuit_mark_for_close_ (circ=0x2427ef0, reason=1, line=352,

file=0x5bfbe0 "src/or/command.c") at src/or/circuitlist.c:1397

#4 0x00000000004bdf76 in command_process_created_cell (cell=0x7fffffffdd40, chan=0x21cc0e0) at src/or/command.c:352
#5 0x00000000004bd3c7 in command_process_cell (chan=0x21cc0e0, cell=0x7fffffffdd40) at src/or/command.c:143
#6 0x000000000048c6c6 in channel_queue_cell (chan=0x21cc0e0, cell=0x7fffffffdd40) at src/or/channel.c:2540
#7 0x00000000004941b8 in channel_tls_handle_cell (cell=0x7fffffffdd40, conn=0x224e780) at src/or/channeltls.c:923
#8 0x00000000004ed0bf in connection_or_process_cells_from_inbuf (conn=0x224e780) at src/or/connection_or.c:1972
#9 0x00000000004e9664 in connection_or_process_inbuf (conn=0x224e780) at src/or/connection_or.c:483
#10 0x00000000004dc28e in connection_process_inbuf (conn=0x224e780, package_partial=1) at src/or/connection.c:4001
#11 0x00000000004d9a2a in connection_handle_read_impl (conn=0x224e780) at src/or/connection.c:2839
#12 0x00000000004d9b57 in connection_handle_read (conn=0x224e780) at src/or/connection.c:2880
#13 0x000000000040bafb in conn_read_callback (fd=33, event=2, _conn=0x224e780) at src/or/main.c:718
#14 0x00007ffff770cdd4 in event_process_active_single_queue (base=0x826e10, flags=<value optimized out>)

at event.c:1350

#15 event_process_active (base=0x826e10, flags=<value optimized out>) at event.c:1420
#16 event_base_loop (base=0x826e10, flags=<value optimized out>) at event.c:1621
#17 0x000000000040e65d in do_main_loop () at src/or/main.c:1987
#18 0x000000000040fda3 in tor_main (argc=3, argv=0x7fffffffe5f8) at src/or/main.c:2703
#19 0x0000000000409d74 in main (argc=3, argv=0x7fffffffe5f8) at src/or/tor_main.c:30

comment:20 Changed 7 years ago by nickm

Still have the core? any chance of seeing what was in *circ?

comment:21 in reply to:  20 Changed 7 years ago by andrea

Replying to nickm:

Still have the core? any chance of seeing what was in *circ?

(gdb) frame 3
#3 0x00000000004a6bf3 in circuit_mark_for_close_ (circ=0x2427ef0, reason=1, line=352,

file=0x5bfbe0 "src/or/command.c") at src/or/circuitlist.c:1397

1397 src/or/circuitlist.c: No such file or directory.

in src/or/circuitlist.c

(gdb) print *circ
$1 = {magic = 4148688768, n_chan = 0x0, n_circ_id = 0, n_mux = 0x2427f30, n_chan_cells = {head = 0x0,

tail = 0x7ffff71a9840, n = -143680640, insertion_times = 0x71}, n_hop = 0xe6222efb7a94a304,

streams_blocked_on_n_chan = 1, streams_blocked_on_p_chan = 1, state = 225 '\341', purpose = 64 '@',
package_window = -1135566341, deliver_window = 1469111653, n_chan_create_cell = 0xa2a583aa00000000,
timestamp_began = {tv_sec = 6385653667941287783, tv_usec = 0}, timestamp_created = {tv_sec = 0, tv_usec = 0},
timestamp_dirty = 0, marked_for_close = 0, marked_for_close_file = 0x0, dirreq_id = 51539607552,
next = 0x7ffff76f9708, next_active_on_n_chan = 0x41, prev_active_on_n_chan = 0x0}

I have a copy of the core dump if you want to look closer.

comment:22 Changed 7 years ago by nickm

oh hey, maybe 957f252df5e8 is the problem. We're setting p_chan and/or n_chan to NULL without calling circuit_set_p_circid_chan() and/or circuit_set_n_circid_chan(). That means that later, when circuit_free() (or whoeever) calls circuit_set_[pn]_circid_chan() to remove the channel from the chan_circid_map, it won't find the old entry.

There's an obvious fix but I want to make sure it doesn't interfere with anything.

comment:23 Changed 7 years ago by nickm

I added a fixup commit for the issue to my branch.

The changes to p/n_chan that I made above do seem like they might interfere with the channel_note_destroy_pending() mechanism, or they might not. I *think* not, but it needs consideration. We'll need to test this as merged into 0.2.5 separately from how we test it in 0.2.4.

comment:24 Changed 7 years ago by nickm

(The reason I think that the changes to p/n_chan might interfere with channel_note_destroy_pending() is that the latter function assumes that a circuit that has been been through circuit_mark_for_close might still be present in the chan_circid_map. Then again, it might be that my patch in c8d9fd13edda93b049 actually simplifies this case so that the extra logic in channel_note_destroy_pending might not be needed any more. I'm not sure.)

comment:25 Changed 7 years ago by nickm

I've squashed and rebased this all to "bug9683_rebased" for testing. Now it's on master.

comment:26 Changed 7 years ago by andrea

Testing new branch now; hasn't blown up yet.

comment:27 Changed 7 years ago by andrea

Mar 17 07:39:15.000 [notice] Heartbeat: Tor's uptime is 2 days 17:47 hours, with 26 circuits open. I've sent 2.81 GB and received 2.83 GB.

...and it still lives!

comment:28 Changed 7 years ago by andrea

Eventually debug log ran my test box out of disk space, last heartbeat was:

Mar 17 18:39:15.000 [notice] Heartbeat: Tor's uptime is 3 days 4:47 hours, with 20 circuits open. I've sent 3.16 GB and received 3.19 GB.

I think I'm okay with this branch in light of that.

comment:29 Changed 7 years ago by nickm

Okay, I'll merge it after 0.2.5.3-alpha is out. Remind me then if I don't merge it right away.

comment:30 Changed 7 years ago by nickm

Keywords: 025-triaged added

comment:31 Changed 7 years ago by nickm

Resolution: fixed
Status: needs_reviewclosed

Merged for 0.2.5.4-alpha.

comment:32 Changed 6 years ago by arma

moria1's log is getting bombed with

Mar 26 02:05:02.940 [notice] List of 0 circuits was as expected.
Mar 26 02:05:02.942 [notice] List of 0 circuits was as expected.
Mar 26 02:05:02.944 [notice] List of 0 circuits was as expected.
Mar 26 02:05:02.946 [notice] List of 0 circuits was as expected.
Mar 26 02:05:02.948 [notice] List of 0 circuits was as expected.
Mar 26 02:05:02.950 [notice] List of 0 circuits was as expected.
Mar 26 02:05:02.952 [notice] List of 0 circuits was as expected.
Mar 26 02:05:02.955 [notice] List of 0 circuits was as expected.
...

comment:33 Changed 6 years ago by nickm

whoops. That testing code isn't supposed to still be on. Fixed in 6da2544

Note: See TracTickets for help on using tickets.