Opened 15 months ago

Closed 15 months ago

Last modified 13 months ago

#18116 closed defect (fixed)

abort in circuitmux_detach_all_circuits on Tor Browser exit with stem scripts running

Reported by: teor Owned by: andrea
Priority: Very High Milestone: Tor: 0.2.8.x-final
Component: Core Tor/Tor Version:
Severity: Major Keywords: crash, TorCoreTeam201602, 2016-bug-retrospective
Cc: Actual Points:
Parent ID: #18117 Points:
Reviewer: Sponsor:

Description

Using master (0.2.8.0-alpha-dev) on OS X I get a crash in circuitmux_detach_all_circuits when I exit Tor Browser (modified to use the binary from a master build) while running a script similar to stem's custom_path_selection.py example.

Thread 0 Crashed:: Dispatch queue: com.apple.main-thread
0   libsystem_kernel.dylib        	0x00007fff8d787002 __pthread_kill + 10
1   libsystem_pthread.dylib       	0x00007fff96ce75c5 pthread_kill + 90
2   libsystem_c.dylib             	0x00007fff9de5f6e7 abort + 129
3   tor                           	0x00000001036b3e57 circuitmux_detach_all_circuits + 743
4   tor                           	0x00000001036a0abc channel_free_list + 236
5   tor                           	0x00000001036a0863 channel_free_all + 147
6   tor                           	0x0000000103727a43 tor_free_all + 131
7   tor                           	0x0000000103727cb8 tor_cleanup + 296
8   tor                           	0x0000000103727262 signal_callback + 2498
9   libevent-2.0.5.dylib          	0x0000000103902be0 event_signal_closure + 185
10  libevent-2.0.5.dylib          	0x00000001039029f7 event_process_active_single_queue + 237
11  libevent-2.0.5.dylib          	0x00000001039030b1 event_process_active + 122
12  libevent-2.0.5.dylib          	0x00000001039036ca event_base_loop + 647
13  tor                           	0x0000000103725aac do_main_loop + 1532
14  tor                           	0x0000000103727dbf tor_main + 223
15  tor                           	0x00000001036943a9 main + 25
16  libdyld.dylib                 	0x00007fff903cb5ad start + 1

Child Tickets

Attachments (1)

bug18116_stem.py (1.3 KB) - added by teor 15 months ago.
Minimal example that triggers the bug

Download all attachments as: .zip

Change History (20)

comment:1 Changed 15 months ago by teor

Here's the specific assertion and surrounding log messages:

27:07.000 [notice] No circuits are opened. Relaxed timeout for circuit 170239 (a Circuit made by controller 3-hop circuit in state doing handshakes with channel state open) to 60000ms. However, it appears the circuit has timed out anyway. 20 guards are live. [59 similar message(s) suppressed in last 3600 seconds]
00:12.000 [notice] Owning controller connection has closed -- exiting now.
00:12.000 [notice] Catching signal TERM, exiting cleanly.
00:12.000 [warn] Couldn't find circuit 2162585894 (for channel 599)
00:12.000 [err] tor_assertion_failed_: Bug: src/or/circuitmux.c:501: circuitmux_detach_all_circuits: Assertion to_remove->muxinfo.policy_data == NULL failed; aborting. (on Tor 0.2.8.0-alpha-dev )
00:12.000 [err] Bug: Assertion to_remove->muxinfo.policy_data == NULL failed in circuitmux_detach_all_circuits at src/or/circuitmux.c:501. Stack trace: (on Tor 0.2.8.0-alpha-dev )
00:12.000 [err] Bug:     0   tor                                 0x000000010378ba79 log_backtrace + 73 (on Tor 0.2.8.0-alpha-dev )
00:12.000 [err] Bug:     1   tor                                 0x0000000103799467 tor_assertion_failed_ + 151 (on Tor 0.2.8.0-alpha-dev )
00:12.000 [err] Bug:     2   tor                                 0x00000001036b3e52 circuitmux_detach_all_circuits + 738 (on Tor 0.2.8.0-alpha-dev )
00:12.000 [err] Bug:     3   tor                                 0x00000001036a0abc channel_free_list + 236 (on Tor 0.2.8.0-alpha-dev )
00:12.000 [err] Bug:     4   tor                                 0x00000001036a0863 channel_free_all + 147 (on Tor 0.2.8.0-alpha-dev )
00:12.000 [err] Bug:     5   tor                                 0x0000000103727a43 tor_free_all + 131 (on Tor 0.2.8.0-alpha-dev )
00:12.000 [err] Bug:     6   tor                                 0x0000000103727cb8 tor_cleanup + 296 (on Tor 0.2.8.0-alpha-dev )
00:12.000 [err] Bug:     7   tor                                 0x0000000103727262 signal_callback + 2498 (on Tor 0.2.8.0-alpha-dev )
00:12.000 [err] Bug:     8   libevent-2.0.5.dylib                0x0000000103902be0 event_signal_closure + 185 (on Tor 0.2.8.0-alpha-dev )
00:12.000 [err] Bug:     9   libevent-2.0.5.dylib                0x00000001039029f7 event_process_active_single_queue + 237 (on Tor 0.2.8.0-alpha-dev )
00:12.000 [err] Bug:     10  libevent-2.0.5.dylib                0x00000001039030b1 event_process_active + 122 (on Tor 0.2.8.0-alpha-dev )
00:12.000 [err] Bug:     11  libevent-2.0.5.dylib                0x00000001039036ca event_base_loop + 647 (on Tor 0.2.8.0-alpha-dev )
00:12.000 [err] Bug:     12  tor                                 0x0000000103725aac do_main_loop + 1532 (on Tor 0.2.8.0-alpha-dev )
00:12.000 [err] Bug:     13  tor                                 0x0000000103727dbf tor_main + 223 (on Tor 0.2.8.0-alpha-dev )
00:12.000 [err] Bug:     14  tor                                 0x00000001036943a9 main + 25 (on Tor 0.2.8.0-alpha-dev )
00:12.000 [err] Bug:     15  libdyld.dylib                       0x00007fff903cb5ad start + 1 (on Tor 0.2.8.0-alpha-dev )

comment:2 Changed 15 months ago by nickm

  • Owner set to andrea
  • Status changed from new to assigned

Does this have an easy explanation?

comment:3 follow-up: Changed 15 months ago by andrea

Is this reproducible? It sounds like something screwed up some cmux-internal data structures and it quite plausibly has the same cause as the line right above it:

00:12.000 [warn] Couldn't find circuit 2162585894 (for channel 599)

To find the cause of this, it'd be most helpful to figure out exactly where that warning came from; I suspect something destroyed a circuit without telling cmux about it. IIRC I had to fix a few like this while writing cmux by finding everywhere we were doing anything like that and making sure they went through proper interfaces - perhaps something new was introduced recently that doesn't do that?

comment:4 in reply to: ↑ 3 Changed 15 months ago by teor

Replying to andrea:

Is this reproducible?

I'll have time to check this in a few days.

It sounds like something screwed up some cmux-internal data structures and it quite plausibly has the same cause as the line right above it:

00:12.000 [warn] Couldn't find circuit 2162585894 (for channel 599)

To find the cause of this, it'd be most helpful to figure out exactly where that warning came from; I suspect something destroyed a circuit without telling cmux about it.

I'll have time to find where the log message is coming from in a few more days.

IIRC I had to fix a few like this while writing cmux by finding everywhere we were doing anything like that and making sure they went through proper interfaces - perhaps something new was introduced recently that doesn't do that?

Tor's implementation of the controller CLOSECIRCUIT command is a possible culprit here.

Alternately, it could be that when I create the circuit in the controller, I mark it as purpose = "controller", and it's these open circuits that are causing issues when I quit tor.

Otherwise, periodic events have been refactored in master (9f31908a), and attach_circuit has been decoupled (#17590), and I've had code merged that closes circuits, but I believe it does it through the proper interfaces (#4483).

comment:5 Changed 15 months ago by teor

  • Parent ID set to #18117

Also see #18117, but let's fix this first and see if that goes away.

Changed 15 months ago by teor

Minimal example that triggers the bug

comment:6 Changed 15 months ago by teor

The bug is reproducible, but it's a race condition, which means getting the timing right can be difficult. I think it happens if tor is terminated just after stem calls close_circuit(), but before some cleanup is done (perhaps the next second_elapsed_callback()?).

To reproduce:

  • start a tor client
  • run python bug18116_stem.py [ControlPort]
  • kill tor when the stem script asks you to

To increase your chances of triggering the race condition:

  • increase N_ATTEMPTS in the script,
  • run multiple scripts against the same tor instance,
  • terminate tor after they've each made a few connections.

(You only have to get the timing right with one script to trigger the bug.)

comment:7 Changed 15 months ago by teor

(I couldn't reproduce this issue on 0.2.7.6, even with multiple connections happening at once.)

comment:8 Changed 15 months ago by andrea

The warning originates on circuitmux.c:488, BTW, also in circuitmux_detach_all_circuits():

 391 /**
 392  * Detach all circuits from a circuitmux (use before circuitmux_free())
 393  *
 394  * If <b>detached_out</b> is non-NULL, add every detached circuit_t to
 395  * detached_out.
 396  */
 397 
 398 void
 399 circuitmux_detach_all_circuits(circuitmux_t *cmux, smartlist_t *detached_out)
 400 {
 401   chanid_circid_muxinfo_t **i = NULL, *to_remove;
 402   channel_t *chan = NULL;
 403   circuit_t *circ = NULL;
 404 
 405   tor_assert(cmux);
 406   /*
 407    * Don't circuitmux_assert_okay_paranoid() here; this gets called when
 408    * channels are being freed and have already been unregistered, so
 409    * the channel ID lookups it does will fail.
 410    */
 411 
 412   i = HT_START(chanid_circid_muxinfo_map, cmux->chanid_circid_map);
 413   while (i) {
 414     to_remove = *i;
 415 
 416     if (! to_remove) {
 417       log_warn(LD_BUG, "Somehow, an HT iterator gave us a NULL pointer.");
 418       break;
 419     } else {
 420       /* Find a channel and circuit */
 421       chan = channel_find_by_global_id(to_remove->chan_id);
 422       if (chan) {
 423         circ =
 424           circuit_get_by_circid_channel_even_if_marked(to_remove->circ_id,
 425                                                        chan);
 426         if (circ) {
 427           /* Clear the circuit's mux for this direction */
 428           if (to_remove->muxinfo.direction == CELL_DIRECTION_OUT) {
 429             /*
 430              * Update active_circuits et al.; this does policy notifies, so
 431              * comes before freeing policy data
 432              */
 433 
 434             if (to_remove->muxinfo.cell_count > 0) {
 435               circuitmux_make_circuit_inactive(cmux, circ, CELL_DIRECTION_OUT);
 436             }
 437 
 438             /* Clear n_mux */
 439             circ->n_mux = NULL;
 440 
 441             if (detached_out)
 442               smartlist_add(detached_out, circ);
 443           } else if (circ->magic == OR_CIRCUIT_MAGIC) {
 444             /*
 445              * Update active_circuits et al.; this does policy notifies, so
 446              * comes before freeing policy data
 447              */
 448 
 449             if (to_remove->muxinfo.cell_count > 0) {
 450               circuitmux_make_circuit_inactive(cmux, circ, CELL_DIRECTION_IN);
 451             }
 452 
 453             /*
 454              * It has a sensible p_chan and direction == CELL_DIRECTION_IN,
 455              * so clear p_mux.
 456              */
 457             TO_OR_CIRCUIT(circ)->p_mux = NULL;
 458 
 459             if (detached_out)
 460               smartlist_add(detached_out, circ);
 461           } else {
 462             /* Complain and move on */
 463             log_warn(LD_CIRC,
 464                      "Circuit %u/channel " U64_FORMAT " had direction == "
 465                      "CELL_DIRECTION_IN, but isn't an or_circuit_t",
 466                      (unsigned)to_remove->circ_id,
 467                      U64_PRINTF_ARG(to_remove->chan_id));
 468           }
 469 
 470           /* Free policy-specific data if we have it */
 471           if (to_remove->muxinfo.policy_data) {
 472             /*
 473              * If we have policy data, assert that we have the means to
 474              * free it
 475              */
 476             tor_assert(cmux->policy);
 477             tor_assert(cmux->policy->free_circ_data);
 478             /* Call free_circ_data() */
 479             cmux->policy->free_circ_data(cmux,
 480                                          cmux->policy_data,
 481                                          circ,
 482                                          to_remove->muxinfo.policy_data);
 483             to_remove->muxinfo.policy_data = NULL;
 484           }
 485         } else {
 486           /* Complain and move on */
 487           log_warn(LD_CIRC,
 488                    "Couldn't find circuit %u (for channel " U64_FORMAT ")",
 489                    (unsigned)to_remove->circ_id,
 490                    U64_PRINTF_ARG(to_remove->chan_id));
 491         }
 492       } else {
 493         /* Complain and move on */
 494         log_warn(LD_CIRC,
 495                  "Couldn't find channel " U64_FORMAT " (for circuit id %u)",
 496                  U64_PRINTF_ARG(to_remove->chan_id),
 497                  (unsigned)to_remove->circ_id);
 498       }
 499 
 500       /* Assert that we don't have un-freed policy data for this circuit */
 501       tor_assert(to_remove->muxinfo.policy_data == NULL);
 502     }
 503 
 504     i = HT_NEXT_RMV(chanid_circid_muxinfo_map, cmux->chanid_circid_map, i);
 505 
 506     /* Free it */
 507     tor_free(to_remove);
 508   }
 509 
 510   cmux->n_circuits = 0;
 511   cmux->n_active_circuits = 0;
 512   cmux->n_cells = 0;
 513 }

comment:9 Changed 15 months ago by andrea

So, the assert is happening because it failed to find a circuit for the chanid/circid pair left over in the chanid_circid_muxinfo_map, and couldn't free the cmux policy specific data for that reason. The bug here is definitely that something is removing a circuit earlier on without telling cmux about it, and leaving behind a chanid_circid_muxinfo_map entry.

comment:10 Changed 15 months ago by andrea

Another possible option for trying to repro is to rebuild with CMUX_PARANOIA #defined, which causes many cmux calls to perform expensive consistency checks and will make data structure problems like this surface much earlier.

comment:11 Changed 15 months ago by andrea

I can definitely buy the second_elapsed_callback() thing; the order of some of those cleanup calls mattered for circuitmuxes IIRC, and that code got messed with recently.

comment:12 Changed 15 months ago by nickm

  • Priority changed from Medium to Very High
  • Severity changed from Normal to Major

comment:13 Changed 15 months ago by asn

Seems like #18251 is a similar bug but on relays.

comment:14 Changed 15 months ago by andrea

Ah, looks like my theory was wrong, so don't worry about the second_elapsed_callback() code. The bug was introduced in 8b4e5b7ee902fb7fa07767410a18433d752c7aef from nickm's decouple_circuit_mark branch, which split circuit_about_to_free() out from circuit_free(), but left a bare circuit_free() call in the circuit_free_all() loop, so that if any circuits are left open at the time that call occurs (this is the race), cmux won't be properly notified to detach them.

comment:15 Changed 15 months ago by andrea

  • Status changed from assigned to needs_review

See my bug18116 branch for proposed fix.

comment:16 follow-up: Changed 15 months ago by nickm

  • Resolution set to fixed
  • Status changed from needs_review to closed

Looks reasonable; merging!

(I'm renaming "terminal" to "atexit" because "terminal" also means something else.)

comment:17 in reply to: ↑ 16 Changed 15 months ago by cypherpunks

Replying to nickm:

(I'm renaming "terminal" to "atexit" because "terminal" also means something else.)

This change isn't in the repository?

comment:18 Changed 15 months ago by nickm

whoops. cae59b913f7daa154c6b1eb9083d1f582c8d2a1e does it. (I started reviewing the next ticket while this was compiling, and forgot I had more work to do)

comment:19 Changed 13 months ago by nickm

  • Keywords 2016-bug-retrospective added

Marking these tickets (based on severity and hand-review) for inclusion in 2016 bug retrospective

Note: See TracTickets for help on using tickets.