Subsystem "thread_cleanup" is never called

changed milestone to %Tor: unspecified

added 044-deferred actualpoints::.2 component::core tor/tor milestone::Tor: unspecified owner::nickm priority::medium reviewer::catalyst severity::normal status::accepted type::defect labels

Trac:
Keywords: N/A deleted, 042-should added
Milestone: N/A to Tor: 0.4.2.x-final

I made a PR on GitHub with some changes that should fix this problem. Feel free to use them if they're helpful.

This fixes ticket #32103 (moved). It extends the threadpool to allow a customizable thread spawn function. This allows us to use our own spawn function which calls subsystems_thread_cleanup and a new subsystems_thread_init. It also calls 'spawn_exit', which negates the need for 1412. Finally the above two functions are called in the main thread after subsystems_init and before subsystems_shutdown respectively.

Trac:
Status: new to needs_review

Trac:
Reviewer: N/A to catalyst

Trac:
Keywords: N/A deleted, extra-review added

Thanks for the patches!

Overall, it looks like a good start. There are a few changes that I think should happen before merging. I'm also asking nickm to do some extra review in case there are subtle things related to how we use threads.

The new wrapper in tor_threads.c apparently changes the contract of spawn_func(), which says func should not return, but rather should call spawn_exit(). The documentation of spawn_func() isn't changed accordingly. Maybe the wrapper for the thread-local shutdown should go in spawn_exit(), instead of expecting the thread-start function to return? It seems like this would have been caught by a unit test that tests the new functionality. (I'm not quite sure how difficult such a test would be to write. Please let me know if it would be exceptionally difficult.)

At an architectural level, I would prefer to avoid adding a new general interface that only has one user. It seems to add considerable complexity: new fields and parameters in several places, new files, etc.

Why not call subsystems_thread_init() from spawn_func() and subsystems_thread_cleanup() from spawn_exit()?

Minor: The patches expand tor_run_main() beyond the practracker limit of 100 lines. (This is causing CI to fail.) We can consider refactoring, or add a new exception for that function.

Minor: The loop direction fixup should probably be squashed before merge.

Minor: There should probably be a changes file (see doc/HACKING/CodingStandards.md). We can help with this if you like.

You noticed that the crypto subsystem uses a singleton pattern currently. Your patches don't seem to modify this. Is that intentional?

How will these changes interact with building tor as a library that possibly gets started and stopped, or loaded and unloaded?

Trac:
Reviewer: catalyst to nickm

Replying to catalyst:

Thanks for the patches!

Overall, it looks like a good start. There are a few changes that I think should happen before merging. I'm also asking nickm to do some extra review in case there are subtle things related to how we use threads.

Thanks for taking a look at it! I'll try to explain some of my design reasoning below. Let me know what changes you'd like me to make, and I can do that.

The new wrapper in tor_threads.c apparently changes the contract of spawn_func(), which says func should not return, but rather should call spawn_exit(). The documentation of spawn_func() isn't changed accordingly. Maybe the wrapper for the thread-local shutdown should go in spawn_exit(), instead of expecting the thread-start function to return? It seems like this would have been caught by a unit test that tests the new functionality. (I'm not quite sure how difficult such a test would be to write. Please let me know if it would be exceptionally difficult.)

At an architectural level, I would prefer to avoid adding a new general interface that only has one user. It seems to add considerable complexity: new fields and parameters in several places, new files, etc.

Why not call subsystems_thread_init() from spawn_func() and subsystems_thread_cleanup() from spawn_exit()?

My reason for this was to make the start_tor_thread functionality be an app-level abstraction on top of the original tor threading library. Calling the subsystem manager code (such as subsystems_thread_init()) from the threading library (such as within spawn_func()) causes the subsystem module to become a dependency of the threading library. My goal was to not have this library (in the lib/ directory) depend on application-specific code (in the app/ directory). This approach also means that someone using a thread (created with start_tor_thread()) does not need to worry about running spawn_exit() or any additional cleanup-code since it will happen automatically, hopefully reducing the possibility of bugs. For example, the threadpool already forgets to call spawn_exit() (see https://github.com/torproject/tor/pull/1412/), which means there would be a bug.

Additionally, in the future someone may wish to create a new thread which does something very simple, and may not want that thread interacting with the tor subsystem manager. Integrating the subsystem manager into the threading library would make this impossible.

Minor: The patches expand tor_run_main() beyond the practracker limit of 100 lines. (This is causing CI to fail.) We can consider refactoring, or add a new exception for that function.

Yes, I didn't want to refactor tor's main method in this PR since that could probably be a whole new PR on its own, but also didn't want to reduce the readability of this PR to satisfy the CI system. (Aside: practracker seems to include empty lines in the limit, but it would be nice if it ignored them since empty lines can make the code more readable). Let me know what approach you would prefer and I can make that change.

Minor: The loop direction fixup should probably be squashed before merge.

I'm not sure how GitHub handles push --force when people have left comments on commits (I don't want to overwrite teor's comment), so I haven't done this yet. Please let me know if you'd like me to squash now or later.

Minor: There should probably be a changes file (see doc/HACKING/CodingStandards.md). We can help with this if you like.

Okay thanks, I'll add one and then just let me know if you'd like me to change it.

You noticed that the crypto subsystem uses a singleton pattern currently. Your patches don't seem to modify this. Is that intentional?

I didn't modify that in this PR since I didn't want to make too many changes at once (although I have modified it in my own personal changes to tor).

How will these changes interact with building tor as a library that possibly gets started and stopped, or loaded and unloaded?

I'm not sure, I have no experience building tor as a library. Currently threads are only created when tor is running as a relay (in server mode). Does anyone use tor as a library when running a relay? I feel like this would already break things, such as the threadpool which is never freed, and whose work entries are also not freed when shutting down.

I've done a secondary review at Catalyst's request. Comments are on the PR. I think I agree with Catalyst's suggestions.

Trac:
Status: needs_review to needs_revision

Thanks for the review. If you want to call subsystems_thread_cleanup() from spawn_exit(), then I think it's best to close this PR, and make a new patch that does that instead since none of this code will be needed. The problem is that it make linking complicated, since now you need to link the tor code from app/ into the tools in src/tools, or to remove the lib/thread code from these tools. I'm not sure the best way to do this, so I'll let someone else who is more familiar with the tor build system take a look at this.

This fix is unlikely to make our 0.4.2 release.

Feel free to open a new PR, and put the link in this ticket.

Trac:
Milestone: Tor: 0.4.2.x-final to Tor: 0.4.3.x-final
Keywords: 042-should deleted, 043-should added

The 'defect' status and milestone should probably be removed from this ticket. When I made the ticket I didn't realize that threads in tor never exit, so there is no reason to clean up any memory in these threads. It might be best just to remove the thread_cleanup function altogether.

Do we need a thread cleanup function, nickm?

Trac:
Owner: N/A to nickm
Status: needs_revision to assigned

I believe we do. Threads do exit (at shutdown), and thread-local storage is indeed a thing we use.

See branch bug32103 with PR at https://github.com/torproject/tor/pull/1699

This is nontrivial code, and it invokes code that wasn't actually run before. Therefore I think we shouldn't backport it?, but it is worth doing this patch to avoid memory leaks on embedded stop/start usage.

Trac:
Actualpoints: N/A to .2
Status: assigned to needs_review

Hi, just wanted to leave a few comments:

(1) The function 'tor_run_thread_cleanup_fn()' is never run. I checked this by adding the following and checking that they never execute (added the call to system since I didn't check if the logging subsystem will still log at this point of the shutdown procedure).

tor_run_thread_cleanup_fn(void)
{
  log_warn(LD_OR, "HERE");
  system("touch /tmp/abc");
  if (thread_cleanup_fn)
    thread_cleanup_fn();
}

(2) The code comments still say "func should not return, but rather should call spawn_exit", but if you do this, then "tor_run_thread_cleanup_fn()" would be run twice (once in spawn_exit and once in tor_pthread_helper_fn).

I don't think that second part is true -- spawn_exit() causes the current thread to exit without running any more code, so it won't reach the end of tor_pthread_helper_fn.

Ah you're right sorry, it would only be called once, although it is a little confusing. Is there a reason not to remove spawn_exit() entirely at this point and just call pthread_exit(NULL) within the wrapper?

One other thing is that you're writing to a global (thread_cleanup_fn) in the main thread, but reading it from other threads. This should be fine (I'm assuming NUMA won't complicate things), but I think it's worth documenting that you must not call tor_set_thread_cleanup_fn() after you have started any threads.

Trac:
Reviewer: nickm to N/A

Hm. I'd rather retain the option to have both ways to exit a thread, since we're likely to do more thread stuff in the future.

I'll add a comment about setting thread_cleanup_fn before starting any threads.

It looks like the reason that the cleanup functions don't currently run is that the thread pool itself is not shut down anywhere on shutdown, so the threads are not told to exit. This is not yet a problem for embedded users, since they are mostly clients and clients don't yet use threads, but it's something we should fix.

I'll see how hard the fix is.

Trac:
Status: needs_review to needs_revision

Trac:
0001-Uncoupled-the-thread-pool-from-the-reply-queue.patch

Trac:
0002-Changed-threads-to-be-joinable-rather-than-detatched.patch

Trac:
0003-Added-the-ability-to-shutdown-the-threadpool.patch

Above I added some patches for how I did this a few months ago. I'm not proposing to use them on tor as I didn't test them thoroughly, but they might be helpful to look at when you or someone else in the future decides to fix this. The first patch isn't necessary for this bug, but makes some changes to the thread pool that might affect the other two patches.

Trac:
Status: needs_revision to needs_review

Trac:
Reviewer: N/A to catalyst

This isn't actually ready for review yet: my branch is incomplete, and opara's patches are not proposed for Tor integration. I'll work on both to see what I can use, and then make a branch for review.

Trac:
Status: needs_review to accepted

Putting this into 044 since it requires actually sending the shutdown signal to the threads, which we haven't tested much yet.

Trac:
Milestone: Tor: 0.4.3.x-final to Tor: 0.4.4.x-final
Keywords: extra-review, 043-should deleted, N/A added

Bulk-remove tickets from 0.4.4. Add the 044-deferred label to them.

Trac:
Milestone: Tor: 0.4.4.x-final to Tor: unspecified
Keywords: N/A deleted, 044-deferred added

added 1h 36m of time spent

moved to tpo/core/tor#32103

Subsystem "thread_cleanup" is never called

Child items ...

Activity