Subsystems implement the interface of struct subsys_fns_t, with one of the optional function pointers being void (*thread_cleanup)(void). This thread_cleanup function is called for all subsystems by the subsystem manager function void subsystems_thread_cleanup(void), but the subsystems_thread_cleanup function is never called anywhere in the code.
At the moment, the only subsystem to implement the thread_cleanup interface is the crypto subsystem, which uses thread_cleanup for freeing the threadlocal crypto_fast_rng_t, as well as freeing the threadlocal error queue on old versions of OpenSSL. As far as I can tell, this is never run.
I think that the subsystems_thread_cleanup function should be run somewhere in the code, but it's not clear to me how this subsystems_thread_cleanup is expected to be used. It seems like there should also be subsystems_thread_init and thread_init functions as well for initializing threadlocal variables. Right now the crypto subsystem does an "initialize on first use" singleton pattern, but it might be useful to add this initialization interface function so that subsystems have the option of initializing all of their threadlocals in one place.
To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information
Child items ...
Show closed items
Linked items 0
Link issues together to show that they're related.
Learn more.
I made a PR on GitHub with some changes that should fix this problem. Feel free to use them if they're helpful.
This fixes ticket #32103 (moved). It extends the threadpool to allow a customizable thread spawn function. This allows us to use our own spawn function which calls subsystems_thread_cleanup and a new subsystems_thread_init. It also calls 'spawn_exit', which negates the need for 1412. Finally the above two functions are called in the main thread after subsystems_init and before subsystems_shutdown respectively.
Overall, it looks like a good start. There are a few changes that I think should happen before merging. I'm also asking nickm to do some extra review in case there are subtle things related to how we use threads.
The new wrapper in tor_threads.c apparently changes the contract of spawn_func(), which says func should not return, but rather should call spawn_exit(). The documentation of spawn_func() isn't changed accordingly. Maybe the wrapper for the thread-local shutdown should go in spawn_exit(), instead of expecting the thread-start function to return? It seems like this would have been caught by a unit test that tests the new functionality. (I'm not quite sure how difficult such a test would be to write. Please let me know if it would be exceptionally difficult.)
At an architectural level, I would prefer to avoid adding a new general interface that only has one user. It seems to add considerable complexity: new fields and parameters in several places, new files, etc.
Why not call subsystems_thread_init() from spawn_func() and subsystems_thread_cleanup() from spawn_exit()?
Minor: The patches expand tor_run_main() beyond the practracker limit of 100 lines. (This is causing CI to fail.) We can consider refactoring, or add a new exception for that function.
Minor: The loop direction fixup should probably be squashed before merge.
Minor: There should probably be a changes file (see doc/HACKING/CodingStandards.md). We can help with this if you like.
You noticed that the crypto subsystem uses a singleton pattern currently. Your patches don't seem to modify this. Is that intentional?
How will these changes interact with building tor as a library that possibly gets started and stopped, or loaded and unloaded?
Overall, it looks like a good start. There are a few changes that I think should happen before merging. I'm also asking nickm to do some extra review in case there are subtle things related to how we use threads.
Thanks for taking a look at it! I'll try to explain some of my design reasoning below. Let me know what changes you'd like me to make, and I can do that.
The new wrapper in tor_threads.c apparently changes the contract of spawn_func(), which says func should not return, but rather should call spawn_exit(). The documentation of spawn_func() isn't changed accordingly. Maybe the wrapper for the thread-local shutdown should go in spawn_exit(), instead of expecting the thread-start function to return? It seems like this would have been caught by a unit test that tests the new functionality. (I'm not quite sure how difficult such a test would be to write. Please let me know if it would be exceptionally difficult.)
At an architectural level, I would prefer to avoid adding a new general interface that only has one user. It seems to add considerable complexity: new fields and parameters in several places, new files, etc.
Why not call subsystems_thread_init() from spawn_func() and subsystems_thread_cleanup() from spawn_exit()?
My reason for this was to make the start_tor_thread functionality be an app-level abstraction on top of the original tor threading library. Calling the subsystem manager code (such as subsystems_thread_init()) from the threading library (such as within spawn_func()) causes the subsystem module to become a dependency of the threading library. My goal was to not have this library (in the lib/ directory) depend on application-specific code (in the app/ directory). This approach also means that someone using a thread (created with start_tor_thread()) does not need to worry about running spawn_exit() or any additional cleanup-code since it will happen automatically, hopefully reducing the possibility of bugs. For example, the threadpool already forgets to call spawn_exit() (see https://github.com/torproject/tor/pull/1412/), which means there would be a bug.
Additionally, in the future someone may wish to create a new thread which does something very simple, and may not want that thread interacting with the tor subsystem manager. Integrating the subsystem manager into the threading library would make this impossible.
Minor: The patches expand tor_run_main() beyond the practracker limit of 100 lines. (This is causing CI to fail.) We can consider refactoring, or add a new exception for that function.
Yes, I didn't want to refactor tor's main method in this PR since that could probably be a whole new PR on its own, but also didn't want to reduce the readability of this PR to satisfy the CI system. (Aside: practracker seems to include empty lines in the limit, but it would be nice if it ignored them since empty lines can make the code more readable). Let me know what approach you would prefer and I can make that change.
Minor: The loop direction fixup should probably be squashed before merge.
I'm not sure how GitHub handles push --force when people have left comments on commits (I don't want to overwrite teor's comment), so I haven't done this yet. Please let me know if you'd like me to squash now or later.
Minor: There should probably be a changes file (see doc/HACKING/CodingStandards.md). We can help with this if you like.
Okay thanks, I'll add one and then just let me know if you'd like me to change it.
You noticed that the crypto subsystem uses a singleton pattern currently. Your patches don't seem to modify this. Is that intentional?
I didn't modify that in this PR since I didn't want to make too many changes at once (although I have modified it in my own personal changes to tor).
How will these changes interact with building tor as a library that possibly gets started and stopped, or loaded and unloaded?
I'm not sure, I have no experience building tor as a library. Currently threads are only created when tor is running as a relay (in server mode). Does anyone use tor as a library when running a relay? I feel like this would already break things, such as the threadpool which is never freed, and whose work entries are also not freed when shutting down.
Thanks for the review. If you want to call subsystems_thread_cleanup() from spawn_exit(), then I think it's best to close this PR, and make a new patch that does that instead since none of this code will be needed. The problem is that it make linking complicated, since now you need to link the tor code from app/ into the tools in src/tools, or to remove the lib/thread code from these tools. I'm not sure the best way to do this, so I'll let someone else who is more familiar with the tor build system take a look at this.
The 'defect' status and milestone should probably be removed from this ticket. When I made the ticket I didn't realize that threads in tor never exit, so there is no reason to clean up any memory in these threads. It might be best just to remove the thread_cleanup function altogether.
This is nontrivial code, and it invokes code that wasn't actually run before. Therefore I think we shouldn't backport it?, but it is worth doing this patch to avoid memory leaks on embedded stop/start usage.
Trac: Actualpoints: N/Ato .2 Status: assigned to needs_review
(1) The function 'tor_run_thread_cleanup_fn()' is never run. I checked this by adding the following and checking that they never execute (added the call to system since I didn't check if the logging subsystem will still log at this point of the shutdown procedure).
tor_run_thread_cleanup_fn(void){ log_warn(LD_OR, "HERE"); system("touch /tmp/abc"); if (thread_cleanup_fn) thread_cleanup_fn();}
(2) The code comments still say "func should not return, but rather should call spawn_exit", but if you do this, then "tor_run_thread_cleanup_fn()" would be run twice (once in spawn_exit and once in tor_pthread_helper_fn).
I don't think that second part is true -- spawn_exit() causes the current thread to exit without running any more code, so it won't reach the end of tor_pthread_helper_fn.
Ah you're right sorry, it would only be called once, although it is a little confusing. Is there a reason not to remove spawn_exit() entirely at this point and just call pthread_exit(NULL) within the wrapper?
One other thing is that you're writing to a global (thread_cleanup_fn) in the main thread, but reading it from other threads. This should be fine (I'm assuming NUMA won't complicate things), but I think it's worth documenting that you must not call tor_set_thread_cleanup_fn() after you have started any threads.
It looks like the reason that the cleanup functions don't currently run is that the thread pool itself is not shut down anywhere on shutdown, so the threads are not told to exit. This is not yet a problem for embedded users, since they are mostly clients and clients don't yet use threads, but it's something we should fix.
Above I added some patches for how I did this a few months ago. I'm not proposing to use them on tor as I didn't test them thoroughly, but they might be helpful to look at when you or someone else in the future decides to fix this. The first patch isn't necessary for this bug, but makes some changes to the thread pool that might affect the other two patches.
This isn't actually ready for review yet: my branch is incomplete, and opara's patches are not proposed for Tor integration. I'll work on both to see what I can use, and then make a branch for review.