This change should give us enough information to diagnose the hangs that we're seeing in tor's CI.
Now I see tracebacks like this:
$ make src/app/tor && timelimit -p -t 30 -s USR1 -T 10 -S ABRT python "$HOME"/stem/run_tests.py --tor src/app/tor --integ --log notice --target RUN_ALL...timelimit: sending kill signal 6Signal 6 received by thread 140735628112768 in process 23883:Traceback for thread 123145553137664 in process 23883: File "/usr/local/Cellar/python/3.7.3/Frameworks/Python.framework/Versions/3.7/lib/python3.7/threading.py", line 885, in _bootstrap self._bootstrap_inner() File "/usr/local/Cellar/python/3.7.3/Frameworks/Python.framework/Versions/3.7/lib/python3.7/threading.py", line 917, in _bootstrap_inner self.run() File "/usr/local/Cellar/python/3.7.3/Frameworks/Python.framework/Versions/3.7/lib/python3.7/threading.py", line 865, in run self._target(*self._args, **self._kwargs) File "/Users/base/stem/stem/util/test_tools.py", line 169, in _wrapper runner(*args) if args else runner() File "/Users/base/stem/test/integ/descriptor/server_descriptor.py", line 34, in test_cached_descriptor for desc in stem.descriptor.parse_file(descriptor_file, 'server-descriptor 1.0', validate = True): File "/Users/base/stem/stem/descriptor/__init__.py", line 354, in parse_file for desc in parse(descriptor_file): File "/Users/base/stem/stem/descriptor/__init__.py", line 401, in _parse_metrics_file for desc in stem.descriptor.server_descriptor._parse_file(descriptor_file, is_bridge = False, validate = validate, **kwargs): File "/Users/base/stem/stem/descriptor/server_descriptor.py", line 221, in _parse_file yield RelayDescriptor(descriptor_text, validate, annotations, **kwargs) File "/Users/base/stem/stem/descriptor/server_descriptor.py", line 868, in __init__ if stem.prereq._is_crypto_ed25519_supported() and self.certificate: File "/Users/base/stem/stem/prereq.py", line 252, in _is_crypto_ed25519_supported if not is_crypto_available(): File "/Users/base/stem/stem/prereq.py", line 127, in is_crypto_available from cryptography.utils import int_from_bytes, int_to_bytes File "<frozen importlib._bootstrap>", line 983, in _find_and_load File "<frozen importlib._bootstrap>", line 953, in _find_and_load_unlocked File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed File "<frozen importlib._bootstrap>", line 983, in _find_and_load File "<frozen importlib._bootstrap>", line 963, in _find_and_load_unlocked File "<frozen importlib._bootstrap>", line 906, in _find_spec File "<frozen importlib._bootstrap_external>", line 1280, in find_spec File "<frozen importlib._bootstrap_external>", line 1252, in _get_spec File "<frozen importlib._bootstrap_external>", line 1364, in find_spec File "<frozen importlib._bootstrap_external>", line 81, in _path_statTraceback for thread 140735628112768 in process 23883: File "/Users/base/stem/run_tests.py", line 468, in <module> main() File "/Users/base/stem/run_tests.py", line 309, in main run_result = _run_test(args, test_class, output_filters) File "/Users/base/stem/run_tests.py", line 449, in _run_test run_result = stem.util.test_tools.TimedTestRunner(test_results, verbosity = 2).run(suite) File "/Users/base/stem/stem/util/test_tools.py", line 308, in run return super(TimedTestRunner, self).run(test) File "/usr/local/Cellar/python/3.7.3/Frameworks/Python.framework/Versions/3.7/lib/python3.7/unittest/runner.py", line 176, in run test(result) File "/usr/local/Cellar/python/3.7.3/Frameworks/Python.framework/Versions/3.7/lib/python3.7/unittest/suite.py", line 84, in __call__ return self.run(*args, **kwds) File "/usr/local/Cellar/python/3.7.3/Frameworks/Python.framework/Versions/3.7/lib/python3.7/unittest/suite.py", line 122, in run test(result) File "/usr/local/Cellar/python/3.7.3/Frameworks/Python.framework/Versions/3.7/lib/python3.7/unittest/case.py", line 663, in __call__ return self.run(*args, **kwds) File "/Users/base/stem/stem/util/test_tools.py", line 258, in run result = super(type(self), self).run(result) File "/usr/local/Cellar/python/3.7.3/Frameworks/Python.framework/Versions/3.7/lib/python3.7/unittest/case.py", line 615, in run testMethod() File "/Users/base/stem/stem/util/test_tools.py", line 152, in <lambda> self.method = lambda test: self.result(test) # method that can be mixed into TestCases File "/Users/base/stem/stem/util/test_tools.py", line 220, in result self._result = self._process_pipe.recv() File "/usr/local/Cellar/python/3.7.3/Frameworks/Python.framework/Versions/3.7/lib/python3.7/multiprocessing/connection.py", line 250, in recv buf = self._recv_bytes() File "/usr/local/Cellar/python/3.7.3/Frameworks/Python.framework/Versions/3.7/lib/python3.7/multiprocessing/connection.py", line 407, in _recv_bytes buf = self._recv(4) File "/usr/local/Cellar/python/3.7.3/Frameworks/Python.framework/Versions/3.7/lib/python3.7/multiprocessing/connection.py", line 379, in _recv chunk = read(handle, remaining) File "/Users/base/stem/run_tests.py", line 118, in log_traceback for ident, frame_ in sys._current_frames().items()]Signal 6 received by thread 140735628112768 in process 23920:Traceback for thread 123145308028928 in process 23920: File "/usr/local/Cellar/python/3.7.3/Frameworks/Python.framework/Versions/3.7/lib/python3.7/threading.py", line 885, in _bootstrap self._bootstrap_inner() File "/usr/local/Cellar/python/3.7.3/Frameworks/Python.framework/Versions/3.7/lib/python3.7/threading.py", line 917, in _bootstrap_inner self.run() File "/usr/local/Cellar/python/3.7.3/Frameworks/Python.framework/Versions/3.7/lib/python3.7/threading.py", line 865, in run self._target(*self._args, **self._kwargs) File "/Users/base/stem/test/integ/process.py", line 413, in short_launch timeout = timeout_arg, File "/Users/base/stem/stem/process.py", line 285, in launch_tor_with_config return launch_tor(tor_cmd, ['-f', '-'], None, completion_percent, init_msg_handler, timeout, take_ownership, close_output, stdin = config_str) File "/Users/base/stem/stem/process.py", line 153, in launch_tor init_line = tor_process.stdout.readline().decode('utf-8', 'replace').strip()Traceback for thread 140735628112768 in process 23920: File "/Users/base/stem/run_tests.py", line 468, in <module> main() File "/Users/base/stem/run_tests.py", line 283, in main test_classes[0].run_tests(async_args) File "/Users/base/stem/test/integ/process.py", line 120, in run_tests async_test.run(TOR_CMD) File "/Users/base/stem/stem/util/test_tools.py", line 201, in run self._process.start() File "/usr/local/Cellar/python/3.7.3/Frameworks/Python.framework/Versions/3.7/lib/python3.7/multiprocessing/process.py", line 112, in start self._popen = self._Popen(self) File "/usr/local/Cellar/python/3.7.3/Frameworks/Python.framework/Versions/3.7/lib/python3.7/multiprocessing/context.py", line 223, in _Popen return _default_context.get_context().Process._Popen(process_obj) File "/usr/local/Cellar/python/3.7.3/Frameworks/Python.framework/Versions/3.7/lib/python3.7/multiprocessing/context.py", line 277, in _Popen return Popen(process_obj) File "/usr/local/Cellar/python/3.7.3/Frameworks/Python.framework/Versions/3.7/lib/python3.7/multiprocessing/popen_fork.py", line 20, in __init__ self._launch(process_obj) File "/usr/local/Cellar/python/3.7.3/Frameworks/Python.framework/Versions/3.7/lib/python3.7/multiprocessing/popen_fork.py", line 74, in _launch code = process_obj._bootstrap() File "/usr/local/Cellar/python/3.7.3/Frameworks/Python.framework/Versions/3.7/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap self.run() File "/usr/local/Cellar/python/3.7.3/Frameworks/Python.framework/Versions/3.7/lib/python3.7/multiprocessing/process.py", line 99, in run self._target(*self._args, **self._kwargs) File "/Users/base/stem/stem/util/test_tools.py", line 169, in _wrapper runner(*args) if args else runner() File "/Users/base/stem/test/integ/process.py", line 437, in test_can_run_multithreaded if launch_async_with_timeout(stem.process.DEFAULT_INIT_TIMEOUT) is not None: File "/Users/base/stem/test/integ/process.py", line 420, in launch_async_with_timeout t.join() File "/usr/local/Cellar/python/3.7.3/Frameworks/Python.framework/Versions/3.7/lib/python3.7/threading.py", line 1032, in join self._wait_for_tstate_lock() File "/usr/local/Cellar/python/3.7.3/Frameworks/Python.framework/Versions/3.7/lib/python3.7/threading.py", line 1048, in _wait_for_tstate_lock elif lock.acquire(block, timeout): File "/Users/base/stem/run_tests.py", line 118, in log_traceback for ident, frame_ in sys._current_frames().items()]Exit 255
Trac: Actualpoints: N/Ato 0.4 Status: assigned to needs_review
20:56 <+atagar> teor4: Hi teor. Concerning the 'ignore signals while in a signal handler' part of your stem signaling patch could you please provide repro steps for a case in which the disable_signal_handlers() call is helpful? I just tried sending SIGUSR1s both with and without doing that and in practice the dumps seem to have the same stacktraces but I probably simply need to adjust what I'm doing somehow.
20:56 <+atagar> teor4: Hi teor. Concerning the 'ignore signals while in a signal handler' part of your stem signaling patch could you please provide repro steps for a case in which the disable_signal_handlers() call is helpful? I just tried sending SIGUSR1s both with and without doing that and in practice the dumps seem to have the same stacktraces but I probably simply need to adjust what I'm doing somehow.}}}
Signals are delivered asynchronously, so it is possible to have a signal arrive while python is executing a signal handler. Ignoring signals in a signal handler is a standard Unix precaution against these kinds of race conditions.
From the Python documentation:
Although Python signal handlers are called asynchronously as far as the Python user is concerned, they can only occur between the “atomic” instructions of the Python interpreter. This means that signals arriving during long calculations implemented purely in C (such as regular expression matches on large bodies of text) may be delayed for an arbitrary amount of time.
I'm not sure if you can trigger a race condition with the current code. I wrote it carefully, so the outcome of each line (and most expressions) wouldn't be affected by multiple signals. But you can definitely get different stack traces, if the timing is just right.
If you want to simulate a race condition, you can patch the pull request to signal the process itself:
{{{
propagate the signal to any multiprocessing children
pgid = os.getpgid(pid)
for p in multiprocessing.active_children():
avoid race conditions
if p.is_alive():
os.kill(p.pid, sig)
import time
for _ in range(0, 10):
cause a race condition by repeating the signal to this process
os.kill(pid, sig)
now keep the stack around, so it appears in future traces
time.sleep(5)
Or you can just add a time.sleep(10) somewhere in the signal handler, after it prints the stack trace and flushes the buffer.**Trac**: **Status**: needs_information **to** needs_review
Gotcha. Figured this was the case. I asked because I suspect checking the executing thread (rather than de-registering signals) will be a cleaner approach and I was unsure how you exercised that part of the patch.
Gotcha. Figured this was the case. I asked because I suspect checking the executing thread (rather than de-registering signals) will be a cleaner approach and I was unsure how you exercised that part of the patch.
I'll experiment a bit later.
Our priority is fixing Tor's CI, and we need this patch to find out why test-stem is failing.
Can you merge this patch as-is, and fix it later?
Or can you merge this patch without the parts that disable signals, and add them later?
I just noticed that there is still a race condition here, if the process is alive when the list is created, and for the is_alive() check, but exits before it is signalled: