Opened 3 months ago

Closed 3 months ago

Last modified 8 weeks ago

#30122 closed enhancement (fixed)

Make stem's unit tests propagate the backtrace signals to child processes

Reported by: teor Owned by: teor
Priority: Medium Milestone:
Component: Core Tor/Stem Version:
Severity: Normal Keywords: tor-ci-fail-sometimes
Cc: Actual Points: 0.6
Parent ID: #29437 Points: 0.5
Reviewer: Sponsor: Sponsor31-can

Description

Sometimes, our test backtraces just show us that stem's test is waiting for a response from a child process.

But we can fix that.

Child Tickets

Change History (10)

comment:1 Changed 3 months ago by teor

Actual Points: 0.4
Status: assignedneeds_review

See my pull request:
https://github.com/torproject/stem/pull/12

This change should give us enough information to diagnose the hangs that we're seeing in tor's CI.

Now I see tracebacks like this:

$ make src/app/tor && timelimit -p -t 30 -s USR1 -T 10 -S ABRT python "$HOME"/stem/run_tests.py --tor src/app/tor --integ --log notice --target RUN_ALL
...
timelimit: sending kill signal 6
Signal 6 received by thread 140735628112768 in process 23883:

Traceback for thread 123145553137664 in process 23883:

  File "/usr/local/Cellar/python/3.7.3/Frameworks/Python.framework/Versions/3.7/lib/python3.7/threading.py", line 885, in _bootstrap
    self._bootstrap_inner()
  File "/usr/local/Cellar/python/3.7.3/Frameworks/Python.framework/Versions/3.7/lib/python3.7/threading.py", line 917, in _bootstrap_inner
    self.run()
  File "/usr/local/Cellar/python/3.7.3/Frameworks/Python.framework/Versions/3.7/lib/python3.7/threading.py", line 865, in run
    self._target(*self._args, **self._kwargs)
  File "/Users/base/stem/stem/util/test_tools.py", line 169, in _wrapper
    runner(*args) if args else runner()
  File "/Users/base/stem/test/integ/descriptor/server_descriptor.py", line 34, in test_cached_descriptor
    for desc in stem.descriptor.parse_file(descriptor_file, 'server-descriptor 1.0', validate = True):
  File "/Users/base/stem/stem/descriptor/__init__.py", line 354, in parse_file
    for desc in parse(descriptor_file):
  File "/Users/base/stem/stem/descriptor/__init__.py", line 401, in _parse_metrics_file
    for desc in stem.descriptor.server_descriptor._parse_file(descriptor_file, is_bridge = False, validate = validate, **kwargs):
  File "/Users/base/stem/stem/descriptor/server_descriptor.py", line 221, in _parse_file
    yield RelayDescriptor(descriptor_text, validate, annotations, **kwargs)
  File "/Users/base/stem/stem/descriptor/server_descriptor.py", line 868, in __init__
    if stem.prereq._is_crypto_ed25519_supported() and self.certificate:
  File "/Users/base/stem/stem/prereq.py", line 252, in _is_crypto_ed25519_supported
    if not is_crypto_available():
  File "/Users/base/stem/stem/prereq.py", line 127, in is_crypto_available
    from cryptography.utils import int_from_bytes, int_to_bytes
  File "<frozen importlib._bootstrap>", line 983, in _find_and_load
  File "<frozen importlib._bootstrap>", line 953, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed
  File "<frozen importlib._bootstrap>", line 983, in _find_and_load
  File "<frozen importlib._bootstrap>", line 963, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 906, in _find_spec
  File "<frozen importlib._bootstrap_external>", line 1280, in find_spec
  File "<frozen importlib._bootstrap_external>", line 1252, in _get_spec
  File "<frozen importlib._bootstrap_external>", line 1364, in find_spec
  File "<frozen importlib._bootstrap_external>", line 81, in _path_stat


Traceback for thread 140735628112768 in process 23883:

  File "/Users/base/stem/run_tests.py", line 468, in <module>
    main()
  File "/Users/base/stem/run_tests.py", line 309, in main
    run_result = _run_test(args, test_class, output_filters)
  File "/Users/base/stem/run_tests.py", line 449, in _run_test
    run_result = stem.util.test_tools.TimedTestRunner(test_results, verbosity = 2).run(suite)
  File "/Users/base/stem/stem/util/test_tools.py", line 308, in run
    return super(TimedTestRunner, self).run(test)
  File "/usr/local/Cellar/python/3.7.3/Frameworks/Python.framework/Versions/3.7/lib/python3.7/unittest/runner.py", line 176, in run
    test(result)
  File "/usr/local/Cellar/python/3.7.3/Frameworks/Python.framework/Versions/3.7/lib/python3.7/unittest/suite.py", line 84, in __call__
    return self.run(*args, **kwds)
  File "/usr/local/Cellar/python/3.7.3/Frameworks/Python.framework/Versions/3.7/lib/python3.7/unittest/suite.py", line 122, in run
    test(result)
  File "/usr/local/Cellar/python/3.7.3/Frameworks/Python.framework/Versions/3.7/lib/python3.7/unittest/case.py", line 663, in __call__
    return self.run(*args, **kwds)
  File "/Users/base/stem/stem/util/test_tools.py", line 258, in run
    result = super(type(self), self).run(result)
  File "/usr/local/Cellar/python/3.7.3/Frameworks/Python.framework/Versions/3.7/lib/python3.7/unittest/case.py", line 615, in run
    testMethod()
  File "/Users/base/stem/stem/util/test_tools.py", line 152, in <lambda>
    self.method = lambda test: self.result(test)  # method that can be mixed into TestCases
  File "/Users/base/stem/stem/util/test_tools.py", line 220, in result
    self._result = self._process_pipe.recv()
  File "/usr/local/Cellar/python/3.7.3/Frameworks/Python.framework/Versions/3.7/lib/python3.7/multiprocessing/connection.py", line 250, in recv
    buf = self._recv_bytes()
  File "/usr/local/Cellar/python/3.7.3/Frameworks/Python.framework/Versions/3.7/lib/python3.7/multiprocessing/connection.py", line 407, in _recv_bytes
    buf = self._recv(4)
  File "/usr/local/Cellar/python/3.7.3/Frameworks/Python.framework/Versions/3.7/lib/python3.7/multiprocessing/connection.py", line 379, in _recv
    chunk = read(handle, remaining)
  File "/Users/base/stem/run_tests.py", line 118, in log_traceback
    for ident, frame_ in sys._current_frames().items()]

Signal 6 received by thread 140735628112768 in process 23920:

Traceback for thread 123145308028928 in process 23920:

  File "/usr/local/Cellar/python/3.7.3/Frameworks/Python.framework/Versions/3.7/lib/python3.7/threading.py", line 885, in _bootstrap
    self._bootstrap_inner()
  File "/usr/local/Cellar/python/3.7.3/Frameworks/Python.framework/Versions/3.7/lib/python3.7/threading.py", line 917, in _bootstrap_inner
    self.run()
  File "/usr/local/Cellar/python/3.7.3/Frameworks/Python.framework/Versions/3.7/lib/python3.7/threading.py", line 865, in run
    self._target(*self._args, **self._kwargs)
  File "/Users/base/stem/test/integ/process.py", line 413, in short_launch
    timeout = timeout_arg,
  File "/Users/base/stem/stem/process.py", line 285, in launch_tor_with_config
    return launch_tor(tor_cmd, ['-f', '-'], None, completion_percent, init_msg_handler, timeout, take_ownership, close_output, stdin = config_str)
  File "/Users/base/stem/stem/process.py", line 153, in launch_tor
    init_line = tor_process.stdout.readline().decode('utf-8', 'replace').strip()


Traceback for thread 140735628112768 in process 23920:

  File "/Users/base/stem/run_tests.py", line 468, in <module>
    main()
  File "/Users/base/stem/run_tests.py", line 283, in main
    test_classes[0].run_tests(async_args)
  File "/Users/base/stem/test/integ/process.py", line 120, in run_tests
    async_test.run(TOR_CMD)
  File "/Users/base/stem/stem/util/test_tools.py", line 201, in run
    self._process.start()
  File "/usr/local/Cellar/python/3.7.3/Frameworks/Python.framework/Versions/3.7/lib/python3.7/multiprocessing/process.py", line 112, in start
    self._popen = self._Popen(self)
  File "/usr/local/Cellar/python/3.7.3/Frameworks/Python.framework/Versions/3.7/lib/python3.7/multiprocessing/context.py", line 223, in _Popen
    return _default_context.get_context().Process._Popen(process_obj)
  File "/usr/local/Cellar/python/3.7.3/Frameworks/Python.framework/Versions/3.7/lib/python3.7/multiprocessing/context.py", line 277, in _Popen
    return Popen(process_obj)
  File "/usr/local/Cellar/python/3.7.3/Frameworks/Python.framework/Versions/3.7/lib/python3.7/multiprocessing/popen_fork.py", line 20, in __init__
    self._launch(process_obj)
  File "/usr/local/Cellar/python/3.7.3/Frameworks/Python.framework/Versions/3.7/lib/python3.7/multiprocessing/popen_fork.py", line 74, in _launch
    code = process_obj._bootstrap()
  File "/usr/local/Cellar/python/3.7.3/Frameworks/Python.framework/Versions/3.7/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
    self.run()
  File "/usr/local/Cellar/python/3.7.3/Frameworks/Python.framework/Versions/3.7/lib/python3.7/multiprocessing/process.py", line 99, in run
    self._target(*self._args, **self._kwargs)
  File "/Users/base/stem/stem/util/test_tools.py", line 169, in _wrapper
    runner(*args) if args else runner()
  File "/Users/base/stem/test/integ/process.py", line 437, in test_can_run_multithreaded
    if launch_async_with_timeout(stem.process.DEFAULT_INIT_TIMEOUT) is not None:
  File "/Users/base/stem/test/integ/process.py", line 420, in launch_async_with_timeout
    t.join()
  File "/usr/local/Cellar/python/3.7.3/Frameworks/Python.framework/Versions/3.7/lib/python3.7/threading.py", line 1032, in join
    self._wait_for_tstate_lock()
  File "/usr/local/Cellar/python/3.7.3/Frameworks/Python.framework/Versions/3.7/lib/python3.7/threading.py", line 1048, in _wait_for_tstate_lock
    elif lock.acquire(block, timeout):
  File "/Users/base/stem/run_tests.py", line 118, in log_traceback
    for ident, frame_ in sys._current_frames().items()]

Exit 255

comment:2 Changed 3 months ago by atagar

Status: needs_reviewneeds_information

Hi teor. Moving over from irc...

20:56 <+atagar> teor4: Hi teor. Concerning the 'ignore signals while
  in a signal handler' part of your stem signaling patch could you
  please provide repro steps for a case in which the disable_signal_handlers()
  call is helpful? I just tried sending SIGUSR1s both with and without
  doing that and in practice the dumps seem to have the same stacktraces
  but I probably simply need to adjust what I'm doing somehow.

comment:3 in reply to:  2 Changed 3 months ago by teor

Status: needs_informationneeds_review

Replying to atagar:

Hi teor. Moving over from irc...

20:56 <+atagar> teor4: Hi teor. Concerning the 'ignore signals while
  in a signal handler' part of your stem signaling patch could you
  please provide repro steps for a case in which the disable_signal_handlers()
  call is helpful? I just tried sending SIGUSR1s both with and without
  doing that and in practice the dumps seem to have the same stacktraces
  but I probably simply need to adjust what I'm doing somehow.

Signals are delivered asynchronously, so it is possible to have a signal arrive while python is executing a signal handler. Ignoring signals in a signal handler is a standard Unix precaution against these kinds of race conditions.

From the Python documentation:

Although Python signal handlers are called asynchronously as far as the Python user is concerned, they can only occur between the “atomic” instructions of the Python interpreter. This means that signals arriving during long calculations implemented purely in C (such as regular expression matches on large bodies of text) may be delayed for an arbitrary amount of time.

https://docs.python.org/2/library/signal.html#module-signal

For more information, see:

https://en.wikipedia.org/wiki/Unix_signal#Risks

I'm not sure if you can trigger a race condition with the current code. I wrote it carefully, so the outcome of each line (and most expressions) wouldn't be affected by multiple signals. But you can definitely get different stack traces, if the timing is just right.

If you want to simulate a race condition, you can patch the pull request to signal the process itself:

-  # propagate the signal to any multiprocessing children
-  pgid = os.getpgid(pid)
-  for p in multiprocessing.active_children():
-    # avoid race conditions
-    if p.is_alive():
-      os.kill(p.pid, sig)
+  import time
+  for _ in range(0, 10):
+    # cause a race condition by repeating the signal to this process
+    os.kill(pid, sig)
+  # now keep the stack around, so it appears in future traces
+  time.sleep(5)

Or you can just add a time.sleep(10) somewhere in the signal handler, after it prints the stack trace and flushes the buffer.

comment:4 Changed 3 months ago by atagar

Gotcha. Figured this was the case. I asked because I suspect checking the executing thread (rather than de-registering signals) will be a cleaner approach and I was unsure how you exercised that part of the patch.

I'll experiment a bit later.

comment:5 in reply to:  4 Changed 3 months ago by teor

Replying to atagar:

Gotcha. Figured this was the case. I asked because I suspect checking the executing thread (rather than de-registering signals) will be a cleaner approach and I was unsure how you exercised that part of the patch.

I'll experiment a bit later.

Our priority is fixing Tor's CI, and we need this patch to find out why test-stem is failing.

Can you merge this patch as-is, and fix it later?
Or can you merge this patch without the parts that disable signals, and add them later?

comment:6 Changed 3 months ago by atagar

Resolution: fixed
Status: needs_reviewclosed

Hi teor, pushed with some adjustments. Feel free to reopen if you'd care to discuss any of them or make any tweaks.

comment:7 Changed 3 months ago by teor

Resolution: fixed
Status: closedreopened

Thanks for the merge and improvements!

I just noticed that there is still a race condition here, if the process is alive when the list is created, and for the is_alive() check, but exits before it is signalled:

-    # avoid race conditions
     if p.is_alive():

We should just ignore the exception instead:
https://github.com/torproject/stem/pull/13

comment:8 Changed 3 months ago by teor

Actual Points: 0.40.6
Status: reopenedneeds_review

comment:9 Changed 3 months ago by atagar

Resolution: fixed
Status: needs_reviewclosed

Thanks teor! Great catch, fix pushed. Feel free to reopen if you spot anything else. :)

comment:10 Changed 8 weeks ago by teor

Sponsor: Sponsor31-can

Setting as sponsor 31 can, because we use these jobs to make sure our refactoring works.

Note: See TracTickets for help on using tickets.