Opened 5 months ago

Last modified 6 weeks ago

#29437 assigned defect

test-stem times out intermittently

Reported by: rl1987 Owned by:
Priority: Medium Milestone: Tor: 0.2.9.x-final
Component: Core Tor/Stem Version: Tor: 0.2.4.8-alpha
Severity: Normal Keywords: tor-ci-fail-sometimes
Cc: Actual Points: 0.4
Parent ID: Points: 0.2
Reviewer: Sponsor: Sponsor31-can

Description

  control.base_controller...                           success (1.23s)
  control.controller...                               
No output has been received in the last 10m0s, this potentially indicates a stalled build or something wrong with the build itself.
Check the details on how to adjust your build configuration on: https://docs.travis-ci.com/user/common-build-problems/#Build-times-out-because-no-output-was-received
The build has been terminated

https://travis-ci.org/rl1987/tor/jobs/490611608

Child Tickets

TicketStatusOwnerSummaryComponent
#30011closedteorKill test-stem if takes more than 9.5 minutesCore Tor/Tor
#30012closedteorWhen stem receives a signal, log useful informationCore Tor/Stem
#30021closednickmDo not cache cipher list classification if cipher list is not yet available.Core Tor/Tor
#30117closedteorSupport stem's backtrace signals in TravisCore Tor/Tor
#30122closedteorMake stem's unit tests propagate the backtrace signals to child processesCore Tor/Stem
#30234closedteorGet a stacktrace from tor processes launched by stemCore Tor/Tor
#30235needs_informationStem hangs waiting for control port dataCore Tor/Stem
#30243newatagarStem's tests use fixed ports for tor, which may already be in useCore Tor/Stem
#30257newatagarPropagate USR1 and ABRT signals from stem tests through to torCore Tor/Stem
#30591needs_reviewteorMake tor's Travis stem job log controller messages from stem and torCore Tor/Tor
#30695newConsider running tor's stem CI job in --target ONLINE mode, maybe with chutneyCore Tor/Tor
#30744closedteorAllow failures in the Travis test-stem jobCore Tor/Tor
#30901newAdd control port trace logging to torCore Tor/Tor

Change History (30)

comment:1 Changed 5 months ago by atagar

Hi rl1987. I've seen that from time to time too but haven't dug in since it's pretty infrequent. Lack of a consistent repro is going to make this a bit troublesome to solve.

First step if someone would care to dig in is to run our tests with '--verbose --log TRACE' until we get a hang and attach the output for the controller tests.

comment:2 Changed 5 months ago by atagar

Status: newneeds_information

comment:3 Changed 5 months ago by atagar

Component: Core Tor/TorCore Tor/Stem

comment:4 Changed 5 months ago by atagar

Another report of this: #29644

comment:5 Changed 5 months ago by teor

How verbose are the logs?

Our CI has a 10,000 line log limit, and we're already around 5,000.

Normally, we would send the logs to a file, then tail them on error. But we can't do that if the process hangs, because the tail command would never run.

We may be able to trigger the tail using a sleep command, after the process has hung for 50 minutes?
(Our builds normally take 5-10 minutes.)

comment:7 Changed 4 months ago by teor

This also caused a failure in today's 0.2.9 backport rollup after merging to master:
https://travis-ci.org/teor2345/tor/jobs/505980176

comment:8 Changed 4 months ago by teor

This bug caused our maint-0.4.0 cron job to fail today:
https://travis-ci.org/torproject/tor/jobs/507687407

comment:9 Changed 4 months ago by teor

Keywords: tor-ci-fail-sometimes added

comment:10 Changed 4 months ago by teor

Actual Points: 0.2
Points: 0.2

#30011 kills stem when it hangs, so this issue will still cause CI failures, but they won't take an hour to fail.

comment:11 Changed 4 months ago by teor

We can get better diagnostics when #30012 is implemented in Stem.

comment:12 Changed 4 months ago by teor

Owner: set to atagar
Status: needs_informationassigned

Assigning this ticket to atagar, so it doesn't accidentally get assigned to someone else in the network team.

comment:13 Changed 4 months ago by teor

Status: assignedneeds_information

comment:14 Changed 4 months ago by nickm

I find that this (or something similar) happens for me on my desktop: I get long timeouts on control.controller....

On closer investigation, though, it seems to happen every time when I'm running with Python>=3.6, and not when I am running with Python<=3.5. I don't know if this trend will happen for others, though. Please let me know if there's anything I can do to help diagnose what's actually going on with these stalls.

comment:15 Changed 4 months ago by nickm

I've investigated this for a while; here's what I found.

I investigated the stall in test.integ.client.connection, which I see with Python >= 3.6 but not Python <= 3.5. I used a hacked up version of stem and a hacked up version of tor that logged a bunch of extra information. I found that the buggy behavior, from Stem's point of view, is that Tor accepted a VERSIONS cell but never replied.

From Tor's point of view, however, Stem negotiated a v1 link connection with it, and so there is no such thing as a variable-length cell or a VERSIONS cell.

Why does Tor think that stem negotiated a v1 link connection? It decides this in tor_tls_server_info_callback(), when it first calls tor_tls_client_is_using_v2_ciphers(). For some reason, SSL_get_client_ciphers() returns null at this point, meaning "the client has declared no ciphers" and convincing tor that the v1 link protocol is in use. This result of scanning the ciphers is cached through subsequent calls. If I remove the code in Tor that caches this result, then the stem test passes, since later calls to SSL_get_client_ciphers() do in fact return non-NULL.

Why does SSL_get_client_ciphers() return NULL for python3.6, but not python3.5?

When Python 3.5 is in use, the session_secret_cb function is invoked before the tor_tls_server_info_callback() function triggers cipher classification, and at that point, the ciphers are set. When Python 3.6 is in use, the tor_tls_server_info_callback() function is hit twice early on to hit cipher classification: once with the ciphers not set, and then later with the ciphers set.

comment:16 Changed 4 months ago by nickm

I have a fix for this issue that works for me in #30021. I do not completely understand _why_ this issue occurs with Python 3.6's usage of openssl, however.

comment:17 Changed 4 months ago by teor

Status: needs_informationnew

This ticket can close when its child tickets close.

comment:18 Changed 4 months ago by teor

Owner: changed from atagar to teor
Status: newassigned

Fix tags and owner, everything happens in the children now.

comment:19 Changed 4 months ago by atagar

Yikes, this must have been quite a tricky issue. Thank you Nick for you detailed description, it was a fascinating read and congratulations on sorting it out!

comment:20 Changed 4 months ago by teor

Milestone: Tor: 0.2.9.x-final
Resolution: fixed
Status: assignedclosed
Version: Tor: 0.2.4.8-alpha

Fixed in children.

comment:21 Changed 3 months ago by teor

Resolution: fixed
Status: closedreopened

This ticket appears to have multiple causes.

We merged #30021 to 0.2.9 and later, but it's still failing:

  control.controller...                               timelimit: sending warning signal 15
make: *** [test-stem] Terminated
/home/travis/.travis/functions: line 104: 16451 Terminated              timelimit -p -t 540 -T 30 make src/app/tor test-stem
The command "if [[ "$TEST_STEM" != "" ]]; then timelimit -p -t 540 -T 30 make src/app/tor test-stem; fi" exited with 143.

We're seeing this issue with python 3.4 in:
https://travis-ci.org/tlyu/tor/jobs/518187625
https://travis-ci.org/tlyu/tor/jobs/518187612

I made #30012 a child of this ticket again, so we can get a stack trace from stem.

comment:22 Changed 3 months ago by teor

I submitted a pull request on #30012.

comment:23 Changed 3 months ago by teor

It seems like #30021 actually makes stem/tor more unstable with python 2.7:
https://travis-ci.org/torproject/tor/jobs/518181138

That's a good thing, because we should be able to reproduce this issue quickly.

comment:24 Changed 3 months ago by teor

This issue failed 0.4.0, but we did not backport #30117 to 0.4.0.
https://travis-ci.org/torproject/tor/jobs/521087438

comment:25 Changed 3 months ago by teor

Owner: teor deleted
Status: reopenedassigned

Removing myself as owner, because I will be away from the middle of this week, and we need to make progress while I'm away.

comment:26 Changed 2 months ago by catalyst

This Travis failure seems to not be related to #30235?

[34;1m  control.controller...                               [0mtimelimit: sending warning signal 10

================================================================================
Signal SIGUSR1 received by thread MainThread in process 19670
--------------------------------------------------------------------------------
Event notifier thread stacktrace

  File "/usr/lib/python3.4/threading.py", line 888, in _bootstrap
    self._bootstrap_inner()
  File "/usr/lib/python3.4/threading.py", line 920, in _bootstrap_inner
    self.run()
  File "/usr/lib/python3.4/threading.py", line 868, in run
    self._target(*self._args, **self._kwargs)
  File "/home/travis/build/torproject/tor/stem/stem/control.py", line 991, in _event_loop
    self._event_notice.wait(0.05)
  File "/usr/lib/python3.4/threading.py", line 553, in wait
    signaled = self._cond.wait(timeout)
  File "/usr/lib/python3.4/threading.py", line 294, in wait
    gotit = waiter.acquire(True, timeout)

--------------------------------------------------------------------------------
Tor listener thread stacktrace

  File "/usr/lib/python3.4/threading.py", line 888, in _bootstrap
    self._bootstrap_inner()
  File "/usr/lib/python3.4/threading.py", line 920, in _bootstrap_inner
    self.run()
  File "/usr/lib/python3.4/threading.py", line 868, in run
    self._target(*self._args, **self._kwargs)
  File "/home/travis/build/torproject/tor/stem/stem/control.py", line 946, in _reader_loop
    control_message = self._socket.recv()
  File "/home/travis/build/torproject/tor/stem/stem/socket.py", line 474, in recv
    return self._recv(lambda s, sf: recv_message(sf))
  File "/home/travis/build/torproject/tor/stem/stem/socket.py", line 274, in _recv
    return handler(my_socket, my_socket_file)
  File "/home/travis/build/torproject/tor/stem/stem/socket.py", line 474, in <lambda>
    return self._recv(lambda s, sf: recv_message(sf))
  File "/home/travis/build/torproject/tor/stem/stem/socket.py", line 676, in recv_message
    line = control_file.readline()
  File "/usr/lib/python3.4/socket.py", line 374, in readinto
    return self._sock.recv_into(b)

--------------------------------------------------------------------------------
MainThread thread stacktrace

  File "/home/travis/build/torproject/tor/stem/run_tests.py", line 451, in <module>
    main()
  File "/home/travis/build/torproject/tor/stem/run_tests.py", line 292, in main
    run_result = _run_test(args, test_class, output_filters)
  File "/home/travis/build/torproject/tor/stem/run_tests.py", line 432, in _run_test
    run_result = stem.util.test_tools.TimedTestRunner(test_results, verbosity = 2).run(suite)
  File "/home/travis/build/torproject/tor/stem/stem/util/test_tools.py", line 308, in run
    return super(TimedTestRunner, self).run(test)
  File "/usr/lib/python3.4/unittest/runner.py", line 168, in run
    test(result)
  File "/usr/lib/python3.4/unittest/suite.py", line 84, in __call__
    return self.run(*args, **kwds)
  File "/usr/lib/python3.4/unittest/suite.py", line 122, in run
    test(result)
  File "/usr/lib/python3.4/unittest/case.py", line 625, in __call__
    return self.run(*args, **kwds)
  File "/home/travis/build/torproject/tor/stem/stem/util/test_tools.py", line 258, in run
    result = super(type(self), self).run(result)
  File "/usr/lib/python3.4/unittest/case.py", line 577, in run
    testMethod()
  File "/home/travis/build/torproject/tor/stem/test/require.py", line 57, in wrapped
    return func(self, *args, **kwargs)
  File "/home/travis/build/torproject/tor/stem/test/require.py", line 57, in wrapped
    return func(self, *args, **kwargs)
  File "/home/travis/build/torproject/tor/stem/test/integ/control/controller.py", line 527, in test_without_ephemeral_hidden_services
    with test.runner.get_runner().get_tor_controller() as controller:
  File "/home/travis/build/torproject/tor/stem/test/runner.py", line 482, in get_tor_controller
    controller.authenticate(password = CONTROL_PASSWORD, chroot_path = self.get_chroot())
  File "/home/travis/build/torproject/tor/stem/stem/control.py", line 1110, in authenticate
    stem.connection.authenticate(self, *args, **kwargs)
  File "/home/travis/build/torproject/tor/stem/stem/connection.py", line 530, in authenticate
    protocolinfo_response = get_protocolinfo(controller)
  File "/home/travis/build/torproject/tor/stem/stem/connection.py", line 1007, in get_protocolinfo
    protocolinfo_response = _msg(controller, 'PROTOCOLINFO 1')
  File "/home/travis/build/torproject/tor/stem/stem/connection.py", line 1036, in _msg
    return controller.msg(message)
  File "/home/travis/build/torproject/tor/stem/stem/control.py", line 661, in msg
    response = self._reply_queue.get()
  File "/usr/lib/python3.4/queue.py", line 167, in get
    self.not_empty.wait()
  File "/usr/lib/python3.4/threading.py", line 290, in wait
    waiter.acquire()
  File "/home/travis/build/torproject/tor/stem/run_tests.py", line 98, in log_traceback
    for thread_name, stacktrace in test.output.thread_stacktraces().items():
  File "/home/travis/build/torproject/tor/stem/test/output.py", line 110, in thread_stacktraces
    stacktraces[thread.name] = ''.join(traceback.format_stack(frame))

================================================================================
timelimit: sending kill signal 6

================================================================================
Signal SIGIOT received by thread MainThread in process 19670
--------------------------------------------------------------------------------
Event notifier thread stacktrace

  File "/usr/lib/python3.4/threading.py", line 888, in _bootstrap
    self._bootstrap_inner()
  File "/usr/lib/python3.4/threading.py", line 920, in _bootstrap_inner
    self.run()
  File "/usr/lib/python3.4/threading.py", line 868, in run
    self._target(*self._args, **self._kwargs)
  File "/home/travis/build/torproject/tor/stem/stem/control.py", line 991, in _event_loop
    self._event_notice.wait(0.05)
  File "/usr/lib/python3.4/threading.py", line 553, in wait
    signaled = self._cond.wait(timeout)
  File "/usr/lib/python3.4/threading.py", line 294, in wait
    gotit = waiter.acquire(True, timeout)

--------------------------------------------------------------------------------
Tor listener thread stacktrace

  File "/usr/lib/python3.4/threading.py", line 888, in _bootstrap
    self._bootstrap_inner()
  File "/usr/lib/python3.4/threading.py", line 920, in _bootstrap_inner
    self.run()
  File "/usr/lib/python3.4/threading.py", line 868, in run
    self._target(*self._args, **self._kwargs)
  File "/home/travis/build/torproject/tor/stem/stem/control.py", line 946, in _reader_loop
    control_message = self._socket.recv()
  File "/home/travis/build/torproject/tor/stem/stem/socket.py", line 474, in recv
    return self._recv(lambda s, sf: recv_message(sf))
  File "/home/travis/build/torproject/tor/stem/stem/socket.py", line 274, in _recv
    return handler(my_socket, my_socket_file)
  File "/home/travis/build/torproject/tor/stem/stem/socket.py", line 474, in <lambda>
    return self._recv(lambda s, sf: recv_message(sf))
  File "/home/travis/build/torproject/tor/stem/stem/socket.py", line 676, in recv_message
    line = control_file.readline()
  File "/usr/lib/python3.4/socket.py", line 374, in readinto
    return self._sock.recv_into(b)

--------------------------------------------------------------------------------
MainThread thread stacktrace

  File "/home/travis/build/torproject/tor/stem/run_tests.py", line 451, in <module>
    main()
  File "/home/travis/build/torproject/tor/stem/run_tests.py", line 292, in main
    run_result = _run_test(args, test_class, output_filters)
  File "/home/travis/build/torproject/tor/stem/run_tests.py", line 432, in _run_test
    run_result = stem.util.test_tools.TimedTestRunner(test_results, verbosity = 2).run(suite)
  File "/home/travis/build/torproject/tor/stem/stem/util/test_tools.py", line 308, in run
    return super(TimedTestRunner, self).run(test)
  File "/usr/lib/python3.4/unittest/runner.py", line 168, in run
    test(result)
  File "/usr/lib/python3.4/unittest/suite.py", line 84, in __call__
    return self.run(*args, **kwds)
  File "/usr/lib/python3.4/unittest/suite.py", line 122, in run
    test(result)
  File "/usr/lib/python3.4/unittest/case.py", line 625, in __call__
    return self.run(*args, **kwds)
  File "/home/travis/build/torproject/tor/stem/stem/util/test_tools.py", line 258, in run
    result = super(type(self), self).run(result)
  File "/usr/lib/python3.4/unittest/case.py", line 577, in run
    testMethod()
  File "/home/travis/build/torproject/tor/stem/test/require.py", line 57, in wrapped
    return func(self, *args, **kwargs)
  File "/home/travis/build/torproject/tor/stem/test/require.py", line 57, in wrapped
    return func(self, *args, **kwargs)
  File "/home/travis/build/torproject/tor/stem/test/integ/control/controller.py", line 527, in test_without_ephemeral_hidden_services
    with test.runner.get_runner().get_tor_controller() as controller:
  File "/home/travis/build/torproject/tor/stem/test/runner.py", line 482, in get_tor_controller
    controller.authenticate(password = CONTROL_PASSWORD, chroot_path = self.get_chroot())
  File "/home/travis/build/torproject/tor/stem/stem/control.py", line 1110, in authenticate
    stem.connection.authenticate(self, *args, **kwargs)
  File "/home/travis/build/torproject/tor/stem/stem/connection.py", line 530, in authenticate
    protocolinfo_response = get_protocolinfo(controller)
  File "/home/travis/build/torproject/tor/stem/stem/connection.py", line 1007, in get_protocolinfo
    protocolinfo_response = _msg(controller, 'PROTOCOLINFO 1')
  File "/home/travis/build/torproject/tor/stem/stem/connection.py", line 1036, in _msg
    return controller.msg(message)
  File "/home/travis/build/torproject/tor/stem/stem/control.py", line 661, in msg
    response = self._reply_queue.get()
  File "/usr/lib/python3.4/queue.py", line 167, in get
    self.not_empty.wait()
  File "/usr/lib/python3.4/threading.py", line 290, in wait
    waiter.acquire()
  File "/home/travis/build/torproject/tor/stem/run_tests.py", line 98, in log_traceback
    for thread_name, stacktrace in test.output.thread_stacktraces().items():
  File "/home/travis/build/torproject/tor/stem/test/output.py", line 110, in thread_stacktraces
    stacktraces[thread.name] = ''.join(traceback.format_stack(frame))

================================================================================
travis_time:end:009e4e68:start=1558563743787916536,finish=1558564362060945271,duration=618273028735
[0K[31;1mThe command "if [[ "$TEST_STEM" != "" ]]; then make src/app/tor; timelimit -p -t 540 -s USR1 -T 30 -S ABRT python3 "$STEM_SOURCE_DIR"/run_tests.py --tor src/app/tor --integ --log notice --target RUN_ALL; fi" exited with 255.[0m

comment:27 Changed 2 months ago by atagar

Thanks catalyst, that's interesting. The operative bit that jumps out to me is....

  File "/home/travis/build/torproject/tor/stem/test/integ/control/controller.py", line 527, in test_without_ephemeral_hidden_services
    with test.runner.get_runner().get_tor_controller() as controller:
...
  File "/home/travis/build/torproject/tor/stem/stem/connection.py", line 1007, in get_protocolinfo
    protocolinfo_response = _msg(controller, 'PROTOCOLINFO 1')

This indicates that test_without_ephemeral_hidden_services likely hung when calling PROTOCOLINFO in an attempt to authenticate. Most likely this is *not* an issue during test_without_ephemeral_hidden_services, but rather a prior test got tor into a borked state where it no longer responds in a healthy fashion to PROTOCOLINFO.

Previously I suggested collecting trace runlevel logs from Stem. If we don't have repro steps I can use to generate this locally then that's the next step.

comment:28 Changed 8 weeks ago by gaba

Actual Points: 0.2

comment:29 Changed 8 weeks ago by teor

Actual Points: 0.3

I talked with Gaba and I can use actual points to track my time on incomplete tickets.

comment:30 Changed 8 weeks ago by teor

Actual Points: 0.30.4
Sponsor: Sponsor31-can

Setting as sponsor 31 can, because we use these jobs to make sure our refactoring works.

Note: See TracTickets for help on using tickets.