Opened 21 months ago

Closed 9 months ago

#29437 closed defect (fixed)

test-stem times out intermittently

Reported by: rl1987 Owned by: nickm
Priority: Medium Milestone: Tor: 0.4.3.x-final
Component: Core Tor/Tor Version: Tor: 0.2.4.8-alpha
Severity: Normal Keywords: tor-ci-fail-sometimes stem-wants 043-should
Cc: Actual Points: 0.4
Parent ID: Points: 0.2
Reviewer: Sponsor:

Description

  control.base_controller...                           success (1.23s)
  control.controller...                               
No output has been received in the last 10m0s, this potentially indicates a stalled build or something wrong with the build itself.
Check the details on how to adjust your build configuration on: https://docs.travis-ci.com/user/common-build-problems/#Build-times-out-because-no-output-was-received
The build has been terminated

https://travis-ci.org/rl1987/tor/jobs/490611608

Child Tickets

TicketStatusOwnerSummaryComponent
#30011closedteorKill test-stem if takes more than 9.5 minutesCore Tor/Tor
#30012closedteorWhen stem receives a signal, log useful informationArchived/Stem
#30021closednickmDo not cache cipher list classification if cipher list is not yet available.Core Tor/Tor
#30117closedteorSupport stem's backtrace signals in TravisCore Tor/Tor
#30122closedteorMake stem's unit tests propagate the backtrace signals to child processesArchived/Stem
#30234closedteorGet a stacktrace from tor processes launched by stemCore Tor/Tor
#30235closedStem hangs waiting for control port dataArchived/Stem
#30243closedatagarStem's tests use fixed ports for tor, which may already be in useArchived/Stem
#30257closedatagarPropagate USR1 and ABRT signals from stem tests through to torArchived/Stem
#30591closedteorMake tor's Travis stem job log controller messages from stem and torCore Tor/Tor
#30744closedteorAllow failures in the Travis test-stem jobCore Tor/Tor

Change History (42)

comment:1 Changed 21 months ago by atagar

Hi rl1987. I've seen that from time to time too but haven't dug in since it's pretty infrequent. Lack of a consistent repro is going to make this a bit troublesome to solve.

First step if someone would care to dig in is to run our tests with '--verbose --log TRACE' until we get a hang and attach the output for the controller tests.

comment:2 Changed 21 months ago by atagar

Status: newneeds_information

comment:3 Changed 20 months ago by atagar

Component: Core Tor/TorCore Tor/Stem

comment:4 Changed 20 months ago by atagar

Another report of this: #29644

comment:5 Changed 20 months ago by teor

How verbose are the logs?

Our CI has a 10,000 line log limit, and we're already around 5,000.

Normally, we would send the logs to a file, then tail them on error. But we can't do that if the process hangs, because the tail command would never run.

We may be able to trigger the tail using a sleep command, after the process has hung for 50 minutes?
(Our builds normally take 5-10 minutes.)

comment:7 Changed 20 months ago by teor

This also caused a failure in today's 0.2.9 backport rollup after merging to master:
https://travis-ci.org/teor2345/tor/jobs/505980176

comment:8 Changed 19 months ago by teor

This bug caused our maint-0.4.0 cron job to fail today:
https://travis-ci.org/torproject/tor/jobs/507687407

comment:9 Changed 19 months ago by teor

Keywords: tor-ci-fail-sometimes added

comment:10 Changed 19 months ago by teor

Actual Points: 0.2
Points: 0.2

#30011 kills stem when it hangs, so this issue will still cause CI failures, but they won't take an hour to fail.

comment:11 Changed 19 months ago by teor

We can get better diagnostics when #30012 is implemented in Stem.

comment:12 Changed 19 months ago by teor

Owner: set to atagar
Status: needs_informationassigned

Assigning this ticket to atagar, so it doesn't accidentally get assigned to someone else in the network team.

comment:13 Changed 19 months ago by teor

Status: assignedneeds_information

comment:14 Changed 19 months ago by nickm

I find that this (or something similar) happens for me on my desktop: I get long timeouts on control.controller....

On closer investigation, though, it seems to happen every time when I'm running with Python>=3.6, and not when I am running with Python<=3.5. I don't know if this trend will happen for others, though. Please let me know if there's anything I can do to help diagnose what's actually going on with these stalls.

comment:15 Changed 19 months ago by nickm

I've investigated this for a while; here's what I found.

I investigated the stall in test.integ.client.connection, which I see with Python >= 3.6 but not Python <= 3.5. I used a hacked up version of stem and a hacked up version of tor that logged a bunch of extra information. I found that the buggy behavior, from Stem's point of view, is that Tor accepted a VERSIONS cell but never replied.

From Tor's point of view, however, Stem negotiated a v1 link connection with it, and so there is no such thing as a variable-length cell or a VERSIONS cell.

Why does Tor think that stem negotiated a v1 link connection? It decides this in tor_tls_server_info_callback(), when it first calls tor_tls_client_is_using_v2_ciphers(). For some reason, SSL_get_client_ciphers() returns null at this point, meaning "the client has declared no ciphers" and convincing tor that the v1 link protocol is in use. This result of scanning the ciphers is cached through subsequent calls. If I remove the code in Tor that caches this result, then the stem test passes, since later calls to SSL_get_client_ciphers() do in fact return non-NULL.

Why does SSL_get_client_ciphers() return NULL for python3.6, but not python3.5?

When Python 3.5 is in use, the session_secret_cb function is invoked before the tor_tls_server_info_callback() function triggers cipher classification, and at that point, the ciphers are set. When Python 3.6 is in use, the tor_tls_server_info_callback() function is hit twice early on to hit cipher classification: once with the ciphers not set, and then later with the ciphers set.

comment:16 Changed 19 months ago by nickm

I have a fix for this issue that works for me in #30021. I do not completely understand _why_ this issue occurs with Python 3.6's usage of openssl, however.

comment:17 Changed 19 months ago by teor

Status: needs_informationnew

This ticket can close when its child tickets close.

comment:18 Changed 19 months ago by teor

Owner: changed from atagar to teor
Status: newassigned

Fix tags and owner, everything happens in the children now.

comment:19 Changed 19 months ago by atagar

Yikes, this must have been quite a tricky issue. Thank you Nick for you detailed description, it was a fascinating read and congratulations on sorting it out!

comment:20 Changed 19 months ago by teor

Milestone: Tor: 0.2.9.x-final
Resolution: fixed
Status: assignedclosed
Version: Tor: 0.2.4.8-alpha

Fixed in children.

comment:21 Changed 19 months ago by teor

Resolution: fixed
Status: closedreopened

This ticket appears to have multiple causes.

We merged #30021 to 0.2.9 and later, but it's still failing:

  control.controller...                               timelimit: sending warning signal 15
make: *** [test-stem] Terminated
/home/travis/.travis/functions: line 104: 16451 Terminated              timelimit -p -t 540 -T 30 make src/app/tor test-stem
The command "if [[ "$TEST_STEM" != "" ]]; then timelimit -p -t 540 -T 30 make src/app/tor test-stem; fi" exited with 143.

We're seeing this issue with python 3.4 in:
https://travis-ci.org/tlyu/tor/jobs/518187625
https://travis-ci.org/tlyu/tor/jobs/518187612

I made #30012 a child of this ticket again, so we can get a stack trace from stem.

comment:22 Changed 19 months ago by teor

I submitted a pull request on #30012.

comment:23 Changed 19 months ago by teor

It seems like #30021 actually makes stem/tor more unstable with python 2.7:
https://travis-ci.org/torproject/tor/jobs/518181138

That's a good thing, because we should be able to reproduce this issue quickly.

comment:24 Changed 18 months ago by teor

This issue failed 0.4.0, but we did not backport #30117 to 0.4.0.
https://travis-ci.org/torproject/tor/jobs/521087438

comment:25 Changed 18 months ago by teor

Owner: teor deleted
Status: reopenedassigned

Removing myself as owner, because I will be away from the middle of this week, and we need to make progress while I'm away.

comment:26 Changed 17 months ago by catalyst

This Travis failure seems to not be related to #30235?

[34;1m  control.controller...                               [0mtimelimit: sending warning signal 10

================================================================================
Signal SIGUSR1 received by thread MainThread in process 19670
--------------------------------------------------------------------------------
Event notifier thread stacktrace

  File "/usr/lib/python3.4/threading.py", line 888, in _bootstrap
    self._bootstrap_inner()
  File "/usr/lib/python3.4/threading.py", line 920, in _bootstrap_inner
    self.run()
  File "/usr/lib/python3.4/threading.py", line 868, in run
    self._target(*self._args, **self._kwargs)
  File "/home/travis/build/torproject/tor/stem/stem/control.py", line 991, in _event_loop
    self._event_notice.wait(0.05)
  File "/usr/lib/python3.4/threading.py", line 553, in wait
    signaled = self._cond.wait(timeout)
  File "/usr/lib/python3.4/threading.py", line 294, in wait
    gotit = waiter.acquire(True, timeout)

--------------------------------------------------------------------------------
Tor listener thread stacktrace

  File "/usr/lib/python3.4/threading.py", line 888, in _bootstrap
    self._bootstrap_inner()
  File "/usr/lib/python3.4/threading.py", line 920, in _bootstrap_inner
    self.run()
  File "/usr/lib/python3.4/threading.py", line 868, in run
    self._target(*self._args, **self._kwargs)
  File "/home/travis/build/torproject/tor/stem/stem/control.py", line 946, in _reader_loop
    control_message = self._socket.recv()
  File "/home/travis/build/torproject/tor/stem/stem/socket.py", line 474, in recv
    return self._recv(lambda s, sf: recv_message(sf))
  File "/home/travis/build/torproject/tor/stem/stem/socket.py", line 274, in _recv
    return handler(my_socket, my_socket_file)
  File "/home/travis/build/torproject/tor/stem/stem/socket.py", line 474, in <lambda>
    return self._recv(lambda s, sf: recv_message(sf))
  File "/home/travis/build/torproject/tor/stem/stem/socket.py", line 676, in recv_message
    line = control_file.readline()
  File "/usr/lib/python3.4/socket.py", line 374, in readinto
    return self._sock.recv_into(b)

--------------------------------------------------------------------------------
MainThread thread stacktrace

  File "/home/travis/build/torproject/tor/stem/run_tests.py", line 451, in <module>
    main()
  File "/home/travis/build/torproject/tor/stem/run_tests.py", line 292, in main
    run_result = _run_test(args, test_class, output_filters)
  File "/home/travis/build/torproject/tor/stem/run_tests.py", line 432, in _run_test
    run_result = stem.util.test_tools.TimedTestRunner(test_results, verbosity = 2).run(suite)
  File "/home/travis/build/torproject/tor/stem/stem/util/test_tools.py", line 308, in run
    return super(TimedTestRunner, self).run(test)
  File "/usr/lib/python3.4/unittest/runner.py", line 168, in run
    test(result)
  File "/usr/lib/python3.4/unittest/suite.py", line 84, in __call__
    return self.run(*args, **kwds)
  File "/usr/lib/python3.4/unittest/suite.py", line 122, in run
    test(result)
  File "/usr/lib/python3.4/unittest/case.py", line 625, in __call__
    return self.run(*args, **kwds)
  File "/home/travis/build/torproject/tor/stem/stem/util/test_tools.py", line 258, in run
    result = super(type(self), self).run(result)
  File "/usr/lib/python3.4/unittest/case.py", line 577, in run
    testMethod()
  File "/home/travis/build/torproject/tor/stem/test/require.py", line 57, in wrapped
    return func(self, *args, **kwargs)
  File "/home/travis/build/torproject/tor/stem/test/require.py", line 57, in wrapped
    return func(self, *args, **kwargs)
  File "/home/travis/build/torproject/tor/stem/test/integ/control/controller.py", line 527, in test_without_ephemeral_hidden_services
    with test.runner.get_runner().get_tor_controller() as controller:
  File "/home/travis/build/torproject/tor/stem/test/runner.py", line 482, in get_tor_controller
    controller.authenticate(password = CONTROL_PASSWORD, chroot_path = self.get_chroot())
  File "/home/travis/build/torproject/tor/stem/stem/control.py", line 1110, in authenticate
    stem.connection.authenticate(self, *args, **kwargs)
  File "/home/travis/build/torproject/tor/stem/stem/connection.py", line 530, in authenticate
    protocolinfo_response = get_protocolinfo(controller)
  File "/home/travis/build/torproject/tor/stem/stem/connection.py", line 1007, in get_protocolinfo
    protocolinfo_response = _msg(controller, 'PROTOCOLINFO 1')
  File "/home/travis/build/torproject/tor/stem/stem/connection.py", line 1036, in _msg
    return controller.msg(message)
  File "/home/travis/build/torproject/tor/stem/stem/control.py", line 661, in msg
    response = self._reply_queue.get()
  File "/usr/lib/python3.4/queue.py", line 167, in get
    self.not_empty.wait()
  File "/usr/lib/python3.4/threading.py", line 290, in wait
    waiter.acquire()
  File "/home/travis/build/torproject/tor/stem/run_tests.py", line 98, in log_traceback
    for thread_name, stacktrace in test.output.thread_stacktraces().items():
  File "/home/travis/build/torproject/tor/stem/test/output.py", line 110, in thread_stacktraces
    stacktraces[thread.name] = ''.join(traceback.format_stack(frame))

================================================================================
timelimit: sending kill signal 6

================================================================================
Signal SIGIOT received by thread MainThread in process 19670
--------------------------------------------------------------------------------
Event notifier thread stacktrace

  File "/usr/lib/python3.4/threading.py", line 888, in _bootstrap
    self._bootstrap_inner()
  File "/usr/lib/python3.4/threading.py", line 920, in _bootstrap_inner
    self.run()
  File "/usr/lib/python3.4/threading.py", line 868, in run
    self._target(*self._args, **self._kwargs)
  File "/home/travis/build/torproject/tor/stem/stem/control.py", line 991, in _event_loop
    self._event_notice.wait(0.05)
  File "/usr/lib/python3.4/threading.py", line 553, in wait
    signaled = self._cond.wait(timeout)
  File "/usr/lib/python3.4/threading.py", line 294, in wait
    gotit = waiter.acquire(True, timeout)

--------------------------------------------------------------------------------
Tor listener thread stacktrace

  File "/usr/lib/python3.4/threading.py", line 888, in _bootstrap
    self._bootstrap_inner()
  File "/usr/lib/python3.4/threading.py", line 920, in _bootstrap_inner
    self.run()
  File "/usr/lib/python3.4/threading.py", line 868, in run
    self._target(*self._args, **self._kwargs)
  File "/home/travis/build/torproject/tor/stem/stem/control.py", line 946, in _reader_loop
    control_message = self._socket.recv()
  File "/home/travis/build/torproject/tor/stem/stem/socket.py", line 474, in recv
    return self._recv(lambda s, sf: recv_message(sf))
  File "/home/travis/build/torproject/tor/stem/stem/socket.py", line 274, in _recv
    return handler(my_socket, my_socket_file)
  File "/home/travis/build/torproject/tor/stem/stem/socket.py", line 474, in <lambda>
    return self._recv(lambda s, sf: recv_message(sf))
  File "/home/travis/build/torproject/tor/stem/stem/socket.py", line 676, in recv_message
    line = control_file.readline()
  File "/usr/lib/python3.4/socket.py", line 374, in readinto
    return self._sock.recv_into(b)

--------------------------------------------------------------------------------
MainThread thread stacktrace

  File "/home/travis/build/torproject/tor/stem/run_tests.py", line 451, in <module>
    main()
  File "/home/travis/build/torproject/tor/stem/run_tests.py", line 292, in main
    run_result = _run_test(args, test_class, output_filters)
  File "/home/travis/build/torproject/tor/stem/run_tests.py", line 432, in _run_test
    run_result = stem.util.test_tools.TimedTestRunner(test_results, verbosity = 2).run(suite)
  File "/home/travis/build/torproject/tor/stem/stem/util/test_tools.py", line 308, in run
    return super(TimedTestRunner, self).run(test)
  File "/usr/lib/python3.4/unittest/runner.py", line 168, in run
    test(result)
  File "/usr/lib/python3.4/unittest/suite.py", line 84, in __call__
    return self.run(*args, **kwds)
  File "/usr/lib/python3.4/unittest/suite.py", line 122, in run
    test(result)
  File "/usr/lib/python3.4/unittest/case.py", line 625, in __call__
    return self.run(*args, **kwds)
  File "/home/travis/build/torproject/tor/stem/stem/util/test_tools.py", line 258, in run
    result = super(type(self), self).run(result)
  File "/usr/lib/python3.4/unittest/case.py", line 577, in run
    testMethod()
  File "/home/travis/build/torproject/tor/stem/test/require.py", line 57, in wrapped
    return func(self, *args, **kwargs)
  File "/home/travis/build/torproject/tor/stem/test/require.py", line 57, in wrapped
    return func(self, *args, **kwargs)
  File "/home/travis/build/torproject/tor/stem/test/integ/control/controller.py", line 527, in test_without_ephemeral_hidden_services
    with test.runner.get_runner().get_tor_controller() as controller:
  File "/home/travis/build/torproject/tor/stem/test/runner.py", line 482, in get_tor_controller
    controller.authenticate(password = CONTROL_PASSWORD, chroot_path = self.get_chroot())
  File "/home/travis/build/torproject/tor/stem/stem/control.py", line 1110, in authenticate
    stem.connection.authenticate(self, *args, **kwargs)
  File "/home/travis/build/torproject/tor/stem/stem/connection.py", line 530, in authenticate
    protocolinfo_response = get_protocolinfo(controller)
  File "/home/travis/build/torproject/tor/stem/stem/connection.py", line 1007, in get_protocolinfo
    protocolinfo_response = _msg(controller, 'PROTOCOLINFO 1')
  File "/home/travis/build/torproject/tor/stem/stem/connection.py", line 1036, in _msg
    return controller.msg(message)
  File "/home/travis/build/torproject/tor/stem/stem/control.py", line 661, in msg
    response = self._reply_queue.get()
  File "/usr/lib/python3.4/queue.py", line 167, in get
    self.not_empty.wait()
  File "/usr/lib/python3.4/threading.py", line 290, in wait
    waiter.acquire()
  File "/home/travis/build/torproject/tor/stem/run_tests.py", line 98, in log_traceback
    for thread_name, stacktrace in test.output.thread_stacktraces().items():
  File "/home/travis/build/torproject/tor/stem/test/output.py", line 110, in thread_stacktraces
    stacktraces[thread.name] = ''.join(traceback.format_stack(frame))

================================================================================
travis_time:end:009e4e68:start=1558563743787916536,finish=1558564362060945271,duration=618273028735
[0K[31;1mThe command "if [[ "$TEST_STEM" != "" ]]; then make src/app/tor; timelimit -p -t 540 -s USR1 -T 30 -S ABRT python3 "$STEM_SOURCE_DIR"/run_tests.py --tor src/app/tor --integ --log notice --target RUN_ALL; fi" exited with 255.[0m

comment:27 Changed 17 months ago by atagar

Thanks catalyst, that's interesting. The operative bit that jumps out to me is....

  File "/home/travis/build/torproject/tor/stem/test/integ/control/controller.py", line 527, in test_without_ephemeral_hidden_services
    with test.runner.get_runner().get_tor_controller() as controller:
...
  File "/home/travis/build/torproject/tor/stem/stem/connection.py", line 1007, in get_protocolinfo
    protocolinfo_response = _msg(controller, 'PROTOCOLINFO 1')

This indicates that test_without_ephemeral_hidden_services likely hung when calling PROTOCOLINFO in an attempt to authenticate. Most likely this is *not* an issue during test_without_ephemeral_hidden_services, but rather a prior test got tor into a borked state where it no longer responds in a healthy fashion to PROTOCOLINFO.

Previously I suggested collecting trace runlevel logs from Stem. If we don't have repro steps I can use to generate this locally then that's the next step.

comment:28 Changed 17 months ago by gaba

Actual Points: 0.2

comment:29 Changed 17 months ago by teor

Actual Points: 0.3

I talked with Gaba and I can use actual points to track my time on incomplete tickets.

comment:30 Changed 17 months ago by teor

Actual Points: 0.30.4
Sponsor: Sponsor31-can

Setting as sponsor 31 can, because we use these jobs to make sure our refactoring works.

comment:31 Changed 11 months ago by teor

Milestone: Tor: 0.2.9.x-finalTor: 0.4.3.x-final

comment:32 Changed 10 months ago by atagar

Component: Core Tor/StemCore Tor/Tor

All stem subtickets have been resolved or moved to github. I'm discontinuing the stem trac component so sending this to tor's. I don't have need for this ticket so I don't mind if it's closed.

Feel free to file requests within github if you need anything from me.

comment:33 Changed 9 months ago by nickm

Keywords: stem-wants 043-should added

comment:34 Changed 9 months ago by nickm

Owner: set to nickm

My plan here is to look at the travis logs for a couple of weeks on master and track down any lingering issues.

comment:35 Changed 9 months ago by nickm

(Please let me know if you see a test-stem timeout in the wild on current versions?)

comment:36 Changed 9 months ago by teor

Stem could be hanging due to a tor crash or stderr output, for a similar issue, see:
https://trac.torproject.org/projects/tor/ticket/33006#comment:10

comment:37 in reply to:  35 Changed 9 months ago by teor

Replying to nickm:

(Please let me know if you see a test-stem timeout in the wild on current versions?)

It seems like all the stem jobs are passing, from maint-0.3.5 through to master:

(Scroll down to the bottom to see the stem job.)

I spot-checked the 0.3.5, 0.4.2, and master job outputs, to make sure that we aren't ignoring any test failures.

So I think we can close this issue, if we don't see any timeouts for a week or two. But before we do that, we should remove make test-stem from the allow_failure block in Travis. I'll open a child ticket.

comment:38 Changed 9 months ago by teor

Re-enabling failures on the job is a simple change, see #33075.

If we don't see any failures, we should also be able to close this ticket, and close or un-parent a bunch of tickets under it.

comment:39 Changed 9 months ago by nickm

#33075 is now merged to master, so we can see whether there are lingering problems of this kind.

comment:40 Changed 9 months ago by gaba

Sponsor: Sponsor31-can

No more sponsor 31. All this tickets remained open after sponsor 31 ended.

comment:41 Changed 9 months ago by nickm

Status: assignedneeds_information

comment:42 Changed 9 months ago by nickm

Resolution: fixed
Status: needs_informationclosed

We've been running with this change on master for over a week, with no apparent stem timeouts. Closing this ticket.

(Dear future developers: if you are seeing a stem timeout, please consider how long it has been since the last time we saw one. If it has been a super long time, maybe you are seeing a new bug in Tor or stem rather than this bug once again, and you should open a new ticket rather than reopening this one?)

Note: See TracTickets for help on using tickets.