test-stem times out intermittently

changed milestone to %Tor: 0.4.3.x-final

Trac:
Child Ticket(s): #30243 (closed), #30021 (moved), #30117 (moved), #30122 (closed), #30257 (closed), #30235 (closed), #30744 (moved), #30234 (moved), #30011 (moved), #30012 (closed), #30591 (moved)

added 043-should actualpoints::0.4 component::core tor/tor milestone::Tor: 0.4.3.x-final owner::nickm points::0.2 priority::medium resolution::fixed severity::normal status::closed stem-wants tor-ci-fail-sometimes type::defect version::tor 0.2.4.8-alpha labels

Hi rl1987. I've seen that from time to time too but haven't dug in since it's pretty infrequent. Lack of a consistent repro is going to make this a bit troublesome to solve.

First step if someone would care to dig in is to run our tests with '--verbose --log TRACE' until we get a hang and attach the output for the controller tests.

Trac:
Status: new to needs_information

Trac:
Component: Core Tor/Tor to Core Tor/Stem

Another report of this: #29644 (closed)

How verbose are the logs?

Our CI has a 10,000 line log limit, and we're already around 5,000.

Normally, we would send the logs to a file, then tail them on error. But we can't do that if the process hangs, because the tail command would never run.

We may be able to trigger the tail using a sleep command, after the process has hung for 50 minutes? (Our builds normally take 5-10 minutes.)

This also caused a failure in #28525 (moved): https://travis-ci.org/torproject/tor/jobs/503411694

This also caused a failure in today's 0.2.9 backport rollup after merging to master: https://travis-ci.org/teor2345/tor/jobs/505980176

This bug caused our maint-0.4.0 cron job to fail today: https://travis-ci.org/torproject/tor/jobs/507687407

Trac:
Keywords: N/A deleted, tor-ci-fail-sometimes added

#30011 (moved) kills stem when it hangs, so this issue will still cause CI failures, but they won't take an hour to fail.

Trac:
Points: N/A to 0.2
Actualpoints: N/A to 0.2

We can get better diagnostics when #30012 (closed) is implemented in Stem.

Assigning this ticket to atagar, so it doesn't accidentally get assigned to someone else in the network team.

Trac:
Status: needs_information to assigned
Owner: N/A to atagar

Trac:
Status: assigned to needs_information

I find that this (or something similar) happens for me on my desktop: I get long timeouts on control.controller....

On closer investigation, though, it seems to happen every time when I'm running with Python>=3.6, and not when I am running with Python<=3.5. I don't know if this trend will happen for others, though. Please let me know if there's anything I can do to help diagnose what's actually going on with these stalls.

I've investigated this for a while; here's what I found.

I investigated the stall in test.integ.client.connection, which I see with Python >= 3.6 but not Python <= 3.5. I used a hacked up version of stem and a hacked up version of tor that logged a bunch of extra information. I found that the buggy behavior, from Stem's point of view, is that Tor accepted a VERSIONS cell but never replied.

From Tor's point of view, however, Stem negotiated a v1 link connection with it, and so there is no such thing as a variable-length cell or a VERSIONS cell.

Why does Tor think that stem negotiated a v1 link connection? It decides this in tor_tls_server_info_callback(), when it first calls tor_tls_client_is_using_v2_ciphers(). For some reason, SSL_get_client_ciphers() returns null at this point, meaning "the client has declared no ciphers" and convincing tor that the v1 link protocol is in use. This result of scanning the ciphers is cached through subsequent calls. If I remove the code in Tor that caches this result, then the stem test passes, since later calls to SSL_get_client_ciphers() do in fact return non-NULL.

Why does SSL_get_client_ciphers() return NULL for python3.6, but not python3.5?

When Python 3.5 is in use, the session_secret_cb function is invoked before the tor_tls_server_info_callback() function triggers cipher classification, and at that point, the ciphers are set. When Python 3.6 is in use, the tor_tls_server_info_callback() function is hit twice early on to hit cipher classification: once with the ciphers not set, and then later with the ciphers set.

I have a fix for this issue that works for me in #30021 (moved). I do not completely understand why this issue occurs with Python 3.6's usage of openssl, however.

This ticket can close when its child tickets close.

Trac:
Status: needs_information to new

test-stem times out intermittently

Child items ...

Activity