Opened 5 years ago

Closed 5 years ago

#5917 closed defect (duplicate)

integ.control.base_controller hangs

Reported by: gsathya Owned by: atagar
Priority: High Milestone:
Component: Core Tor/Stem Version:
Severity: Keywords:
Cc: gsathya Actual Points:
Parent ID: Points:
Reviewer: Sponsor:

Description

integ.control.base_controller hangs up when running the integ tests

Child Tickets

Change History (15)

comment:1 Changed 5 years ago by gsathya

to be more clear, integ.control.base_controller stops and doesn't proceed (as arma pointed out "hangs up" means "disconnects" which isn't the case here)

comment:2 follow-up: Changed 5 years ago by atagar

Is this with the current master? Are you able to reproduce the issue?

comment:3 in reply to: ↑ 2 Changed 5 years ago by gsathya

Replying to atagar:

Is this with the current master?

Yes

Are you able to reproduce the issue?

Yes

comment:4 follow-up: Changed 5 years ago by atagar

Is this with the current master?

Yes

Damn

Are you able to reproduce the issue?

Yes

Great, though I'm having difficulty doing the same. When running the integ tests against tor commit ef0bc7f (the one that you're using in the other ticket) I get an error from the base_controller integ, but it doesn't hang. I've fixed the error - mind seeing if it runs for you now?

https://gitweb.torproject.org/stem.git/commitdiff/aed91aa634b9f960e87296b9e631a2c6aa05f2c0

If it does then we should still figure out what about that test caused stem to hang on your platform.

comment:5 in reply to: ↑ 4 Changed 5 years ago by gsathya

Replying to atagar:

I've fixed the error - mind seeing if it runs for you now?

https://gitweb.torproject.org/stem.git/commitdiff/aed91aa634b9f960e87296b9e631a2c6aa05f2c0

If it does then we should still figure out what about that test caused stem to hang on your platform.

It still hangs on the latest master :( Should I try running against a different Tor?

comment:6 follow-up: Changed 5 years ago by atagar

It still hangs on the latest master :( Should I try running against a different Tor?

Nope. We can now either try to figure out how our systems differ or try to do debugging on yours. Personally I favor the later - first step of that is to narrow this down to a single test. Add "self.skipTest("blah")" to all of the base_controller integ tests, and then try removing them until you've figured out which test is choking. This'll narrow the haystack quite a bit.

comment:7 in reply to: ↑ 6 Changed 5 years ago by gsathya

Replying to atagar:r?

Add "self.skipTest("blah")" to all of the base_controller integ tests, and then try removing them until you've figured out which test is choking. This'll narrow the haystack quite a bit.

test_msg, test_msg_invalid, test_msg_invalid_getinfo, test_msg_repeatedly, test_status_notifications are the ones that choke.

comment:8 Changed 5 years ago by gsathya

Also, sometimes SIGINT doesn't seem to work when the base_controller hangs, I need to SIGKILL it (This happens only intermittently though)

comment:9 follow-up: Changed 5 years ago by atagar

test_msg, test_msg_invalid, test_msg_invalid_getinfo, test_msg_repeatedly, test_status_notifications are the ones that choke.

Yikes. Ok, lets just focus on the simplest of those tests (test_msg). That does three things...

  • constructs a BaseController around a control socket
  • sends it 'GETINFO config-file', checking its result
  • closes the controller

Skip all tests but test_msg and try adding print statements to figure out which of those points its hanging on. Ideally try to narrow it down to the actual like that it's getting stuck on. My guess is that there's some call (like send()) that I didn't think would be blocking but is for some reason on your system.

comment:10 in reply to: ↑ 9 Changed 5 years ago by gsathya

Replying to atagar:

Running it through pdb,

> /Users/sathya/Documents/stem/test/integ/control/base_controller.py(151)test_asynchronous_event_handling()
-> class ControlledListener(stem.control.BaseController):
(Pdb) c
> /Users/sathya/Documents/stem/test/integ/control/base_controller.py(70)test_connect_repeatedly()
-> with test.runner.get_runner().get_tor_socket() as control_socket:
(Pdb) c
> /Users/sathya/Documents/stem/test/integ/control/base_controller.py(45)test_from_port()
-> if test.runner.Torrc.PORT in test.runner.get_runner().get_options():
(Pdb) c
> /Users/sathya/Documents/stem/test/integ/control/base_controller.py(57)test_from_socket_file()
-> if test.runner.Torrc.SOCKET in test.runner.get_runner().get_options():
(Pdb) c
> /Users/sathya/Documents/stem/test/integ/control/base_controller.py(83)test_msg()
-> with test.runner.get_runner().get_tor_socket() as control_socket:
(Pdb) n

test_asynchronous_event_handling(), test_connect_repeatedly(), test_from_port(), test_from_socket_file() get executed successfully. The test_msg hangs(). It gets stuck at https://gitweb.torproject.org/stem.git/blob/HEAD:/test/integ/control/base_controller.py#l78

comment:11 Changed 5 years ago by atagar

That's interesting. I assume that this is the default target, which means that stem is trying to establish a socket connection to tor (which it has already done dozens of times before) but somehow getting stuck. Could you please dig into get_tor_socket() to figure out if it's getting stuck on the connection or authentication? It would be helpful to figure out the last line in the stem codebase where we get stuck. One possability is that we're sending a PROTOCOLINFO and never receiving a response from tor, which would be... bad.

It's weird that the bug only manifests when multiple tests are run. We have a with clause wrapping all of the ControlSockets here, so there shouldn't be bleeding between the tests... I'm puzzled.

comment:12 Changed 5 years ago by atagar

I've reproed this on a mac and spent the last four hours unsuccessfully trying to figure it out. Here's what I know so far...

  • The issue always occurs in test_status_notifications(), when we're sending garbage data to the socket prior to calling AUTHENTICATE (which causes tor to terminate the socket connection). We're hanging up on the readline() call when trying to receive the response, which should *not* be blocking (it should either read content or error).
  • This issue only manifests if we also run a couple other base_controller integ tests that put a heavy load on the socket prior to the problem. From what I can tell nothing should be bleeding between the tests, so my guess is that the prior activity must be putting the socket into some problematic state.
  • This occurs with both port and file based sockets.

I'm using a socket based file object (derived from socket.makefile()) to have an efficient method for reading newline delimited responses. I haven't a clue why it's only blocking on macs, and only after encountering a heavy load (python bug?).

Guess that tomorrow I'll try removing our usage of a socket file, and make raw recv() calls to see if that works.

comment:13 Changed 5 years ago by atagar

Good news and bad news. Good news is that I have a 'fix' to stop the tests from hanging...
https://gitweb.torproject.org/user/atagar/stem.git/commitdiff/8206fe96c00fadf885d1d6c0be5970cad3537f2b

Bad news is that the problem isn't that the readline() simply becomes blocking. Somehow our ability to talk with tor's control socket breaks, and all further calls fail. Even with new sockets...

Traceback:

File "/Users/atagar/stem/test/integ/control/base_controller.py", line 95, in test_msg_repeatedly

with test.runner.get_runner().get_tor_socket() as control_socket:

File "/Users/atagar/stem/test/runner.py", line 497, in get_tor_socket

stem.connection.authenticate(control_socket, CONTROL_PASSWORD, self.get_chroot())

File "/Users/atagar/stem/stem/connection.py", line 291, in authenticate

raise AuthenticationFailure("socket connection failed" % exc)

AuthenticationFailure: socket connection failed

Naturally all further tests fail. The tor instance itself is fine, I can still use telnet to connect and authenticate. This is... extremely weird.

comment:14 Changed 5 years ago by atagar

  • Priority changed from normal to major

I've reproed this issue without using stem, and sent a ticket to Nick for ideas...
https://trac.torproject.org/6235

At this point I'm completely stuck. The control socket is going into CLOSE_WAIT, and after creating roughly three hundred connections we're unable to make any more. I've scoured forums and the python bugtracker for hours without finding a solution.

comment:15 Changed 5 years ago by atagar

  • Resolution set to duplicate
  • Status changed from new to closed

Skipping the problematic tests on Mac OSX...
https://gitweb.torproject.org/stem.git/commitdiff/48404ed766a775b2d958d9b4b686e96cc88914c0

Tested on a mac and they now pass without freezing. It doesn't look like this is a stem bug so resolving in favor of #6235. Feel free to reopen if you still get freezing issues or I'm wrong about it not being a stem bug.

Note: See TracTickets for help on using tickets.