Opened 12 days ago

Closed 2 days ago

#27053 closed defect (fixed)

Check controller's event error handling

Reported by: atagar Owned by: atagar
Priority: Very High Milestone:
Component: Core Tor/Stem Version:
Severity: Major Keywords: controller
Cc: dmr Actual Points:
Parent ID: Points:
Reviewer: Sponsor:

Description

Mike had a great question: 'what happens to events if there's an error?'. On reflection the answer is: nothing good.

Our handle_event method parses events without any kind of error handling...

https://gitweb.torproject.org/stem.git/tree/stem/control.py#n3876

This would in turn break our event handling thread. Instead we should direct malformed events into a 'broken event' queue.

Child Tickets

Change History (16)

comment:1 Changed 12 days ago by mikeperry

Related: if a controller itself messes up handling an event callback and throws an arbitrary exception, it should have a way to know that this happened, so it can decide to exit. Or perhaps equivalently for my purposes: if stem just exits the program in this case.

Without this, controllers that are supposed to be long-lived might sit around in a zombie-state where users think they are running, but they actually are not reacting to anything.

comment:2 Changed 12 days ago by atagar

Hi Mike. When our caller's handler raises an exception it should be ignored by Stem. Presently it breaks the event thread and that is definitely wrong.

Stem intends to provide the following contract...

  • Users should not need to know about Stem's internal event thread and it should never fail. If it does that's my bug.
  • When a listener is subscribed it should always be invoked when a validly formatted event of that type is received.
  • What the listener does (and if it fails) is of no interest to Stem.

Unfortunately I need to carve out one exception for that last point: if the called function hangs then that will break our event thread. To protect us from that I could invoke handlers in subthreads like we do for status changed listeners...

https://stem.torproject.org/api/control.html#stem.control.BaseController.add_status_listener

... but unlike status changes events are numerous enough that guardrails like that would be costly.

comment:3 Changed 9 days ago by atagar

Hi Mike, sorry for the delay. While looking into this I spotted another oversight I had to correct first (in particular, ControlMessage equality/hashing wasn't yet implemented). Pushed tests that repro the issues mentioned above...

https://gitweb.torproject.org/stem.git/commit/?id=b557d09

I'll push fixes for these in the next few days.

comment:4 in reply to:  3 ; Changed 8 days ago by dmr

Cc: dmr added

Replying to atagar:

[...] Pushed tests that repro the issues mentioned above...

https://gitweb.torproject.org/stem.git/commit/?id=b557d09

I'll push fixes for these in the next few days.

One of these tests is failing intermittently. I've seen 3 failures in py35, and no failures in py27 (yet, at least). Here's the output:

======================================================================
FAIL: test_events_get_received
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/usr/lib/python3.5/unittest/mock.py", line 1159, in patched
    return func(*args, **keywargs)
  File "/path/to/test/unit/control/controller.py", line 633, in test_events_get_received
    self.circ_listener.assert_called_once_with(CIRC_EVENT)
  File "/usr/lib/python3.5/unittest/mock.py", line 804, in assert_called_once_with
    raise AssertionError(msg)
AssertionError: Expected 'mock' to be called once. Called 0 times.

comment:5 Changed 8 days ago by dmr

Keywords: controller added

comment:6 in reply to:  4 Changed 8 days ago by dmr

Replying to dmr:

[...] and no failures in py27 (yet, at least).

Spoke too soon...

======================================================================
FAIL: test_events_get_received
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/path/to/.tox/py27/local/lib/python2.7/site-packages/mock/mock.py", line 1305, in patched
    return func(*args, **keywargs)
  File "/path/to/test/unit/control/controller.py", line 633, in test_events_get_received
    self.circ_listener.assert_called_once_with(CIRC_EVENT)
  File "/path/to/.tox/py27/local/lib/python2.7/site-packages/mock/mock.py", line 947, in assert_called_once_with
    raise AssertionError(msg)
AssertionError: Expected 'mock' to be called once. Called 0 times.

comment:7 Changed 7 days ago by dmr

Somehow I missed this before, but currently I'm getting this output as well, when running all tests without any options beyond -a:

Threads lingering after test run:
  <_MainThread(MainThread, started 140422286939904)>
  <Thread(Event Notifier, started daemon 140421808256768)>
68 TESTS WERE SKIPPED
TESTING PASSED (100 seconds)

Specifically I've seen this for py27 and py35, but the above is only for py27 (it's probably slightly different for py35).

It only seems to happen when the aforementioned test doesn't fail.

It doesn't happen from the preceding commit.

It also doesn't happen if only unit tests are run (-u), so there might be some weird interplay going on.

comment:8 Changed 7 days ago by atagar

Resolution: fixed
Status: newclosed

Hi Mike, fix pushed. Feel free to reopen if you run into any further issues...

https://gitweb.torproject.org/stem.git/commit/?id=ed6460d

One of these tests is failing intermittently.

Thanks Dave, I saw this a bit too. Bumping the sleep looks to have done the trick.

Threads lingering after test run

Pity, thought we had all those sorted out. We've sorted those when they cropped up in the past but presently I'm not seeing it. If you get a reliable repro then a patch would be great. That said, shouldn't be related to this ticket.

comment:9 in reply to:  8 ; Changed 6 days ago by dmr

Replying to atagar:

We've sorted those when they cropped up in the past but presently I'm not seeing it. If you get a reliable repro then a patch would be great. That said, shouldn't be related to this ticket.

They were reproducing pretty reliably at b557d09c2721858785b6310f331e29c06b76887f and later (my commits), and not at d1284e6d0301de59ddf286b5e799dee50a5d4a5a (right before the WIP test-repo commit). I'd say: either the aformentioned test would fail OR the threads would linger - I never saw a case where neither happened.

That said, I'll let you know if I happen to still see them now that you've submitted ed6460df4a6fa8446a1e2923bddde652f5d91ed7. (About to rebase my work onto that.)

comment:10 in reply to:  9 Changed 6 days ago by dmr

Resolution: fixed
Status: closedreopened

Replying to dmr:

That said, I'll let you know if I happen to still see them now that you've submitted ed6460df4a6fa8446a1e2923bddde652f5d91ed7. (About to rebase my work onto that.)

Ok, just finished a test run on master (ed6460df4a6fa8446a1e2923bddde652f5d91ed7).

Based on the output (still failing) and past experience prior to this change, I'm going to reopen this ticket - something in these commits seems to be causing this.

atagar, if you think it's something specific to my configuration / setup, let's discuss over IRC to nail this down.

Ran:
tox -- -a
(which is basically run_tests.py -a on py27 and p35

Here's snippet output...

py27:

py27 runtests: commands[1] | python run_tests.py -a
======================================================================
                             INITIALISING                             
======================================================================

  checking stem version...                             1.6.0-dev
  checking tor version...                              0.3.2.10
  checking python version...                           2.7.13
  checking cryptography version...                     2.2.2
  checking pynacl version...                           1.2.1
  checking mock version...                             2.0.0
  checking pyflakes version...                         2.0.0
  checking pycodestyle version...                      2.4.0
  checking for orphaned .pyc files...                  done (0.0s)
  checking for unused tests...                         done (0.1s)
  importing test modules...                            done (0.1s)
  running pyflakes...                                  running
  running pycodestyle...                               running

...
Shutting down tor... done

Threads lingering after test run:
  <_MainThread(MainThread, started 140094097536768)>
  <Thread(Event Notifier, started daemon 140093752395520)>
  <Thread(Event Notifier, started daemon 140093760788224)>
  <Thread(Event Notifier, started daemon 140093780129536)>
69 TESTS WERE SKIPPED
TESTING PASSED (23 seconds)

py35:

py35 runtests: commands[1] | python run_tests.py -a
======================================================================
                             INITIALISING                             
======================================================================

  checking stem version...                             1.6.0-dev
  checking tor version...                              0.3.2.10
  checking python version...                           3.5.3
  checking cryptography version...                     2.2.2
  checking pynacl version...                           1.2.1
  checking mock version...                             1.0
  checking pyflakes version...                         2.0.0
  checking pycodestyle version...                      2.4.0
  checking for orphaned .pyc files...                  done (0.0s)
  checking for unused tests...                         done (0.0s)
  importing test modules...                            done (0.3s)
  running pyflakes...                                  running
  running pycodestyle...                               running

======================================================================
                              UNIT TESTS                              
======================================================================

...
  control.controller...                                failed (0.58s)
test_add_event_listener                                4 ms  [SUCCESS]
test_attach_stream                                     0 ms  [SUCCESS]
test_drop_guards                                       1 ms  [SUCCESS]
test_event_description                                 7 ms  [SUCCESS]
test_event_description_includes_all_events             0 ms  [SUCCESS]
test_event_listing_with_error                        170 ms  [SUCCESS]
test_event_listing_with_malformed_event                      [FAILURE]
test_events_get_received                             103 ms  [SUCCESS]
test_get_accounting_stats                              1 ms  [SUCCESS]
test_get_effective_rate                                1 ms  [SUCCESS]
test_get_exit_policy                                   4 ms  [SUCCESS]
test_get_exit_policy_if_not_relaying                   0 ms  [SUCCESS]
test_get_info                                          1 ms  [SUCCESS]
test_get_info_address_caching                          2 ms  [SUCCESS]
test_get_info_without_fingerprint                      3 ms  [SUCCESS]
test_get_network_status                                1 ms  [SUCCESS]
test_get_network_status_for_ourselves                  1 ms  [SUCCESS]
test_get_network_status_when_unavailable               1 ms  [SUCCESS]
test_get_pid_by_getinfo                                0 ms  [SUCCESS]
test_get_pid_by_name                                   9 ms  [SUCCESS]
test_get_pid_by_pid_file                               1 ms  [SUCCESS]
test_get_pid_remote                                    1 ms  [SUCCESS]
test_get_ports                                         4 ms  [SUCCESS]
test_get_protocolinfo                                  1 ms  [SUCCESS]
test_get_socks_listeners_new                           1 ms  [SUCCESS]
test_get_socks_listeners_old                           2 ms  [SUCCESS]
test_get_streams                                       0 ms  [SUCCESS]
test_get_user_by_getinfo                               0 ms  [SUCCESS]
test_get_user_by_system                                1 ms  [SUCCESS]
test_get_user_remote                                   1 ms  [SUCCESS]
test_get_version                                       1 ms  [SUCCESS]
test_parse_circ_path                                   0 ms  [SUCCESS]
test_timeout                                         101 ms  [SUCCESS]

======================================================================
ERROR: test_event_listing_with_malformed_event
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/usr/lib/python3.5/unittest/mock.py", line 1159, in patched
    return func(*args, **keywargs)
  File "/path/to/test/unit/control/controller.py", line 676, in test_event_listing_with_malformed_event
    self.malformed_listener.assert_called_once()
  File "/usr/lib/python3.5/unittest/mock.py", line 585, in __getattr__
    raise AttributeError(name)
AttributeError: assert_called_once

----------------------------------------------------------------------
Ran 33 tests in 0.576s

FAILED (errors=1)

...

======================================================================
                          INTEGRATION TESTS                           
======================================================================

...

Shutting down tor... done

Threads lingering after test run:
  <_MainThread(MainThread, started 139884796438272)>
  <Thread(Event Notifier, started daemon 139884448605952)>
  <Thread(Event Notifier, started daemon 139884459190016)>
  <Thread(Event Notifier, started daemon 139884417718016)>
TESTING FAILED (30 seconds)
  [UNIT TEST] test_event_listing_with_malformed_event (test.unit.control.controller.TestControl) ... ERROR

So, seemingly: seeing multiple threads lingering, independent of test failure.

comment:11 Changed 6 days ago by atagar

Thanks Dave. Yup, you're right, this caused the lingering thread regression. Fixed.

https://gitweb.torproject.org/stem.git/commit/?id=bc06836

As for the unit test failures I have a theory. When we run all our tests 'run_tests.py --all' it performs our heaviest integ tests asynchronously along with our unit tests to speed things up. This tends to peg the CPU, possibly enough for the bumped sleep to still not be enough. Sleeping like that was a lazy hack anyway so replaced it with something better.

https://gitweb.torproject.org/stem.git/commit/?id=c9cc0f5

Does that do the trick?

comment:12 in reply to:  11 Changed 6 days ago by dmr

Replying to atagar:

As for the unit test failures I have a theory. When we run all our tests 'run_tests.py --all' it performs our heaviest integ tests asynchronously along with our unit tests to speed things up. This tends to peg the CPU, possibly enough for the bumped sleep to still not be enough. Sleeping like that was a lazy hack anyway so replaced it with something better.

https://gitweb.torproject.org/stem.git/commit/?id=c9cc0f5

Does that do the trick?

Still seeing unit-test failures. :(
tox -- -a again.

py27:

======================================================================
FAIL: test_event_listing_with_error
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/path/to/.tox/py27/local/lib/python2.7/site-packages/mock/mock.py", line 1305, in patched
    return func(*args, **keywargs)
  File "/path/to/test/unit/control/controller.py", line 663, in test_event_listing_with_error
    self.circ_listener.assert_called_once_with(CIRC_EVENT)
  File "/path/to/.tox/py27/local/lib/python2.7/site-packages/mock/mock.py", line 947, in assert_called_once_with
    raise AssertionError(msg)
AssertionError: Expected 'mock' to be called once. Called 0 times.

----------------------------------------------------------------------
Ran 33 tests in 0.308s

FAILED (failures=1)

py35:

======================================================================
ERROR: test_event_listing_with_malformed_event
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/usr/lib/python3.5/unittest/mock.py", line 1159, in patched
    return func(*args, **keywargs)
  File "/path/to/test/unit/control/controller.py", line 688, in test_event_listing_with_malformed_event
    self.malformed_listener.assert_called_once()
  File "/usr/lib/python3.5/unittest/mock.py", line 585, in __getattr__
    raise AttributeError(name)
AttributeError: assert_called_once

----------------------------------------------------------------------
Ran 33 tests in 0.327s

FAILED (errors=1)

Note that these are different tests!

Thanks Dave. Yup, you're right, this caused the lingering thread regression. Fixed.

https://gitweb.torproject.org/stem.git/commit/?id=bc06836

On the plus side, it does look like the lingering threads are gone for me! :)

comment:13 Changed 5 days ago by atagar

Huh, pity. Since I can't repro this would you mind looking into this, Dave?

comment:14 Changed 4 days ago by atagar

Hi Dave. Good news, bad news time. The good news is that I figured out the python3 mock issue you found. Turns out there's a difference between PyPI and Python3's mock modules. Fixed...

https://gitweb.torproject.org/stem.git/commit/?id=e75cf25

Bad news is that this doesn't fix the transient test failures. I was able to repro those a few times, but once I started digging in the tests once again began passing reliably. Frustrating...

comment:15 Changed 3 days ago by atagar

Hi Dave, think I got it but tough to be sure. Does this do the trick for you?

https://gitweb.torproject.org/stem.git/commit/?id=abc6f29

comment:16 in reply to:  15 Changed 2 days ago by dmr

Resolution: fixed
Status: reopenedclosed

Replying to atagar:

Hi Dave, think I got it but tough to be sure. Does this do the trick for you?

https://gitweb.torproject.org/stem.git/commit/?id=abc6f29

(Yep still failing as of that rev)

As discussed over IRC, there's nothing here that leads me to believe the implementation for this ticket is wrong, so closing it out in favor of another ticket to track the unit-test failures:
#27173

Note: See TracTickets for help on using tickets.