investigate halt of scheduling for one of many tasks in collector's scheduler

changed milestone to %CollecTor 1.0.0

added component::metrics/collector milestone::CollecTor 1.0.0 owner::iwakeh priority::high resolution::fixed severity::normal status::closed type::defect labels

fixed typos, assigned.

Trac:
Status: new to assigned
Description: Affected is the current HEAD of the master branch.

The scheduling of the updateindex was stopped w/o any exception/warning whatsoever.

Positive about this is that the other scheduled task were not affected.

First steps:

add some logging to threads of the thread-pool
maybe, use other ways to study this behavior
implement a solution based on the facts found in the steps above.

to

Affected is the current HEAD of the master branch.

The scheduling of updateindex was stopped w/o any exception/warning whatsoever.

Positive about this is that the other scheduled tasks were not affected.

First steps:

add some logging to threads of the thread-pool
maybe, use other ways to study this behavior
implement a solution based on the facts found in the steps above.

Added a test that reproduces the issue and a solution patch. The problem occurs when a java.lang.Error is thrown. Now, Throwable is caught in CollecTorMain.run and the error is logged.

Usually, Errors pose a serious problem; so there should be a note in the Operator guide about monitoring logs. (added this to #18734 (moved))

Please review the patch branch.

Trac:
Status: assigned to needs_review
Priority: Medium to High

Fix looks good, but test hangs here:

    [junit] Testcase: testUrlValueException took 0.001 sec
    [junit] Testsuite: org.torproject.collector.cron.CollecTorMainTest
    [junit] Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.574 sec
    [junit] 
    [junit] Testcase: testCheckAvailableSpace took 0.551 sec
    [junit] Testsuite: org.torproject.collector.cron.SchedulerTest

Replying to karsten:

Fix looks good, but test hangs here:

{{{ [junit] Testcase: testUrlValueException took 0.001 sec [junit] Testsuite: org.torproject.collector.cron.CollecTorMainTest [junit] Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.574 sec [junit] [junit] Testcase: testCheckAvailableSpace took 0.551 sec [junit] Testsuite: org.torproject.collector.cron.SchedulerTest }}}

This test lasts a few minutes to give the Scheduler time.

Oh. Then we should either make it much faster, not add it, or add a separate test target for tests taking longer than a few seconds. Otherwise we're not going to run tests very often. What do you think?

Well, I always let the tests run. But I understand your concern.

The added test reproduces the problem and demonstrates that the solution works, but I don't expect it to reveal any other problems in future. But, other tests for the Scheduler might need similarly long.

Thinking about it, it might be easy to change Scheduler a little in order to provide a fast way for testing by simply exposing a milisecond version of the scheduling method.

So, accept the test for now and have a new ticket for making this test and potentially others related to scheduling faster? That way the long test times will go away soon.

Hmmm. How about we run the test now prior to merging, merge only the fix without the test, and add the test to a new ticket that provides faster tests for the scheduler?

Replying to karsten:

Hmmm. How about we run the test now prior to merging, merge only the fix without the test, and add the test to a new ticket that provides faster tests for the scheduler?

With a minor difference b/c the test won't be changed too much, I'd like to keep it.

As you suggest: run the test before merging and add the ignore annotation to the test method: @Ignore("This test takes <time>, which is too long.") with the time you measured.

Then I'll create the ticket about improving the test timing.

Would that be ok?

Sure, adding an @Ignore annotation sounds good.

But! The test doesn't pass here:

    [junit] Testcase: testScheduleBrokenClass took 120.039 sec
    [junit] 	FAILED
    [junit] expected:<15> but was:<10>
    [junit] junit.framework.AssertionFailedError: expected:<15> but was:<10>
    [junit] 	at org.torproject.collector.cron.SchedulerTest.testScheduleBrokenClass(SchedulerTest.java:113)

Please find the rebased and slightly tweaked patch in my task-19771-2 branch. Did I tweak it too much? Does it work for you?

Well, the tweaking didn't go far enough :-)

As the scheduling behavior got changed the test now needs to wait a little longer. Here is the diff:

diff --git a/src/test/java/org/torproject/collector/cron/SchedulerTest.java b/src/test/java/org/torproject/collector/cron/SchedulerTest.java
index 3fad830..85f7002 100644
--- a/src/test/java/org/torproject/collector/cron/SchedulerTest.java
+++ b/src/test/java/org/torproject/collector/cron/SchedulerTest.java
@@ -107,7 +107,7 @@ public class SchedulerTest {
         schedulerField.get(Scheduler.getInstance());
     Scheduler.getInstance().scheduleModuleRuns(ctms, conf);
     long sysNow = System.currentTimeMillis();
-    while (System.currentTimeMillis() - sysNow < 120_000) {
+    while (System.currentTimeMillis() - sysNow < 180_000) {
       try { Thread.sleep(10_000);} catch (Exception e) { /* ignored */ }
     }
     assertEquals(15, Broken.count.intValue());

With the old version the first runs happened immediately, now the scheduler waits for the beginning of the next minute. So, increasing test time makes it pass again.

(You probably noticed, but just in case: the logs in generated/test-logs help pass the time waiting for the test to complete :-)

Okay, tweaked more, looks good now. Pushed to master and closing. Thanks!

Trac:
Resolution: N/A to fixed
Status: needs_review to closed

And fixed compile in a separate commit. Oops.

closed

investigate halt of scheduling for one of many tasks in collector's scheduler

Child items 0

Activity