Opened 3 years ago

Closed 3 years ago

Last modified 3 years ago

#19771 closed defect (fixed)

investigate halt of scheduling for one of many tasks in collector's scheduler

Reported by: iwakeh Owned by: iwakeh
Priority: High Milestone: CollecTor 1.0.0
Component: Metrics/CollecTor Version:
Severity: Normal Keywords:
Cc: Actual Points:
Parent ID: Points:
Reviewer: Sponsor:

Description (last modified by iwakeh)

Affected is the current HEAD of the master branch.

The scheduling of updateindex was stopped w/o any exception/warning whatsoever.

Positive about this is that the other scheduled tasks were not affected.

First steps:

  1. add some logging to threads of the thread-pool
  2. maybe, use other ways to study this behavior
  3. implement a solution based on the facts found in the steps above.

Child Tickets

Change History (12)

comment:1 Changed 3 years ago by iwakeh

Description: modified (diff)
Status: newassigned

fixed typos, assigned.

comment:2 Changed 3 years ago by iwakeh

Priority: MediumHigh
Status: assignedneeds_review

Added a test that reproduces the issue and a solution patch.
The problem occurs when a java.lang.Error is thrown. Now, Throwable is caught in CollecTorMain.run and the error is logged.

Usually, Errors pose a serious problem; so there should be a note in the Operator guide about monitoring logs. (added this to #18734)

Please review the patch branch.

comment:3 Changed 3 years ago by karsten

Fix looks good, but test hangs here:

    [junit] Testcase: testUrlValueException took 0.001 sec
    [junit] Testsuite: org.torproject.collector.cron.CollecTorMainTest
    [junit] Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.574 sec
    [junit] 
    [junit] Testcase: testCheckAvailableSpace took 0.551 sec
    [junit] Testsuite: org.torproject.collector.cron.SchedulerTest

comment:4 in reply to:  3 Changed 3 years ago by iwakeh

Replying to karsten:

Fix looks good, but test hangs here:

    [junit] Testcase: testUrlValueException took 0.001 sec
    [junit] Testsuite: org.torproject.collector.cron.CollecTorMainTest
    [junit] Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.574 sec
    [junit] 
    [junit] Testcase: testCheckAvailableSpace took 0.551 sec
    [junit] Testsuite: org.torproject.collector.cron.SchedulerTest

This test lasts a few minutes to give the Scheduler time.

comment:5 Changed 3 years ago by karsten

Oh. Then we should either make it much faster, not add it, or add a separate test target for tests taking longer than a few seconds. Otherwise we're not going to run tests very often. What do you think?

comment:6 Changed 3 years ago by iwakeh

Well, I always let the tests run. But I understand your concern.

The added test reproduces the problem and demonstrates that the solution works, but I don't
expect it to reveal any other problems in future. But, other tests for the Scheduler might
need similarly long.

Thinking about it, it might be easy to change Scheduler a little in order to provide a fast
way for testing by simply exposing a milisecond version of the scheduling method.

So, accept the test for now and have a new ticket for making this test and potentially others related to scheduling faster?
That way the long test times will go away soon.

comment:7 Changed 3 years ago by karsten

Hmmm. How about we run the test now prior to merging, merge only the fix without the test, and add the test to a new ticket that provides faster tests for the scheduler?

comment:8 in reply to:  7 Changed 3 years ago by iwakeh

Replying to karsten:

Hmmm. How about we run the test now prior to merging, merge only the fix without the test, and add the test to a new ticket that provides faster tests for the scheduler?

With a minor difference b/c the test won't be changed too much, I'd like to keep it.

As you suggest: run the test before merging and
add the ignore annotation to the test method: @Ignore("This test takes <time>, which is too long.") with the time you measured.

Then I'll create the ticket about improving the test timing.

Would that be ok?

comment:9 Changed 3 years ago by karsten

Sure, adding an @Ignore annotation sounds good.

But! The test doesn't pass here:

    [junit] Testcase: testScheduleBrokenClass took 120.039 sec
    [junit] 	FAILED
    [junit] expected:<15> but was:<10>
    [junit] junit.framework.AssertionFailedError: expected:<15> but was:<10>
    [junit] 	at org.torproject.collector.cron.SchedulerTest.testScheduleBrokenClass(SchedulerTest.java:113)

Please find the rebased and slightly tweaked patch in my task-19771-2 branch. Did I tweak it too much? Does it work for you?

comment:10 Changed 3 years ago by iwakeh

Well, the tweaking didn't go far enough :-)

As the scheduling behavior got changed the test now needs to wait a little longer.
Here is the diff:

diff --git a/src/test/java/org/torproject/collector/cron/SchedulerTest.java b/src/test/java/org/torproject/collector/cron/SchedulerTest.java
index 3fad830..85f7002 100644
--- a/src/test/java/org/torproject/collector/cron/SchedulerTest.java
+++ b/src/test/java/org/torproject/collector/cron/SchedulerTest.java
@@ -107,7 +107,7 @@ public class SchedulerTest {
         schedulerField.get(Scheduler.getInstance());
     Scheduler.getInstance().scheduleModuleRuns(ctms, conf);
     long sysNow = System.currentTimeMillis();
-    while (System.currentTimeMillis() - sysNow < 120_000) {
+    while (System.currentTimeMillis() - sysNow < 180_000) {
       try { Thread.sleep(10_000);} catch (Exception e) { /* ignored */ }
     }
     assertEquals(15, Broken.count.intValue());

With the old version the first runs happened immediately, now the scheduler waits for the beginning of the next minute. So, increasing test time makes it pass again.

(You probably noticed, but just in case: the logs in generated/test-logs help pass the time waiting for the test to complete :-)

comment:11 Changed 3 years ago by karsten

Resolution: fixed
Status: needs_reviewclosed

Okay, tweaked more, looks good now. Pushed to master and closing. Thanks!

comment:12 Changed 3 years ago by karsten

And fixed compile in a separate commit. Oops.

Note: See TracTickets for help on using tickets.