Opened 3 years ago

Closed 2 years ago

#20440 closed defect (fixed)

endless wait in BlockingIteratorImpl

Reported by: iwakeh Owned by: iwakeh
Priority: High Milestone: metrics-lib 1.6.0
Component: Metrics/Library Version:
Severity: Normal Keywords:
Cc: dgoulet Actual Points:
Parent ID: Points:
Reviewer: Sponsor:

Description

As reported by David:

Onionoo updater didn't stop after supposedly successful run.

Here's the relevant thread dump part:

"Thread-0" #8 prio=5 os_prio=0 tid=0x00007f0d602f4000 nid=0x21a7 in Object.wait() [0x00007f0d435e0000]
   java.lang.Thread.State: WAITING (on object monitor)
	at java.lang.Object.wait(Native Method)
	- waiting on <0x00000007157e4ee0> (a org.torproject.descriptor.impl.BlockingIteratorImpl)
	at java.lang.Object.wait(Object.java:502)
	at org.torproject.descriptor.impl.BlockingIteratorImpl.add(BlockingIteratorImpl.java:40)
	- locked <0x00000007157e4ee0> (a org.torproject.descriptor.impl.BlockingIteratorImpl)
	at org.torproject.descriptor.impl.DescriptorReaderImpl$DescriptorReaderRunnable.readDescriptors(DescriptorReaderImpl.java:291)
	at org.torproject.descriptor.impl.DescriptorReaderImpl$DescriptorReaderRunnable.run(DescriptorReaderImpl.java:188)
	at java.lang.Thread.run(Thread.java:745)

This is the while loop in BlockingIteratorImpl.add

    while (this.queue.size() >= this.maxQueueSize) {
      try {
        wait();
      } catch (InterruptedException e) {
        /* nothing to be done */
      }
    }

As there are no consumers anymore and the queue is full the while-loop doesn't stop.

Child Tickets

Change History (17)

comment:1 Changed 3 years ago by iwakeh

Please test and review my branch.

This only prevents the endless-while (hope this can be confirmed), but doesn't yet solve the underlying issue.

I'll add a test later.

At some point java.util.concurrent classes should be used (already suggested in #14071).

comment:2 Changed 3 years ago by iwakeh

This is another branch based on the branch in comment:1 providing a test that fails when 'outOfDescriptors' was set while waiting to add an item.

comment:3 Changed 3 years ago by iwakeh

Improved the naming in the test in this commit.

In order to avoid confusion:
For testing please just use this branch for testing.
Thanks!

comment:4 Changed 3 years ago by karsten

Could this be related to #20421? The reason why this only affects Onionoo with the --update-only execution mode might be that Onionoo without that execution mode simply accumulates waiting metrics-lib threads without anybody noticing.

comment:5 in reply to:  4 Changed 3 years ago by iwakeh

Replying to karsten:

Could this be related to #20421? The reason why this only affects Onionoo with the --update-only execution mode might be that Onionoo without that execution mode simply accumulates waiting metrics-lib threads without anybody noticing.

I don't think so, because the problematic descriptors in #20421 were initially written by CollecTor w/o using metrics-lib. These descriptors only became visible while Oninoo processed CollecTor data.

comment:6 Changed 3 years ago by karsten

I'm sorry, I don't understand your last comment. Can you rephrase?

comment:7 Changed 3 years ago by dgoulet

Ok I did ran b4e557ee5d92c133b49fc7e593bbb50ffa20f1c9 but it's still hangs and no ERROR output from those extra logging statements in the patch.

Extra information, if I run the updater _not_ in the data repository that is where there is in/ and out/ and status/, it doesn't stall but then doesn't do much as well :).

comment:8 Changed 3 years ago by iwakeh

With the provided data I could reproduce. And actually it seems that one thread is caught in 'wait' not in the while-loop while anything else finished already. The jvm jut waits too and doesn't shutdown.
The immediate 'cure' is to make the responsible thread a daemon thread which works in my test setting.

Please, use this branch task-20440-endless-wait for a 'hot-fix'.

This issue needs some more investigation in order to fix it properly.

comment:9 Changed 3 years ago by dgoulet

Ah fix does work! Thanks a lot for this!

comment:10 Changed 2 years ago by iwakeh

Milestone: metrics-lib 1.6.0

This ought to be in 1.6.0.

comment:11 Changed 2 years ago by iwakeh

Owner: changed from karsten to iwakeh
Status: newassigned

comment:12 Changed 2 years ago by iwakeh

Status: assignedneeds_review

@karsten: Please take a look at this change.

I would recommend to add a new ticket for verifying the concurrent use-scenarios of metrics-lib.

comment:13 Changed 2 years ago by karsten

The fix looks good to me, even unrelated to this particular error case. What if an application starts reading descriptors and decides at some point to stop processing descriptors, possibly because a database connection has broke or the disk has run full or whatever. In such a case the application shouldn't be forced to request all descriptors from the iterator until it's empty. And the application shouldn't be forced to System.exit() just to kill the library thread. In a way, we're only using a thread to read and parse descriptors to speed up the process and not do all this when the application asks for another descriptor.

I'm still curious though: how did you reproduce this bug? Can I have the data you mention above that would reproduce it? (Or maybe I already have it, but where?) Thanks!

comment:14 in reply to:  13 Changed 2 years ago by iwakeh

Replying to karsten:

The fix looks good to me, even unrelated to this particular error case. What if an application starts reading descriptors and decides at some point to stop processing descriptors, possibly because a database connection has broke or the disk has run full or whatever. In such a case the application shouldn't be forced to request all descriptors from the iterator until it's empty. And the application shouldn't be forced to System.exit() just to kill the library thread. In a way, we're only using a thread to read and parse descriptors to speed up the process and not do all this when the application asks for another descriptor.

I'm still curious though: how did you reproduce this bug? Can I have the data you mention above that would reproduce it? (Or maybe I already have it, but where?) Thanks!

The data is not public, but check your mail (Oct 24, 2016)

I think, it might be a concurrency issue or at least that affects metrics-lib in addition.
For this discussion see ticket #21365.

comment:15 Changed 2 years ago by karsten

Thanks for pointing me to that email from October. I was able to reproduce the issue!

Here's what happens: Onionoo reads descriptors from in/archive/ and finds descriptor types in there that it does not recognize, including votes, microdescs, and certs. It decides (for whatever reason!) that it should stop reading from that directory, leaves the descriptor queue alone, and continues with reading descriptors from in/recent/. However, the metrics-lib non-daemon thread that reads descriptors and puts them into the queue blocks as soon as the queue runs full and later prevents the JVM from terminating.

This means we have two bugs here:

  • metrics-lib should allow the application to stop reading descriptors for whatever reason. Therefore, turning the reader thread into a daemon thread seems like the right fix. Please review my task-20440 branch with your cherry-picked commit and a change log entry.
  • Onionoo should be a little more robust against unexpected data in the in/archive/ directory. It could simply skip the unrecognized descriptor and carry on with the next descriptor. That's s/break/continue/ in this line. Happy to discuss that further in an Onionoo ticket, or happy to just make this change.

Of course, we could as well blame the operator for putting votes, microdescs, and certs into the in/archive/ directory, knowing very well that Onionoo would not recognize those. ;)

comment:16 in reply to:  15 Changed 2 years ago by iwakeh

Status: needs_reviewmerge_ready

Replying to karsten:

Thanks for pointing me to that email from October. I was able to reproduce the issue!

Here's what happens: Onionoo reads descriptors from in/archive/ and finds descriptor types in there that it does not recognize, including votes, microdescs, and certs. It decides (for whatever reason!) that it should stop reading from that directory, ...

Oh, good to know.

This means we have two bugs here:

  • metrics-lib should allow the application to stop reading descriptors for whatever reason. Therefore, turning the reader thread into a daemon thread seems like the right fix. Please review my task-20440 branch with your cherry-picked commit and a change log entry.

Yes, this looks fine. Ready for merge.

  • Onionoo should be a little more robust against unexpected data in the in/archive/ directory. It could simply skip the unrecognized descriptor and carry on with the next descriptor. That's s/break/continue/ in this line. Happy to discuss that further in an Onionoo ticket, or happy to just make this change.

Yes, this is a new ticket #21370, but rather an enhancement ;-)

comment:17 Changed 2 years ago by karsten

Resolution: fixed
Status: merge_readyclosed

Great, merged! I guess that means this ticket is done. Closing. Thanks!

Note: See TracTickets for help on using tickets.