Opened 4 years ago

Closed 4 years ago

#14071 closed defect (fixed)

BlockingIteratorImpl.hasNext waits forever

Reported by: iwakeh Owned by: karsten
Priority: Immediate Milestone:
Component: Metrics/Library Version:
Severity: Keywords:
Cc: Actual Points:
Parent ID: Points:
Reviewer: Sponsor:

Description

My Onionoo server blocks after about 20 runs.
The analysis yields that it or the descriptor lib waits forever in
org.torproject.descriptor.impl.BlockingIteratorImpl.hasNext

This issue blocks #14070

Maybe also think about using java.util.concurrent throughout the library?
E.g. some BlockingQueue implementation, and atomic variables?

Child Tickets

Change History (7)

comment:1 Changed 4 years ago by karsten

Status: newaccepted

Interesting. Let's debug and fix this first, and then consider switching to something else. (Otherwise we might be changing too many variables at once.)

I just started a local Onionoo instance with the Main class you posted to #13600. Let's see if I run into similar problems.

Of course, if you find out more details about the issue, please post them here.

comment:2 Changed 4 years ago by iwakeh

It is not necessary to use the "new" Main class.

I reproduced the problem using the old version on master.
Just run the background app starting from scratch.
And at some point there seems to be a data situation that
the initial wait() in hasNext() is called, but no notifyAll() or notify()
can occur later.

comment:3 Changed 4 years ago by karsten

Can you attach your logs to this ticket?

comment:4 Changed 4 years ago by iwakeh

Analysis

From reading the source I concluded that the above situation is going to happen
when the thread running 'DescriptorReaderRunnable' (in DescriptorReaderImpl.java)
is terminated before this.descriptorQueue.setOutOfDescriptors();
inside the run method.

The methods called from run() contain unhandled exceptions and any other odd RuntimeException
could just terminate this thread prematurely.
If this happens before the call to 'setOutOfDescriptors()' the while-loop in BlockingIteratorIml.hasNext() won't ever exit.

Possible remedies

Maybe, surround the contents of run method with try-catch for all throwables or have a catch-clause for each line in run() in order to handle errors differently?
What do you think? Should the application exit? Or only 'this.descriptorQueue.setOutOfDescriptors()' be called?
Logging would be very important too, but metrics-lib does not use slf4j yet and its introduction would take a while, afaik?

In the long run an event mechanism and using java.concurrent might be a good replacement.
With the customizable factory classes (I think this was #12868) it would be easy to have two versions of metrics-lib available.
So, a new approach would not be risky for the other projects depending on metrics-lib.

comment:5 Changed 4 years ago by karsten

Wow, good catch! Want to try out my branch task-14071? I'm particularly interested what runtime exception that was, because we should fix that regardless of surrounding things with try.

Regarding logging, I agree that we should have that. But for now, I'd say write to stderr if something really bad happens. (Like a runtime exception in the descriptor reader thread...)

Regarding the suggested event mechanism, I'd say let's just write a new DescriptorReader interface and implementation class, and deprecate the existing interface. The current interface has some parts I'd want to get rid of anyway: all configuration should happen in a builder class, and reading/writing the parse history file shouldn't be done by the reader at all.

Sorry for taking so long to respond. And thanks a lot for investigating this! I'm curious to learn what the exception was.

comment:6 Changed 4 years ago by iwakeh

Well, I didn't keep all those logfiles around, but I remember there was a number format
exception mentioned in the nohup out-file. I could not reproduce this particular exception.

Currently, I have something running that looks like your task-14071 branch except for the
'if' statement (and logging ;-)

I'll post an update when there comes something up.

Last edited 4 years ago by iwakeh (previous) (diff)

comment:7 Changed 4 years ago by karsten

Resolution: fixed
Status: acceptedclosed

Okay, I merged the patch with an additional comment and "logging message" saying that this is a bug. I'm resolving this ticket. If you run into this bug again, please open a new ticket. Thanks!

Note: See TracTickets for help on using tickets.