Opened 5 years ago

Closed 5 years ago

#12566 closed defect (fixed)

Onionoo stalled while downloading descriptors

Reported by: karsten Owned by: karsten
Priority: Medium Milestone:
Component: Metrics/Onionoo Version:
Severity: Keywords:
Cc: Actual Points:
Parent ID: Points:
Reviewer: Sponsor:

Description

Yesterday evening, Onionoo's back-end stalled forever while downloading descriptors from CollecTor. Subsequent back-end runs found the stalled run's lock file and terminated immediately. After six hours, the front-end considered its data to be stale and replied to all requests with 500 Internal Server Error. This was the reason for #12565.

From the back-end log:

     [java] Sun Jul 06 22:15:02 UTC 2014: Downloading descriptors.
[...]
     [java] Sun Jul 06 23:15:02 UTC 2014: Initializing.
     [java]   Could not acquire lock.  Is Onionoo already running?  Terminating (00:00.000 minutes).

Three ideas to fix this problem:

  • The terminating runs should have sent error messages to the operator, so that the problem would have been detected much earlier.
  • Six hours may be too short for the front-end to consider its data stale. Maybe 24 hours is more realistic. After all, 24 hour old data are not wrong, it's just not as fresh as users would expect. Maybe clients could display results and add a little warning that the data are not as fresh as usual.
  • I have no idea why downloading descriptors stalled in the first place. My current idea is to add more log statements to track down what went wrong.

Child Tickets

Change History (2)

comment:1 Changed 5 years ago by karsten

Fixed the first thing on the list: "The terminating runs should have sent error messages to the operator, so that the problem would have been detected much earlier."

comment:2 in reply to:  description Changed 5 years ago by karsten

Resolution: fixed
Status: newclosed

Replying to karsten:

  • Six hours may be too short for the front-end to consider its data stale. Maybe 24 hours is more realistic. After all, 24 hour old data are not wrong, it's just not as fresh as users would expect. Maybe clients could display results and add a little warning that the data are not as fresh as usual.

On second thought, six hours is not a bad number, and 12 or 24 hours are not obviously better. Let's this unchanged.

  • I have no idea why downloading descriptors stalled in the first place. My current idea is to add more log statements to track down what went wrong.

It didn't happen again. But now that error messages will be sent to the service operator, we'll detect the problem much faster if it ever happens again.

Note: See TracTickets for help on using tickets.