Opened 3 years ago

Closed 3 years ago

#22602 closed defect (duplicate)

CollecTor's relaydescs module freezes while downloading from directory authorities

Reported by: karsten Owned by: metrics-team
Priority: High Milestone:
Component: Metrics/CollecTor Version:
Severity: Normal Keywords:
Cc: Actual Points:
Parent ID: Points:
Reviewer: Sponsor:


This morning, 2017-06-14 ~07:00, I noticed that the latest consensus retrieved by CollecTor was valid after 2017-06-13 17:00.

The last log lines from the relaydescs module were:

2017-06-13 17:05:00,001 INFO o.t.c.c.CollecTorMain:66 Starting relaydescs module of CollecTor.
2017-06-13 17:05:26,184 INFO o.t.c.r.CachedRelayDescriptorReader:255 Finished importing relay descriptors from local Tor data directories:
cached-consensus: 2017-06-13 17:00:00
cached-descriptors: parsed 0, skipped 24560 server descriptors parsed 608, skipped 8585 server descriptors
cached-extrainfo: parsed 0, skipped 24543 extra-info descriptors parsed 607, skipped 8239 extra-info descriptors
v3-status-votes: parsed 8, skipped 0 votes

All other modules continued as usual.

Here's a stack trace obtained using jcmd:

"CollecTor-Scheduled-Thread-8" daemon prio=10 tid=0x00007fedd8006800 nid=0x6411 runnable [0x00007fee023fd000]
   java.lang.Thread.State: RUNNABLE
        at Method)
        - locked <0x000000078fd3b3d8> (a
        - locked <0x000000078fd3b418> (a
        at org.torproject.collector.relaydescs.RelayDescriptorDownloader.downloadResourceFromAuthority(
        at org.torproject.collector.relaydescs.RelayDescriptorDownloader.downloadDescriptors(
        at org.torproject.collector.relaydescs.ArchiveWriter.startProcessing(
        at java.util.concurrent.Executors$
        at java.util.concurrent.FutureTask.runAndReset(
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(
        at java.util.concurrent.ScheduledThreadPoolExecutor$
        at java.util.concurrent.ThreadPoolExecutor.runWorker(
        at java.util.concurrent.ThreadPoolExecutor$

I stopped and restarted CollecTor and am now working on filling the gap of relay descriptors published in these ~16 hours by syncing from the backup instance.

I guess the fix is to start using a timeout somewhere. It's just curious that we didn't run into this case before. We didn't change anything there recently, did we?

Child Tickets

Change History (4)

comment:1 Changed 3 years ago by iwakeh

I assume this is the HttpURLConnection problem we already have in several tickets (#20516, #20515, etc).

comment:2 Changed 3 years ago by iwakeh

Analysis in #20323.

comment:3 Changed 3 years ago by karsten

Oh well. I didn't look yet. Yet one more reason to finally fix this.

comment:4 Changed 3 years ago by karsten

Resolution: duplicate
Status: newclosed

Looks like this is a duplicate of #20515. Closing, but referencing this ticket there, in case we need a stack trace to fix this.

Note: See TracTickets for help on using tickets.