Opened 14 months ago

Closed 10 months ago

Last modified 9 months ago

#24290 closed defect (fixed)

Configure timeout for metrics-lib clients, e.g., those using DescriptorIndexCollector

Reported by: karsten Owned by: iwakeh
Priority: High Milestone: metrics-lib 2.3.0
Component: Metrics/Library Version:
Severity: Normal Keywords:
Cc: metrics-team Actual Points:
Parent ID: Points:
Reviewer: Sponsor:

Description

Tonight the relaydescs module of the main CollecTor instance froze when attempting to fetch the remote index.json from the backup CollecTor instance. Here's a stack trace I got from jcmd:

"CollecTor-Scheduled-Thread-10" #38 daemon prio=5 os_prio=0 tid=0x00007fb7ac009800 nid=0xea7 runnable [0x00007fb7e5893000]
   java.lang.Thread.State: RUNNABLE
        at java.net.SocketInputStream.socketRead0(Native Method)
        at java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
        at java.net.SocketInputStream.read(SocketInputStream.java:171)
        at java.net.SocketInputStream.read(SocketInputStream.java:141)
        at sun.security.ssl.InputRecord.readFully(InputRecord.java:465)
        at sun.security.ssl.InputRecord.read(InputRecord.java:503)
        at sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:983)
        - locked <0x000000008061f060> (a java.lang.Object)
        at sun.security.ssl.SSLSocketImpl.performInitialHandshake(SSLSocketImpl.java:1385)
        - locked <0x000000008061f188> (a java.lang.Object)
        at sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1413)
        at sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1397)
        at sun.net.www.protocol.https.HttpsClient.afterConnect(HttpsClient.java:559)
        at sun.net.www.protocol.https.AbstractDelegateHttpsURLConnection.connect(AbstractDelegateHttpsURLConnection.java:185)
        at sun.net.www.protocol.http.HttpURLConnection.getInputStream0(HttpURLConnection.java:1564)
        - locked <0x000000008061f1f8> (a sun.net.www.protocol.https.DelegateHttpsURLConnection)
        at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1492)
        - locked <0x000000008061f1f8> (a sun.net.www.protocol.https.DelegateHttpsURLConnection)
        at sun.net.www.protocol.https.HttpsURLConnectionImpl.getInputStream(HttpsURLConnectionImpl.java:263)
        - locked <0x000000008061f2e0> (a sun.net.www.protocol.https.HttpsURLConnectionImpl)
        at java.net.URL.openStream(URL.java:1045)
        at org.torproject.descriptor.index.IndexNode.fetchIndex(IndexNode.java:101)
        at org.torproject.descriptor.index.DescriptorIndexCollector.collectDescriptors(DescriptorIndexCollector.java:74)
        at org.torproject.collector.sync.SyncManager.collectFromOtherInstances(SyncManager.java:59)
        at org.torproject.collector.sync.SyncManager.merge(SyncManager.java:43)
        at org.torproject.collector.cron.CollecTorMain.run(CollecTorMain.java:76)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
        at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)

I didn't look at the code in detail, but I could imagine that a timeout would have helped in this case. Maybe there are other possible fixes, though.

Child Tickets

Change History (14)

comment:1 Changed 14 months ago by karsten

Component: Metrics/CollecTorMetrics/Library

Erm, moving to Library, as this issue is likely to be found there.

comment:2 Changed 14 months ago by iwakeh

Owner: changed from metrics-team to iwakeh
Status: newaccepted

comment:3 Changed 12 months ago by iwakeh

Owner: changed from iwakeh to metrics-team
Status: acceptedassigned

Move to metrics-team as these are not worked on by me during the next week.

comment:4 Changed 11 months ago by iwakeh

Owner: changed from metrics-team to iwakeh
Status: assignedaccepted

comment:5 Changed 11 months ago by iwakeh

Component: Metrics/LibraryMetrics
Status: acceptedneeds_review
Summary: Use timeout for fetching remote index.json in DescriptorIndexCollectorConfigure timeout for metrics-lib client, e.g., those using DescriptorIndexCollector

Moving this to component Metrics because all metrics-lib clients are concerned.

I could reproduce the exact thread stack trace above. The waiting there would indeed be stopped using a read timeout.
Apparently, there is some connect timeout in place already (probably in the socket implementation) as can be viewed in CollecTor2's logs:

WARN o.t.d.i.DescriptorIndexCollector:77 Cannot fetch index file https://collector.torproject.org/index/index.json and hence cannot determine which remote files to fetch.  Aborting descriptor collection.
java.net.ConnectException: Connection timed out (Connection timed out)
        at java.net.PlainSocketImpl.socketConnect(Native Method)
        at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350)
        at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206)
        at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188)
        at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
        at java.net.Socket.connect(Socket.java:589)
        at sun.security.ssl.SSLSocketImpl.connect(SSLSocketImpl.java:673)
        at sun.security.ssl.BaseSSLSocketImpl.connect(BaseSSLSocketImpl.java:173)
        at sun.net.NetworkClient.doConnect(NetworkClient.java:180)
        at sun.net.www.http.HttpClient.openServer(HttpClient.java:463)
       ...
sun.net.www.protocol.https.HttpsURLConnectionImpl.getInputStream(HttpsURLConnectionImpl.java:263)
        at java.net.URL.openStream(URL.java:1045)
        at org.torproject.descriptor.index.IndexNode.fetchIndex(IndexNode.java:101)
        at org.torproject.descriptor.index.DescriptorIndexCollector.collectDescriptors(DescriptorIndexCollector.java:74)
        at org.torproject.collector.sync.SyncManager.collectFromOtherInstances(SyncManager.java:59)
        at org.torproject.collector.sync.SyncManager.merge(SyncManager.java:43)
        at org.torproject.collector.cron.CollecTorMain.run(CollecTorMain.java:76)
...
        at java.lang.Thread.run(Thread.java:748)

Today I've noticed similar traces in the Onionoo logs.

So in general, this is a java configuration not an implementation issue. The timeouts should not be hard-coded as the need for shorter or longer settings depends on the environment.

As remedy two jvm properties need to be set

-Dsun.net.client.defaultConnectTimeout=5000 
-Dsun.net.client.defaultReadTimeout=5000

The read timeout fixes the problem reported in the description. Setting the connect timeout, too, in order to be consistent. Five seconds should be ok?

(Slightly related: For the ftp implementation the timeout got changed from infinite to five minutes in JDK 8u151.)

comment:6 Changed 11 months ago by iwakeh

Summary: Configure timeout for metrics-lib client, e.g., those using DescriptorIndexCollectorConfigure timeout for metrics-lib clients, e.g., those using DescriptorIndexCollector

comment:7 Changed 11 months ago by karsten

Status: needs_reviewneeds_revision

Great that you were able to track this down!

Regarding the suggested solution, leaving this as JVM property feels a bit ugly to me. First, with every config option we're making it harder to remember how to run our tools. And in this case, if somebody forgets to set the option, they'll end up with a bad default value. Second, this timeout is something that we should decide as developers and not leave up to operators. If we don't know what a good value is, how do the operators? And it's not that environments vary by that much.

Should we simply hard-code five minutes? It's way better than waiting infinitely, and five seconds seems possibly too short. And if we find out that five minutes is a bad value, we can further tweak it in the future. Of course, we should document this timeout in the relevant documentation. And then it should be okay to hard-code it.

comment:8 in reply to:  7 ; Changed 11 months ago by iwakeh

Status: needs_revisionneeds_information

Replying to karsten:

Great that you were able to track this down!

Regarding the suggested solution, leaving this as JVM property feels a bit ugly to me. First, with every config option we're making it harder to remember how to run our tools. And in this case, if somebody forgets to set the option, they'll end up with a bad default value. Second, this timeout is something that we should decide as developers and not leave up to operators. If we don't know what a good value is, how do the operators? And it's not that environments vary by that much.


As operator I might be able to know if some other host sits on the same rack in the same server room or is on a slow connection far away. The above properties shouldn't be overridden in code. To provide a better default behavior and avoid overriding a sort of system property I'd suggest setting a timeout only if sun.net.client.default<Connect|Read>Timeout is not set. Maybe, one minute for connect and two for read?

comment:9 in reply to:  8 Changed 11 months ago by karsten

Replying to iwakeh:

As operator I might be able to know if some other host sits on the same rack in the same server room or is on a slow connection far away.

Well, I guess we can safely assume that none of our tools runs in such an environment.

The above properties shouldn't be overridden in code. To provide a better default behavior and avoid overriding a sort of system property I'd suggest setting a timeout only if sun.net.client.default<Connect|Read>Timeout is not set.

Right, I didn't mean to override those properties. I was rather thinking of switching to methods that allow us to specify connect/read timeouts.

But I also don't think we need to respect system properties starting with sun.net.*. We can just ignore them and set the timeout we think is right.

Maybe, one minute for connect and two for read?

Anything between 1 and 5 minutes should be okay to start with.

comment:10 Changed 10 months ago by iwakeh

Status: needs_informationaccepted

comment:11 Changed 10 months ago by iwakeh

Status: acceptedneeds_review

Please review the timeout patch, which sets read and connect timeouts to one minute unless the relevant system properties were set on the command line.

comment:12 Changed 10 months ago by iwakeh

Cc: metrics-team added

comment:13 Changed 10 months ago by karsten

Resolution: fixed
Status: needs_reviewclosed

Looks good, merged with another commit for the change log entry! Closing. Thanks!

comment:14 Changed 9 months ago by karsten

Component: MetricsMetrics/Library
Milestone: metrics-lib 2.3.0
Note: See TracTickets for help on using tickets.