Opened 4 weeks ago

Closed 3 weeks ago

#32194 closed defect (fixed)

NetworkStatusEntryImpl#parseSLine (and probably other methods) is not thread-safe

Reported by: karsten Owned by: metrics-team
Priority: High Milestone:
Component: Metrics/Library Version:
Severity: Normal Keywords:
Cc: metrics-team Actual Points:
Parent ID: Points:
Reviewer: irl Sponsor:

Description

Turns out that creating multiple DescriptorReader instances in separate threads can cause thread-safety issues in NetworkStatusEntryImpl#parseSLine:

18:02:24.650 [Thread-1] INFO org.torproject.descriptor.DescriptorSourceFactory - Serving implementation org.torproject.descriptor.impl.DescriptorReaderImpl for descriptor.reader.
18:02:24.650 [Thread-5] INFO org.torproject.descriptor.DescriptorSourceFactory - Serving implementation org.torproject.descriptor.impl.DescriptorReaderImpl for descriptor.reader.
18:02:24.650 [Thread-3] INFO org.torproject.descriptor.DescriptorSourceFactory - Serving implementation org.torproject.descriptor.impl.DescriptorReaderImpl for descriptor.reader.
18:02:24.947 [Thread-10] ERROR org.torproject.descriptor.impl.DescriptorReaderImpl - Bug: uncaught exception or error while reading descriptors.
java.lang.RuntimeException: java.lang.reflect.InvocationTargetException
	at org.torproject.descriptor.impl.DescriptorParserImpl.parseOneDescriptor(DescriptorParserImpl.java:236)
	at org.torproject.descriptor.impl.DescriptorParserImpl.parseOneOrMoreDescriptors(DescriptorParserImpl.java:213)
	at org.torproject.descriptor.impl.DescriptorParserImpl.detectTypeAndParseDescriptors(DescriptorParserImpl.java:61)
	at org.torproject.descriptor.impl.DescriptorParserImpl.parseDescriptors(DescriptorParserImpl.java:33)
	at org.torproject.descriptor.impl.DescriptorReaderImpl$DescriptorReaderRunnable.readTarball(DescriptorReaderImpl.java:320)
	at org.torproject.descriptor.impl.DescriptorReaderImpl$DescriptorReaderRunnable.readTarballs(DescriptorReaderImpl.java:270)
	at org.torproject.descriptor.impl.DescriptorReaderImpl$DescriptorReaderRunnable.run(DescriptorReaderImpl.java:161)
	at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.reflect.InvocationTargetException: null
	at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
	at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
	at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
	at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
	at org.torproject.descriptor.impl.DescriptorParserImpl.parseOneDescriptor(DescriptorParserImpl.java:229)
	... 7 common frames omitted
Caused by: java.lang.NullPointerException: null
	at org.torproject.descriptor.impl.NetworkStatusEntryImpl.parseSLine(NetworkStatusEntryImpl.java:177)
	at org.torproject.descriptor.impl.NetworkStatusEntryImpl.parseStatusEntryBytes(NetworkStatusEntryImpl.java:98)
	at org.torproject.descriptor.impl.NetworkStatusEntryImpl.<init>(NetworkStatusEntryImpl.java:58)
	at org.torproject.descriptor.impl.RelayNetworkStatusConsensusImpl.parseStatusEntry(RelayNetworkStatusConsensusImpl.java:121)
	at org.torproject.descriptor.impl.NetworkStatusImpl.parseStatusEntries(NetworkStatusImpl.java:85)
	at org.torproject.descriptor.impl.NetworkStatusImpl.splitAndParseParts(NetworkStatusImpl.java:59)
	at org.torproject.descriptor.impl.NetworkStatusImpl.<init>(NetworkStatusImpl.java:27)
	at org.torproject.descriptor.impl.RelayNetworkStatusConsensusImpl.<init>(RelayNetworkStatusConsensusImpl.java:28)
	... 12 common frames omitted
18:02:24.948 [Thread-1] INFO org.torproject.descriptor.DescriptorSourceFactory - Serving implementation org.torproject.descriptor.impl.DescriptorReaderImpl for descriptor.reader.

Code to reproduce this in 1 out of 10 cases:

import org.torproject.descriptor.Descriptor;
import org.torproject.descriptor.DescriptorSourceFactory;
import org.torproject.descriptor.RelayNetworkStatusConsensus;

import java.io.File;
import java.util.SortedSet;
import java.util.TreeSet;
import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;

public class Main {
  public static void main(String[] args) {
    ExecutorService executor = Executors.newFixedThreadPool(15,
        (Runnable r) -> {
          Thread t = new Thread(r);
          //t.setDaemon(true);
          return t;
        });
    SortedSet<File> descriptorFiles = new TreeSet<>();
    for (File descriptorFile : new File("in").listFiles()) {
      descriptorFiles.add(descriptorFile);
    }
    for (int i = 0; i < 5; i++) {
      for (File descriptorFile : descriptorFiles) {
        Thread readerThread = new Thread() {
          @Override
          public void run() {
            parseDescriptorFile(descriptorFile);
          }
        };
        executor.submit(readerThread);
      }
    }
  }
  private static void parseDescriptorFile(File descriptorFile) {
    for (Descriptor descriptor
        : DescriptorSourceFactory.createDescriptorReader()
        .readDescriptors(descriptorFile)) {
      // ignore
      if (descriptor instanceof RelayNetworkStatusConsensus) {
        continue;
      }
    }
  }
}

There are probably similar issues with other methods.

Child Tickets

Change History (3)

comment:1 Changed 4 weeks ago by karsten

Reviewer: irl
Status: newneeds_review

While writing a fix for this particular issue with parsing "s" lines I didn't find any similar issues. We do have similar code in Onionoo, which is not thread-safe either. But we're also not using multiple threads in the Onionoo updater, so that's not an issue.

I'm going to do some more testing with multi-threaded parsing in metrics-lib in the near future. Just in case there are more issues, I might spot them. For now, I'll assume this was the only such case.

Please review commit a99e20b in my task-32194 branch.

comment:2 Changed 3 weeks ago by irl

Status: needs_reviewmerge_ready

LGTM.

I have noticed that the Onionoo updater is finishing very close to the next start time. We're seeing that sometimes it is taking too long (hence we get a pile of alerts from Nagios that the time is older than 2 hours, which get resolved quite quickly). We should be thinking about thread-safety in any new code to avoid this being a problem later.

comment:3 in reply to:  2 Changed 3 weeks ago by karsten

Resolution: fixed
Status: merge_readyclosed

Replying to irl:

LGTM.

Great! Merging.

I have noticed that the Onionoo updater is finishing very close to the next start time. We're seeing that sometimes it is taking too long (hence we get a pile of alerts from Nagios that the time is older than 2 hours, which get resolved quite quickly). We should be thinking about thread-safety in any new code to avoid this being a problem later.

Ah, in that specific case we don't have to worry about thread-safety, because we're never running two Onionoo updaters at the same time. We're using this method for scheduling runs. Its documentation says: "If any execution of this task takes longer than its period, then subsequent executions may start late, but will not concurrently execute." That doesn't mean we shouldn't be looking into why executions take so long. But at least that doesn't cause new trouble with concurrent executions.

Closing this ticket. Thanks!

Note: See TracTickets for help on using tickets.