Opened 3 years ago

Closed 3 years ago

#20335 closed defect (not a bug)

ReferenceChecker causes OOM

Reported by: iwakeh Owned by:
Priority: High Milestone:
Component: Metrics/CollecTor Version:
Severity: Normal Keywords:
Cc: Actual Points:
Parent ID: #20519 Points:
Reviewer: Sponsor:

Description

In general, statistics and analysis should be separated from normal functioning.

After a sync-run ReferenceChecker hogs CPU and memory. It might need to be disabled for the first sync-release.

The implementation should be redesigned.

Child Tickets

Change History (15)

comment:1 Changed 3 years ago by iwakeh

In addition, the reported statistics are somewhat doubtful, but I didn't explicitly verify.

comment:2 in reply to:  description Changed 3 years ago by karsten

Replying to iwakeh:

In general, statistics and analysis should be separated from normal functioning.

Separated how? We need to run these checks every hour to learn early if we're missing more descriptors than usual.

After a sync-run ReferenceChecker hogs CPU and memory. It might need to be disabled for the first sync-release.

Do you have an idea why it would do that after a sync run and not after a usual download run? I'd rather not want to disable the checker in a release and instead fix it. I can try to take a look if you give me something to reproduce the issue.

The implementation should be redesigned.

How? Maybe we can get away by fixing it for the moment.

comment:3 in reply to:  1 Changed 3 years ago by karsten

Replying to iwakeh:

In addition, the reported statistics are somewhat doubtful, but I didn't explicitly verify.

Can you elaborate?

comment:4 Changed 3 years ago by iwakeh

Steps to reproduce (from #18910, the discussion should take place here):
Collector branch: Run the jar in a new location, i.e. with empty recent and out. Configure relaydescs and sync. RunOnce will be sufficient the OOM heap dump contains very many Reference objects. (Your assumption in one of the previous comments hinted in the right direction, I think, but I couldn't verify that yet.)

comment:5 Changed 3 years ago by iwakeh

Suggestion for how to separate statistics:

Well, mostly the code for the stats tries to implement database functionality, i.e. persistence between runs, queries, and cleaning data items that are too old.

I think a simple hsqldb would be fine here. This needs just the hsqldb dependency (available in debian). It creates the db if it is not there, below the stats path for example. The data could be written from the newly introduced *Persistence classes. The clean-up would be just a query per table (there are foreign keys and cascades available). Schema and access rights are delivered with inside collector.jar hidden from the user. So, there is some freedom to adapt and change with time.

An additional benefit would be that this db could be queried by command line for additional information. Or, the date could easily even be transferred to a 'real' dbms for continuous analysis, if necessary.

Last edited 3 years ago by iwakeh (previous) (diff)

comment:6 Changed 3 years ago by iwakeh

I forgot: such a db could solve the microdesc validAfter-date-in-out-path question, too.

comment:7 in reply to:  4 ; Changed 3 years ago by karsten

Replying to iwakeh:

Steps to reproduce (from #18910, the discussion should take place here):
Collector branch: Run the jar in a new location, i.e. with empty recent and out. Configure relaydescs and sync. RunOnce will be sufficient the OOM heap dump contains very many Reference objects. (Your assumption in one of the previous comments hinted in the right direction, I think, but I couldn't verify that yet.)

Hmm, I didn't run out of memory while checking references. And I don't quite understand how that would happen in an initial run, because the reference checker is run before the sync manager even starts to run.

I did run out of heap space while reading descriptors in a second run:

$ java -DLOGBASE=log -jar collector-1.1.0-dev.jar 
2016-10-13 12:54:55,995 ERROR o.t.d.i.DescriptorReaderImpl:192 Bug: uncaught exception or error while reading descriptors: Java heap space
java.lang.OutOfMemoryError: Java heap space
	at java.util.Arrays.copyOf(Arrays.java:3236) ~[na:1.8.0_102]
	at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:118) ~[na:1.8.0_102]
	at java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93) ~[na:1.8.0_102]
	at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:153) ~[na:1.8.0_102]
	at org.torproject.descriptor.impl.DescriptorReaderImpl$DescriptorReaderRunnable.readFile(DescriptorReaderImpl.java:378) ~[collector-1.1.0-dev.jar:1.1.0-dev-5eb89c1]
	at org.torproject.descriptor.impl.DescriptorReaderImpl$DescriptorReaderRunnable.readDescriptors(DescriptorReaderImpl.java:284) ~[collector-1.1.0-dev.jar:1.1.0-dev-5eb89c1]
	at org.torproject.descriptor.impl.DescriptorReaderImpl$DescriptorReaderRunnable.run(DescriptorReaderImpl.java:188) ~[collector-1.1.0-dev.jar:1.1.0-dev-5eb89c1]
	at java.lang.Thread.run(Thread.java:745) [na:1.8.0_102]

Can you try to put the reference checker back in and see if you still run out of memory somewhere? I'm not yet convinced that it's the reference checker that is to blame here.

comment:8 in reply to:  7 Changed 3 years ago by iwakeh

Replying to karsten:

...

I did run out of heap space while reading descriptors in a second run:

Right, this is where the OOM occurs. Without the reference checker out, there was no oom.

But, I'll test this again with the reference checker included after applying the other changes from #18910.

comment:9 Changed 3 years ago by iwakeh

I think I found a way to avoid the oom:

The sync-process provides collector with many (in the constant*10E+6 range) new descriptor references that are actually stale. I added a check for the valid date to addReference, i.e.

 
   private void addReference(String referencing, String referenced,
       double weight, long expiresAfterMillis) {
-    this.references.add(new Reference(referencing.toUpperCase(),
-        referenced.toUpperCase(), weight, expiresAfterMillis));
+    if (this.currentTimeMillis <= expiresAfterMillis) {
+      this.references.add(new Reference(referencing.toUpperCase(),
+          referenced.toUpperCase(), weight, expiresAfterMillis));
+    }

This seems to reduce the used memory; the sync-test is still running ...

Is there anything problematic with this change that I missed?

comment:10 Changed 3 years ago by karsten

Priority: MediumHigh

I believe that's a smart thing to do. I'm not sure if that would break something.

But I think it doesn't solve the issue here. Or, I ran out of memory with a similar patch applied. And I think I know why:

-rw-r--r--  1 karsten  staff   2.1G Oct 18 11:52 /Users/karsten/src/collector/recent/relay-descriptors/votes/2016-10-18-09-22-46-votes

With the recent change to append all votes to a single file, we're creating a 2.1G large file when doing the first synchronization run. However, metrics-lib wants to read that entire file to memory which probably doubles in size when it's parsed. Ugh!

Note that the CollecTor's reference checker is just one application that breaks when input files are too large. Other applications using metrics-lib would similarly be affected.

A longer-term fix might be to change metrics-lib to read through large files and pause when it has parsed enough descriptors that the application still has to process.

A short-term fix would be to stick with the current naming scheme for files in the recent/ directory, as that would have avoided this issue.

Fun times!

comment:11 in reply to:  10 Changed 3 years ago by iwakeh

Replying to karsten:

I believe that's a smart thing to do. I'm not sure if that would break something.

Unfortunately, you're right and this was not the reason for the oom :-/

But I think it doesn't solve the issue here. Or, I ran out of memory with a similar patch applied. And I think I know why:

-rw-r--r--  1 karsten  staff   2.1G Oct 18 11:52 /Users/karsten/src/collector/recent/relay-descriptors/votes/2016-10-18-09-22-46-votes

With the recent change to append all votes to a single file, we're creating a 2.1G large file when doing the first synchronization run. However, metrics-lib wants to read that entire file to memory which probably doubles in size when it's parsed. Ugh!

Note that the CollecTor's reference checker is just one application that breaks when input files are too large. Other applications using metrics-lib would similarly be affected.

A longer-term fix might be to change metrics-lib to read through large files and pause when it has parsed enough descriptors that the application still has to process.

Did you verify that metrics-lib is the culprit? Like, reading the 2G file for parsing?

A short-term fix would be to stick with the current naming scheme for files in the recent/ directory, as that would have avoided this issue.

The commit for the review in #18910 is easily adaptable, if necessary.

Fun times!

Wouldn't it be boring without the unforeseen ;-)

comment:12 Changed 3 years ago by karsten

Yes, metrics-lib is to blame. Here's a small class I wrote to show that:

package oom;

import org.torproject.descriptor.DescriptorReader;
import org.torproject.descriptor.DescriptorSourceFactory;

import java.io.File;

public class Main {
  public static void main(String[] args) {
    DescriptorReader descriptorReader =
        DescriptorSourceFactory.createDescriptorReader();
    descriptorReader.addDirectory(new File("in"));
    descriptorReader.readDescriptors();
  }
}

And here's the output, once with the default heap size and once with 8G:

$ javac -d bin/ -cp lib/descriptor-1.4.0.jar src/oom/Main.java && java -cp bin/:lib/descriptor-1.4.0.jar:lib/slf4j-api-1.7.7.jar:lib/logback-classic-1.1.2.jar:lib/logback-core-1.1.2.jar:lib/commons-compress-1.9.jar oom.Main
19:53:31.906 [Thread-0] ERROR o.t.d.impl.DescriptorReaderImpl - Bug: uncaught exception or error while reading descriptors: Java heap space
java.lang.OutOfMemoryError: Java heap space
	at java.util.Arrays.copyOf(Arrays.java:3236) ~[na:1.8.0_102]
	at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:118) ~[na:1.8.0_102]
	at java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93) ~[na:1.8.0_102]
	at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:153) ~[na:1.8.0_102]
	at org.torproject.descriptor.impl.DescriptorReaderImpl$DescriptorReaderRunnable.readFile(DescriptorReaderImpl.java:378) ~[descriptor-1.4.0.jar:1.4.0-adf4a67]
	at org.torproject.descriptor.impl.DescriptorReaderImpl$DescriptorReaderRunnable.readDescriptors(DescriptorReaderImpl.java:284) ~[descriptor-1.4.0.jar:1.4.0-adf4a67]
	at org.torproject.descriptor.impl.DescriptorReaderImpl$DescriptorReaderRunnable.run(DescriptorReaderImpl.java:188) ~[descriptor-1.4.0.jar:1.4.0-adf4a67]
	at java.lang.Thread.run(Thread.java:745) [na:1.8.0_102]
$ javac -d bin/ -cp lib/descriptor-1.4.0.jar src/oom/Main.java && java -Xmx8g -cp bin/:lib/descriptor-1.4.0.jar:lib/slf4j-api-1.7.7.jar:lib/logback-classic-1.1.2.jar:lib/logback-core-1.1.2.jar:lib/commons-compress-1.9.jar oom.Main
19:54:02.620 [Thread-0] ERROR o.t.d.impl.DescriptorReaderImpl - Bug: uncaught exception or error while reading descriptors: null
java.lang.OutOfMemoryError: null
	at java.io.ByteArrayOutputStream.hugeCapacity(ByteArrayOutputStream.java:123) ~[na:1.8.0_102]
	at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:117) ~[na:1.8.0_102]
	at java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93) ~[na:1.8.0_102]
	at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:153) ~[na:1.8.0_102]
	at org.torproject.descriptor.impl.DescriptorReaderImpl$DescriptorReaderRunnable.readFile(DescriptorReaderImpl.java:378) ~[descriptor-1.4.0.jar:1.4.0-adf4a67]
	at org.torproject.descriptor.impl.DescriptorReaderImpl$DescriptorReaderRunnable.readDescriptors(DescriptorReaderImpl.java:284) ~[descriptor-1.4.0.jar:1.4.0-adf4a67]
	at org.torproject.descriptor.impl.DescriptorReaderImpl$DescriptorReaderRunnable.run(DescriptorReaderImpl.java:188) ~[descriptor-1.4.0.jar:1.4.0-adf4a67]
	at java.lang.Thread.run(Thread.java:745) [na:1.8.0_102]

I can't think of a clean way to avoid this problem other than keeping votes in the recent/ folder in separate files. Seriously, I didn't think of this error case at all. Wow.

comment:13 Changed 3 years ago by iwakeh

Good to know the true reason now!

I'll add a metrics-lib ticket for further work on that. (see #20395)

Regarding Collector it would also cause troubles if a Collector instance syncs from another Collector that has such files. So the grouping by acquisition date needs to be postponed until there is a fix in metrics-lib.

Last edited 3 years ago by iwakeh (previous) (diff)

comment:14 Changed 3 years ago by iwakeh

Parent ID: #20519

comment:15 in reply to:  13 Changed 3 years ago by karsten

Resolution: not a bug
Status: newclosed

Replying to iwakeh:

Good to know the true reason now!

I'll add a metrics-lib ticket for further work on that. (see #20395)

Okay, I think this means we can resolve this issue, because the real issue is #20395, and the fix to that will magically lead to the ReferenceChecker not OOMing anymore. Please re-open if this was wrong. Closing. Thanks!

Note: See TracTickets for help on using tickets.