Make CollecTor's webstats module use less RAM and wall time
I re-processed 516 xz-compressed log files downloaded from webstats.tp.o with a total size of 4.4M. Here's what I found:
My first attempt to process these log files with 4G RAM failed after 10 minutes with a java.lang.OutOfMemoryError
:
2018-01-31 13:42:18,189 INFO o.t.c.cron.Scheduler:73 Prepare single run for org.torproject.collector.webstats.SanitizeWeblogs.
2018-01-31 13:42:18,193 INFO o.t.c.cron.Scheduler:150 New Thread created: CollecTor-Scheduled-Thread-1
2018-01-31 13:42:18,194 INFO o.t.c.c.CollecTorMain:66 Starting webstats module of CollecTor.
2018-01-31 13:42:18,302 INFO o.t.c.w.SanitizeWeblogs:98 Found log files for 1 virtual hosts.
2018-01-31 13:42:18,302 INFO o.t.c.w.SanitizeWeblogs:105 Processing logs for metrics.torproject.org on meronense.torproject.org.
2018-01-31 13:53:02,368 ERROR o.t.c.c.CollecTorMain:71 The webstats module failed: null
java.lang.OutOfMemoryError: null
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at java.util.concurrent.ForkJoinTask.getThrowableException(ForkJoinTask.java:598)
at java.util.concurrent.ForkJoinTask.reportException(ForkJoinTask.java:677)
at java.util.concurrent.ForkJoinTask.invoke(ForkJoinTask.java:735)
at java.util.stream.ReduceOps$ReduceOp.evaluateParallel(ReduceOps.java:714)
at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:233)
at java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:499)
at org.torproject.collector.webstats.SanitizeWeblogs.findCleanWrite(SanitizeWeblogs.java:110)
at org.torproject.collector.webstats.SanitizeWeblogs.startProcessing(SanitizeWeblogs.java:87)
at org.torproject.collector.cron.CollecTorMain.run(CollecTorMain.java:67)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.OutOfMemoryError: GC overhead limit exceeded
at java.util.Arrays.copyOf(Arrays.java:3236)
at sun.misc.IOUtils.readFully(IOUtils.java:60)
at java.util.jar.JarFile.getBytes(JarFile.java:425)
at java.util.jar.JarFile.getManifestFromReference(JarFile.java:193)
at java.util.jar.JarFile.getManifest(JarFile.java:180)
at sun.misc.URLClassPath$JarLoader$2.getManifest(URLClassPath.java:981)
at java.net.URLClassLoader.defineClass(URLClassLoader.java:450)
at java.net.URLClassLoader.access$100(URLClassLoader.java:73)
at java.net.URLClassLoader$1.run(URLClassLoader.java:368)
at java.net.URLClassLoader$1.run(URLClassLoader.java:362)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:361)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
at ch.qos.logback.classic.spi.LoggingEvent.<init>(LoggingEvent.java:119)
at ch.qos.logback.classic.Logger.buildLoggingEventAndAppend(Logger.java:419)
at ch.qos.logback.classic.Logger.filterAndLog_2(Logger.java:414)
at ch.qos.logback.classic.Logger.debug(Logger.java:490)
at org.torproject.descriptor.log.WebServerAccessLogLine.makeLine(WebServerAccessLogLine.java:129)
at org.torproject.collector.webstats.SanitizeWeblogs.lambda$lineStream$7(SanitizeWeblogs.java:189)
at org.torproject.collector.webstats.SanitizeWeblogs$$Lambda$24/1590539684.apply(Unknown Source)
at java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:193)
at java.util.Iterator.forEachRemaining(Iterator.java:116)
at java.util.Spliterators$IteratorSpliterator.forEachRemaining(Spliterators.java:1801)
at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:481)
at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:471)
at java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:708)
at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
at java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:499)
at org.torproject.collector.webstats.SanitizeWeblogs.lineStream(SanitizeWeblogs.java:190)
at org.torproject.collector.webstats.SanitizeWeblogs.lambda$findCleanWrite$1(SanitizeWeblogs.java:109)
2018-01-31 13:53:02,376 INFO o.t.c.c.ShutdownHook:23 Shutdown in progress ...
2018-01-31 13:53:02,377 INFO o.t.c.cron.Scheduler:127 Waiting at most 10 minutes for termination of running tasks ...
2018-01-31 13:53:02,377 INFO o.t.c.cron.Scheduler:132 Shutdown of all scheduled tasks completed successfully.
2018-01-31 13:53:02,378 INFO o.t.c.c.ShutdownHook:25 Shutdown finished. Exiting.
My second attempt with 8G RAM succeeded after 20 minutes and produced correct log files.
We need to make this code run much, much faster and consume much, much less RAM. We have roughly 1G of logs on webstats.tp.o, so we're talking about 0.5% of logs here.