Opened 2 years ago

Closed 2 years ago

#25100 closed enhancement (fixed)

Make CollecTor's webstats module use less RAM and wall time

Reported by: karsten Owned by: iwakeh
Priority: High Milestone:
Component: Metrics/CollecTor Version:
Severity: Normal Keywords:
Cc: Actual Points:
Parent ID: Points:
Reviewer: Sponsor:

Description

I re-processed 516 xz-compressed log files downloaded from webstats.tp.o with a total size of 4.4M. Here's what I found:

My first attempt to process these log files with 4G RAM failed after 10 minutes with a java.lang.OutOfMemoryError:

2018-01-31 13:42:18,189 INFO o.t.c.cron.Scheduler:73 Prepare single run for org.torproject.collector.webstats.SanitizeWeblogs.
2018-01-31 13:42:18,193 INFO o.t.c.cron.Scheduler:150 New Thread created: CollecTor-Scheduled-Thread-1
2018-01-31 13:42:18,194 INFO o.t.c.c.CollecTorMain:66 Starting webstats module of CollecTor.
2018-01-31 13:42:18,302 INFO o.t.c.w.SanitizeWeblogs:98 Found log files for 1 virtual hosts.
2018-01-31 13:42:18,302 INFO o.t.c.w.SanitizeWeblogs:105 Processing logs for metrics.torproject.org on meronense.torproject.org.
2018-01-31 13:53:02,368 ERROR o.t.c.c.CollecTorMain:71 The webstats module failed: null
java.lang.OutOfMemoryError: null
	at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
	at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
	at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
	at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
	at java.util.concurrent.ForkJoinTask.getThrowableException(ForkJoinTask.java:598)
	at java.util.concurrent.ForkJoinTask.reportException(ForkJoinTask.java:677)
	at java.util.concurrent.ForkJoinTask.invoke(ForkJoinTask.java:735)
	at java.util.stream.ReduceOps$ReduceOp.evaluateParallel(ReduceOps.java:714)
	at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:233)
	at java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:499)
	at org.torproject.collector.webstats.SanitizeWeblogs.findCleanWrite(SanitizeWeblogs.java:110)
	at org.torproject.collector.webstats.SanitizeWeblogs.startProcessing(SanitizeWeblogs.java:87)
	at org.torproject.collector.cron.CollecTorMain.run(CollecTorMain.java:67)
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.OutOfMemoryError: GC overhead limit exceeded
	at java.util.Arrays.copyOf(Arrays.java:3236)
	at sun.misc.IOUtils.readFully(IOUtils.java:60)
	at java.util.jar.JarFile.getBytes(JarFile.java:425)
	at java.util.jar.JarFile.getManifestFromReference(JarFile.java:193)
	at java.util.jar.JarFile.getManifest(JarFile.java:180)
	at sun.misc.URLClassPath$JarLoader$2.getManifest(URLClassPath.java:981)
	at java.net.URLClassLoader.defineClass(URLClassLoader.java:450)
	at java.net.URLClassLoader.access$100(URLClassLoader.java:73)
	at java.net.URLClassLoader$1.run(URLClassLoader.java:368)
	at java.net.URLClassLoader$1.run(URLClassLoader.java:362)
	at java.security.AccessController.doPrivileged(Native Method)
	at java.net.URLClassLoader.findClass(URLClassLoader.java:361)
	at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
	at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)
	at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
	at ch.qos.logback.classic.spi.LoggingEvent.<init>(LoggingEvent.java:119)
	at ch.qos.logback.classic.Logger.buildLoggingEventAndAppend(Logger.java:419)
	at ch.qos.logback.classic.Logger.filterAndLog_2(Logger.java:414)
	at ch.qos.logback.classic.Logger.debug(Logger.java:490)
	at org.torproject.descriptor.log.WebServerAccessLogLine.makeLine(WebServerAccessLogLine.java:129)
	at org.torproject.collector.webstats.SanitizeWeblogs.lambda$lineStream$7(SanitizeWeblogs.java:189)
	at org.torproject.collector.webstats.SanitizeWeblogs$$Lambda$24/1590539684.apply(Unknown Source)
	at java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:193)
	at java.util.Iterator.forEachRemaining(Iterator.java:116)
	at java.util.Spliterators$IteratorSpliterator.forEachRemaining(Spliterators.java:1801)
	at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:481)
	at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:471)
	at java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:708)
	at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
	at java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:499)
	at org.torproject.collector.webstats.SanitizeWeblogs.lineStream(SanitizeWeblogs.java:190)
	at org.torproject.collector.webstats.SanitizeWeblogs.lambda$findCleanWrite$1(SanitizeWeblogs.java:109)
2018-01-31 13:53:02,376 INFO o.t.c.c.ShutdownHook:23 Shutdown in progress ... 
2018-01-31 13:53:02,377 INFO o.t.c.cron.Scheduler:127 Waiting at most 10 minutes for termination of running tasks ... 
2018-01-31 13:53:02,377 INFO o.t.c.cron.Scheduler:132 Shutdown of all scheduled tasks completed successfully.
2018-01-31 13:53:02,378 INFO o.t.c.c.ShutdownHook:25 Shutdown finished. Exiting.

My second attempt with 8G RAM succeeded after 20 minutes and produced correct log files.

We need to make this code run much, much faster and consume much, much less RAM. We have roughly 1G of logs on webstats.tp.o, so we're talking about 0.5% of logs here.

Child Tickets

TicketTypeStatusOwnerSummary
#25103enhancementclosedmetrics-teamImprove webstats performance

Change History (20)

comment:1 Changed 2 years ago by iwakeh

Owner: changed from metrics-team to iwakeh
Status: newaccepted

I could reproduce this behavior; heap usage quickly goes up to just below 6G and stays there during the processing time with a few peaks below 6.671G.
Looking further.

comment:2 Changed 2 years ago by iwakeh

See the patch in the child ticket for metrics-lib.
It shortens processing time of 516 meronense logs to four minutes.

2018-01-31 16:31:37,805 INFO o.t.c.c.CollecTorMain:66 Starting webstats module of CollecTor.
2018-01-31 16:31:37,910 INFO o.t.c.w.SanitizeWeblogs:98 Found log files for 1 virtual hosts.
2018-01-31 16:31:37,911 INFO o.t.c.w.SanitizeWeblogs:105 Processing logs for metrics.torproject.org on meronense.torproject.org.
2018-01-31 16:35:21,428 INFO o.t.c.c.CollecTorMain:68 Terminating webstats module of CollecTor.
2018-01-31 16:35:21,430 INFO o.t.c.c.ShutdownHook:23 Shutdown in progress ... 
2018-01-31 16:35:21,432 INFO o.t.c.cron.Scheduler:127 Waiting at most 10 minutes for termination of running tasks ... 
2018-01-31 16:35:21,432 INFO o.t.c.cron.Scheduler:132 Shutdown of all scheduled tasks completed successfully.
2018-01-31 16:35:21,433 INFO o.t.c.c.ShutdownHook:25 Shutdown finished. Exiting.

comment:3 Changed 2 years ago by karsten

I'll take a look in a bit. But let me ask anyway: does it affect the memory requirement, too?

comment:4 Changed 2 years ago by iwakeh

Less noticeably, but that requirement might not at all increase linearly with the log amount to process. I'm running tests with more logs (meronense.torproject.org and weschniakowii.torproject.org) once my download finishes.

comment:5 Changed 2 years ago by iwakeh

Please review two commits one adapting to the changes in #25103 (this commit) and one using more parallelization.

The two more metrics-lib tweaks, the adaption for collector as well as adding more parallelization improved time and memory usage a little.
The initial median memory usage was 5.867G (max 6.671G). Now, it is possible to process the 516 meronense logs with memory of 6G (median 4.541G, max 5.647G) in less than three minutes (2:36min).

Testing with 8867 logs ...

comment:6 Changed 2 years ago by karsten

Status: acceptedneeds_revision

Commits a5f3d6a and 1873f12 look fine.

But these are all just tweaks that save some memory here and there. Should we also think about taking a different approach that scales better by design? Even if we can limit memory usage to 6G, that's far too much. Ideally, we'd keep the -Xmx2g setting for all of CollecTor, or maybe -Xmx4g. But imagine what we'd have to set when bulk-importing logs in 2019 or 2022.

How about we sanitize logs in two steps: in the first step we scan all input files just for contained dates, and in the second step we iterate over input files in an order that lets us just keep a sliding window of log lines that we need to write output files.

comment:7 in reply to:  6 ; Changed 2 years ago by iwakeh

True, so far we didn't trade memory for time, but got some improvements that could be picked easily even winning some time here.
Keeping counts of different sanitized lines in memory could also help and might be only a small change; I'm looking into this next.

But first, we should make sure that the performance tuning focuses on the usual scenario (not the rare bulk import) before starting bigger changes.

  1. The usual import amount will logs of a few days, not the yearly logs, right?
  2. Major bulk imports like the initial one should work, but also appear very rarely. Correct?

Do you have some reasonable figures as example for each?

Last edited 2 years ago by iwakeh (previous) (diff)

comment:8 in reply to:  7 ; Changed 2 years ago by karsten

Replying to iwakeh:

True, so far we didn't trade memory for time, but got some improvements that could be picked easily even winning some time here.
Keeping counts of different sanitized lines in memory could also help and might be only a small change; I'm looking into this next.

Aha! That sounds very promising, too. Maybe even leave out the date part from sanitized lines and keep a bag of dates containing sanitized lines. Something like Map<String, Bag<LocalDate>> (yes, I know that there's no Bag type in Java; time to add Apache Commons Collections?). And later when we write sanitized logs, we simply put in the date.

But first, we should make sure that the performance tuning focuses on the usual scenario (not the rare bulk import) before starting bigger changes.

  1. The usual import amount will logs of a few days, not the yearly logs, right?
  2. Major bulk imports like the initial one should work, but also appear very rarely. Correct?

Do you have some reasonable figures as example for each?

Agreed that bulk imports are rare. Still, they may happen. Maybe the suggestion above resolves this relatively easily.

comment:9 in reply to:  8 Changed 2 years ago by iwakeh

Replying to karsten:

Replying to iwakeh:

True, so far we didn't trade memory for time, but got some improvements that could be picked easily even winning some time here.
Keeping counts of different sanitized lines in memory could also help and might be only a small change; I'm looking into this next.

Aha! That sounds very promising, too. Maybe even leave out the date part from sanitized lines and keep a bag of dates containing sanitized lines. Something like Map<String, Bag<LocalDate>> (yes, I know that there's no Bag type in Java; time to add Apache Commons Collections?). And later when we write sanitized logs, we simply put in the date.

Depending on the target scenarios it might be also very fruitful and a reusable approach for other CollecTor modules, not no implement 'compression' (which the above is) by hand, but rather use some in-memory database that compresses the highly redundant data at hand. Reasoning: the above mentioned 8867 logs from weschniakowsky and meronense combined are just 60M when xz compressed and roughly 20G (plus/minus x) deflated. If the in-memory db achieves a compression about ten times less efficient than xz, still only 600M were needed. Plus we'd get some sql (like) query support in addition.

If it works, we'd have a useful approach to recycle widely in metrics' code base.

Thoughts?

comment:10 Changed 2 years ago by karsten

I think that an in-memory database is the second-best solution. The "manual compression" sounds more promising to me, because it leverages a specific redundancy of web server logs. Of course, we could further normalize the data and store request line parts in separate tables. But I'd say that the effort to make code changes and get them reviewed is several times as high as using the suggested data structure, whereas we'd already achieve 2/3 or 3/4 of possible improvements just from that without using a database.

comment:11 Changed 2 years ago by iwakeh

Be aware that the compression achieved using the in-memory db will not be based on normalization and/or a complex db schema, but rather on an efficient underlying, transparent storage mechanism for strings and repeated values. It won't need the review time the deployment of a real persistent db would require.

The "manual compression" also increases review and later maintenance time (I know that from reviewing older Metrics' code bases).

There would be still some one-time overhead investigating which in-memory solution offers the wanted features and to get it running.

I'll take a look at both the "manual compression" and in-memory db in parallel; either one is done very quickly or a show-stopper comes up for the other approach.

comment:12 Changed 2 years ago by karsten

Uhh, sounds like we'd be relying on the database to implicitly do magic for us, whereas we could explicitly do magic ourselves. I think I'd want to see both solutions if we really want to go the internal database route without normalized schema. In my (possibly naive) understanding, the improved data structure is a 20-line patch.

comment:13 Changed 2 years ago by iwakeh

Summary: Make CollecTor's webstats module use less RAM and CPU timeMake CollecTor's webstats module use less RAM and wall time

comment:14 Changed 2 years ago by iwakeh

Status: needs_revisionneeds_review

Please review two small commits applying parallelization tweaks and making use of the metrics-lib changes.

With metrics-lib task-25103 branch meronense's 516 logs can be processed using 2G (heap usage throughout was mostly below 1.5G and with three peaks slightly above 1.5G) in 2.5 min.

Test runs processing of larger imports using 8G are still on the way. So far, the memory handling looks fine; mostly well below 4G, only the processing of aus1.tp.o logs, which are quite large needed the 8G to work. Considering that these are the logs for a year (327 days) this should be ok and CollecTor is surely able to handle the regular daily imports and occasionally imports of a few months now with modest RAM and time demand.

I test further and post results, but I think the changes here and those in metrics-lib can be reviewed now.

comment:15 Changed 2 years ago by karsten

My Java 8 is not good enough to say much about commit 6264477, but at least it doesn't look wrong to me. And commit ae12c2e looks obviously correct. I say if it compiles and passes tests, let's do it.

Thanks for putting in this effort! I'm positive that it will be worth it.

Let me know when I should do something. Otherwise I'll just wait for more commits or green light to merge.

comment:16 in reply to:  15 Changed 2 years ago by iwakeh

Replying to karsten:

My Java 8 is not good enough to say much about commit 6264477, but at least it doesn't look wrong to me. And commit ae12c2e looks obviously correct. I say if it compiles and passes tests, let's do it.

Thanks for the quick review!

Thanks for putting in this effort! I'm positive that it will be worth it.

Let me know when I should do something. Otherwise I'll just wait for more commits or green light to merge.

I didn't only mean review, but also testing or whatever you did to cause this ticket :-)

Unless any bugs show up I don't intend more commits.

comment:17 Changed 2 years ago by karsten

Alright, I'll run more tests on Monday, or maybe over the weekend as time permits.

comment:18 Changed 2 years ago by iwakeh

I didn't find any more bugs processing the following benchmarks for determining how to approach the import using the two latest branches of this and the child ticket.

The second quarter 2017 of logs from weschniakowii amounts to 32M (compressed) and can be processed in 36min using 8G. The entire year won't work with just 8G.

85min and 16G are needed for the entire available archives of meronense and weschniakowii together (compressed 59M). The used heap median usage is 8.5G and the max 15.8G.

So, depending on the hardware a conservative import strategy might be to import slices of quarters with subsequent CollecTor runs.

comment:19 Changed 2 years ago by karsten

Status: needs_reviewmerge_ready

Sounds good! Pushed to master! We can leave this open for additional tests or close it, up to you. Thanks!

comment:20 Changed 2 years ago by iwakeh

Resolution: fixed
Status: merge_readyclosed

If anything new comes up, it should be a new ticket.

Closing. Thanks!

Note: See TracTickets for help on using tickets.