Make CollecTor's webstats module use less RAM and wall time

added component::metrics/collector owner::iwakeh priority::high resolution::fixed severity::normal status::closed type::enhancement labels

I could reproduce this behavior; heap usage quickly goes up to just below 6G and stays there during the processing time with a few peaks below 6.671G. Looking further.

Trac:
Owner: metrics-team to iwakeh
Status: new to accepted

See the patch in the child ticket for metrics-lib. It shortens processing time of 516 meronense logs to four minutes.

2018-01-31 16:31:37,805 INFO o.t.c.c.CollecTorMain:66 Starting webstats module of CollecTor.
2018-01-31 16:31:37,910 INFO o.t.c.w.SanitizeWeblogs:98 Found log files for 1 virtual hosts.
2018-01-31 16:31:37,911 INFO o.t.c.w.SanitizeWeblogs:105 Processing logs for metrics.torproject.org on meronense.torproject.org.
2018-01-31 16:35:21,428 INFO o.t.c.c.CollecTorMain:68 Terminating webstats module of CollecTor.
2018-01-31 16:35:21,430 INFO o.t.c.c.ShutdownHook:23 Shutdown in progress ... 
2018-01-31 16:35:21,432 INFO o.t.c.cron.Scheduler:127 Waiting at most 10 minutes for termination of running tasks ... 
2018-01-31 16:35:21,432 INFO o.t.c.cron.Scheduler:132 Shutdown of all scheduled tasks completed successfully.
2018-01-31 16:35:21,433 INFO o.t.c.c.ShutdownHook:25 Shutdown finished. Exiting.

I'll take a look in a bit. But let me ask anyway: does it affect the memory requirement, too?

Less noticeably, but that requirement might not at all increase linearly with the log amount to process. I'm running tests with more logs (meronense.torproject.org and weschniakowii.torproject.org) once my download finishes.

Please review two commits one adapting to the changes in #25103 (moved) (this commit) and one using more parallelization.

The two more metrics-lib tweaks, the adaption for collector as well as adding more parallelization improved time and memory usage a little. The initial median memory usage was 5.867G (max 6.671G). Now, it is possible to process the 516 meronense logs with memory of 6G (median 4.541G, max 5.647G) in less than three minutes (2:36min).

Testing with 8867 logs ...

Commits a5f3d6a and 1873f12 look fine.

But these are all just tweaks that save some memory here and there. Should we also think about taking a different approach that scales better by design? Even if we can limit memory usage to 6G, that's far too much. Ideally, we'd keep the -Xmx2g setting for all of CollecTor, or maybe -Xmx4g. But imagine what we'd have to set when bulk-importing logs in 2019 or 2022.

How about we sanitize logs in two steps: in the first step we scan all input files just for contained dates, and in the second step we iterate over input files in an order that lets us just keep a sliding window of log lines that we need to write output files.

Trac:
Status: accepted to needs_revision

True, so far we didn't trade memory for time, but got some improvements that could be picked easily even winning some time here. Keeping counts of different sanitized lines in memory could also help and might be only a small change; I'm looking into this next.

But first, we should make sure that the performance tuning focuses on the usual scenario (not the rare bulk import) before starting bigger changes.

The usual import amount will logs of a few days, not the yearly logs, right?
Major bulk imports like the initial one should work, but also appear very rarely. Correct?

Do you have some reasonable figures as example for each?

Replying to iwakeh:

True, so far we didn't trade memory for time, but got some improvements that could be picked easily even winning some time here. Keeping counts of different sanitized lines in memory could also help and might be only a small change; I'm looking into this next.

Aha! That sounds very promising, too. Maybe even leave out the date part from sanitized lines and keep a bag of dates containing sanitized lines. Something like Map<String, Bag<LocalDate>> (yes, I know that there's no Bag type in Java; time to add Apache Commons Collections?). And later when we write sanitized logs, we simply put in the date.

But first, we should make sure that the performance tuning focuses on the usual scenario (not the rare bulk import) before starting bigger changes.

The usual import amount will logs of a few days, not the yearly logs, right?

Major bulk imports like the initial one should work, but also appear very rarely. Correct?

Do you have some reasonable figures as example for each?

Agreed that bulk imports are rare. Still, they may happen. Maybe the suggestion above resolves this relatively easily.

Replying to karsten:

Replying to iwakeh:

True, so far we didn't trade memory for time, but got some improvements that could be picked easily even winning some time here. Keeping counts of different sanitized lines in memory could also help and might be only a small change; I'm looking into this next.

Aha! That sounds very promising, too. Maybe even leave out the date part from sanitized lines and keep a bag of dates containing sanitized lines. Something like Map<String, Bag<LocalDate>> (yes, I know that there's no Bag type in Java; time to add Apache Commons Collections?). And later when we write sanitized logs, we simply put in the date.

Depending on the target scenarios it might be also very fruitful and a reusable approach for other CollecTor modules, not no implement 'compression' (which the above is) by hand, but rather use some in-memory database that compresses the highly redundant data at hand. Reasoning: the above mentioned 8867 logs from weschniakowsky and meronense combined are just 60M when xz compressed and roughly 20G (plus/minus x) deflated. If the in-memory db achieves a compression about ten times less efficient than xz, still only 600M were needed. Plus we'd get some sql (like) query support in addition.

If it works, we'd have a useful approach to recycle widely in metrics' code base.

Thoughts?

I think that an in-memory database is the second-best solution. The "manual compression" sounds more promising to me, because it leverages a specific redundancy of web server logs. Of course, we could further normalize the data and store request line parts in separate tables. But I'd say that the effort to make code changes and get them reviewed is several times as high as using the suggested data structure, whereas we'd already achieve 2/3 or 3/4 of possible improvements just from that without using a database.

Be aware that the compression achieved using the in-memory db will not be based on normalization and/or a complex db schema, but rather on an efficient underlying, transparent storage mechanism for strings and repeated values. It won't need the review time the deployment of a real persistent db would require.

The "manual compression" also increases review and later maintenance time (I know that from reviewing older Metrics' code bases).

There would be still some one-time overhead investigating which in-memory solution offers the wanted features and to get it running.

I'll take a look at both the "manual compression" and in-memory db in parallel; either one is done very quickly or a show-stopper comes up for the other approach.

Uhh, sounds like we'd be relying on the database to implicitly do magic for us, whereas we could explicitly do magic ourselves. I think I'd want to see both solutions if we really want to go the internal database route without normalized schema. In my (possibly naive) understanding, the improved data structure is a 20-line patch.

Trac:
Summary: Make CollecTor's webstats module use less RAM and CPU time to Make CollecTor's webstats module use less RAM and wall time

Please review two small commits applying parallelization tweaks and making use of the metrics-lib changes.

With metrics-lib task-25103 branch meronense's 516 logs can be processed using 2G (heap usage throughout was mostly below 1.5G and with three peaks slightly above 1.5G) in 2.5 min.

Test runs processing of larger imports using 8G are still on the way. So far, the memory handling looks fine; mostly well below 4G, only the processing of aus1.tp.o logs, which are quite large needed the 8G to work. Considering that these are the logs for a year (327 days) this should be ok and CollecTor is surely able to handle the regular daily imports and occasionally imports of a few months now with modest RAM and time demand.

I test further and post results, but I think the changes here and those in metrics-lib can be reviewed now.

Trac:
Status: needs_revision to needs_review

My Java 8 is not good enough to say much about commit 6264477, but at least it doesn't look wrong to me. And commit ae12c2e looks obviously correct. I say if it compiles and passes tests, let's do it.

Thanks for putting in this effort! I'm positive that it will be worth it.

Let me know when I should do something. Otherwise I'll just wait for more commits or green light to merge.

Replying to karsten:

My Java 8 is not good enough to say much about commit 6264477, but at least it doesn't look wrong to me. And commit ae12c2e looks obviously correct. I say if it compiles and passes tests, let's do it.

Thanks for the quick review!

Thanks for putting in this effort! I'm positive that it will be worth it.

Let me know when I should do something. Otherwise I'll just wait for more commits or green light to merge.

I didn't only mean review, but also testing or whatever you did to cause this ticket :-)

Unless any bugs show up I don't intend more commits.

Alright, I'll run more tests on Monday, or maybe over the weekend as time permits.

I didn't find any more bugs processing the following benchmarks for determining how to approach the import using the two latest branches of this and the child ticket.

The second quarter 2017 of logs from weschniakowii amounts to 32M (compressed) and can be processed in 36min using 8G. The entire year won't work with just 8G.

85min and 16G are needed for the entire available archives of meronense and weschniakowii together (compressed 59M). The used heap median usage is 8.5G and the max 15.8G.

So, depending on the hardware a conservative import strategy might be to import slices of quarters with subsequent CollecTor runs.

Sounds good! Pushed to master! We can leave this open for additional tests or close it, up to you. Thanks!

Trac:
Status: needs_review to merge_ready

If anything new comes up, it should be a new ticket.

Closing. Thanks!

Trac:
Resolution: N/A to fixed
Status: merge_ready to closed

closed

mentioned in issue #25103 (moved)

mentioned in issue #25161 (moved)

Make CollecTor's webstats module use less RAM and wall time

Child items 0

Activity