#25317 closed defect (fixed)

Enable webstats to process large (> 2G) logfiles

Reported by: iwakeh Owned by: iwakeh
Priority: High Milestone:
Component: Metrics/CollecTor Version:
Severity: Normal Keywords:
Cc: metrics-team Actual Points:
Parent ID: Points:
Reviewer: Sponsor:


Quote from #25161, comment 12:

Looking at the stack trace and the input log files, I noticed that two log files are larger than 2G when decompressed:

3.2G in/webstats/archeotrichon.torproject.org/dist.torproject.org-access.log-20160531
584K in/webstats/archeotrichon.torproject.org/dist.torproject.org-access.log-20160531.xz
2.1G in/webstats/archeotrichon.torproject.org/dist.torproject.org-access.log-20160601
404K in/webstats/archeotrichon.torproject.org/dist.torproject.org-access.log-20160601.xz

I just ran another bulk import with just those two files as import and ran into the same exception.

It seems like we shouldn't attempt to decompress these files into a byte[] in FileType.decompress, because Java can only handle arrays with up to 2 billion elements: https://en.wikipedia.org/wiki/Criticism_of_Java#Large_arrays . Maybe we should work with streams there, not byte[].

Child Tickets

#25329closediwakehEnable metrics-lib to process large (> 2G) logfilesMetrics/Library

Change History (9)

comment:1 Changed 16 months ago by iwakeh

See #25329 for the metrics-lib changes. When these are done the CollecTor changes are only a few.

Last edited 16 months ago by iwakeh (previous) (diff)

comment:2 Changed 16 months ago by iwakeh

Status: assignedaccepted

comment:3 Changed 16 months ago by iwakeh

Status: acceptedneeds_review

Please review two commits this patch branch based on the current master branch. The first commit adapts collector to the changes introduced to metrics-lib when adding the log line sub-interfaces. The next tackles the memory issues. See commit comment for details. There is also a speed-up of 50% compared to the previous version.

This depends on the metrics-lib patch of #25329.

comment:4 Changed 16 months ago by iwakeh

Status: needs_reviewneeds_revision

This patch needs revision. One more memory issue to clean out.

Last edited 16 months ago by iwakeh (previous) (diff)

comment:5 Changed 16 months ago by iwakeh

Status: needs_revisionneeds_review

Please review another commit on the current patch branch.

Timing with the test batch is a little faster and memory usage looks way healthier now (stays between 2G and 9G mostly around 5G with very few peaks at 11G). A more concave build-up, that drops down again to even less than 2G often. The higher memory usage is during the compressing and writing phase which makes sense.

Processing yearly batches for larger imports is much faster by about a factor of three on the aroides logs.

comment:6 Changed 16 months ago by karsten

Status: needs_reviewneeds_revision

Two issues:

  • Should return new LocalDate[]{LocalDate.MAX, LocalDate.MIN}; have MAX and MIN exchanged? If not, can you document why this is correct?
  • In the block after if (count >= LISTLIMIT) {, the local variable count is not reset. This means that this code will be run after each line! Should be sufficient to set count = 0; in that block.

comment:7 Changed 16 months ago by iwakeh

Status: needs_revisionneeds_review

Thanks for catching that!
Please also review the fixup commit.

comment:8 Changed 16 months ago by karsten

Status: needs_reviewmerge_ready

Looks good! Ready to be merged.

comment:9 Changed 16 months ago by karsten

Resolution: fixed
Status: merge_readyclosed

Merged, closing. Thanks!

Note: See TracTickets for help on using tickets.