#25259 closed enhancement (fixed)

Tune advbwdist module of metrics-web

Reported by: iwakeh Owned by: metrics-team
Priority: Medium Milestone:
Component: Metrics/Statistics Version:
Severity: Normal Keywords:
Cc: metrics-team Actual Points:
Parent ID: Points:
Reviewer: Sponsor:

Description

A first step to improve performance mostly in memory usage, because the current input of 350M advbwdist-validafter.csv uses up to 7G already.

(I'll post a branch once there is a ticket number.)

Child Tickets

Change History (6)

comment:1 Changed 14 months ago by iwakeh

Please review these tuning steps.
From the commit comment:
Processing advbwdist-validafter.csv (350M) took 150 seconds and used up to 7G.
Performing pre-processing separately, helping R by defining read types, and avoiding multiple casting operations led to halving the processing time (to 77 seconds) and reducing the necessary memory to about 25% (approx. 1.8G). The resulting advbwdist.csv are identical.

In future it will be necessary to split the aggregating process into years. Or, to store data for years that won't have changes anymore and combine existing with freshly computed data.

comment:2 Changed 14 months ago by iwakeh

Status: newneeds_review

comment:3 Changed 14 months ago by karsten

Those look like good tweaks. I copied the R file to the server and will let it run tonight. If that succeeds (and I think it should), I'll merge to master tomorrow. Thanks!

comment:4 Changed 14 months ago by iwakeh

Please find another commit tweaking memory usage and processing time a little more. The result is identical to the result from the current master branch.

Anyway, future changes should either split the input data (as suggested in comment:1) or mive the module to java. R is just always 'pieces' that don't scale well together.

Last edited 14 months ago by iwakeh (previous) (diff)

comment:5 in reply to:  4 Changed 14 months ago by karsten

Replying to iwakeh:

Please find another commit tweaking memory usage and processing time a little more. The result is identical to the result from the current master branch.

Great! I copied the new R file to the server and will let it run this afternoon. If that succeeds, I'll squash and merge to master. Thanks!

Anyway, future changes should either split the input data (as suggested in comment:1) or mive the module to java. R is just always 'pieces' that don't scale well together.

Full ack.

comment:6 Changed 14 months ago by karsten

Resolution: fixed
Status: needs_reviewclosed

Squashed, and rebased and pushed to master. Closing. Thanks!

Note: See TracTickets for help on using tickets.