Opened 8 months ago

Closed 7 months ago

#32747 closed defect (fixed)

Avoid reprocessing webstats files

Reported by: karsten Owned by: karsten
Priority: Medium Milestone:
Component: Metrics/CollecTor Version:
Severity: Normal Keywords:
Cc: metrics-team Actual Points:
Parent ID: Points:
Reviewer: Sponsor:

Description

Web servers typically provide us with the last 14 days of request logs. We shouldn't process the whole 14 days over and over. Instead we should only process new logs files and any other log files containing log lines from newly written dates.

In some cases web servers stop serving a given virtual host or stop acting as web server at all. However, in these cases we're left with 14 days of logs per virtual host. Ideally, these logs would get cleaned up, but until that's the case, we should at least not reprocess these files over and over.

Child Tickets

Change History (5)

comment:1 Changed 8 months ago by karsten

Status: assignedneeds_review

I spent pretty much the whole week on this ticket to understand what exactly the current code is doing, writing tests for it, figuring out why tests broke due to using a security manager (!), and finally coding a patch that does what this ticket is about.

I obviously didn't run this code on the server yet, but I expect that runtime of this module will shrink from 45 minutes every six hours to something around 10 minutes. And maybe we're lucky and it resolves #31901 for us, though I didn't look into that yet.

Please review:

comment:2 Changed 7 months ago by irl

Status: needs_reviewneeds_revision

The metrics-base commit is ok.

The tests look good.

Why have you removed WebServerAccessLogPersistence to duplicate the code now in collector/webstats/SanitizeWeblogs.java? WebServerAccessLogPersistence is still used by the sync code so we get two versions of the same thing both in use.

comment:3 in reply to:  2 Changed 7 months ago by karsten

Status: needs_revisionneeds_review

Replying to irl:

The metrics-base commit is ok.

The tests look good.

Great! I'll merge these when the other commit looks good, too.

Why have you removed WebServerAccessLogPersistence to duplicate the code now in collector/webstats/SanitizeWeblogs.java? WebServerAccessLogPersistence is still used by the sync code so we get two versions of the same thing both in use.

The reason was that we now have to calculate the output path of a web server access log before sanitizing, and it seemed easier to just pull the rest of the path-computing logic over from the persistence class. But I see your point about having this code in two places now.

I took an alternative approach by adding another constructor to the persistence class that we can use when we don't have a sanitized web server access log yet. Please review squash commit 94b8709 when you get the chance. And please also review squash commit 76db495 which fixes a minor bug that I found while testing today. Thanks!

comment:4 Changed 7 months ago by irl

Status: needs_reviewmerge_ready

Both commits lgtm.

comment:5 Changed 7 months ago by karsten

Resolution: fixed
Status: merge_readyclosed

Thanks for look! Squashed and pushed all commits. I'll put out a new release today or tomorrow to get this deployed soon. Closing.

Note: See TracTickets for help on using tickets.