Opened 2 years ago

Closed 16 months ago

#22428 closed enhancement (fixed)

Add webstats module

Reported by: iwakeh Owned by: iwakeh
Priority: High Milestone: CollecTor 1.5.0
Component: Metrics/CollecTor Version:
Severity: Normal Keywords: metrics-2018
Cc: metrics-team Actual Points:
Parent ID: Points:
Reviewer: Sponsor:

Description

Port the python code for sanitizing weblogs and add a webstat module to collector.

Child Tickets

Change History (60)

comment:1 Changed 2 years ago by iwakeh

Status: newneeds_information

Could some weblog samples (clean and real) be made available under NDA?
This involves a bit regex tweaking and I need examples for tests.

comment:2 Changed 2 years ago by iwakeh

Owner: changed from metrics-team to iwakeh
Status: needs_informationaccepted

comment:3 Changed 23 months ago by iwakeh

Currently, treated weblogs contain lines like this:

0.0.0.0 - - [30/May/2017:00:00:00 +0000] "GET /images/android-icon.png HTTP/1.1" 200 21630 "-" "-" -

Could/should that not be reduced to the following?

30/May/2017 "GET /images/android-icon.png HTTP/1.1" 200 21630

comment:4 Changed 23 months ago by iwakeh

Status: acceptedneeds_information

comment:5 Changed 23 months ago by iwakeh

And, second question: It would probably be more convenient for processing to use numbers instead of names for the months.

comment:6 Changed 23 months ago by karsten

No, the goal is to keep Apache's Combined Log Format, so that the resulting sanitized logs can still be evaluated using standard tools.

comment:7 Changed 23 months ago by iwakeh

Status: needs_informationaccepted

Alright, that makes sense. I'll add that to the comments for future reference.

comment:8 Changed 23 months ago by iwakeh

Status: acceptedneeds_information

Comments and suggestions welcome!

Input-Output-Specification

Input: How are logs provided to CollecTor

Current assumption:
local directory w/o subdirectories; files are compressed; the filename contains the host's name (e.g. metrics.torproject.org) and the date.

Output: How to store cleaned logs in out and recent

Current assumption:
The hostname of the collecting server (e.g. meronense.torproject.org) is not used for path creation.

recent

recent/webstats/<hostname>-access.log-<year-month-day>.xz,
e.g.,
recent/webstats/metrics.torproject.org-access.log-20160829.xz

out

out/webstats/<year>/<month>/<day>/<hostname>-access.log-<year-month-day>.xz
for example:
out/webstats/2016/08/29/metrics.torproject.org-access.log-20160829.xz.

comment:9 Changed 23 months ago by karsten

First bit of feedback (more may follow): wouldn't it be problematic to leave out the hostname of the collecting server in the output path? There are several hosts serving the main Tor website, and each of them will create a log file for that domain name for a given day.

Note that I don't have access to the current webstats host, but I would imagine that not all files are stored in a directory without subdirectories. The existing code might reveal something about that, but I haven't looked.

comment:10 in reply to:  9 Changed 23 months ago by iwakeh

Replying to karsten:

First bit of feedback (more may follow): wouldn't it be problematic to leave out the hostname of the collecting server in the output path? There are several hosts serving the main Tor website, and each of them will create a log file for that domain name for a given day.

Good point! The hostname is used as directory name on webstats.torproject.org, but doesn't seem to be part of current shell/python processing, which takes place in a directory containing the files to be cleaned (afaict).

Note that I don't have access to the current webstats host, but I would imagine that not all files are stored in a directory without subdirectories. The existing code might reveal something about that, but I haven't looked.

As mentioned above, but it is very easy to simply look for files in subfolder, too.

So all in all, the main question left is how the access logs and the log meta data are provided. So far, meta data seems to be in the log-filename and the path (i.e. the hostname).
Could the logs be 'delivered' as compressed files having meta data in their filename, e.g.,
<measuring-host>-<served-hostname>-access.log-<year-month-day>.xz
This is just an example any other combination will be fine, important is what and how meta-data is communicated.

comment:11 Changed 23 months ago by karsten

Hmmm, the metadata question is really a good one. Ideally, we'd include metadata in the first lines, but I didn't find anything about comments or metadata or ignored lines in the Apache documentation that we could use here.

Putting metadata in paths and file names is something we're just giving up in the case of bridge network statuses. Ideally, file contents would be self-contained, if that's possible in this case.

Another option could be to put sanitized web logs into tarballs of their own and include a small metadata file. This might have other downsides, though.

Maybe there are other ways?

comment:12 in reply to:  11 Changed 23 months ago by iwakeh

Replying to karsten:

Hmmm, the metadata question is really a good one. Ideally, we'd include metadata in the first lines, but I didn't find anything about comments or metadata or ignored lines in the Apache documentation that we could use here.

Putting metadata in paths and file names is something we're just giving up in the case of bridge network statuses. Ideally, file contents would be self-contained, if that's possible in this case.

Another option could be to put sanitized web logs into tarballs of their own and include a small metadata file. This might have other downsides, though.

Maybe there are other ways?

The output side is something we can build and your suggestion here is one solution.
The open question is how the 'real' access logs and the log meta data are provided to CollecTor?

comment:13 Changed 23 months ago by iwakeh

Regarding the output of meta-data:

Shouldn't applications processing Apache access logs ignore 'funny' lines?
If yes, we could include meta-data lines, i.e.:

measuring meronense.torproject.org
measured metrics.torproject.org
date 20160829

comment:14 Changed 23 months ago by iwakeh

Input will be compressed files in a file system, most likely in/...

comment:15 Changed 23 months ago by karsten

Regarding metadata in output files, I think the least bad solution is to prefix existing filenames with the-measuring.host-. Adding funny/invalid data lines just for metadata seems worse, and wrapping files into tarballs together with a metadata file might make it unnecessary complex to process files. Maybe we can't give up on parsing metadata from filenames just yet, though we can still try to reduce it to cases like this where adding metadata to files is not a good option.

An upside of this approach is that we can easily copy over existing sanitized web logs without opening them, adding metadata, and closing them. The only thing we need to do is rename them, like https://webstats.torproject.org/out/meronense.torproject.org/metrics.torproject.org-access.log-20170704.xz to meronense.torproject.org-metrics.torproject.org-access.log-20170704.xz.

comment:16 Changed 23 months ago by iwakeh

Status: needs_informationaccepted

Ok, the metadata-path seems the most pragmatic approach. I'm fine with this.

comment:17 Changed 23 months ago by iwakeh

Milestone: CollecTor 1.3.0

comment:18 Changed 22 months ago by iwakeh

The question about a new Descriptor type in metrics-lib was moved to its own ticket #22983.

comment:19 Changed 22 months ago by iwakeh

Status: acceptedneeds_review

Please review this branch with the webstats implementation.
It depends on the functionality provided in metrics-lib task #22983.

It implements local import with log-sanitation as well as sync-functionality.
The commits try to make the different steps clear.

comment:20 Changed 22 months ago by iwakeh

Status: needs_reviewaccepted

This will be revised soon, b/c of the new changes in metrics-lib (#22983).

comment:21 Changed 22 months ago by iwakeh

Status: acceptedneeds_review

Please review this branch, which relies on the latest branch of #22983.

(The four commits just separate the code into more review-able chunks, but all commits and a metrics-lib-2.0.0-dev.jar are necessary for compile.)

comment:22 Changed 21 months ago by iwakeh

Status: needs_reviewaccepted

Setting to accepted as this is waiting on #22983, which might lead to changes.

comment:23 Changed 21 months ago by iwakeh

Milestone: CollecTor 1.3.0CollecTor 1.4.0

Move to next milestone in order to accommodate interim release.

comment:24 Changed 20 months ago by iwakeh

Status: acceptedneeds_information

Two more properties are necessary; here a list of all properties for the webstats module:

######## Tor Weblogs ########
#
## Define descriptor sources
#  possible values: Local, Sync
WebstatsSources = Local
## Relative path to directory to import logfiles from.
WebstatsLocalOrigins = in/webstats
# Location for keeping temporary log before publication.
WebstatsTemporaryPath = temp-webstats
# Date from which to calculate publishable sanitized logs.
# Default value 'Now' uses the current system date.
# Format for fixed dates: yyyymmdd
# Example: WebstatsReferenceDate = 20170801
WebstatsReferenceDate = Now

WebstatsSources and WebstatsSources are the usual ones for CollecTorMain modules and just added here for completeness.

WebstatsTemporaryPath: specifies the work directory for CollecTor, because logs have to be two days old before being published.

WebstatsReferenceDate: enables bulk imports up to a certain date. Default setting uses the system date.

Does this make sense?

comment:25 Changed 20 months ago by karsten

I'm not too deep into this topic right now, so handle the following comment with care.

I wonder if we can avoid having that directory for temporary log files that cannot be published yet. It seems like a possible source for trouble when processing breaks at some point and we need to fix that, with half of a log file being written to the temporary directory and the rest still being in the import directory.

Maybe we can simplify that by keeping a text file in stats/ where we keep some state which files we already read or wrote. And we only write a file to out/ and recent/ when it's ready for publication. Not sure if this will solve all cases, but it seems potentially easier to understand for future operators of this service (including ourselves when we don't remember these design discussions anymore).

Regarding WebstatsReferenceDate, it would be good to explain in the comments when this value needs to be changed, and to what value. The comment alone should be sufficient to know how to use the property, without further looking at the code.

comment:26 in reply to:  25 ; Changed 20 months ago by iwakeh

Replying to karsten:

I'm not too deep into this topic right now, so handle the following comment with care.

Will do :-) but this is really asked from a meta perspective and the main difference to all other descriptors is that logfiles should only be published after their completion (i.e., two days after their dates).

I wonder if we can avoid having that directory for temporary log files that cannot be published yet. It seems like a possible source for trouble when processing breaks at some point and we need to fix that, with half of a log file being written to the temporary directory and the rest still being in the import directory.

I would want to have a separation here, b/c log files from the import directory are not sanitized and should not be published. On the other hand, log descriptors in the temporary location are possibly not yet complete and already published log descriptors should not be altered (e.g., by appending and resorting them). If anything breaks, the incompletely written files in the working dir ought to be removed.
I would also want to avoid the stats file.

Maybe we can simplify that by keeping a text file in stats/ where we keep some state which files we already read or wrote. And we only write a file to out/ and recent/ when it's ready for publication. Not sure if this will solve all cases, but it seems potentially easier to understand for future operators of this service (including ourselves when we don't remember these design discussions anymore).

The explanation could be more elaborate and maybe the property renamed to WebstatsSanitizingPath or some better name?
The stats file option could be misleading. For example, if another local re-import leads to overwriting a sanitized not yet published log. A temporary sanitizing-working directory makes clear that only CollecTor touches files in there and stuff from 'in' could be removed after a processing round. That ought to be easier for operation: "don't touch the sanitizing-working directory" and treat the input directory as with other modules?

Regarding WebstatsReferenceDate, it would be good to explain in the comments when this value needs to be changed, and to what value. The comment alone should be sufficient to know how to use the property, without further looking at the code.

I was thinking that it might be useful to be able to have partial imports of older logs, hmm. This might be trickier than just documenting the property. Example: add all July 2017 logs to in and set reference date to 20170801 means that only logs up to (incl.) 20170730 are published in that round.

Let's extend this discussion:
What other operation scenarios the webstats module will have to be prepared to deal with and do these have to be available with the initial release of webstats?

comment:27 Changed 20 months ago by iwakeh

Priority: MediumHigh

The open topics block this issue.

comment:28 in reply to:  26 ; Changed 20 months ago by karsten

Replying to iwakeh:

Replying to karsten:

I'm not too deep into this topic right now, so handle the following comment with care.

Will do :-) but this is really asked from a meta perspective and the main difference to all other descriptors is that logfiles should only be published after their completion (i.e., two days after their dates).

Well, discussing these things, even from a meta perspective, requires (sometimes: deep) thinking. Just saying, keep in mind that you're much, much deeper into this topic than I am at this time.

I wonder if we can avoid having that directory for temporary log files that cannot be published yet. It seems like a possible source for trouble when processing breaks at some point and we need to fix that, with half of a log file being written to the temporary directory and the rest still being in the import directory.

I would want to have a separation here, b/c log files from the import directory are not sanitized and should not be published. On the other hand, log descriptors in the temporary location are possibly not yet complete and already published log descriptors should not be altered (e.g., by appending and resorting them).

I think I agree on all statements above.

If anything breaks, the incompletely written files in the working dir ought to be removed.

Wait, is that correct? What if the sanitizing process aborts at a random point, possibly due to the host losing power in the middle of a run? Would the operator simply delete everything from the working directory in that case?

I'm pointing this out, because I was in the situation with the old webstats code. The code seemed plausible while writing it. But that changed a few months later when something broke and I had to figure out how to recover. I'd like to avoid such situations with this new code.

I would also want to avoid the stats file.

Yes, if we can avoid it, let's avoid it. Maybe we can use one just to avoid re-processing files unnecessarily. So, if it's empty or gone, nothing bad happens.

Maybe we can simplify that by keeping a text file in stats/ where we keep some state which files we already read or wrote. And we only write a file to out/ and recent/ when it's ready for publication. Not sure if this will solve all cases, but it seems potentially easier to understand for future operators of this service (including ourselves when we don't remember these design discussions anymore).

The explanation could be more elaborate and maybe the property renamed to WebstatsSanitizingPath or some better name?
The stats file option could be misleading. For example, if another local re-import leads to overwriting a sanitized not yet published log.

Not sure I understand what you mean here.

A temporary sanitizing-working directory makes clear that only CollecTor touches files in there

... except when the operator needs to repair something and has to touch these files, too.

and stuff from 'in' could be removed after a processing round.

No, we shouldn't remove anything from in/. We didn't put files there, so we shouldn't remove them, either.

That ought to be easier for operation: "don't touch the sanitizing-working directory" and treat the input directory as with other modules?

A few thoughts on how this might work using a stats file:

  • We read files from in/.
  • We write fully sanitized files to out/ and recent/, but only if we're certain that we won't have more data later on that would require us to update files there, because we wouldn't do that.
  • If there's a file in in/ that contains lines that we couldn't put into a file in out/ and recent/, we will simply process that in/-file again next time.
  • We might want to use a file in stats/ to remember which files in in/ are already completely processed, so that we can skip them.
  • We never delete anything from in/ but let the script do that that also places files in there.

Please note that I'm not sure yet whether this will work. It just seems like something that is relatively easy to operate, in particular when something breaks.

Regarding WebstatsReferenceDate, it would be good to explain in the comments when this value needs to be changed, and to what value. The comment alone should be sufficient to know how to use the property, without further looking at the code.

I was thinking that it might be useful to be able to have partial imports of older logs, hmm. This might be trickier than just documenting the property. Example: add all July 2017 logs to in and set reference date to 20170801 means that only logs up to (incl.) 20170730 are published in that round.

Let's extend this discussion:
What other operation scenarios the webstats module will have to be prepared to deal with and do these have to be available with the initial release of webstats?

I think the following scenarios are most common:

  • Initialize by processing log files from the past 2 weeks up to now. Similarly, re-process in case of change to sanitizing steps.
  • Do a periodic run every few hours.

Note that there are no archives of non-sanitized logs that would reach back more than 2 weeks. That's different with bridge descriptor tarballs.

comment:29 Changed 20 months ago by karsten

Summary: add webstats module to collectorAdd webstats module

Tweak the summary (start with capital letter, remove redundant component name).

comment:30 in reply to:  28 Changed 20 months ago by iwakeh

Replying to karsten:

Replying to iwakeh:

Replying to karsten:

I'm not too deep into this topic right now, so handle the following comment with care.

Will do :-) but this is really asked from a meta perspective and the main difference to all other descriptors is that logfiles should only be published after their completion (i.e., two days after their dates).

Well, discussing these things, even from a meta perspective, requires (sometimes: deep) thinking. Just saying, keep in mind that you're much, much deeper into this topic than I am at this time.

Thanks for your thoughts! I intended to hear the outside operationally focused view.

...

Now I have something to think and work about/with. A bunch of real logfiles for testing would be really great! Roughly the two week amount of successive logs should be sufficient to encountered odd edge cases and have a good testing ground. These also will help testing the metrics-lib part #22983 well.
Could you provide these?

comment:31 Changed 20 months ago by karsten

Keywords: metrics-2018 added

comment:32 Changed 20 months ago by karsten

Keywords: metrics-2017 added; metrics-2018 removed

comment:33 Changed 20 months ago by iwakeh

Status: needs_informationneeds_review

Please review [​https://gitweb.torproject.org/user/iwakeh/collector.git/?h=task-22428-3 this task branch], which is based on the current master and already uses java 8 features.

The implementation changed quite a bit as it could build on the web server log specification (#23243). That's why I didn't continue on the earlier branch, because I think the code is easier to review in total here.
It builds on 2.1.0-dev version of metrics-lib.
Besides the (j)unit tests I ran system tests for sync and local import functionality with the supplied real logs.
The metrics-lib version needs to be adapted as soon as a release with log descriptors is available.

comment:34 Changed 20 months ago by iwakeh

Milestone: CollecTor 1.4.0CollecTor 1.5.0

Move to next version.

comment:35 in reply to:  33 Changed 20 months ago by iwakeh

Please review [​https://gitweb.torproject.org/user/iwakeh/collector.git/?h=task-22428-4 this task branch], which is based on the current master and already uses java 8 features and the latest metrics-lib.

comment:36 Changed 19 months ago by karsten

Status: needs_reviewneeds_revision

Alright, I finished an initial review of commit 086e904 in your task-22428-4 branch. I have several trivial or minor findings, but I'd like to postpone them until we have resolved one that I consider major:

I'm unclear whether the sibling approach is robust enough to cover all cases and edge cases. Maybe even worse, I'm unclear whether we'd notice if we'd be running into an uncovered edge case or if we'd silently not process and therefore lose data.

For example, what happens if we sanitize logs from a server that receives very few requests, maybe only a few requests per week? Consider these original log files (where I scrubbed the virtual host name):

  • scrubbed.torproject.org-access.log-20171001.gz contains requests from 2017-09-30 and 2017-10-01.
  • scrubbed.torproject.org-access.log-20171002.gz contains requests from 2017-10-01 only.
  • scrubbed.torproject.org-access.log-20171004.gz contains requests from 2017-10-03 only.
  • scrubbed.torproject.org-access.log-20171006.gz contains requests from 2017-10-05 and 2017-10-06.

Would the existing code produce logs for 2017-10-01, -03, -05, and -06 with exactly the sanitized log lines from these original log files? (I didn't run it, I only read the code and am unclear about this.)

Here's another, related question: what happens if a web server rotates logs more often than once per day? At least that's something that we write in the specification. I'm not sure how this would work with file names, so maybe we in fact require that logs are rotated exactly once per day, and we just didn't write that in the specification yet. However, it seems rather restrictive to prescribe exact log rotation intervals in order to sanitize logs subsequently. Maybe we should be less restrictive here.

Is there a way to make this approach more robust? And is there a way to ensure that we'll learn about any broken assumptions as early as possible?

Ah, and do you mind doing another round of JavaDoc editing and variable renaming towards finding a middle ground between 2-characters-is-almost-verbose and 80-characters-can-fit-in-a-line-so-let-us-not-use-more-than-79? As a fixup/squash commit without rebasing, please. :) Thank you!

comment:37 in reply to:  36 Changed 19 months ago by iwakeh

Replying to karsten:

Alright, I finished an initial review of commit 086e904 in your task-22428-4 branch. I have several trivial or minor findings, but I'd like to postpone them until we have resolved one that I consider major:

Good, it's better to address the small stuff at the very end :-)

I'm unclear whether the sibling approach is robust enough to cover all cases and edge cases. Maybe even worse, I'm unclear whether we'd notice if we'd be running into an uncovered edge case or if we'd silently not process and therefore lose data.

This is not the siblings approach question, but the general question: When is the log for a certain day done?
I'll address this in more detail on #23243, because this is a specification issue and this ticket here should only be concerned with implementation, imo.

For example, what happens if we sanitize logs from a server that receives very few requests, maybe only a few requests per week? Consider these original log files (where I scrubbed the virtual host name):

  • scrubbed.torproject.org-access.log-20171001.gz contains requests from 2017-09-30 and 2017-10-01.
  • scrubbed.torproject.org-access.log-20171002.gz contains requests from 2017-10-01 only.
  • scrubbed.torproject.org-access.log-20171004.gz contains requests from 2017-10-03 only.
  • scrubbed.torproject.org-access.log-20171006.gz contains requests from 2017-10-05 and 2017-10-06.

Would the existing code produce logs for 2017-10-01, -03, -05, and -06 with exactly the sanitized log lines from these original log files? (I didn't run it, I only read the code and am unclear about this.)

The result does not depend on the contents of an input log. The above files would lead to a single sanitized log for 2017-10-01. The implementation relies on having the sibling, which could be provided by a simple touch scrubbed.torproject.org-access.log-20171003 command. The application needs an outside cue. I'm stating this here for completeness. As there is more to this (including the below questions), let's move the discussion to the spec ticket.

Here's another, related question: what happens if a web server rotates logs more often than once per day? At least that's something that we write in the specification. I'm not sure how this would work with file names, so maybe we in fact require that logs are rotated exactly once per day, and we just didn't write that in the specification yet. However, it seems rather restrictive to prescribe exact log rotation intervals in order to sanitize logs subsequently. Maybe we should be less restrictive here.

Is there a way to make this approach more robust? And is there a way to ensure that we'll learn about any broken assumptions as early as possible?

Will move all these questions and possible answers to on #23243.

Ah, and do you mind doing another round of JavaDoc editing and variable renaming towards finding a middle ground between 2-characters-is-almost-verbose and 80-characters-can-fit-in-a-line-so-let-us-not-use-more-than-79? As a fixup/squash commit without rebasing, please. :) Thank you!

I'll take another look, no problem ;-)

comment:38 Changed 19 months ago by karsten

Alright, let's move to #23243 first and come back here once the open questions there are answered.

comment:39 Changed 19 months ago by iwakeh

Just for the record there is now a name and javadoc tweaking fixup commit. But, review is not necessary immediately as the changes in #23243 will lead to more changes here.

comment:40 Changed 18 months ago by iwakeh

Now, the as the spec is overhauled and using the conclusions in #23421 comments8&9 the work can continue.

comment:41 Changed 17 months ago by iwakeh

Keywords: metrics-2018 added; metrics-2017 removed

Will be completed in 2018.

comment:42 Changed 17 months ago by iwakeh

Status: needs_revisionneeds_information

Regarding import of logs the spec states

All access log files written by Tor's web servers follow the naming convention <virtual-host>-access.log-YYYYMMDD, where "YYYYMMDD" is the date of the rotation and finalization of the log file, which is not used in the further sanitizing process.

Should a mangled or missing date part of the log's filename be ignored or lead to discarding the log?
I'd vote for ignoring the date completely and think this is the best interpretation of the spec.
On the other hand the last sentence of section 2 says:

As first safeguard against publishing log files that are too sensitive, we discard all files not matching the naming convention for access logs. This is to prevent, for example, error logs from slipping through.

Last edited 17 months ago by iwakeh (previous) (diff)

comment:43 Changed 16 months ago by iwakeh

Status: needs_informationassigned

Because of spec section 2:

As first safeguard against publishing log files that are too sensitive, we discard all files not matching the naming convention for access logs. This is to prevent, for example, error logs from slipping through.

all input log files not complying to the naming pattern will be discarded.

comment:44 Changed 16 months ago by iwakeh

Section 4.1 (spec) states:

In addition, log lines are treated differently according to the date they contain:

During an import process the sanitizer takes all log line dates into account and determines the reference interval as stretching from the oldest date to the youngest date encountered. Depending on the reference interval log lines are not yet processed, if their date is on the edges of the reference interval, i.e., the date is not at least a day younger than the older endpoint or the date is only LIMIT days older than the younger endpoint, where LIMIT is initially set to two, but this might change if necessary.
If the younger endpoint of the reference interval coincides with the current system date, the day before is used as the new younger reference interval endpoint, which ensures that the sanitizer won't publish logs prematurely, i.e., before there is a chance that they are complete. Thus, processing of log lines carrying such date is postponed.
All log lines with dates for which the sanitizer already published a log file are discarded in order to avoid altering published logs.

While testing I noticed it might be useful to add a `WebstatsNoLimit' property defaulting to false and, if set to true, not applying the limits, i.e., writing all logs regardless. This would make sense for a bulk import where it is known that the data are complete and ready to be published.
(Of course, one 'workaround' is to add fake lines to enlarge the interval as needed.)
Thoughts?

PS: Well, actually it might be better to name the property `WebstatsLimits'; defaulting to true.

Last edited 16 months ago by iwakeh (previous) (diff)

comment:45 Changed 16 months ago by iwakeh

Status: assignedneeds_review

Please review this branch. The most recent commit adapts to the new spec.

The revision was necessary because the specification turned the reference date into a reference interval. This change is also the reason for moving the sanitization logic into CollecTor's webstats module.

This branch depends on the latest metrics-lib branch.

comment:46 Changed 16 months ago by karsten

Status: needs_reviewneeds_revision

I just reviewed the last commits (086e904, d7a2835, 45dd764) altogether. Here's what I found:

  • We'll have to clean up CHANGELOG.md before this branch gets merged. It contains duplicate/overlapping entries for the new changes. I can do that as part of merging.
  • Let's also not forget to release metrics-lib 2.2.0 before merging this branch and updating to 2.2.0 rather than 2.2.0-dev. I can do that as part of merging, too.
  • New copyrights should be changed to 2018. I can do this, too.
  • Can I ask you to reference all member attributes or methods with this.? I know it's syntactically irrelevant for the compiler, but for me as a human being it improves readability a lot.
  • There's a line in SanitizeWeblogs.java , namely SortedSet<LocalDate> sorted = new TreeSet();, which is missing a pair of <>. This is the only actual code change I have, which is why I didn't include this change in a patch.
  • The default values WebstatsPeriodMinutes = 31 and WebstatsOffsetMinutes = 0 seem a bit unusual. Wouldn't it make more sense to start with an offset of 1 minute and then run every 30 minutes? And regardless of the period being unusual, isn't that far too often? How about 360 as period and, say, 31 as offset?
  • Other than that, I'd say let's try it out when you say it's ready. I think that's more efficient to find remaining bugs than re-reading and tweaking the code. Just let me know when I should do some tests.

comment:47 Changed 16 months ago by karsten

Cc: metrics-team added

(For some reason metrics-bugs@ doesn't get updates for this ticket. Trying to copy metrics-team.)

comment:48 in reply to:  46 Changed 16 months ago by iwakeh

Replying to karsten:

I just reviewed the last commits (086e904, d7a2835, 45dd764) altogether. Here's what I found:

Thanks for the quick review! No objections to the suggested changes and fine to start testing more.
A few comments inline:

  • We'll have to clean up CHANGELOG.md before this branch gets merged. It contains duplicate/overlapping entries for the new changes. I can do that as part of merging.

+1

  • Let's also not forget to release metrics-lib 2.2.0 before merging this branch and updating to 2.2.0 rather than 2.2.0-dev. I can do that as part of merging, too.

+1

  • New copyrights should be changed to 2018. I can do this, too.

Thanks, I tried to catch all places, but didn't succeed.

  • Can I ask you to reference all member attributes or methods with this.? I know it's syntactically irrelevant for the compiler, but for me as a human being it improves readability a lot.

Sure. Do you intend to add the missing ones before testing or should I check and supply a branch?

  • There's a line in SanitizeWeblogs.java , namely SortedSet<LocalDate> sorted = new TreeSet();, which is missing a pair of <>. This is the only actual code change I have, which is why I didn't include this change in a patch.

An oversight, please change.

  • The default values WebstatsPeriodMinutes = 31 and WebstatsOffsetMinutes = 0 seem a bit unusual. Wouldn't it make more sense to start with an offset of 1 minute and then run every 30 minutes? And regardless of the period being unusual, isn't that far too often? How about 360 as period and, say, 31 as offset?

That was arbitrarily chosen. The larger intervals make more sense in production.

  • Other than that, I'd say let's try it out when you say it's ready. I think that's more efficient to find remaining bugs than re-reading and tweaking the code. Just let me know when I should do some tests.

Fine to go!

comment:49 in reply to:  47 Changed 16 months ago by iwakeh

Replying to karsten:

(For some reason metrics-bugs@ doesn't get updates for this ticket. Trying to copy metrics-team.)

True, sometimes these messages seem to not be sent :-/

comment:50 Changed 16 months ago by karsten

Alright, I made a few tweaks to your branch in my task-22428-4 branch with the latest commit being 1f1adec.

I also ran a first round of tests. Here's what I found:

  • The recent/ directory contains only 1 log file per physical/virtual host from 3 days ago. The reason is that the last 2 days are excluded as too recent. But the recent/ directory is supposed to contain the last 72 hours of data produced by CollecTor, not data originally published within the last 72 hours. The goal is to give CollecTor clients 72 hours to obtain newly provided data before they need to fall back to archives. I think the other modules only look at last-modified time of provided files and delete them after 72 hours. Maybe we should do that here, too.
  • The regular expression in metrics-lib's WebServerAccessLogLine is too strict. It does not consider the following log line to be valid, but it should: 0.0.0.0 - - [22/Jan/2018:00:00:00 +0000] "GET /collector/archive HTTP/1.1" 301 -. We should probably also compare the regular expression to the Apache specification for similar cases where we're being too strict.
  • There are subtle differences between files provided by webstats.tp.o and the ones produced here. We should probably write a simple (shell) script to convert files from webstats.tp.o to the format we'd have produced. Things like cutting off columns at the end or cutting off the ? from parameters. We'd run that script once when copying over files.
  • I guess we'll need to extend create-tarballs script to produce tarballs containing webstats files. I haven't looked, though.

Do you want to look into these issues?

Remaining changes after those above are:

  • Run another round of tests with most or even all original log files as input.
  • Clean up CHANGELOG.md.
  • Release metrics-lib 2.2.0 and update build.xml.
  • Update copyrights to 2018.

comment:51 in reply to:  50 ; Changed 16 months ago by iwakeh

Replying to karsten:

Alright, I made a few tweaks to your branch in my task-22428-4 branch with the latest commit being 1f1adec.

Changes look fine, thanks! I will do further work based on this branch.

I also ran a first round of tests. Here's what I found:

  • The recent/ directory contains only 1 log file per physical/virtual host from 3 days ago. The reason is that the last 2 days are excluded as too recent. But the recent/ directory is supposed to contain the last 72 hours of data produced by CollecTor, not data originally published within the last 72 hours. The goal is to give CollecTor clients 72 hours to obtain newly provided data before they need to fall back to archives. I think the other modules only look at last-modified time of provided files and delete them after 72 hours. Maybe we should do that here, too.

This is related to comment:44, which quotes the relevant section of our spec (and still waits for comments).
The spec allows only earliest logs from two days before the current date. Providing data for the last three days of collecting results would mean to extend the logic to allow data from up to five days ago. Example for the intended functionality:

Log files from (omitting the year) 1/11, 1/12, 1/13, 1/14, 1/15, 1/16, 1/17 should result in having the logs of 1/13, 1/14, 1/15 in 'recent'.

If yes I'll add a patch commit.

  • The regular expression in metrics-lib's WebServerAccessLogLine is too strict. It does not consider the following log line to be valid, but it should: 0.0.0.0 - - [22/Jan/2018:00:00:00 +0000] "GET /collector/archive HTTP/1.1" 301 -. We should probably also compare the regular expression to the Apache specification for similar cases where we're being too strict.

If such lines are contained and valid we should make sure it is also reflected in the spec.
I will provide a patch.

  • There are subtle differences between files provided by webstats.tp.o and the ones produced here. We should probably write a simple (shell) script to convert files from webstats.tp.o to the format we'd have produced. Things like cutting off columns at the end or cutting off the ? from parameters. We'd run that script once when copying over files.

I'm objecting against the script solution.
The easiest way and another way of testing might be to simply import these using CollecTor. Setting log level to debug will tell us what gets dropped and provide another good test scenario with a useful result.

  • I guess we'll need to extend create-tarballs script to produce tarballs containing webstats files. I haven't looked, though.

Oh, true.

Do you want to look into these issues?

Yep.
Will you later import the old logs? I could also do that on corsicum or on my machine and upload there.

Remaining changes after those above are:

  • Run another round of tests with most or even all original log files as input.

karsten's task

  • import the old logs (iwakeh or karsten see above)
  • Clean up CHANGELOG.md.
  • Release metrics-lib 2.2.0 and update build.xml.
  • Update copyrights to 2018.

I could rebase your most recent branch on master and then add these changes?

comment:52 in reply to:  51 Changed 16 months ago by karsten

Replying to iwakeh:

Replying to karsten:

Alright, I made a few tweaks to your branch in my task-22428-4 branch with the latest commit being 1f1adec.

Changes look fine, thanks! I will do further work based on this branch.

Great!

I also ran a first round of tests. Here's what I found:

  • The recent/ directory contains only 1 log file per physical/virtual host from 3 days ago. The reason is that the last 2 days are excluded as too recent. But the recent/ directory is supposed to contain the last 72 hours of data produced by CollecTor, not data originally published within the last 72 hours. The goal is to give CollecTor clients 72 hours to obtain newly provided data before they need to fall back to archives. I think the other modules only look at last-modified time of provided files and delete them after 72 hours. Maybe we should do that here, too.

This is related to comment:44, which quotes the relevant section of our spec (and still waits for comments).

Huh, I did not see that comment. First thought is that we probably don't need another config option as long as we clearly document how the processing works. But I'll think more about this.

The spec allows only earliest logs from two days before the current date. Providing data for the last three days of collecting results would mean to extend the logic to allow data from up to five days ago. Example for the intended functionality:

Log files from (omitting the year) 1/11, 1/12, 1/13, 1/14, 1/15, 1/16, 1/17 should result in having the logs of 1/13, 1/14, 1/15 in 'recent'.

If yes I'll add a patch commit.

Well, no. 1/11 and 1/12 would be included in the recent/ directory, too, at least after processing the stated files in bulk. When processing them in an ongoing way, we'd indeed only expect 1/13, 1/14, and 1/15 in recent/.

Maybe it helps to ignore the 2-days logic entirely here and only look at the last-modified time of sanitized files. This may produce a gigantic recent/ directory after the first bulk import and the 72 hours after that, but after that everything should stabilize.

  • The regular expression in metrics-lib's WebServerAccessLogLine is too strict. It does not consider the following log line to be valid, but it should: 0.0.0.0 - - [22/Jan/2018:00:00:00 +0000] "GET /collector/archive HTTP/1.1" 301 -. We should probably also compare the regular expression to the Apache specification for similar cases where we're being too strict.

If such lines are contained and valid we should make sure it is also reflected in the spec.
I will provide a patch.

Sounds good. Maybe also see what the Apache log file specification says here.

  • There are subtle differences between files provided by webstats.tp.o and the ones produced here. We should probably write a simple (shell) script to convert files from webstats.tp.o to the format we'd have produced. Things like cutting off columns at the end or cutting off the ? from parameters. We'd run that script once when copying over files.

I'm objecting against the script solution.
The easiest way and another way of testing might be to simply import these using CollecTor. Setting log level to debug will tell us what gets dropped and provide another good test scenario with a useful result.

Ah, sure, that would work, too.

  • I guess we'll need to extend create-tarballs script to produce tarballs containing webstats files. I haven't looked, though.

Oh, true.

Do you want to look into these issues?

Yep.
Will you later import the old logs? I could also do that on corsicum or on my machine and upload there.

I can do that.

Remaining changes after those above are:

  • Run another round of tests with most or even all original log files as input.

karsten's task

Yep, I'll do that.

  • import the old logs (iwakeh or karsten see above)

And this.

  • Clean up CHANGELOG.md.
  • Release metrics-lib 2.2.0 and update build.xml.
  • Update copyrights to 2018.

I could rebase your most recent branch on master and then add these changes?

I'm happy to do this. Please keep working on the current branch.

comment:53 in reply to:  44 ; Changed 16 months ago by karsten

Replying to iwakeh:

PS: Well, actually it might be better to name the property `WebstatsLimits'; defaulting to true.

Ah, well, contrary to what I just wrote, let's add a new config option. How about we make it an integer config option that expects the number of days and that defaults to 2?

comment:54 in reply to:  53 Changed 16 months ago by iwakeh

Replying to karsten:

Replying to iwakeh:

PS: Well, actually it might be better to name the property `WebstatsLimits'; defaulting to true.

Ah, well, contrary to what I just wrote, let's add a new config option. How about we make it an integer config option that expects the number of days and that defaults to 2?

This integer would only refer to the upper limit, correct? The lower limit of one day from the earliest processed date is not being altered. Maybe add integer properties for each, i.e. WebstatsLowerLimit and WebstatsUpperLimit? The boolean property was supposed to either use both or none of these (hard-coded) limits.

Thoughts?

comment:55 Changed 16 months ago by karsten

Ah, very good point. In that case, WebstatsLimits with a boolean value sounds like the better solution.

comment:56 Changed 16 months ago by iwakeh

Status: needs_revisionneeds_review

Please review two commit on top of your branch, which together with the latest branch/commit for #22983 solve all open issues (on my list).

comment:57 Changed 16 months ago by karsten

Commit d18424a:

  • It looks like we're adding sorted.first().minusDays(1) but not sorted.last().plusDays(1). I didn't try this out, but from looking at the code using this method it seems like we're excluding interval ends. Can you take another look?
  • Rest looks good.

Commit 2dee32c looks good.

comment:58 in reply to:  57 Changed 16 months ago by iwakeh

Replying to karsten:

Commit d18424a:

  • It looks like we're adding sorted.first().minusDays(1) but not sorted.last().plusDays(1). I didn't try this out, but from looking at the code using this method it seems like we're excluding interval ends. Can you take another look?

Good catch! Amended the fix to the last fixup-commit.

  • Rest looks good.

Commit 2dee32c looks good.

Seems this module approaches finalization :-)

comment:59 Changed 16 months ago by karsten

Amended commit e9f6c4e looks good.

Great, I'll run another test tomorrow.

comment:60 Changed 16 months ago by karsten

Resolution: fixed
Status: needs_reviewclosed

Local tests pass! I squashed all commits related to this change, updated the change log, updated build.xml to use metrics-lib 2.2.0, updated copyrights, rebased to master, and pushed. I'll run some more tests with files from webstats.tp.o, but I'll open new tickets in case I run into any bugs. Closing. Thanks! :)

Note: See TracTickets for help on using tickets.