==== Input-Output-Specification
=== Input: How are logs provided to CollecTor
Current assumption:
local directory w/o subdirectories; files are compressed; the filename contains the host's name (e.g. metrics.torproject.org) and the date.
=== Output: How to store cleaned logs in out and recent
Current assumption:
The hostname of the collecting server (e.g. meronense.torproject.org) is not used for path creation.
==== out
out/webstats/<year>/<month>/<day>/<hostname>-access.log-<year-month-day>.xz
for example:
out/webstats/2016/08/29/metrics.torproject.org-access.log-20160829.xz.
First bit of feedback (more may follow): wouldn't it be problematic to leave out the hostname of the collecting server in the output path? There are several hosts serving the main Tor website, and each of them will create a log file for that domain name for a given day.
Note that I don't have access to the current webstats host, but I would imagine that not all files are stored in a directory without subdirectories. The existing code might reveal something about that, but I haven't looked.
First bit of feedback (more may follow): wouldn't it be problematic to leave out the hostname of the collecting server in the output path? There are several hosts serving the main Tor website, and each of them will create a log file for that domain name for a given day.
Good point! The hostname is used as directory name on webstats.torproject.org, but doesn't seem to be part of current shell/python processing, which takes place in a directory containing the files to be cleaned (afaict).
Note that I don't have access to the current webstats host, but I would imagine that not all files are stored in a directory without subdirectories. The existing code might reveal something about that, but I haven't looked.
As mentioned above, but it is very easy to simply look for files in subfolder, too.
So all in all, the main question left is how the access logs and the log meta data are provided. So far, meta data seems to be in the log-filename and the path (i.e. the hostname).
Could the logs be 'delivered' as compressed files having meta data in their filename, e.g.,
<measuring-host>-<served-hostname>-access.log-<year-month-day>.xz
This is just an example any other combination will be fine, important is what and how meta-data is communicated.
Hmmm, the metadata question is really a good one. Ideally, we'd include metadata in the first lines, but I didn't find anything about comments or metadata or ignored lines in the Apache documentation that we could use here.
Putting metadata in paths and file names is something we're just giving up in the case of bridge network statuses. Ideally, file contents would be self-contained, if that's possible in this case.
Another option could be to put sanitized web logs into tarballs of their own and include a small metadata file. This might have other downsides, though.
Hmmm, the metadata question is really a good one. Ideally, we'd include metadata in the first lines, but I didn't find anything about comments or metadata or ignored lines in the Apache documentation that we could use here.
Putting metadata in paths and file names is something we're just giving up in the case of bridge network statuses. Ideally, file contents would be self-contained, if that's possible in this case.
Another option could be to put sanitized web logs into tarballs of their own and include a small metadata file. This might have other downsides, though.
Maybe there are other ways?
The output side is something we can build and your suggestion here is one solution.
The open question is how the 'real' access logs and the log meta data are provided to CollecTor?
Regarding metadata in output files, I think the least bad solution is to prefix existing filenames with the-measuring.host-. Adding funny/invalid data lines just for metadata seems worse, and wrapping files into tarballs together with a metadata file might make it unnecessary complex to process files. Maybe we can't give up on parsing metadata from filenames just yet, though we can still try to reduce it to cases like this where adding metadata to files is not a good option.
An upside of this approach is that we can easily copy over existing sanitized web logs without opening them, adding metadata, and closing them. The only thing we need to do is rename them, like https://webstats.torproject.org/out/meronense.torproject.org/metrics.torproject.org-access.log-20170704.xz to meronense.torproject.org-metrics.torproject.org-access.log-20170704.xz.
Two more properties are necessary; here a list of all properties for the webstats module:
######## Tor Weblogs ########### Define descriptor sources# possible values: Local, SyncWebstatsSources = Local## Relative path to directory to import logfiles from.WebstatsLocalOrigins = in/webstats# Location for keeping temporary log before publication.WebstatsTemporaryPath = temp-webstats# Date from which to calculate publishable sanitized logs.# Default value 'Now' uses the current system date.# Format for fixed dates: yyyymmdd# Example: WebstatsReferenceDate = 20170801WebstatsReferenceDate = Now
WebstatsSources and WebstatsSources are the usual ones for CollecTorMain modules and just added here for completeness.
WebstatsTemporaryPath: specifies the work directory for CollecTor, because logs have to be two days old before being published.
WebstatsReferenceDate: enables bulk imports up to a certain date. Default setting uses the system date.
I'm not too deep into this topic right now, so handle the following comment with care.
I wonder if we can avoid having that directory for temporary log files that cannot be published yet. It seems like a possible source for trouble when processing breaks at some point and we need to fix that, with half of a log file being written to the temporary directory and the rest still being in the import directory.
Maybe we can simplify that by keeping a text file in stats/ where we keep some state which files we already read or wrote. And we only write a file to out/ and recent/ when it's ready for publication. Not sure if this will solve all cases, but it seems potentially easier to understand for future operators of this service (including ourselves when we don't remember these design discussions anymore).
Regarding WebstatsReferenceDate, it would be good to explain in the comments when this value needs to be changed, and to what value. The comment alone should be sufficient to know how to use the property, without further looking at the code.
I'm not too deep into this topic right now, so handle the following comment with care.
Will do :-) but this is really asked from a meta perspective and the main difference to all other descriptors is that logfiles should only be published after their completion (i.e., two days after their dates).
I wonder if we can avoid having that directory for temporary log files that cannot be published yet. It seems like a possible source for trouble when processing breaks at some point and we need to fix that, with half of a log file being written to the temporary directory and the rest still being in the import directory.
I would want to have a separation here, b/c log files from the import directory are not sanitized and should not be published. On the other hand, log descriptors in the temporary location are possibly not yet complete and already published log descriptors should not be altered (e.g., by appending and resorting them). If anything breaks, the incompletely written files in the working dir ought to be removed.
I would also want to avoid the stats file.
Maybe we can simplify that by keeping a text file in stats/ where we keep some state which files we already read or wrote. And we only write a file to out/ and recent/ when it's ready for publication. Not sure if this will solve all cases, but it seems potentially easier to understand for future operators of this service (including ourselves when we don't remember these design discussions anymore).
The explanation could be more elaborate and maybe the property renamed to WebstatsSanitizingPath or some better name?
The stats file option could be misleading. For example, if another local re-import leads to overwriting a sanitized not yet published log. A temporary sanitizing-working directory makes clear that only CollecTor touches files in there and stuff from 'in' could be removed after a processing round. That ought to be easier for operation: "don't touch the sanitizing-working directory" and treat the input directory as with other modules?
Regarding WebstatsReferenceDate, it would be good to explain in the comments when this value needs to be changed, and to what value. The comment alone should be sufficient to know how to use the property, without further looking at the code.
I was thinking that it might be useful to be able to have partial imports of older logs, hmm. This might be trickier than just documenting the property. Example: add all July 2017 logs to in and set reference date to 20170801 means that only logs up to (incl.) 20170730 are published in that round.
Let's extend this discussion:
What other operation scenarios the webstats module will have to be prepared to deal with and do these have to be available with the initial release of webstats?
I'm not too deep into this topic right now, so handle the following comment with care.
Will do :-) but this is really asked from a meta perspective and the main difference to all other descriptors is that logfiles should only be published after their completion (i.e., two days after their dates).
Well, discussing these things, even from a meta perspective, requires (sometimes: deep) thinking. Just saying, keep in mind that you're much, much deeper into this topic than I am at this time.
I wonder if we can avoid having that directory for temporary log files that cannot be published yet. It seems like a possible source for trouble when processing breaks at some point and we need to fix that, with half of a log file being written to the temporary directory and the rest still being in the import directory.
I would want to have a separation here, b/c log files from the import directory are not sanitized and should not be published. On the other hand, log descriptors in the temporary location are possibly not yet complete and already published log descriptors should not be altered (e.g., by appending and resorting them).
I think I agree on all statements above.
If anything breaks, the incompletely written files in the working dir ought to be removed.
Wait, is that correct? What if the sanitizing process aborts at a random point, possibly due to the host losing power in the middle of a run? Would the operator simply delete everything from the working directory in that case?
I'm pointing this out, because I was in the situation with the old webstats code. The code seemed plausible while writing it. But that changed a few months later when something broke and I had to figure out how to recover. I'd like to avoid such situations with this new code.
I would also want to avoid the stats file.
Yes, if we can avoid it, let's avoid it. Maybe we can use one just to avoid re-processing files unnecessarily. So, if it's empty or gone, nothing bad happens.
Maybe we can simplify that by keeping a text file in stats/ where we keep some state which files we already read or wrote. And we only write a file to out/ and recent/ when it's ready for publication. Not sure if this will solve all cases, but it seems potentially easier to understand for future operators of this service (including ourselves when we don't remember these design discussions anymore).
The explanation could be more elaborate and maybe the property renamed to WebstatsSanitizingPath or some better name?
The stats file option could be misleading. For example, if another local re-import leads to overwriting a sanitized not yet published log.
Not sure I understand what you mean here.
A temporary sanitizing-working directory makes clear that only CollecTor touches files in there
... except when the operator needs to repair something and has to touch these files, too.
and stuff from 'in' could be removed after a processing round.
No, we shouldn't remove anything from in/. We didn't put files there, so we shouldn't remove them, either.
That ought to be easier for operation: "don't touch the sanitizing-working directory" and treat the input directory as with other modules?
A few thoughts on how this might work using a stats file:
We read files from in/.
We write fully sanitized files to out/ and recent/, but only if we're certain that we won't have more data later on that would require us to update files there, because we wouldn't do that.
If there's a file in in/ that contains lines that we couldn't put into a file in out/ and recent/, we will simply process that in/-file again next time.
We might want to use a file in stats/ to remember which files in in/ are already completely processed, so that we can skip them.
We never delete anything from in/ but let the script do that that also places files in there.
Please note that I'm not sure yet whether this will work. It just seems like something that is relatively easy to operate, in particular when something breaks.
Regarding WebstatsReferenceDate, it would be good to explain in the comments when this value needs to be changed, and to what value. The comment alone should be sufficient to know how to use the property, without further looking at the code.
I was thinking that it might be useful to be able to have partial imports of older logs, hmm. This might be trickier than just documenting the property. Example: add all July 2017 logs to in and set reference date to 20170801 means that only logs up to (incl.) 20170730 are published in that round.
Let's extend this discussion:
What other operation scenarios the webstats module will have to be prepared to deal with and do these have to be available with the initial release of webstats?
I think the following scenarios are most common:
Initialize by processing log files from the past 2 weeks up to now. Similarly, re-process in case of change to sanitizing steps.
Do a periodic run every few hours.
Note that there are no archives of non-sanitized logs that would reach back more than 2 weeks. That's different with bridge descriptor tarballs.
I'm not too deep into this topic right now, so handle the following comment with care.
Will do :-) but this is really asked from a meta perspective and the main difference to all other descriptors is that logfiles should only be published after their completion (i.e., two days after their dates).
Well, discussing these things, even from a meta perspective, requires (sometimes: deep) thinking. Just saying, keep in mind that you're much, much deeper into this topic than I am at this time.
Thanks for your thoughts! I intended to hear the outside operationally focused view.
...
Now I have something to think and work about/with. A bunch of real logfiles for testing would be really great! Roughly the two week amount of successive logs should be sufficient to encountered odd edge cases and have a good testing ground. These also will help testing the metrics-lib part #22983 (moved) well.
Could you provide these?
The implementation changed quite a bit as it could build on the web server log specification (#23243 (moved)). That's why I didn't continue on the earlier branch, because I think the code is easier to review in total here.
It builds on 2.1.0-dev version of metrics-lib.
Besides the (j)unit tests I ran system tests for sync and local import functionality with the supplied real logs.
The metrics-lib version needs to be adapted as soon as a release with log descriptors is available.
Alright, I finished an initial review of commit 086e904 in your task-22428-4 branch. I have several trivial or minor findings, but I'd like to postpone them until we have resolved one that I consider major:
I'm unclear whether the sibling approach is robust enough to cover all cases and edge cases. Maybe even worse, I'm unclear whether we'd notice if we'd be running into an uncovered edge case or if we'd silently not process and therefore lose data.
For example, what happens if we sanitize logs from a server that receives very few requests, maybe only a few requests per week? Consider these original log files (where I scrubbed the virtual host name):
scrubbed.torproject.org-access.log-20171001.gz contains requests from 2017-09-30 and 2017-10-01.
scrubbed.torproject.org-access.log-20171002.gz contains requests from 2017-10-01 only.
scrubbed.torproject.org-access.log-20171004.gz contains requests from 2017-10-03 only.
scrubbed.torproject.org-access.log-20171006.gz contains requests from 2017-10-05 and 2017-10-06.
Would the existing code produce logs for 2017-10-01, -03, -05, and -06 with exactly the sanitized log lines from these original log files? (I didn't run it, I only read the code and am unclear about this.)
Here's another, related question: what happens if a web server rotates logs more often than once per day? At least that's something that we write in the specification. I'm not sure how this would work with file names, so maybe we in fact require that logs are rotated exactly once per day, and we just didn't write that in the specification yet. However, it seems rather restrictive to prescribe exact log rotation intervals in order to sanitize logs subsequently. Maybe we should be less restrictive here.
Is there a way to make this approach more robust? And is there a way to ensure that we'll learn about any broken assumptions as early as possible?
Ah, and do you mind doing another round of JavaDoc editing and variable renaming towards finding a middle ground between 2-characters-is-almost-verbose and 80-characters-can-fit-in-a-line-so-let-us-not-use-more-than-79? As a fixup/squash commit without rebasing, please. :) Thank you!
Alright, I finished an initial review of commit 086e904 in your task-22428-4 branch. I have several trivial or minor findings, but I'd like to postpone them until we have resolved one that I consider major:
Good, it's better to address the small stuff at the very end :-)
I'm unclear whether the sibling approach is robust enough to cover all cases and edge cases. Maybe even worse, I'm unclear whether we'd notice if we'd be running into an uncovered edge case or if we'd silently not process and therefore lose data.
This is not the siblings approach question, but the general question: When is the log for a certain day done?
I'll address this in more detail on #23243 (moved), because this is a specification issue and this ticket here should only be concerned with implementation, imo.
For example, what happens if we sanitize logs from a server that receives very few requests, maybe only a few requests per week? Consider these original log files (where I scrubbed the virtual host name):
scrubbed.torproject.org-access.log-20171001.gz contains requests from 2017-09-30 and 2017-10-01.
scrubbed.torproject.org-access.log-20171002.gz contains requests from 2017-10-01 only.
scrubbed.torproject.org-access.log-20171004.gz contains requests from 2017-10-03 only.
scrubbed.torproject.org-access.log-20171006.gz contains requests from 2017-10-05 and 2017-10-06.
Would the existing code produce logs for 2017-10-01, -03, -05, and -06 with exactly the sanitized log lines from these original log files? (I didn't run it, I only read the code and am unclear about this.)
The result does not depend on the contents of an input log. The above files would lead to a single sanitized log for 2017-10-01. The implementation relies on having the sibling, which could be provided by a simple touch scrubbed.torproject.org-access.log-20171003 command. The application needs an outside cue. I'm stating this here for completeness. As there is more to this (including the below questions), let's move the discussion to the spec ticket.
Here's another, related question: what happens if a web server rotates logs more often than once per day? At least that's something that we write in the specification. I'm not sure how this would work with file names, so maybe we in fact require that logs are rotated exactly once per day, and we just didn't write that in the specification yet. However, it seems rather restrictive to prescribe exact log rotation intervals in order to sanitize logs subsequently. Maybe we should be less restrictive here.
Is there a way to make this approach more robust? And is there a way to ensure that we'll learn about any broken assumptions as early as possible?
Will move all these questions and possible answers to on #23243 (moved).
Ah, and do you mind doing another round of JavaDoc editing and variable renaming towards finding a middle ground between 2-characters-is-almost-verbose and 80-characters-can-fit-in-a-line-so-let-us-not-use-more-than-79? As a fixup/squash commit without rebasing, please. :) Thank you!
Just for the record there is now a name and javadoc tweaking fixup commit. But, review is not necessary immediately as the changes in #23243 (moved) will lead to more changes here.
Regarding import of logs the spec states
All access log files written by Tor's web servers follow the naming convention -access.log-YYYYMMDD, where "YYYYMMDD" is the date of the rotation and finalization of the log file, which is not used in the further sanitizing process.
Should a mangled or missing date part of the log's filename be ignored or lead to discarding the log?
I'd vote for ignoring the date completely and think this is the best interpretation of the spec.
On the other hand the last sentence of section 2 says:
As first safeguard against publishing log files that are too sensitive, we discard all files not matching the naming convention for access logs. This is to prevent, for example, error logs from slipping through.
Because of spec section 2:
As first safeguard against publishing log files that are too sensitive, we discard all files not matching the naming convention for access logs. This is to prevent, for example, error logs from slipping through.
all input log files not complying to the naming pattern will be discarded.
In addition, log lines are treated differently according to the date they contain:
During an import process the sanitizer takes all log line dates into account and determines the reference interval as stretching from the oldest date to the youngest date encountered. Depending on the reference interval log lines are not yet processed, if their date is on the edges of the reference interval, i.e., the date is not at least a day younger than the older endpoint or the date is only LIMIT days older than the younger endpoint, where LIMIT is initially set to two, but this might change if necessary.If the younger endpoint of the reference interval coincides with the current system date, the day before is used as the new younger reference interval endpoint, which ensures that the sanitizer won't publish logs prematurely, i.e., before there is a chance that they are complete. Thus, processing of log lines carrying such date is postponed.All log lines with dates for which the sanitizer already published a log file are discarded in order to avoid altering published logs.
While testing I noticed it might be useful to add a `WebstatsNoLimit' property defaulting to false and, if set to true, not applying the limits, i.e., writing all logs regardless. This would make sense for a bulk import where it is known that the data are complete and ready to be published.
(Of course, one 'workaround' is to add fake lines to enlarge the interval as needed.)
Thoughts?
PS: Well, actually it might be better to name the property `WebstatsLimits'; defaulting to true.
Please review this branch. The most recent commit adapts to the new spec.
The revision was necessary because the specification turned the reference date into a reference interval. This change is also the reason for moving the sanitization logic into CollecTor's webstats module.
I just reviewed the last commits (086e904, d7a2835, 45dd764) altogether. Here's what I found:
We'll have to clean up CHANGELOG.md before this branch gets merged. It contains duplicate/overlapping entries for the new changes. I can do that as part of merging.
Let's also not forget to release metrics-lib 2.2.0 before merging this branch and updating to 2.2.0 rather than 2.2.0-dev. I can do that as part of merging, too.
New copyrights should be changed to 2018. I can do this, too.
Can I ask you to reference all member attributes or methods with this.? I know it's syntactically irrelevant for the compiler, but for me as a human being it improves readability a lot.
There's a line in SanitizeWeblogs.java , namely SortedSet<LocalDate> sorted = new TreeSet();, which is missing a pair of <>. This is the only actual code change I have, which is why I didn't include this change in a patch.
The default values WebstatsPeriodMinutes = 31 and WebstatsOffsetMinutes = 0 seem a bit unusual. Wouldn't it make more sense to start with an offset of 1 minute and then run every 30 minutes? And regardless of the period being unusual, isn't that far too often? How about 360 as period and, say, 31 as offset?
Other than that, I'd say let's try it out when you say it's ready. I think that's more efficient to find remaining bugs than re-reading and tweaking the code. Just let me know when I should do some tests.
I just reviewed the last commits (086e904, d7a2835, 45dd764) altogether. Here's what I found:
Thanks for the quick review! No objections to the suggested changes and fine to start testing more.
A few comments inline:
We'll have to clean up CHANGELOG.md before this branch gets merged. It contains duplicate/overlapping entries for the new changes. I can do that as part of merging.
+1
Let's also not forget to release metrics-lib 2.2.0 before merging this branch and updating to 2.2.0 rather than 2.2.0-dev. I can do that as part of merging, too.
+1
New copyrights should be changed to 2018. I can do this, too.
Thanks, I tried to catch all places, but didn't succeed.
Can I ask you to reference all member attributes or methods with this.? I know it's syntactically irrelevant for the compiler, but for me as a human being it improves readability a lot.
Sure. Do you intend to add the missing ones before testing or should I check and supply a branch?
There's a line in SanitizeWeblogs.java , namely SortedSet<LocalDate> sorted = new TreeSet();, which is missing a pair of <>. This is the only actual code change I have, which is why I didn't include this change in a patch.
An oversight, please change.
The default values WebstatsPeriodMinutes = 31 and WebstatsOffsetMinutes = 0 seem a bit unusual. Wouldn't it make more sense to start with an offset of 1 minute and then run every 30 minutes? And regardless of the period being unusual, isn't that far too often? How about 360 as period and, say, 31 as offset?
That was arbitrarily chosen. The larger intervals make more sense in production.
Other than that, I'd say let's try it out when you say it's ready. I think that's more efficient to find remaining bugs than re-reading and tweaking the code. Just let me know when I should do some tests.