#23243 closed enhancement (implemented)

Write a specification for Tor web server logs

Reported by: iwakeh Owned by: metrics-team
Priority: High Milestone:
Component: Metrics/Website Version:
Severity: Normal Keywords: metrics-2017
Cc: Actual Points:
Parent ID: Points:
Reviewer: Sponsor:

Description

This document should answer the following questions:

  • What will the raw input data look like?
    • compressed logs
    • varying dates in log-lines despite the file being tagged with a single date
    • are there only GET log-lines of 200 responses to be expected?
    • size could be huge (in future)
    • exact input format (if possible to define)
    • meta-data is provided in paths and filenames
    • ...
  • What will sanitized stored (on disk) logs look like?
    • cleaned log-lines, define exact format, give examples (as this might deviate from the current python sanitation)
    • meta-data is provided in paths and filenames
    • should files be reassembled, i.e., only log lines of a given date in a descriptor for that log date?
    • should storage (on disk) be in compressed files (opposed to storing other descriptors uncompressed)?
    • Should such log be stored (on disk) in reasonably sized chunks (once a GB size is reached)?
    • ...

Please add more.

Child Tickets

Attachments (7)

weblog-spec.html.xz (3.2 KB) - added by iwakeh 18 months ago.
webstats-spec.txt (6.3 KB) - added by karsten 18 months ago.
karsten's first draft
webstats-spec.2.txt (6.2 KB) - added by karsten 18 months ago.
Third draft based on karsten's first draft and a discussion with iwakeh
webstats-spec.3.txt (6.8 KB) - added by karsten 18 months ago.
Fourth draft
web-spec-4.zip (2.9 KB) - added by iwakeh 18 months ago.
draft five numbered 4 (trac keeps rejecting unzipped uploads as spam :-/
web-server-logs.pdf (115.9 KB) - added by karsten 18 months ago.
web-server-logs-2.pdf (118.2 KB) - added by karsten 18 months ago.

Download all attachments as: .zip

Change History (81)

comment:1 Changed 19 months ago by iwakeh

Status: newneeds_information

Added to metrics-web as the spec will likely also face the public there.
This is the basis for metrics-lib ticket #22983.

comment:2 in reply to:  description Changed 19 months ago by karsten

Replying to iwakeh:

This document should answer the following questions:

Good idea to start such a document! I'll start filling information below.

  • What will the raw input data look like?
    • compressed logs

Very likely, though compression shouldn't be a strict requirement.

  • varying dates in log-lines despite the file being tagged with a single date

Yes, to a certain degree. We'll have to ask the admins for details, but I believe that the date in the file name is put in when rotating logs and that the date per line is when the host started processing a request. Now, it's possible that some requests are received before midnight and completed after midnight. And depending on when the log is rotated it's possible that some requests are started on the day before the log was rotated and finished after rotating the log.

  • are there only GET log-lines of 200 responses to be expected?

No, there might be other methods and other response codes.

  • size could be huge (in future)

Yes.

  • exact input format (if possible to define)

Good question. We should ideally support Apache's Combined Log Format, even though we'd currently only receive Tor's privacy* log formats:

LogFormat "0.0.0.0 - %u %{[%d/%b/%Y:00:00:00 %z]}t \"%r\" %>s %b \"%{Referer}i\" \"-\" %{Age}o" privacy
LogFormat "0.0.0.1 - %u %{[%d/%b/%Y:00:00:00 %z]}t \"%r\" %>s %b \"%{Referer}i\" \"-\" %{Age}o" privacyssl
LogFormat "0.0.0.2 - %u %{[%d/%b/%Y:00:00:00 %z]}t \"%r\" %>s %b \"%{Referer}i\" \"-\" %{Age}o" privacyhs

And there's already the first contradiction: The %{Age}o part is not contained in the Combined Log Format:

LogFormat "%h %l %u %t \"%r\" %>s %b \"%{Referer}i\" \"%{User-agent}i\"" combined

Maybe we require lines to start with the Common Log Format and ignore any further fields? Needs discussion.

  • meta-data is provided in paths and filenames

Yep.

  • ...
  • What will sanitized stored (on disk) logs look like?
    • cleaned log-lines, define exact format, give examples (as this might deviate from the current python sanitation)
    • meta-data is provided in paths and filenames
    • should files be reassembled, i.e., only log lines of a given date in a descriptor for that log date?

Yes! That's important! Otherwise we'll leak information of lines contained for a given date before/after rotating logs. That's a much shorter time frame than 24 hours then. We'll have to do this.

  • should storage (on disk) be in compressed files (opposed to storing other descriptors uncompressed)?

Yes. Configurable by the application, but yes.

  • Should such log be stored (on disk) in reasonably sized chunks (once a GB size is reached)?

No, compression should already reduce the size enough so that we'll never run into such sizes. Never!

  • ...

Please add more.

Looks like a good start! Will add more as more comes to mind.

comment:3 Changed 18 months ago by iwakeh

Some questions before I begin to distill a small description:

It seems the sanitized clean log lines should adhere to the Common Log Format:

LogFormat "%h %l %u %t \"%r\" %>s %b"

Thus, the sanitized log format will be changed from currently:
0.0.0.0 - - [10/Mar/2017:00:00:00 +0000] "GET / HTTP/1.0" 200 3018 "-" "-" -
to
0.0.0.0 - - [10/Mar/2017:00:00:00 +0000] "GET / HTTP/1.0" 200 3018.
Can we agree on this?
When parsing all lines with a beginning that doesn't fit the Common Log Format are considered unparseable. Lines with a beginning that matches CLF are considered parseable and all trailing content is ignored.

POST requests are to be dropped?

comment:4 in reply to:  3 Changed 18 months ago by karsten

Replying to iwakeh:

Some questions before I begin to distill a small description:

It seems the sanitized clean log lines should adhere to the Common Log Format:

LogFormat "%h %l %u %t \"%r\" %>s %b"

Thus, the sanitized log format will be changed from currently:
0.0.0.0 - - [10/Mar/2017:00:00:00 +0000] "GET / HTTP/1.0" 200 3018 "-" "-" -
to
0.0.0.0 - - [10/Mar/2017:00:00:00 +0000] "GET / HTTP/1.0" 200 3018.
Can we agree on this?

Yes.

When parsing all lines with a beginning that doesn't fit the Common Log Format are considered unparseable. Lines with a beginning that matches CLF are considered parseable and all trailing content is ignored.

Yes.

POST requests are to be dropped?

Yes.

comment:5 Changed 18 months ago by iwakeh

Please review a first draft in this branch.
I also attach a simple html version here.

Changed 18 months ago by iwakeh

Attachment: weblog-spec.html.xz added

comment:6 Changed 18 months ago by iwakeh

Status: needs_informationneeds_review

The status change somehow didn't 'arrive' at trac yesterday. Here it is.

Changed 18 months ago by karsten

Attachment: webstats-spec.txt added

karsten's first draft

comment:7 Changed 18 months ago by karsten

So, I started reviewing your draft and made a few tweaks here and there. But then I moved around things to better reflect the order of sanitizing steps and to better reason about why we're doing these steps. In the end I decided to start over and write a new draft mostly based on my memory, plus a few other sources except for your draft. The idea was to start from scratch. It's not supposed to replace your draft entirely, because your draft describes some parts better than mine. Maybe we can somehow combine our drafts. Though we'll first have to discuss a few open questions (marked with XXX in my draft). Maybe something for a pad meeting early next week?

Changed 18 months ago by karsten

Attachment: webstats-spec.2.txt added

Third draft based on karsten's first draft and a discussion with iwakeh

comment:8 Changed 18 months ago by karsten

Here's another draft based on today's discussion.

comment:9 Changed 18 months ago by iwakeh

Answer to open question
'How does the script handle dates differing from the log-files date?' (cf. python implementation)
Currently, log lines that have a date after the log-file's date or more than a day before the log-file's date are dropped and the dates are logged by the script. Lines from a day earlier are written to the earlier log file.

Thus, sanitized logs can be published once they are two days old.

comment:10 Changed 18 months ago by karsten

Do you mind giving an example with exact timestamps?

comment:11 Changed 18 months ago by iwakeh

For example, in a log file with date 20170903 (=: x) lines are processed according to their dates:

  • line a with date 20170903 (x) is written to the sanitized log file with date 20170903
  • line b with date 20170902 (x - 1) is written to the sanitized log file with date 20170902
  • line c with date 20170904 (x + y, where y > 0) is dropped and the date is logged on debug level
  • line d with date 20170901 (x - 1 - y, where y > 0) is dropped and the date is logged on debug level

And, system date 20170903 (x) is the earliest publishing date for a sanitized log file with date 20170901 (x - 2).

Changed 18 months ago by karsten

Attachment: webstats-spec.3.txt added

Fourth draft

comment:12 Changed 18 months ago by karsten

Okay, I tried to specify that, but please review carefully. The part that made this a bit more complex was that there are actually 2 places where we need to look at dates/times: 1) when deciding about discarding lines that are too old or too new and 2) when deciding when to publish a sanitized file and never ever touch it again. Maybe I overcomplicated this, so if you see a way to simplify what I wrote, please say so!

Here's the diff, if that helps reviewing:

diff --git a/webstats-spec.txt b/webstats-spec.txt
index 7e46449..48c0287 100644
--- a/webstats-spec.txt
+++ b/webstats-spec.txt
@@ -3,7 +3,6 @@ Tor webserver logs
 
 Next steps:
  - Replace webserver with web server which seems to be Less Bad English (karsten).
- - Find out what exact delay we'll need for publishing sanitized logs (iwakeh?)
  - Turn this document into XML (karsten)
  - Code the decisions (iwakeh)
  - Try out the code on actual logs (iwakeh; karsten can make more logs available)
@@ -30,6 +29,8 @@ LogFormat "0.0.0.2 - %u %{[%d/%b/%Y:00:00:00 %z]}t \"%r\" %>s %b \"%{Referer}i\"
 
 The main difference to Apache's Common Log Format is that request IP addresses are removed and the field is instead used to encode whether the request came in via http:// (0.0.0.0), via https:// (0.0.0.1), or via the site's onion service (0.0.0.2).
 
+Tor's webservers are configured to use UTC as timezone, which is also highly recommended when rewriting request times to "00:00:00" in order for the subsequent sanitizing steps to work correctly. Alternatively, if the system timezone is not set to UTC, webservers should keep request times unchanged and let them be handled by the subsequent sanitizing steps.
+
 Tor's webservers are configured to rotate logs at least once per day, which does not necessarily happen at 00:00:00 UTC. As a result, log files may contain requests from up to two UTC days and several log files may contain requests that have been started on the same UTC day.
 
 All access log files written by Tor's webservers follow the naming convention <hostname>.torproject.org-access.log-YYYYMMDD.
@@ -48,6 +49,8 @@ Log files are expected to contain exactly 1 request per line. We process these f
 
  - Lines begin with Apache's Common Log Format ("%h %l %u %t \"%r\" %>s %b") or a compatible format like one of Tor's privacy formats. It is acceptable if lines start with a format that is compatible to the Common Log Format and continue with additional fields. Those additional fields will later be discarded, but the line will not be discarded because of them.
  - The request IP address starts with "0.0.0.", followed by any number between 0 and 255.
+ - The time the request was received does not lie in the future.
+ - The date the request was received, after converting the request time to UTC, does not lie more than 1 day in the past. (Bulk imports of archived logs are exempt from this requirement.)
  - The request protocol is HTTP.
  - The request method is either GET or HEAD.
  - The final status of the request is neither 400 ("Bad Request") nor 404 ("Not Found").
@@ -80,9 +83,7 @@ Sanitized log files may additionally be sorted into directories by virtual host
 
 <virtual-host>/YYYY/MM/<virtual-host>-<physical-host>-access.log-YYYYMMDD[.xz]
 
-Due to the fact that the date when a log file was rotated and the start date of contained requests may not always overlap, we need to delay publishing sanitized log files until all log files containing requests from that date are guaranteed to be processed. After this delay, the sanitized log files are published and not further modified.
-
-XXX What's the delay? End of UTC day + 24 hours? Check current script!
+Due to the fact that the date when a log file was rotated and the start date of contained requests may not always overlap, we need to delay publishing sanitized log files until the start date of requests in UTC plus 2 days. After this delay, all log files containing requests from that date are assumed to be processed. Sanitized log files are published and not further modified in the future. (Again, bulk imports of archived logs are exempt from this.)
 
 As last and certainly not least important sanitizing step, all rewritten log lines are sorted alphabetically, so that request order cannot be inferred from sanitized log files.

If you think it's good, I'll continue with the remaining next steps. Thanks!

comment:13 Changed 18 months ago by iwakeh

The actual date (or system date) is only of concern for publishing the logs. All other dates refer to the date the (original) log is finalized. I introduced the term 'reference date' for this.
The diff:

--- webstats-spec.3.txt
+++ webstats-spec.4.txt
@@ -33,7 +33,7 @@
 
 Tor's webservers are configured to rotate logs at least once per day, which does not necessarily happen at 00:00:00 UTC. As a result, log files may contain requests from up to two UTC days and several log files may contain requests that have been started on the same UTC day.
 
-All access log files written by Tor's webservers follow the naming convention <hostname>.torproject.org-access.log-YYYYMMDD.
+All access log files written by Tor's webservers follow the naming convention <hostname>.torproject.org-access.log-YYYYMMDD, where 'YYYYMMDD' is the date of the rotation and finalization of the log file.  This date will be referred to as 'reference date' in the following sections.
 
 # Sanitizing steps
 
@@ -41,16 +41,16 @@
 
 ## Discarding non-matching files
 
-As first safeguard against publishing log files that are too sensitive, we discard all files not matching the naming convention for access logs. This is to prevent, for example, error logs from slipping through.
+As first safeguard against publishing log files that are too sensitive, we discard all files not matching the naming convention for access logs. This is to prevent, for example, error logs from slipping through.  In addition, the log file's name is supposed to contain the reference date, which is used to determine the validity of log lines.  If the log file's name doesn't end in a date string of the format 'YYYYMMDD' the entire file is discarded.  
 
 ## Discarding non-matching lines
 
-Log files are expected to contain exactly 1 request per line. We process these files line by line and discard any lines not matching the following criteria:
+Log files are expected to contain exactly 1 request per line.  We process these files line by line and discard any lines not matching the following criteria:
 
  - Lines begin with Apache's Common Log Format ("%h %l %u %t \"%r\" %>s %b") or a compatible format like one of Tor's privacy formats. It is acceptable if lines start with a format that is compatible to the Common Log Format and continue with additional fields. Those additional fields will later be discarded, but the line will not be discarded because of them.
  - The request IP address starts with "0.0.0.", followed by any number between 0 and 255.
- - The time the request was received does not lie in the future.
- - The date the request was received, after converting the request time to UTC, does not lie more than 1 day in the past. (Bulk imports of archived logs are exempt from this requirement.)
+ - The time the request was received does not lie in the future of the reference date.
+ - The date the request was received, after converting the request time to UTC, does not lie more than 1 day in the past of the reference date.
  - The request protocol is HTTP.
  - The request method is either GET or HEAD.
  - The final status of the request is neither 400 ("Bad Request") nor 404 ("Not Found").
@@ -83,7 +83,7 @@
 
 <virtual-host>/YYYY/MM/<virtual-host>-<physical-host>-access.log-YYYYMMDD[.xz]
 
-Due to the fact that the date when a log file was rotated and the start date of contained requests may not always overlap, we need to delay publishing sanitized log files until the start date of requests in UTC plus 2 days. After this delay, all log files containing requests from that date are assumed to be processed. Sanitized log files are published and not further modified in the future. (Again, bulk imports of archived logs are exempt from this.)
+Due to the fact that the date when a log file was rotated and the start date of contained requests may not always overlap, we need to delay publishing sanitized log files until the start date of requests in UTC plus 2 days. After this delay, all log files containing requests from that date are assumed to be processed.
 
 As last and certainly not least important sanitizing step, all rewritten log lines are sorted alphabetically, so that request order cannot be inferred from sanitized log files.

And, I don't see the necessity for stating that the files won't be changed in future. This doesn't seem part of a spec here. Anyway, we might want to re-sanitize these files, if suddenly there is a privacy issue with fields that seem benign now (as with bridge descriptors, for example).

Changed 18 months ago by iwakeh

Attachment: web-spec-4.zip added

draft five numbered 4 (trac keeps rejecting unzipped uploads as spam :-/

comment:14 Changed 18 months ago by iwakeh

Another addition:

Even though Tor's Apache webservers are configured to only provide three ip addresses (e.g. 0.0.0.{0,1,2}) all lines with different ips are accepted and sanitized to ip 0.0.0.0.

Or, should such lines be discarded?

comment:15 Changed 18 months ago by iwakeh

Similarly, only lines for 400 and 404 response codes are discarded. Any bogus response code will be kept.

comment:16 in reply to:  13 Changed 18 months ago by karsten

Replying to iwakeh:

The actual date (or system date) is only of concern for publishing the logs. All other dates refer to the date the (original) log is finalized. I introduced the term 'reference date' for this.
The diff: [...]

And, I don't see the necessity for stating that the files won't be changed in future. This doesn't seem part of a spec here. Anyway, we might want to re-sanitize these files, if suddenly there is a privacy issue with fields that seem benign now (as with bridge descriptors, for example).

Agreed, those changes look good to me.

comment:17 in reply to:  14 ; Changed 18 months ago by karsten

Replying to iwakeh:

Another addition:

Even though Tor's Apache webservers are configured to only provide three ip addresses (e.g. 0.0.0.{0,1,2}) all lines with different ips are accepted and sanitized to ip 0.0.0.0.

Or, should such lines be discarded?

Right now addresses are kept as long as they start with 0.0.0., which seems plausible to me. The spec draft should also say that.

comment:18 in reply to:  15 ; Changed 18 months ago by karsten

Replying to iwakeh:

Similarly, only lines for 400 and 404 response codes are discarded. Any bogus response code will be kept.

Right, but I think 400 and 404 were the most problematic ones when we came up with the original plan for sanitizing these logs. Do you have any others in mind that we'd have to sanitize? If not, I'd say let's stick with 400 and 404 for now.

So, should I move forward with turning draft number five, numbered 4, into XML?

comment:19 in reply to:  17 ; Changed 18 months ago by iwakeh

Replying to karsten:

Replying to iwakeh:

Another addition:

Even though Tor's Apache webservers are configured to only provide three ip addresses (e.g. 0.0.0.{0,1,2}) all lines with different ips are accepted and sanitized to ip 0.0.0.0.

Or, should such lines be discarded?

Right now addresses are kept as long as they start with 0.0.0., which seems plausible to me. The spec draft should also say that.

Agreed. My question was a different one: what about log lines that contain other ips (e.g. in case Apache suddenly logs more 11.22.33.44). Currently these would be replaced by 0.0.0.0 and the sanitized lines kept.

comment:20 in reply to:  18 ; Changed 18 months ago by iwakeh

Replying to karsten:

Replying to iwakeh:

Similarly, only lines for 400 and 404 response codes are discarded. Any bogus response code will be kept.

Right, but I think 400 and 404 were the most problematic ones when we came up with the original plan for sanitizing these logs. Do you have any others in mind that we'd have to sanitize? If not, I'd say let's stick with 400 and 404 for now.

No, the other way around. As said above invalid (=bogus) response codes will be kept. A better way to phrase this: we don't check, if a response code is a valid one and only sift out the valid 400 & 404.

So, should I move forward with turning draft number five, numbered 4, into XML?

Yes :-)

comment:21 in reply to:  19 ; Changed 18 months ago by karsten

Replying to iwakeh:

Replying to karsten:

Replying to iwakeh:

Another addition:

Even though Tor's Apache webservers are configured to only provide three ip addresses (e.g. 0.0.0.{0,1,2}) all lines with different ips are accepted and sanitized to ip 0.0.0.0.

Or, should such lines be discarded?

Right now addresses are kept as long as they start with 0.0.0., which seems plausible to me. The spec draft should also say that.

Agreed. My question was a different one: what about log lines that contain other ips (e.g. in case Apache suddenly logs more 11.22.33.44). Currently these would be replaced by 0.0.0.0 and the sanitized lines kept.

Ah, hmm. I think that both the current script and the specification draft say that we drop any lines not starting with 0.0.0., but if a line matches that, we keep the 0.0.0.x address unchanged.

But I wonder if we should change that to "keep any address starting with 0.0.0. and replace everything else with 0.0.0.0". That way we could easily sanitize logs from web servers using different log formats that are compliant with Apache's Common Log Format. What do you think?

comment:22 in reply to:  20 ; Changed 18 months ago by karsten

Replying to iwakeh:

Replying to karsten:

Replying to iwakeh:

Similarly, only lines for 400 and 404 response codes are discarded. Any bogus response code will be kept.

Right, but I think 400 and 404 were the most problematic ones when we came up with the original plan for sanitizing these logs. Do you have any others in mind that we'd have to sanitize? If not, I'd say let's stick with 400 and 404 for now.

No, the other way around. As said above invalid (=bogus) response codes will be kept. A better way to phrase this: we don't check, if a response code is a valid one and only sift out the valid 400 & 404.

I'm not sure where this is going. Do you think we should check if status code is one that is currently defined by the HTTP protocol and only accept the ones that are, except for 400 and 404?

So, should I move forward with turning draft number five, numbered 4, into XML?

Yes :-)

Will do as soon as the two open questions are resolved.

comment:23 in reply to:  21 Changed 18 months ago by iwakeh

Replying to karsten:

Replying to iwakeh:

Replying to karsten:

Replying to iwakeh:

Another addition:

Even though Tor's Apache webservers are configured to only provide three ip addresses (e.g. 0.0.0.{0,1,2}) all lines with different ips are accepted and sanitized to ip 0.0.0.0.

Or, should such lines be discarded?

Right now addresses are kept as long as they start with 0.0.0., which seems plausible to me. The spec draft should also say that.

Agreed. My question was a different one: what about log lines that contain other ips (e.g. in case Apache suddenly logs more 11.22.33.44). Currently these would be replaced by 0.0.0.0 and the sanitized lines kept.

Ah, hmm. I think that both the current script and the specification draft say that we drop any lines not starting with 0.0.0., but if a line matches that, we keep the 0.0.0.x address unchanged.

But I wonder if we should change that to "keep any address starting with 0.0.0. and replace everything else with 0.0.0.0". That way we could easily sanitize logs from web servers using different log formats that are compliant with Apache's Common Log Format. What do you think?

Yes, that's what I would choose.

comment:24 in reply to:  22 Changed 18 months ago by iwakeh

Replying to karsten:

...

Similarly, only lines for 400 and 404 response codes are discarded. Any bogus response code will be kept.

Right, but I think 400 and 404 were the most problematic ones when we came up with the original plan for sanitizing these logs. Do you have any others in mind that we'd have to sanitize? If not, I'd say let's stick with 400 and 404 for now.

No, the other way around. As said above invalid (=bogus) response codes will be kept. A better way to phrase this: we don't check, if a response code is a valid one and only sift out the valid 400 & 404.

I'm not sure where this is going. Do you think we should check if status code is one that is currently defined by the HTTP protocol and only accept the ones that are, except for 400 and 404?

That's the question. Intuitively, I would not check for validity. After all the server sets the status code and that should not really be privacy jeopardizing, supposing that the faulty codes simply happen by error not by hack.

So, should I move forward with turning draft number five, numbered 4, into XML?

Yes :-)

Will do as soon as the two open questions are resolved.

comment:25 Changed 18 months ago by karsten

Alright, I'll turn the last draft we have into XML then.

comment:26 Changed 18 months ago by karsten

(incorporating the last feedback above -- just wanted to say that there are no open questions anymore)

Changed 18 months ago by karsten

Attachment: web-server-logs.pdf added

comment:27 Changed 18 months ago by karsten

Please review my task-23243 branch and this PDF print-out of the compiled web page.

comment:28 Changed 18 months ago by karsten

Remaining next steps are:

  • Code the decisions (iwakeh)
  • Try out the code on actual logs (iwakeh; karsten can make more logs available)
  • Send draft to tor-dev@ and ask for feedback (karsten)

comment:29 in reply to:  28 Changed 18 months ago by iwakeh

Replying to karsten:
From simply reading the spec looks fine!

Replying to karsten:

Remaining next steps are:

  • Code the decisions (iwakeh)

This is not a step here, but tracked in #22836 and #22428.

  • Try out the code on actual logs (iwakeh; karsten can make more logs available)

moved to the metrics-lib ticket, as logs are not needed here.

  • Send draft to tor-dev@ and ask for feedback (karsten)

The next step here. Should this wait until #22836 and #22428 are done or could this go out now?

comment:30 Changed 18 months ago by karsten

Did you mean #22836 above, or some other ticket?

Regarding the timing of asking for feedback, I could see us sending the request now or later. Do you think the implementation will reveal any important changes that would require sending an updated request for feedback? If not, we could send now (well, tomorrow). But we could as well wait a few more days or even a week if we have to.

When we ask for feedback, should we simply put what we have on the Metrics website together with a visible "beta" notice? Might be easier than sending out a PDF.

comment:31 Changed 18 months ago by karsten

(I believe you meant #22983 above, not #22836.)

comment:32 in reply to:  30 Changed 18 months ago by iwakeh

Replying to karsten:

Did you mean #22836 above, or some other ticket?

Yes, you guessed correctly. I removed my comment on the wrong ticket and added it to #22983.

Regarding the timing of asking for feedback, I could see us sending the request now or later. Do you think the implementation will reveal any important changes that would require sending an updated request for feedback? If not, we could send now (well, tomorrow). But we could as well wait a few more days or even a week if we have to.

True, then just wait and add this topic to the team meeting.
The changes could go in both directions, i.e., changes to the spec could need implementation changes. But, very likely these will be small.

When we ask for feedback, should we simply put what we have on the Metrics website together with a visible "beta" notice? Might be easier than sending out a PDF.

This is the best way to go about this. Maybe also add a link to the metrics mailing list on the beta-page for questions and concerns, in case user without a tor-dev-ml subscription view the page.

comment:33 Changed 18 months ago by karsten

Makes sense. Let's talk more on Thursday. Thanks!

comment:34 Changed 18 months ago by iwakeh

Status: needs_reviewneeds_information

There are two open questions:

  1. Should it be mentioned in section 2 of the spec that log files come in directories named as the physical host, i.e., meronense.torproject.org/metrics.torproject.org-access.log.20170707.log?
  2. As already visible in 1.: the files are expected to have ending '.log' or '.log.bz2' or some other compression?

Especially a clear answer for 2. is important for the implementation.

comment:35 in reply to:  34 ; Changed 18 months ago by karsten

Replying to iwakeh:

There are two open questions:

  1. Should it be mentioned in section 2 of the spec that log files come in directories named as the physical host, i.e., meronense.torproject.org/metrics.torproject.org-access.log.20170707.log?

Wait, there's no .log at the end of the file name. Example (from the server):

metrics.torproject.org-access.log-20170912.gz

Also note the - between access.log and the date.

  1. As already visible in 1.: the files are expected to have ending '.log' or '.log.bz2' or some other compression?

Especially a clear answer for 2. is important for the implementation.

I'd say the exact compression type is an implementation detail. See also the very last paragraph in the spec where we said: "Sanitized log files are typically compressed before publication. In particular the sorting step allows for highly efficient compression rates. We typically use XZ for compression, which is indicated by appending ".xz" to log file names, but this is subject to change." -- We could say something similar for logs that are provided to the sanitizer.

How about we add a new first paragraph to Section 3.1 (Discarding non-matching files):

"""
Log files are made available to the santizer in a separate directory per physical web server host. Log files are typically gz-compressed, which is indicated by appending ".gz" to log file names, but this is subject to change. Overall, the sanitizer expects log files to use the following path format:

<phyiscal-host>/<virtual-host>.torproject.org-access.log-YYYYMMDD[.gz]
"""

And while we're at it, let's change "<hostname>.torproject.org-access.log-YYYYMMDD" in the last paragraph of Section 2 to "<virtual-host>.torproject.org-access.log-YYYYMMDD".

comment:36 in reply to:  35 Changed 18 months ago by iwakeh

Thanks for the quick reply!

Replying to karsten:

Replying to iwakeh:

There are two open questions:

  1. Should it be mentioned in section 2 of the spec that log files come in directories named as the physical host, i.e., meronense.torproject.org/metrics.torproject.org-access.log.20170707.log?

Wait, there's no .log at the end of the file name. Example (from the server):

metrics.torproject.org-access.log-20170912.gz

Also note the - between access.log and the date.

  1. As already visible in 1.: the files are expected to have ending '.log' or '.log.bz2' or some other compression?

Especially a clear answer for 2. is important for the implementation.

I'd say the exact compression type is an implementation detail. See also the very last paragraph in the spec where we said: "Sanitized log files are typically compressed before publication. In particular the sorting step allows for highly efficient compression rates. We typically use XZ for compression, which is indicated by appending ".xz" to log file names, but this is subject to change." -- We could say something similar for logs that are provided to the sanitizer.

How about we add a new first paragraph to Section 3.1 (Discarding non-matching files):

"""
Log files are made available to the santizer in a separate directory per physical web server host. Log files are typically gz-compressed, which is indicated by appending ".gz" to log file names, but this is subject to change. Overall, the sanitizer expects log files to use the following path format:

<phyiscal-host>/<virtual-host>.torproject.org-access.log-YYYYMMDD[.gz]
"""

And while we're at it, let's change "<hostname>.torproject.org-access.log-YYYYMMDD" in the last paragraph of Section 2 to "<virtual-host>.torproject.org-access.log-YYYYMMDD".

Sounds good. Minor adaptions:

Maybe, change "<hostname>.torproject.org-access.log-YYYYMMDD" in the last paragraph of Section 2 to "<virtual-host>-access.log-YYYYMMDD" (analogously <physical-host>)? If not, should that be tested for and files not complying rejected, which I'd find too restrictive.

And, an addition to 3.1: files with unknown compression format are discarded.

comment:37 Changed 18 months ago by karsten

Taking out the .torproject.org part makes sense, as does the addition about unknown compression formats.

I'll make those changes later today. Feel free to continue working on the implementation with these two adaptions in mind.

Changed 18 months ago by karsten

Attachment: web-server-logs-2.pdf added

comment:38 Changed 18 months ago by karsten

Status: needs_informationneeds_review

There, please find my updated task-23243 branch and this PDF print-out. Hope this includes all changes as discussed above.

comment:39 Changed 18 months ago by iwakeh

Thanks for providing the PDF!

The 'contact' links in the 'beta-notice' commit contain http://localhost:8888?

Other than that I set this to merge-ready assuming it won't be published immediately ;-)

Last edited 18 months ago by iwakeh (previous) (diff)

comment:40 Changed 18 months ago by iwakeh

Status: needs_reviewmerge_ready

comment:41 in reply to:  39 Changed 18 months ago by karsten

Replying to iwakeh:

Thanks for providing the PDF!

The 'contact' links in the 'beta-notice' commit contain http://localhost:8888?

Uhmmm, right, fixed in a fixup-commit.

Other than that I set this to merge-ready assuming it won't be published immediately ;-)

Sure. Just ping me here when you think it should be published.

comment:42 Changed 17 months ago by karsten

Summary: write a spec for web-server-access log descriptorsWrite a specification for Tor web server logs

Tweak summary.

comment:43 Changed 17 months ago by karsten

Keywords: metrics-2018 added

comment:44 Changed 17 months ago by karsten

Keywords: metrics-2017 added; metrics-2018 removed

comment:45 Changed 16 months ago by iwakeh

Status: merge_readyneeds_revision

The spec might need to be extended:

The implementation of the CollecTor webstats module triggered more questions about the way original logs are supplied. One piece of information that is so far only supplied indirectly is the cue for when a log is finished. In detail:

  • Functionality for bulk imports of log files is necessary. Thus, the implementation cannot rely on the system date anymore to decide when a log day is complete. (distinguishing between reference date as defined in the spec and the 'log for a day' which means all log lines for a given date are available).
  • Implicit assumption: input log files can be empty or not contain any valid lines as long as there naming pattern matches the rules.
  • The current spec allows only for one input log per reference date (per virtual plus physical host).
  • Log lines for a particular log day could be spread over two successive log files (as defined in the current spec).
  • Implicit cue: all log lines are available for a certain reference date when the log for the reference date and its successor are available. This also means a log for a day without an immediate successor is not complete, i.e. won't be processed. The cue in form of the successor could be given as an empty successor log file. This cue has to be supplied from outside and cannot be determined from the implementation.

Related is another question from #22428 comment:36

Here's another, related question: what happens if a web server rotates logs more often than once per day? At least that's something that we write in the specification. I'm not sure how this would work with file names, so maybe we in fact require that logs are rotated exactly once per day, and we just didn't write that in the specification yet. However, it seems rather restrictive to prescribe exact log rotation intervals in order to sanitize logs subsequently. Maybe we should be less restrictive here.

It doesn't really matter, if the log lines for a certain day are spread over two or more input files. Currently, only one input file per reference date is possible (the first wins).
More input files could be supplied by extending the input log name pattern with a dash followed by an integer, i.e., scrubbed.torproject.org-access.log-20171006-77.gz. In such a case it should be required that

  • counting starts with one (arbitrary).
  • there are no gaps, i.e., if there is a file with 3, there have to be files with 2 and 1 for the same virtual, physical host, and date combination.

Again, a cue is needed for when the log day is complete. As above this could be the input file for the immediate successor by reference date with number 1. And, this cue could be an empty file.

Remarks:
The way the cue is given is arbitrary, but the current implementation suggestion already works with the method described above.
The naming pattern is just an arbitrary suggestion. So improvements are welcome.

comment:46 Changed 16 months ago by karsten

I'm not sure if we can resolve these questions by hard thinking. A few ideas:

  • Would it help to know the log and log rotation configuration used on the various Tor web servers?
  • Would it help to have access to the current host that sanitizes web server logs?
  • Does the existing code for sanitizing web server logs contain any more hints on the input data?

comment:47 Changed 16 months ago by iwakeh

Replying to karsten:

I'm not sure if we can resolve these questions by hard thinking.

Well, we need to work on thoughtful decision making. There're not that many questions above except yours:

... what happens if a web server rotates logs more often than once per day? At least that's something that we write in the specification. I'm not sure how this would work with file names, so maybe we in fact require that logs are rotated exactly once per day, and we just didn't write that in the specification yet. However, it seems rather restrictive to prescribe exact log rotation intervals in order to sanitize logs subsequently. Maybe we should be less restrictive here.

The current webstat code and the spec require a log per day. So, if someone decides to change the log rotation to more than that, the spec and code will have to be adapted. Thus, it seems this question refers to a hypothetical change (afaik). In comment:45 I point out that this is a small issue for implementation based on the reasoning that rotated logs usually add a number or a time or both to the log file name. Either way is easily adapted.

  • Would it help to know the log and log rotation configuration used on the various Tor web servers?

Unless you have reason to think that current logging procedures are going to change or even changed already from the one log per day schema, this is not necessary.

  • Would it help to have access to the current host that sanitizes web server logs?

I think there are no questions regarding the current process.

  • Does the existing code for sanitizing web server logs contain any more hints on the input data?

We put all the information from the current code into the spec and the current implementation suggestion. The old code also uses a 'cue' (as mentioned comment:45): sanitize.py returns the sanitized log file name for the day before the processed log file, which is the cue for the calling shell script that this file is now complete and can be published.

Both the old and suggested new version of webstats need an outside cue and without this cue an input log day would not be published.

Now focussing on the new implementation of the webstats module for CollecTor there are several ways of preventing log file loss:

  1. Make sure by outside means that there is no day without a log (e.g. by providing an empty file for that day using 'touch'). This would work without additional implementation for CollecTor and this works for bulk imports as well as daily processing. As a result there will be a sanitized log for each day offered by CollecTor, some might be empty.
  2. For bulk processing a property could signal CollecTor to use all logs without insisting on an uninterrupted chain. This still requires outside measures for making sure no log lines are lost and might result in days without any logs, unless CollecTor creates empty ones.
  3. Think out a mechanism that enables more automated processing of an interrupted chain of logs. This seems error prone an will result in many edge cases.

I think 1. is the easiest in terms of operation, i.e., providing input logs, and implementation (it's there already). In addition, the uninterupted chain of (possibly empty) sanitized logs is also easy to verify and understand. An empty file could result from no log line being valid or no log being available for that day.

So, in order to get forward one of the above methods needs to be chosen (or a new one made up).
The other question about smaller log rotation intervals is only relevant, if that is put into practice. If so, it should be a straightforward task to adapt the code.

Hope this makes some sense. Is there anything else missing here?

comment:48 in reply to:  47 ; Changed 16 months ago by karsten

Replying to iwakeh:

Replying to karsten:

I'm not sure if we can resolve these questions by hard thinking.

Well, we need to work on thoughtful decision making.

Unfortunately, I'm not a good person these days to dive deep enough into this topic to make thoughtful decisions. Too many topics, too little time to do any of them well enough. That's why I hoped you'd just solve all problems here and I could then review the solution. :)

There're not that many questions above except yours:

... what happens if a web server rotates logs more often than once per day? At least that's something that we write in the specification. I'm not sure how this would work with file names, so maybe we in fact require that logs are rotated exactly once per day, and we just didn't write that in the specification yet. However, it seems rather restrictive to prescribe exact log rotation intervals in order to sanitize logs subsequently. Maybe we should be less restrictive here.

The current webstat code and the spec require a log per day.

Well, no. The spec says "Tor's web servers are configured to rotate logs at least once per day". If we didn't mean that, let's phrase it differently. But how?

And we should write down possible failure modes for the case that logs are rotated less often or more often.

In any case, we should warn in case we run into one of these cases, rather than silently continuing operation and simply producing fewer/smaller sanitized logs.

[...]

  1. Make sure by outside means that there is no day without a log (e.g. by providing an empty file for that day using 'touch'). This would work without additional implementation for CollecTor and this works for bulk imports as well as daily processing. As a result there will be a sanitized log for each day offered by CollecTor, some might be empty.

I'd say we need to do something that doesn't require any upstream changes. In other words, whatever ends up in in/webstats/ is what we should be able to work with. We shouldn't require upstream to touch files for us.

  1. For bulk processing a property could signal CollecTor to use all logs without insisting on an uninterrupted chain. This still requires outside measures for making sure no log lines are lost and might result in days without any logs, unless CollecTor creates empty ones.
  2. Think out a mechanism that enables more automated processing of an interrupted chain of logs. This seems error prone an will result in many edge cases.

I don't know, maybe we can do something with system time or state files. Or we could process everything in in/webstats/ and write everything to out/ and recent/ except the first and last encountered UTC days. Just some ideas.

Again, I'm not deep enough into this to make a good decision. I just hope that whatever thing we'll build here is robust enough to either handle all of the cases or warns loudly whenever it runs into an unforeseen case.

I'm very concerned about silently losing data. That's the worst thing that could happen to us, in particular given that we don't keep archives of the input data in this case.

comment:49 in reply to:  48 Changed 16 months ago by iwakeh

Replying to karsten:

Replying to iwakeh:

Replying to karsten:

I'm not sure if we can resolve these questions by hard thinking.

Well, we need to work on thoughtful decision making.

Unfortunately, I'm not a good person these days to dive deep enough into this topic to make thoughtful decisions. Too many topics, too little time to do any of them well enough. That's why I hoped you'd just solve all problems here and I could then review the solution. :)

There are always many solutions. So, you'd also need to be involved in choosing ;-)

...

The current webstat code and the spec require a log per day.

Well, no. The spec says "Tor's web servers are configured to rotate logs at least once per day". If we didn't mean that, let's phrase it differently. But how?

The spec indirectly only allows at most one log per day and virtual-physical host combination by its naming requirements <phyiscal-host>/<virtual-host>-access.log-YYYYMMDD[.gz]. Combining this with the at-least-once-per-day yields exactly one log per day. It might be useful to write this explicitly.
(If more logs for the same date are supplied, the first wins.)

And we should write down possible failure modes for the case that logs are rotated less often or more often.

The naming requirements don't allow for more than one log per day.
Logs that don't match the requirements are dropped (which can be logged as warning but that's part of the implementation not part of the spec).

In any case, we should warn in case we run into one of these cases, rather than silently continuing operation and simply producing fewer/smaller sanitized logs.

There will be warnings in the logs, if suddenly the naming of logs doesn't comply anymore to the specs requirements. And, that would inevitably happen, if logs are rotated more often than once per day.

[...]

  1. Make sure by outside means that there is no day without a log (e.g. by providing an empty file for that day using 'touch'). This would work without additional implementation for CollecTor and this works for bulk imports as well as daily processing. As a result there will be a sanitized log for each day offered by CollecTor, some might be empty.

I'd say we need to do something that doesn't require any upstream changes. In other words, whatever ends up in in/webstats/ is what we should be able to work with. We shouldn't require upstream to touch files for us.

  1. For bulk processing a property could signal CollecTor to use all logs without insisting on an uninterrupted chain. This still requires outside measures for making sure no log lines are lost and might result in days without any logs, unless CollecTor creates empty ones.
  2. Think out a mechanism that enables more automated processing of an interrupted chain of logs. This seems error prone an will result in many edge cases.

I don't know, maybe we can do something with system time or state files. Or we could process everything in in/webstats/ and write everything to out/ and recent/ except the first and last encountered UTC days. Just some ideas.

Again, I'm not deep enough into this to make a good decision. I just hope that whatever thing we'll build here is robust enough to either handle all of the cases or warns loudly whenever it runs into an unforeseen case.

I'm very concerned about silently losing data. That's the worst thing that could happen to us, in particular given that we don't keep archives of the input data in this case.

From the above I infer that you think we should opt for number 3, i.e., automate as much as possible, warn whenever there is some unforeseen naming etc.
Is this correct?

If that's the case, my next steps would be to provide a patch for the spec for making the one log per day requirement explicit and also discuss what happens if logs are missing; and secondly, based on that I will extend the implementation in #22428 to address the new requirements.

Does that sound like a plan?

comment:50 Changed 16 months ago by karsten

While reviewing our discussion above I discovered another weakness in our specification: our naming convention for sanitized log files does not take into account that host names may include dashes. For example, there are virtual hosts like cdn-fastly-backend.torproject.org and physical hosts like oo-hetzner-03.torproject.org, which we would combine to cdn-fastly-backend.torproject.org-oo-hetzner-03.torproject.org-access.log-20171030.xz. Where does the virtual host name end and where does the physical host name begin?

We might consider changing the naming convention to something like <virtual-host>-access.log-<physical-host>-YYYYMMDD[.xz], but even for that we might find host names producing file names that cannot be parsed unambiguously. Maybe we'll have to return to putting phsyical host names in parent directory names and only virtual names in file names. Hmm.

But going back to the discussion above, I don't think we can make assumptions that would allow us to implement your suggestions 1 and 2 above. Yet I don't see how suggestion 3 would be more error prone. It's up to us to design something that is robust, so we'll have to go through the possible edge cases and be prepared to handle them.

And talking about assumptions, I feel like our one-log-per-day assumption is unnecessarily strong. I do agree that the naming requirement only permits one log file per physical host, virtual host, and date. But it seems like it should be up to the web server operators to decide to rotate logs only once per day or more often to keep log files small.

Here's an idea to reduce the number of edge cases: how about we simply ignore the date in the input log file name and only rely on the date given in each of the contained log lines?

As stated earlier, we could process everything in in/webstats/ and write everything to out/ and recent/ except the first and last encountered UTC days. In theory, we could even drop the log rotation requirement entirely.

This approach should work just fine for bulk processing. And for running several times per day we could keep a state file to avoid re-processing input files that we already processed before, by storing file name, last-modified time, and last contained UTC date. Worst case if we lose that state file is that we'll read everything in in/webstats/ once again.

Do you see any conceptual weaknesses in ignoring the date in input files? Do you want to give this implementation a try? Otherwise I'd be willing to write some proof-of-concept code to see whether this can work.

comment:51 in reply to:  50 Changed 16 months ago by iwakeh

Replying to karsten:

While reviewing our discussion above I discovered another weakness in our specification: our naming convention for sanitized log files does not take into account that host names may include dashes. For example, there are virtual hosts like cdn-fastly-backend.torproject.org and physical hosts like oo-hetzner-03.torproject.org, which we would combine to cdn-fastly-backend.torproject.org-oo-hetzner-03.torproject.org-access.log-20171030.xz. Where does the virtual host name end and where does the physical host name begin?

Yikes! This is really bad, but good to become aware of it now.

We might consider changing the naming convention to something like <virtual-host>-access.log-<physical-host>-YYYYMMDD[.xz], but even for that we might find host names producing file names that cannot be parsed unambiguously. Maybe we'll have to return to putting phsyical host names in parent directory names and only virtual names in file names. Hmm.

How much influence do we have on the naming of input files?
Who decides about input file naming and structuring?

For the output files, we could use other separators, maybe the underscore?

But going back to the discussion above, I don't think we can make assumptions that would allow us to implement your suggestions 1 and 2 above. Yet I don't see how suggestion 3 would be more error prone. It's up to us to design something that is robust, so we'll have to go through the possible edge cases and be prepared to handle them.

I just want to point out that many questions or assumptions I asked about are not based on my preferences, but only on the fact that implementation and design need the unambiguous information. Thus, a valid answer to 1 to 3 above is, that the assumptions of 1 and 2 are not valid. That's perfectly fine.

And talking about assumptions, I feel like our one-log-per-day assumption is unnecessarily strong. I do agree that the naming requirement only permits one log file per physical host, virtual host, and date. But it seems like it should be up to the web server operators to decide to rotate logs only once per day or more often to keep log files small.

As above:
How much influence do we have on the structuring of logs?

Again, I pointed out above (comment:45) that the implementation could accommodate this easily. But, for a start there has to be a valid description of the log names to be expected.

Here's an idea to reduce the number of edge cases: how about we simply ignore the date in the input log file name and only rely on the date given in each of the contained log lines?

As stated earlier, we could process everything in in/webstats/ and write everything to out/ and recent/ except the first and last encountered UTC days. In theory, we could even drop the log rotation requirement entirely.

I'm not really worried about the edge cases.

Initially, the date in the log name was used to ignore log lines that actually belong to older logs. This was introduced because of the current way of processing in the shell-python implementation.

A second requirement is that log files shouldn't change once published. This makes accepting the last UTC date more difficult. There has to be the information when a log file is ready to be published. And, so far we used the log's reference date for that purpose.
Again, I don't have a preference, but the topic needs to be solved before implementation.

This approach should work just fine for bulk processing. And for running several times per day we could keep a state file to avoid re-processing input files that we already processed before, by storing file name, last-modified time, and last contained UTC date. Worst case if we lose that state file is that we'll read everything in in/webstats/ once again.

Do you see any conceptual weaknesses in ignoring the date in input files? Do you want to give this implementation a try? Otherwise I'd be willing to write some proof-of-concept code to see whether this can work.

How is the "when is a log ready for publication" question solved here?
Thinking about the process or introducing a performance measurement like the stat file is taking the third step before the first.
Or, do you imply to change the requirement of not altering sanitized logs once they're published?

Conclusion:
The implementation is not difficult either way. Currently the specification (on which the implementation needs to rely) is a moving target. I think we should make sure that our specification is correct and answers all questions. When there is a solid specification, the implementation follows easily.

comment:52 Changed 16 months ago by karsten

A few thoughts:

  • I don't have a good answer to the question of putting virtual and physical host name into a file name. Underscores might work. Trying to solve the bigger issue first.
  • We have little influence on the naming of input files. We can define reasonable requirements that are ideally met by existing input files, and we can reject future input files not matching these requirements.
  • I don't imply that we should alter sanitized log files after publication. We shouldn't. That would be pretty bad.
  • The specification is a moving target, because it was ambiguous and the implementation would have been fragile. It's a valid use case that a physical host does not see a single request for a given virtual host for days, and the implementation (and specification) did not cover that case.

But let's look at the idea to process all input files and produce output files for all dates except the first and last UTC days. And let's ignore performance considerations for now. Why would it not solve the "when is a log ready for publication" question? Can you give an example?

Not sure if this is what you have in mind, but I think we cannot handle the case of logs files "in the middle" being missing in one run and being present in a subsequent run. For example, if we receive input files with requests from the following dates:

  • 2017-11-01 and 2017-11-02
  • 2017-11-02 and 2017-11-03
  • (gap)
  • 2017-11-04 and 2017-11-05
  • 2017-11-05 and 2017-11-06

We would produce output files for:

  • (skip 2017-11-01, because first UTC date)
  • 2017-11-02
  • 2017-11-03
  • 2017-11-04
  • 2017-11-05
  • (skip 2017-11-06, because last UTC date)

Now, if we later find another file filling the gap with the following contained request dates:

  • 2017-11-03 and 2017-11-04

We couldn't update the output files for 2017-11-03 and 2017-11-04 anymore! We would simply leave them unchanged, containing just the requests we processed earlier.

But is this a bug we should be able to handle? It seems like a bug in the log-copying script combined with bad timing. During normal operation and in the bulk-import case this should not happen.

Note that if you think that cutting off the first and last days is not enough, we could easily change that to cutting off the first and last two days. Or the first and the last two. Or first and last three. Whatever we think works best.

comment:53 in reply to:  52 ; Changed 16 months ago by iwakeh

Replying to karsten:

A few thoughts:

  • I don't have a good answer to the question of putting virtual and physical host name into a file name. Underscores might work. Trying to solve the bigger issue first.

Yes, just using a placeholder for spec and implementation is fine here.

  • We have little influence on the naming of input files. We can define reasonable requirements that are ideally met by existing input files, and we can reject future input files not matching these requirements.

Ok, I see. From that it seems we should draw as little as possible from the input-filename-path combination, where the two hostnames seem the absolute minimum as they cannot be inferred from the logs' contents.
So far, the reference date from the log name was used for deciding which log line dates were valid. If there was no reference date, all log line dates would be accepted from a single input log.

  • I don't imply that we should alter sanitized log files after publication. We shouldn't. That would be pretty bad.

True.

  • The specification is a moving target, because it was ambiguous and the implementation would have been fragile. It's a valid use case that a physical host does not see a single request for a given virtual host for days, and the implementation (and specification) did not cover that case.

No offense, I didn't mean the 'moving target spec' as accusation. It's normal that requirements change and extend. Currently, it is important to put emphasis on the spec, b/c the open questions we have, are there. And, the implementation should only cover the spec, neither more nor less.

But let's look at the idea to process all input files and produce output files for all dates except the first and last UTC days. And let's ignore performance considerations for now. Why would it not solve the "when is a log ready for publication" question? Can you give an example?

The 'gap(s)' (as you name it below) would be the problem.

Not sure if this is what you have in mind, but I think we cannot handle the case of logs files "in the middle" being missing in one run and being present in a subsequent run. For example, if we receive input files with requests from the following dates:

  • 2017-11-01 and 2017-11-02
  • 2017-11-02 and 2017-11-03
  • (gap)
  • 2017-11-04 and 2017-11-05
  • 2017-11-05 and 2017-11-06

We would produce output files for:

  • (skip 2017-11-01, because first UTC date)
  • 2017-11-02
  • 2017-11-03
  • 2017-11-04
  • 2017-11-05
  • (skip 2017-11-06, because last UTC date)

Now, if we later find another file filling the gap with the following contained request dates:

  • 2017-11-03 and 2017-11-04

We couldn't update the output files for 2017-11-03 and 2017-11-04 anymore! We would simply leave them unchanged, containing just the requests we processed earlier.

How to find out that 2017-11-04 is already there: only the lookup in 'out' and 'recent' could tell.

But is this a bug we should be able to handle? It seems like a bug in the log-copying script combined with bad timing. During normal operation and in the bulk-import case this should not happen.

During a bulk import it might be harder to guarantee the correct order. Hmm, but that should be manageable somehow ...

Note that if you think that cutting off the first and last days is not enough, we could easily change that to cutting off the first and last two days. Or the first and the last two. Or first and last three. Whatever we think works best.

That cut-off time could be kept variable and be adjusted later, true.

Summary:

  • Only hostnames are inferred from the logs' names and paths.
  • The 'reference date' used in the current spec is dropped.
  • Only the log line dates covered in one run become the reference interval, of which a certain amount at beginning and end is not processed (aka: cut-off time).
  • Sanitized files for dates, that are already available in 'out', are not overwritten or amended and corresponding log lines ignored.
  • Gaps in import logs cannot be filled in later.
  • File provision for (bulk) imports has to insure proper order.
  • Use a placeholder for sanitized log file names (starting with underscore, but easily changeable).

Does this seem solid?
Shall I amend the spec with these changes?

Last edited 16 months ago by iwakeh (previous) (diff)

comment:54 in reply to:  53 Changed 16 months ago by karsten

Replying to iwakeh:

Replying to karsten:

We couldn't update the output files for 2017-11-03 and 2017-11-04 anymore! We would simply leave them unchanged, containing just the requests we processed earlier.

How to find out that 2017-11-04 is already there: only the lookup in 'out' and 'recent' could tell.

Correct, plus maybe a 'state' file with previously sanitized logs (implementation detail). But how else would we find out that we already sanitized and published a log file? That's something only we can know ourselves.

But is this a bug we should be able to handle? It seems like a bug in the log-copying script combined with bad timing. During normal operation and in the bulk-import case this should not happen.

During a bulk import it might be harder to guarantee the correct order. Hmm, but that should be manageable somehow ...

Oh! Now I understand what you mean by bulk import. Like all log files for August, then all log files for September, etc.? Yes, that would be problematic. I'd say that one would have to supply some days from the previous and the next interval and discard any output files for days outside of the currently processed interval. That could work. But it's error-prone. And it's not exactly our use case, because we only have logs from the past 2 weeks.

But regardless of bulk import or not, I think we'll have to parse input log files twice; once to find out which dates are contained and another time to sanitize them. But, implementation detail.

Note that if you think that cutting off the first and last days is not enough, we could easily change that to cutting off the first and last two days. Or the first and the last two. Or first and last three. Whatever we think works best.

That cut-off time could be kept variable and be adjusted later, true.

Good idea.

Summary:

  • Only hostnames are inferred from the logs' names and paths.
  • The 'reference date' used in the current spec is dropped.
  • Only the log line dates covered in one run become the reference interval, of which a certain amount at beginning and end is not processed (aka: cut-off time).
  • Sanitized files for dates, that are already available in 'out', are not overwritten or amended and corresponding log lines ignored.
  • Gaps in import logs cannot be filled in later.
  • File provision for (bulk) imports has to insure proper order.
  • Use a placeholder for sanitized log file names (starting with underscore, but easily changeable).

Does this seem solid?
Shall I amend the spec with these changes?

It seems solid! Yes, please! Thanks!

comment:55 Changed 16 months ago by karsten

Quick thought while reading an unrelated ticket: maybe we should change the output file name format to <physical-host>_<virtual-host>-access.log-YYYYMMDD[.xz] to stay closer to the expected input file name format <physical-host>/<virtual-host>-access.log-YYYYMMDD[.gz]. I know that we previously said that most people would want to sort by virtual host. But this scheme is a lot easier to implement and will have potentially fewer bugs, because we'll just have to cut off the input file name after -access.log and prefix it with the parent directory name plus _. And if somebody really just cares about one virtual host, they can easily grep the list of files. End of random thought.

comment:56 Changed 16 months ago by iwakeh

Please review the changes in my spec revision branch, which is based on your 23243-branch (even the commit from this morning :-)

I introduced the underscore as separator placeholder for separating all parts of the filenames.
Using virtual host as first naming component should stay, because the internal physical hosts are not really known to the public.

Hope this pushes this ticket to completion.

comment:57 Changed 16 months ago by iwakeh

Status: needs_revisionneeds_review

comment:58 Changed 16 months ago by iwakeh

Priority: MediumHigh

comment:59 Changed 16 months ago by karsten

Looks good! I made some tweaks in this commit which is based on your branch.

I asked Sebastian to take a look at this specification, and he agreed. Should I squash and merge and deploy with an updated beta notice?

comment:60 in reply to:  59 Changed 16 months ago by iwakeh

Replying to karsten:

Looks good! I made some tweaks in this commit which is based on your branch.

Thanks for reviewing!
Tweaks look fine.

I asked Sebastian to take a look at this specification, and he agreed. Should I squash and merge and deploy with an updated beta notice?

Fine!

comment:61 Changed 16 months ago by karsten

Squashed and merged and deployed here. Yay!

Ready to close this ticket, or should we keep it open until we're confident enough that we can remove the beta notice?

comment:62 Changed 16 months ago by karsten

Err, let's maybe at least keep it open for possible feedback.

comment:63 Changed 16 months ago by Sebastian

I read the specification, I didn't spot any errors. I think maybe the sorting deserves a bit more prominence, maybe with a sentence explaining that it's not possible to aggregate identical lines without breaking log analysis tools?

Last edited 16 months ago by Sebastian (previous) (diff)

comment:64 in reply to:  63 Changed 16 months ago by karsten

Status: needs_reviewneeds_revision

Replying to Sebastian:

I read the specification, I didn't spot any errors.

Thanks for checking!

I think maybe the sorting deserves a bit more prominence, maybe with a sentence explaining that it's not possible to aggregate identical lines without breaking log analysis tools?

Adding such a sentence sounds doable.

But I'm unsure how to make that sorting step more prominent.

Like, we cannot move that part elsewhere in the spec, because the sorting really happens as one of the last steps of the sanitizing process.

And I'd rather avoid using markup like writing this paragraph in italics or boldface. (In fact, I'm not even sure how to do that in the XML we're using, though I'm not ruling out that it's possible.)

What we could do is be more verbose about this step and explain why it's so important to sort lines.

Would you maybe want to give this a try and rewrite that paragraph and also add the sentence explaining that aggregating lines would break tools?

comment:65 Changed 16 months ago by iwakeh

Assuming Sebastian's suggestion means to make the importance of sorting for guarding user privacy more visible I can think of the following options here:

  • state the importance of sorting the log lines by using something unrelated to the time of the event causing the entries (e.g., sorting alphabetically)
  • and state this more prominently at the beginning of section 4.3. The paragraph could start:
    Rewritten log lines are re-assembled into sanitized log files based on physical host, virtual host, and request start date.
    
    All rewritten log lines are sorted alphabetically, so that request order cannot be inferred from sanitized log files.
    
    The naming convention for ...
    

And, the sentence starting with "As last and certainly not least ..." can be removed.

If this addresses the concern of comment:63, I can provide a patch.

comment:66 in reply to:  63 Changed 16 months ago by iwakeh

Replying to Sebastian:

I read the specification, I didn't spot any errors. I think maybe the sorting deserves a bit more prominence, maybe with a sentence explaining that it's not possible to aggregate identical lines without breaking log analysis tools?

Do you mean that by removing identical lines instead of simply sorting alphabetically and keeping them, typical log analyzers loose too much information?

comment:67 Changed 16 months ago by Sebastian

Yes, that would lose all access to information about the amount of requests. We cannot put a number in front of a line to indicate how often it is repeated because that isn't what the analysis tools expect.

comment:68 in reply to:  67 Changed 16 months ago by iwakeh

Replying to Sebastian:

Yes, that would lose all access to information about the amount of requests. We cannot put a number in front of a line to indicate how often it is repeated because that isn't what the analysis tools expect.

Ah, ok. So, explaining this in addition to what I suggested above (comment:65) would be the changes you feel are necessary?
E.g. at the beginning of 4.3:

Rewritten log lines are re-assembled into sanitized log files based on physical host, virtual host, and request start date.

All rewritten log lines are sorted alphabetically, so that request order cannot be inferred from sanitized log files. Many of the sanitized log lines will now be identical. But in order to not remove too much useful information we keep the identical log lines and thus enable typical web log analyzers to operate on the sanitized log files.

comment:69 Changed 16 months ago by Sebastian

I don't think it's *necessary*, just additionally helpful, yes.

comment:70 in reply to:  69 Changed 16 months ago by iwakeh

Replying to Sebastian:

I don't think it's *necessary*, just additionally helpful, yes.

Necessary for improving the spec :-)
Thanks!

I'll prepare a patch.

comment:71 Changed 16 months ago by iwakeh

Status: needs_revisionneeds_review

Please find and review the changes in this patch commit.

comment:72 Changed 16 months ago by karsten

Looks good, merged and deployed. Thanks! Ready to close this ticket?

comment:73 Changed 16 months ago by iwakeh

Closing seems fine, if there are complaints, additions, corrections, extensions there should be a new ticket.
Disclaimer: I provided the latest patch.

comment:74 Changed 16 months ago by karsten

Resolution: implemented
Status: needs_reviewclosed

Okay, great. Closing!

Note: See TracTickets for help on using tickets.