Ticket #23243: webstats-spec.txt

File webstats-spec.txt, 6.3 KB (added by karsten, 3 years ago)

karsten's first draft

1Tor webserver logs
3Tor's webservers, like most webservers, keep request logs for maintenance and informational purposes.
5However, unlike most other webservers, Tor's webservers use a privacy-aware log format that avoids logging too sensitive data about their users.
7Also unlike most other webserver logs, Tor's logs are neither archived nor analyzed before performing a number of postprocessing steps to further reduce any privacy-sensitive parts.
9This document describes 1) the privacy-aware log format used on Tor's webservers and 2) the subsequent sanitizing steps that are applied before archiving and analyzing these log files.
11As a convention for this document, all format strings conform to the format strings used by Apache's mod_log_config module (http://httpd.apache.org/docs/current/mod/mod_log_config.html).
13# Privacy-aware log format
15Tor's Apache webservers are configured to write log files that extend Apache's Combined Log Format with a couple tweaks towards privacy. For example, the following Apache configuration lines were in use at the time of writing (subject to change):
17LogFormat " - %u %{[%d/%b/%Y:00:00:00 %z]}t \"%r\" %>s %b \"%{Referer}i\" \"-\" %{Age}o" privacy
18LogFormat " - %u %{[%d/%b/%Y:00:00:00 %z]}t \"%r\" %>s %b \"%{Referer}i\" \"-\" %{Age}o" privacyssl
19LogFormat " - %u %{[%d/%b/%Y:00:00:00 %z]}t \"%r\" %>s %b \"%{Referer}i\" \"-\" %{Age}o" privacyhs
21The main difference to Apache's Common Log Format is that request IP addresses are removed and the field is instead used to encode whether the request came in via http:// (, via https:// (, or via the site's onion service (
23Tor's webservers are configured to rotate logs at least once per day, which does not necessarily happen at 00:00:00 UTC. As a result, log files may contain requests from up to two UTC days and several log files may contain requests that have been started on the same UTC day.
25All access log files written by Tor's webservers follow the naming convention <hostname>.torproject.org-access.log-YYYYMMDD.
27# Sanitizing steps
29The request logs written by Tor's webservers still contain too many details that we are uncomfortable publishing. Therefore, we apply a couple of sanitizing steps on these log files before making them public and analyzing them ourselves. Some of these steps could as well be made directly by Apache, but others can only be made with a delay.
31## Discarding non-matching files
33As first safeguard against publishing log files that are too sensitive, we discard all files not matching the naming convention for access logs. This is to prevent, for example, error logs from slipping through.
35## Discarding non-matching lines
37Log files are expected to contain exactly 1 request per line. We process these files line by line and discard any lines not matching the following criteria:
39 - Lines begin with Apache's Common Log Format ("%h %l %u %t \"%r\" %>s %b") or a compatible format like one of Tor's privacy formats. It is acceptable if lines start with a format that is compatible to the Common Log Format and continue with additional fields. Those additional fields will later be discarded, but the line will not be discarded because of them.
40 - The request IP address starts with "0.0.0.", followed by any number between 0 and 255.
41 - The request protocol is HTTP.
42 - The request method is either GET or HEAD.
43 - The final status of the request is neither 400 ("Bad Request") nor 404 ("Not Found").
45Any lines not meeting all these criteria will be discarded, and processing continues with the next line.
47## Rewriting matching lines
49All matching lines, which are already checked to match the format "%h %l %u %t \"%r\" %>s %b", are rewritten following these rules:
51 - %h: The remote hostname is kept unchanged. (The previous sanitizing step already made sure that only addresses in are kept in sanitized logs.)
52 - %l: The remote logname, if present, is rewritten to "-".
53 - %u: The remote user, if present, is rewritten to "-".
54 - %t: All time and time zone components of when the request was received are rewritten to "00:00:00 +0000", while the date components are kept unchanged.
55 - %r: If the first line of request contains a query string, that query string is removed. Otherwise the first line of request is kept unchanged.
56 - %>s: The final status is kept unchanged.
57 - %b: The size of response in bytes is kept unchanged.
59XXX Should we really just set times to 00:00:00 +0000, or should we first convert the given time to UTC? AFAIK, Tor's webservers are all configured to use UTC for their system time, so that this doesn't matter. (And if they are not, this could be pretty bad, because `01/Sep/2017 00:00:00 -0700 == 31/Aug/2017 17:00:00 +0000`, which is either `01/Sep/2017 00:00:00 +0000` if we simply set to 0 or `31/Aug/2017 00:00:00 +0000` if we convert and then set to 0. Hmmm. But if we want to keep sanitizing steps generic, we might have to consider this.
61XXX Should we keep the ? of a query string to indicate that there has been a query string, or should we simply truncate at the first ? we can find?
63The result is still supposed to be fully compatible with the Common Log Format and can be processed by any tools being capable of processing that format.
65## Re-assembling log files
67Rewritten log lines are re-assembled into sanitized log files based on physical host, virtual host, and request start date.
69XXX Should we use a different path convention for sanitized files, like:
70/webstats/<webserver>/<webserver>-<hostname>.torproject.org-access.log-YYYYMMDD? The goal would be to have all metadata in the filename, rather than including the parent filename.
72Due to the fact that the date when a log file was rotated and the start date of contained requests may not always overlap, we need to delay publishing sanitized log files until all log files containing requests from that date are guaranteed to be processed.
74As last and certainly not least important sanitizing step, all rewritten log lines are sorted alphabetically, so that request order cannot be inferred from sanitized log files.
76Sanitized log files are typically compressed before publication. In particular the sorting step allows for highly efficient compression rates. We typically use XZ for compression, which is indicated by appending ".xz" to log file names, but this is subject to change.