Ticket #23243: webstats-spec.3.txt

File webstats-spec.3.txt, 6.8 KB (added by karsten, 3 years ago)

Fourth draft

Line 
1Tor webserver logs
2
3
4Next steps:
5 - Replace webserver with web server which seems to be Less Bad English (karsten).
6 - Turn this document into XML (karsten)
7 - Code the decisions (iwakeh)
8 - Try out the code on actual logs (iwakeh; karsten can make more logs available)
9 - Send draft to tor-dev@ and ask for feedback (karsten)
10
11
12Tor's webservers, like most webservers, keep request logs for maintenance and informational purposes.
13
14However, unlike most other webservers, Tor's webservers use a privacy-aware log format that avoids logging too sensitive data about their users.
15
16Also unlike most other webserver logs, Tor's logs are neither archived nor analyzed before performing a number of postprocessing steps to further reduce any privacy-sensitive parts.
17
18This document describes 1) the privacy-aware log format used on Tor's webservers and 2) the subsequent sanitizing steps that are applied before archiving and analyzing these log files.
19
20As a convention for this document, all format strings conform to the format strings used by Apache's mod_log_config module (http://httpd.apache.org/docs/current/mod/mod_log_config.html).
21
22# Privacy-aware log format
23
24Tor's Apache webservers are configured to write log files that extend Apache's Combined Log Format with a couple tweaks towards privacy. For example, the following Apache configuration lines were in use at the time of writing (subject to change):
25
26LogFormat "0.0.0.0 - %u %{[%d/%b/%Y:00:00:00 %z]}t \"%r\" %>s %b \"%{Referer}i\" \"-\" %{Age}o" privacy
27LogFormat "0.0.0.1 - %u %{[%d/%b/%Y:00:00:00 %z]}t \"%r\" %>s %b \"%{Referer}i\" \"-\" %{Age}o" privacyssl
28LogFormat "0.0.0.2 - %u %{[%d/%b/%Y:00:00:00 %z]}t \"%r\" %>s %b \"%{Referer}i\" \"-\" %{Age}o" privacyhs
29
30The main difference to Apache's Common Log Format is that request IP addresses are removed and the field is instead used to encode whether the request came in via http:// (0.0.0.0), via https:// (0.0.0.1), or via the site's onion service (0.0.0.2).
31
32Tor's webservers are configured to use UTC as timezone, which is also highly recommended when rewriting request times to "00:00:00" in order for the subsequent sanitizing steps to work correctly. Alternatively, if the system timezone is not set to UTC, webservers should keep request times unchanged and let them be handled by the subsequent sanitizing steps.
33
34Tor's webservers are configured to rotate logs at least once per day, which does not necessarily happen at 00:00:00 UTC. As a result, log files may contain requests from up to two UTC days and several log files may contain requests that have been started on the same UTC day.
35
36All access log files written by Tor's webservers follow the naming convention <hostname>.torproject.org-access.log-YYYYMMDD.
37
38# Sanitizing steps
39
40The request logs written by Tor's webservers still contain too many details that we are uncomfortable publishing. Therefore, we apply a couple of sanitizing steps on these log files before making them public and analyzing them ourselves. Some of these steps could as well be made directly by Apache, but others can only be made with a delay.
41
42## Discarding non-matching files
43
44As first safeguard against publishing log files that are too sensitive, we discard all files not matching the naming convention for access logs. This is to prevent, for example, error logs from slipping through.
45
46## Discarding non-matching lines
47
48Log files are expected to contain exactly 1 request per line. We process these files line by line and discard any lines not matching the following criteria:
49
50 - Lines begin with Apache's Common Log Format ("%h %l %u %t \"%r\" %>s %b") or a compatible format like one of Tor's privacy formats. It is acceptable if lines start with a format that is compatible to the Common Log Format and continue with additional fields. Those additional fields will later be discarded, but the line will not be discarded because of them.
51 - The request IP address starts with "0.0.0.", followed by any number between 0 and 255.
52 - The time the request was received does not lie in the future.
53 - The date the request was received, after converting the request time to UTC, does not lie more than 1 day in the past. (Bulk imports of archived logs are exempt from this requirement.)
54 - The request protocol is HTTP.
55 - The request method is either GET or HEAD.
56 - The final status of the request is neither 400 ("Bad Request") nor 404 ("Not Found").
57
58Any lines not meeting all these criteria will be discarded, and processing continues with the next line.
59
60## Rewriting matching lines
61
62All matching lines, which are already checked to match Apache's Common Log Format ("%h %l %u %t \"%r\" %>s %b"), are rewritten following these rules:
63
64 - %h: The remote hostname is kept unchanged. (The previous sanitizing step already made sure that only addresses in 0.0.0.0/8 are kept in sanitized logs.)
65 - %l: The remote logname, if present, is rewritten to "-".
66 - %u: The remote user, if present, is rewritten to "-".
67 - %t: The time the request was received is converted to UTC, unless the time is already given in UTC, and time and time zone components are rewritten to "00:00:00 +0000". Date components are kept unchanged.
68 - %r: If the first line of request contains a query string, that query string is removed from "?" to the end of the request string. Otherwise the first line of request is kept unchanged.
69 - %>s: The final status is kept unchanged.
70 - %b: The size of response in bytes is kept unchanged.
71
72The result is still supposed to be fully compatible with the Common Log Format and can be processed by any tools being capable of processing that format.
73
74## Re-assembling log files
75
76Rewritten log lines are re-assembled into sanitized log files based on physical host, virtual host, and request start date.
77
78The naming convention for sanitized log files is:
79
80<virtual-host>-<physical-host>-access.log-YYYYMMDD[.xz]
81
82Sanitized log files may additionally be sorted into directories by virtual host and date as in:
83
84<virtual-host>/YYYY/MM/<virtual-host>-<physical-host>-access.log-YYYYMMDD[.xz]
85
86Due to the fact that the date when a log file was rotated and the start date of contained requests may not always overlap, we need to delay publishing sanitized log files until the start date of requests in UTC plus 2 days. After this delay, all log files containing requests from that date are assumed to be processed. Sanitized log files are published and not further modified in the future. (Again, bulk imports of archived logs are exempt from this.)
87
88As last and certainly not least important sanitizing step, all rewritten log lines are sorted alphabetically, so that request order cannot be inferred from sanitized log files.
89
90Sanitized log files are typically compressed before publication. In particular the sorting step allows for highly efficient compression rates. We typically use XZ for compression, which is indicated by appending ".xz" to log file names, but this is subject to change.