A few random thoughts. In discussions with the EFF, they recommended this plan from a legal perspective, https://www.eff.org/wp/osp. This suggests we could log in the default CLF and purge the logs within 48 hours. This plan would save space and let us see summarized results.
I'd like to be able to see:
visits and unique visitor counts by days of the week, day of the month
top 25 pages by page view
country of origin
most viewed, entry, and exit pages
downloads by package
http errors
referrers (sanitized if it includes PII)
search engines, keyphrases and keywords
overall traffic by webserver, if possible (is our load balancing working evenly?)
top 10 paths through the site
number of pages viewed per visit
percentage and count of requests through tor exits
Or another idea is to see what we can provide from the feature list of http://awstats.sourceforge.net/ without disclosing PII.
And we should remember that this is more than just the logs for www.tpo, we have check, svn, gitweb, metrics, bridges, and trac websites to analyze.
To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information
Child items 0
Show closed items
No child items are currently assigned. Use child items to break down this issue into smaller parts.
Linked items 0
Link issues together to show that they're related.
Learn more.
A few random thoughts. In discussions with the EFF, they recommended this plan from a legal perspective, https://www.eff.org/wp/osp. This suggests we could log in the default CLF and purge the logs within 48 hours. This plan would save space and let us see summarized results.
To quote a recent message to or-dev, 'we're trying to only make use of data for statistics that we're giving out to everyone'. This plan would contradict our practice of publishing summaries only of datasets that are themselves safe to publish.
To quote a recent message to or-dev, 'we're trying to only make use of data for statistics that we're giving out to everyone'. This plan would contradict our practice of publishing summaries only of datasets that are themselves safe to publish.
Yup, you're right. I was just quoting phobos' idea here to split tickets. I see how using detailed data and not publishing it afterwards contradicts the stated principle.
I'm also quoting arma here who has stated his concerns about phobos' idea in the other ticket.
Yeah, it does sound like phobos should open a new ticket under 'website' or maybe 'infrastructure' with a title like 'change our website logging format and data retention habits'. What we change it to will require quite a bit of discussion. I would be uncomfortable having IP addresses of every user for every service even for 24 hours. (I think EFF's recommendation is a really crappy compromise that they were forced to make because otherwise none of the huge datamining corporations would listen to them at all.)
Search queries and other 'Referer' strings can easily be quite sensitive. They will also be particularly hard to sanitize, so whatever process we use to sanitize them will need a thorough review on or-dev.
And we should remember that this is more than just the logs for www.tpo, we have check, svn, gitweb, metrics, bridges, and trac websites to analyze.
check.tpo currently states: "This server does not log any information about visitors." This published policy for check.tpo should not be changed lightly, if at all.
Logs from gitweb.tpo and svn.tpo may disclose that someone is researching a security bug in a particular piece of code; if sanitized logs from those domains are published at all, they should be delayed by at least 24 hours.
As I understand it, the logs currently collected by BridgeDB/bridges.tpo are quite dangerous. We should also look into reducing the amount of sensitive information which that server stores.
A few thoughts here. I'm looking to publish the summarized data for the world to see, but mainly for us to use. If we can also publish the raw logs, great.
This desire to publish raw logs may also highlight services which collect too much data in their logfiles already. There is a risk from an organization asking/bribing/paying for access to the logs. Another thought is that even if we don't log anything, our ISP may be forced to do so. Perhaps we should offer every service over a hidden service as well.
Most of what I'm trying to learn with the websites is how people find us, what people read, what paths they take through the site, where in the world they come from, and how long they stay on any particular pages.
I'm mostly ignoring this task, because it's on Runa's list as I hear. A few thoughts anyway:
I think for metrics we made a reasonable decision to use only the data that we publish. (I'm a terrible fortuneteller, so I cannot promise we'll never have to break this principle. So far we didn't, and if we'll have to, we'll tell the world/or-dev before.) This decision ensures that we're not collecting too much data in the first place. We may even get feedback from the community if we do collect too much data. Also, we're less susceptible for attacks, because we don't have any secret data. And it's a question of fairness towards other researchers who don't happen to run a Tor network and who want to do Tor research.
The situation with web logs may be comparable to the bridge descriptor situation. We cannot publish raw bridge descriptors. Instead, we're spending significant resources, mostly developer time, on sanitizing bridge descriptors before publishing and analyzing them. Most importantly, we're not analyzing the original descriptors at all.
How about we implement a similar sanitizing process for web logs? Maybe we can replace sensitive data with other data that allows us to keep track of user sessions without giving away any other user data. I admit it will eat a lot of development resources (here: Runa). In addition to that, we should look at the original logs and reduce the details that are too sensitive and that we're not going to use anyway. What we should not do, IMHO, is using the original logs for analysis and publishing only sanitized logs. I'm aware that this means we won't be able to answer all questions we would like to answer, simply because the data is too sensitive. We should also present and discuss the complete sanitizing process on or-dev before implementing it.
Most of what I'm trying to learn with the websites is how people find us, what people read, what paths they take through the site, where in the world they come from, and how long they stay on any particular pages.
We could do an experiment where we collect normal apache logs for a day, stare at them really hard for a few days, and then delete them. That would give us a better sense of what we're missing out on, might answer a few of Andrew's questions, and probably wouldn't put our users at much additional risk.
(Not that it matters, but we'd still be doing better than most industry standards even then.)
Or said another way, I'd be much happier if the default is to keep nothing sensitive and could then better stomach some brief (and transient) divergences from that plan.
Moving this to the Analysis component, because this ticket first needs analysis before we can go implement something. And even when we implement something, this is more likely going to be part of Metrics Data Processor than Website.
Just a note to self, really; I wonder if we can get Piwik to happily chew on our logs and output something fancy. We could also write plugins for Piwik to display info such as package downloads per OS etc. We'd need space on a server to set up piwik.tpo or something similar, though.