Set up new web server logging and log analysis infrastructure

added component::metrics/analysis owner::phobos priority::medium resolution::implemented status::closed type::enhancement labels

Replying to karsten:

A few random thoughts. In discussions with the EFF, they recommended this plan from a legal perspective, https://www.eff.org/wp/osp. This suggests we could log in the default CLF and purge the logs within 48 hours. This plan would save space and let us see summarized results.

To quote a recent message to or-dev, 'we're trying to only make use of data for statistics that we're giving out to everyone'. This plan would contradict our practice of publishing summaries only of datasets that are themselves safe to publish.

Replying to rransom:

To quote a recent message to or-dev, 'we're trying to only make use of data for statistics that we're giving out to everyone'. This plan would contradict our practice of publishing summaries only of datasets that are themselves safe to publish.

Yup, you're right. I was just quoting phobos' idea here to split tickets. I see how using detailed data and not publishing it afterwards contradicts the stated principle.

I'm also quoting arma here who has stated his concerns about phobos' idea in the other ticket.

Yeah, it does sound like phobos should open a new ticket under 'website' or maybe 'infrastructure' with a title like 'change our website logging format and data retention habits'. What we change it to will require quite a bit of discussion. I would be uncomfortable having IP addresses of every user for every service even for 24 hours. (I think EFF's recommendation is a really crappy compromise that they were forced to make because otherwise none of the huge datamining corporations would listen to them at all.)

Replying to karsten (quoting phobos):

referrers (sanitized if it includes PII)

search engines, keyphrases and keywords

Search queries and other 'Referer' strings can easily be quite sensitive. They will also be particularly hard to sanitize, so whatever process we use to sanitize them will need a thorough review on or-dev.

And we should remember that this is more than just the logs for www.tpo, we have check, svn, gitweb, metrics, bridges, and trac websites to analyze.

check.tpo currently states: "This server does not log any information about visitors." This published policy for check.tpo should not be changed lightly, if at all.

Logs from gitweb.tpo and svn.tpo may disclose that someone is researching a security bug in a particular piece of code; if sanitized logs from those domains are published at all, they should be delayed by at least 24 hours.

As I understand it, the logs currently collected by BridgeDB/bridges.tpo are quite dangerous. We should also look into reducing the amount of sensitive information which that server stores.

A few thoughts here. I'm looking to publish the summarized data for the world to see, but mainly for us to use. If we can also publish the raw logs, great.

This desire to publish raw logs may also highlight services which collect too much data in their logfiles already. There is a risk from an organization asking/bribing/paying for access to the logs. Another thought is that even if we don't log anything, our ISP may be forced to do so. Perhaps we should offer every service over a hidden service as well.

Most of what I'm trying to learn with the websites is how people find us, what people read, what paths they take through the site, where in the world they come from, and how long they stay on any particular pages.

I'm mostly ignoring this task, because it's on Runa's list as I hear. A few thoughts anyway:

I think for metrics we made a reasonable decision to use only the data that we publish. (I'm a terrible fortuneteller, so I cannot promise we'll never have to break this principle. So far we didn't, and if we'll have to, we'll tell the world/or-dev before.) This decision ensures that we're not collecting too much data in the first place. We may even get feedback from the community if we do collect too much data. Also, we're less susceptible for attacks, because we don't have any secret data. And it's a question of fairness towards other researchers who don't happen to run a Tor network and who want to do Tor research.

The situation with web logs may be comparable to the bridge descriptor situation. We cannot publish raw bridge descriptors. Instead, we're spending significant resources, mostly developer time, on sanitizing bridge descriptors before publishing and analyzing them. Most importantly, we're not analyzing the original descriptors at all.

How about we implement a similar sanitizing process for web logs? Maybe we can replace sensitive data with other data that allows us to keep track of user sessions without giving away any other user data. I admit it will eat a lot of development resources (here: Runa). In addition to that, we should look at the original logs and reduce the details that are too sensitive and that we're not going to use anyway. What we should not do, IMHO, is using the original logs for analysis and publishing only sanitized logs. I'm aware that this means we won't be able to answer all questions we would like to answer, simply because the data is too sensitive. We should also present and discuss the complete sanitizing process on or-dev before implementing it.

Replying to phobos:

Most of what I'm trying to learn with the websites is how people find us, what people read, what paths they take through the site, where in the world they come from, and how long they stay on any particular pages.

We could do an experiment where we collect normal apache logs for a day, stare at them really hard for a few days, and then delete them. That would give us a better sense of what we're missing out on, might answer a few of Andrew's questions, and probably wouldn't put our users at much additional risk.

(Not that it matters, but we'd still be doing better than most industry standards even then.)

Or said another way, I'd be much happier if the default is to keep nothing sensitive and could then better stomach some brief (and transient) divergences from that plan.

Trac:
Pointsdone: N/A to N/A
Actualpoints: N/A to N/A
Actualpointsdone: N/A to N/A

Trac:
Status: new to assigned
Owner: phobos to runa

Moving this to the Analysis component, because this ticket first needs analysis before we can go implement something. And even when we implement something, this is more likely going to be part of Metrics Data Processor than Website.

Trac:
Component: Website to Analysis

Just a note to self, really; I wonder if we can get Piwik to happily chew on our logs and output something fancy. We could also write plugins for Piwik to display info such as package downloads per OS etc. We'd need space on a server to set up piwik.tpo or something similar, though.

Moving this back to Karsten since the component is Analysis.

Trac:
Owner: runa to karsten

We have webstats.tpo now. Does that mean this ticket is obsolete? Andrew?

Trac:
Owner: karsten to phobos

Trac:
Resolution: N/A to implemented
Status: assigned to closed

closed

mentioned in issue #4458 (moved)

mentioned in issue #4463 (moved)

Set up new web server logging and log analysis infrastructure

Child items 0

Activity