Opened 9 years ago

Closed 7 years ago

#2489 closed enhancement (implemented)

Set up new web server logging and log analysis infrastructure

Reported by: karsten Owned by: phobos
Priority: Medium Milestone:
Component: Metrics/Analysis Version:
Severity: Keywords:
Cc: Actual Points:
Parent ID: Points:
Reviewer: Sponsor:

Description

Moving phobos' comment from #1641 here, because it's really a new ticket:

A few random thoughts. In discussions with the EFF, they recommended this plan from a legal perspective, https://www.eff.org/wp/osp. This suggests we could log in the default CLF and purge the logs within 48 hours. This plan would save space and let us see summarized results.

I'd like to be able to see:

  1. visits and unique visitor counts by days of the week, day of the month
  2. top 25 pages by page view
  3. country of origin
  4. most viewed, entry, and exit pages
  5. downloads by package
  6. http errors
  7. referrers (sanitized if it includes PII)
  8. search engines, keyphrases and keywords
  9. overall traffic by webserver, if possible (is our load balancing working evenly?)
  10. top 10 paths through the site
  11. number of pages viewed per visit
  12. percentage and count of requests through tor exits

Or another idea is to see what we can provide from the feature list of http://awstats.sourceforge.net/ without disclosing PII.

And we should remember that this is more than just the logs for www.tpo, we have check, svn, gitweb, metrics, bridges, and trac websites to analyze.

Child Tickets

Change History (12)

comment:1 in reply to:  description ; Changed 9 years ago by rransom

Replying to karsten:

A few random thoughts. In discussions with the EFF, they recommended this plan from a legal perspective, https://www.eff.org/wp/osp. This suggests we could log in the default CLF and purge the logs within 48 hours. This plan would save space and let us see summarized results.

To quote a recent message to or-dev, 'we're trying to only make use of data for statistics that we're giving out to everyone'. This plan would contradict our practice of publishing summaries only of datasets that are themselves safe to publish.

comment:2 in reply to:  1 Changed 9 years ago by karsten

Replying to rransom:

To quote a recent message to or-dev, 'we're trying to only make use of data for statistics that we're giving out to everyone'. This plan would contradict our practice of publishing summaries only of datasets that are themselves safe to publish.

Yup, you're right. I was just quoting phobos' idea here to split tickets. I see how using detailed data and not publishing it afterwards contradicts the stated principle.

I'm also quoting arma here who has stated his concerns about phobos' idea in the other ticket.

Yeah, it does sound like phobos should open a new ticket under 'website' or maybe 'infrastructure' with a title like 'change our website logging format and data retention habits'. What we change it to will require quite a bit of discussion. I would be uncomfortable having IP addresses of every user for every service even for 24 hours. (I think EFF's recommendation is a really crappy compromise that they were forced to make because otherwise none of the huge datamining corporations would listen to them at all.)

comment:3 in reply to:  description Changed 9 years ago by rransom

Replying to karsten (quoting phobos):

  1. referrers (sanitized if it includes PII)
  2. search engines, keyphrases and keywords

Search queries and other 'Referer' strings can easily be quite sensitive. They will also be particularly hard to sanitize, so whatever process we use to sanitize them will need a thorough review on or-dev.

And we should remember that this is more than just the logs for www.tpo, we have check, svn, gitweb, metrics, bridges, and trac websites to analyze.

check.tpo currently states: "This server does not log any information about visitors." This published policy for check.tpo should not be changed lightly, if at all.

Logs from gitweb.tpo and svn.tpo may disclose that someone is researching a security bug in a particular piece of code; if sanitized logs from those domains are published at all, they should be delayed by at least 24 hours.

As I understand it, the logs currently collected by BridgeDB/bridges.tpo are quite dangerous. We should also look into reducing the amount of sensitive information which that server stores.

comment:4 Changed 9 years ago by phobos

A few thoughts here.  I'm looking to publish the summarized data for the world to see, but mainly for us to use.  If we can also publish the raw logs, great.

This desire to publish raw logs may also highlight services which collect too much data in their logfiles already.  There is a risk from an organization asking/bribing/paying for access to the logs.  Another thought is that even if we don't log anything, our ISP may be forced to do so.  Perhaps we should offer every service over a hidden service as well.

Most of what I'm trying to learn with the websites is how people find us, what people read, what paths they take through the site, where in the world they come from, and how long they stay on any particular pages.

comment:5 Changed 9 years ago by karsten

I'm mostly ignoring this task, because it's on Runa's list as I hear. A few thoughts anyway:

I think for metrics we made a reasonable decision to use only the data that we publish. (I'm a terrible fortuneteller, so I cannot promise we'll never have to break this principle. So far we didn't, and if we'll have to, we'll tell the world/or-dev before.) This decision ensures that we're not collecting too much data in the first place. We may even get feedback from the community if we do collect too much data. Also, we're less susceptible for attacks, because we don't have any secret data. And it's a question of fairness towards other researchers who don't happen to run a Tor network and who want to do Tor research.

The situation with web logs may be comparable to the bridge descriptor situation. We cannot publish raw bridge descriptors. Instead, we're spending significant resources, mostly developer time, on sanitizing bridge descriptors before publishing and analyzing them. Most importantly, we're not analyzing the original descriptors at all.

How about we implement a similar sanitizing process for web logs? Maybe we can replace sensitive data with other data that allows us to keep track of user sessions without giving away any other user data. I admit it will eat a lot of development resources (here: Runa). In addition to that, we should look at the original logs and reduce the details that are too sensitive and that we're not going to use anyway. What we should not do, IMHO, is using the original logs for analysis and publishing only sanitized logs. I'm aware that this means we won't be able to answer all questions we would like to answer, simply because the data is too sensitive. We should also present and discuss the complete sanitizing process on or-dev before implementing it.

comment:6 in reply to:  4 Changed 9 years ago by arma

Replying to phobos:

Most of what I'm trying to learn with the websites is how people find us, what people read, what paths they take through the site, where in the world they come from, and how long they stay on any particular pages.

We could do an experiment where we collect normal apache logs for a day, stare at them really hard for a few days, and then delete them. That would give us a better sense of what we're missing out on, might answer a few of Andrew's questions, and probably wouldn't put our users at much additional risk.

(Not that it matters, but we'd still be doing better than most industry standards even then.)

Or said another way, I'd be much happier if the default is to keep nothing sensitive and could then better stomach some brief (and transient) divergences from that plan.

comment:7 Changed 8 years ago by phobos

Owner: changed from phobos to runa
Status: newassigned

comment:8 Changed 8 years ago by karsten

Component: WebsiteAnalysis

Moving this to the Analysis component, because this ticket first needs analysis before we can go implement something. And even when we implement something, this is more likely going to be part of Metrics Data Processor than Website.

comment:9 Changed 8 years ago by runa

Just a note to self, really; I wonder if we can get Piwik to happily chew on our logs and output something fancy. We could also write plugins for Piwik to display info such as package downloads per OS etc. We'd need space on a server to set up piwik.tpo or something similar, though.

comment:10 Changed 8 years ago by runa

Owner: changed from runa to karsten

Moving this back to Karsten since the component is Analysis.

comment:11 Changed 7 years ago by karsten

Owner: changed from karsten to phobos

We have webstats.tpo now. Does that mean this ticket is obsolete? Andrew?

comment:12 Changed 7 years ago by phobos

Resolution: implemented
Status: assignedclosed
Note: See TracTickets for help on using tickets.