Opened 9 years ago

Closed 7 years ago

#1641 closed task (fixed)

Make website logs available in the Metrics Portal

Reported by: karsten Owned by: karsten
Priority: Low Milestone:
Component: Metrics/Analysis Version:
Severity: Keywords:
Cc: phobos, runa Actual Points:
Parent ID: Points:
Reviewer: Sponsor:

Description

The website logs of archive, byblos, and vescum (56 GB until Feb 19, 2010) shall be made available in the Metrics Portal. Possible tools are webalizer and awstats. The files can be synced using "rsync --bwlimit=400 -ave ssh aroides:/srv/weblogs/ ~/weblogs/"

Child Tickets

Change History (9)

comment:1 Changed 9 years ago by karsten

Owner: changed from Karsten to karsten
Status: newassigned

I looked at a web log sample from January 30 from one of our currently three www servers. Here's a sample line:

0.0.0.1 - - [30/Jan/2011:00:00:00 +0000] "GET /projects/projects.html.en HTTP/1.1" 200 3029 "https://www.torproject.org/docs/bridges.html.en" "-"

The format is Apache's Combined Log Format with the following exceptions:

  • The client IP address is replaced with either 0.0.0.0 for HTTP requests or 0.0.0.1 for HTTPS requests.
  • The request time is set to 00:00:00 +0000.
  • The user-agent string is set to "-".

However, I found CONNECT request and other non-GET requests in the logs which are potentially sensitive. Also, the referer string may be sensitive, especially if it's a non-Tor URL. We should remove all log lines except GET requests and set the referer string to "-".

An even better approach is to define the information we want to keep:

  • We publish only GET requests with the following data fields:
  • 0.0.0.0 for HTTP request or 0.0.0.1 for HTTPS requests,
  • the request date,
  • the requested URL,
  • the HTTP version,
  • the server's HTTP status code, and
  • the size of the returned object.

We retain Apache's Combined Log Format for the sanitized logs, so that we can use standard web log analysis tools.

Runa has web server log analysis on her TODO list. I explained this approach to her yesterday. She agreed with settling a format like the one above and said that she'll find a way to work with it.

How do we proceed? Andrew says the sanitizing process cannot take place on the web servers, because they are quite busy already. Can we set up copying our web server logs to yatei to do the sanitizing there? I can write a parser as part of metrics-db and make daily updated sanitized web logs available in the metrics portal. I also want to make a graph on downloaded packages per day available on the metrics website. Once Runa starts her web server log analysis, we can extend this setup to copy the web server logs, either from the web servers or from yatei, to wherever she does the analysis.

comment:2 Changed 9 years ago by phobos

A few random thoughts.  In discussions with the EFF, they recommended this plan from a legal perspective, https://www.eff.org/wp/osp.  This suggests we could log in the default CLF and purge the logs within 48 hours.  This plan would save space and let us see summarized results.

I'd like to be able to see:

  1. visits and unique visitor counts by days of the week, day of the month
  2. top 25 pages by page view
  3. country of origin
  4. most viewed, entry, and exit pages
  5. downloads by package
  6. http errors
  7. referrers (sanitized if it includes PII)
  8. search engines, keyphrases and keywords
  9. overall traffic by webserver, if possible (is our load balancing working evenly?)
  10. top 10 paths through the site
  11. number of pages viewed per visit
  12. percentage and count of requests through tor exits

Or another idea is to see what we can provide from the feature list of http://awstats.sourceforge.net/ without disclosing PII.  

And we should remember that this is more than just the logs for www.tpo, we have check, svn, gitweb, metrics, bridges, and trac websites to analyze.

comment:3 Changed 9 years ago by karsten

What you suggest sounds very different from what I had in mind. My idea was to make use of the existing months/years of logs to answer simple questions like how many Tor packages were downloaded per day. Your idea suggests setting up a new logging and log analysis infrastructure and will only give us results once we will have done that. We could implement my idea within a few days from now. Your idea is likely going to take more time, depending on when Runa gets to it. We could even implement both ideas. Your call.

comment:4 Changed 9 years ago by arma

Yeah, it does sound like phobos should open a new ticket under 'website' or maybe 'infrastructure' with a title like 'change our website logging format and data retention habits'. What we change it to will require quite a bit of discussion. I would be uncomfortable having IP addresses of every user for every service even for 24 hours. (I think EFF's recommendation is a really crappy compromise that they were forced to make because otherwise none of the huge datamining corporations would listen to them at all.)

comment:5 in reply to:  4 Changed 9 years ago by karsten

Replying to arma:

Yeah, it does sound like phobos should open a new ticket under 'website' or maybe 'infrastructure' with a title like 'change our website logging format and data retention habits'.

See #2489. Discussion about phobos' approach should go there.

comment:6 Changed 9 years ago by karsten

Andrew, how do we proceed here? Can we set up copying the current logs from our wwws to yatei to sanitize and publish them?

comment:7 Changed 8 years ago by karsten

Component: MetricsAnalysis

Moving from Metrics component (which I'm going to rename to Metrics Data Processor soon) to Analysis.

comment:8 Changed 8 years ago by karsten

Cc: phobos runa added
Status: assignedneeds_information

We have webstats.tpo for this. Andrew, Runa, anything else we need to do here? If not, I'll close this ticket.

comment:9 Changed 7 years ago by karsten

Resolution: fixed
Status: needs_informationclosed

We have webstats.tpo which Runa handles. Nothing to do here anymore. Closing.

Note: See TracTickets for help on using tickets.