Opened 8 years ago

Closed 8 years ago

Last modified 7 years ago

#4463 closed project (fixed)

Set up web log analysis tool

Reported by: runa Owned by: runa
Priority: Medium Milestone:
Component: Webpages/Website Version:
Severity: Keywords: SponsorZ
Cc: Actual Points:
Parent ID: Points:
Reviewer: Sponsor:

Description

The goal is to set up web server log analysis for daily processing of all web server logs (www, blog, trac, and so on). This all started with #2489 Set up new web server logging and log analysis infrastructure.

Karsten is analyzing and sanitizing the logs that we have, and I'm going to research tools we can use to display the logs in a user-friendly way.

Suggested tools are AWStats and Piwik. I'm sure there are more tools out there as well.

Child Tickets

Attachments (1)

awstats.www.conf (59.6 KB) - added by runa 8 years ago.

Download all attachments as: .zip

Change History (18)

comment:1 Changed 8 years ago by runa

Owner: changed from phobos to runa
Status: newassigned

comment:2 Changed 8 years ago by karsten

Milestone: Sponsor Z: December 31, 2011
Type: taskproject

Changing this ticket to a project and adding it to the sponsor Z milestone that's due on December 31, 2011, so that it appears on the sponsor Z deliverable list.

Also note that Webalizer may be another web log analysis tool that we could use. We even have preliminary results generated by Webalizer.

comment:3 Changed 8 years ago by runa

Summary: Research web log analysis toolsSet up web log analysis tool

comment:4 Changed 8 years ago by runa

I looked at four different web log analysis tools, here's what I found:

Piwik looks great, but is not available in Ubuntu or Debian. Setting it up manually is pretty straight forward, but you will not be able to import Apache logs without using some third-party script. Last time I checked, that third-party script had some issues with our sanitized log format.

AWStats is easy to set up and easy to use, but incredibly slow when importing logs. I set up AWStats on an Ubuntu EC2 instance and pulled the sanitized logs for January and February 2010 (you only get 8 GB storage). The import of wiki.torproject.org-access.log was pretty quick, and we have some preliminary results. However, the import of www.torproject.org-access.log does not complete at all. Maybe it's because I tried to do all this in the cloud, or maybe it's just AWStats.

Webalizer is just as easy to set up and use as AWStats. It doesn't look as pretty, but it's a lot faster when it comes to importing existing log I managed to set it up and import the Jan+Feb www.torproject.org-access.log without any problems.

Splunk was recommended to me by someone on Twitter, so I figured I'd look into it. The free version of Splunk allows you to index only 500 megabytes of data per day, we probably want more than that.

Another option is to write our own parser and use R to create graphs similar to what we have on metrics.tpo. Writing our own parser will take some time, so maybe we should just go with Webalizer for now.

comment:5 Changed 8 years ago by karsten

I'm curious why AWStats takes so long to process our sanitized logs. This isn't only relevant for picking a web log analysis tool for ourselves, but also for providing logs that will be useful for others.

I wonder if AWStats is confused that all requests come in at 00:00:00 and from either 0.0.0.0 or 0.0.0.1. Maybe we can teach it not to look at these data fields to reconstruct user sessions.

Can you paste your AWStats config and a little howto somewhere? I'd like to try it on my local Debian machine with 500 GiB disk space.

comment:6 in reply to:  5 Changed 8 years ago by runa

Replying to karsten:

I wonder if AWStats is confused that all requests come in at 00:00:00 and from either 0.0.0.0 or 0.0.0.1. Maybe we can teach it not to look at these data fields to reconstruct user sessions.

AWStats didn't have any problems with the log for the wiki. I haven't spent too much time looking into fine-tuning AWStats to ignore certain fields, though.

Can you paste your AWStats config and a little howto somewhere? I'd like to try it on my local Debian machine with 500 GiB disk space.

The Ubuntu AWStats howto covers the basics. I have attached the (almost default) config I used for the www log.

Changed 8 years ago by runa

Attachment: awstats.www.conf added

comment:7 Changed 8 years ago by karsten

So, I ran AWStats tonight to parse the 2010 logs. It took 3 hours and 38 minutes, which is reasonable, IMO. We only have to import a few years of logs once, and if that takes 12 hours, that's fine.

I have no idea why it took forever for you, though. Maybe it was a problem with available disk space, who knows.

comment:8 in reply to:  7 ; Changed 8 years ago by runa

Replying to karsten:

So, I ran AWStats tonight to parse the 2010 logs. It took 3 hours and 38 minutes, which is reasonable, IMO. We only have to import a few years of logs once, and if that takes 12 hours, that's fine.

Cool, can you make the results available? That means we have two options; AWStats and Webalizer. Which one's your favorite? Or should we roll our own solution?

I have no idea why it took forever for you, though. Maybe it was a problem with available disk space, who knows.

Maybe it was related to AWS, not AWStats.

comment:9 in reply to:  8 ; Changed 8 years ago by karsten

Replying to runa:

Cool, can you make the results available?

How would I do that? I don't think the results are written to static HTML files, are they? Can I send you a tarball of some directory (that you're going to tell me), and you make the shiny stats available?

That means we have two options; AWStats and Webalizer. Which one's your favorite? Or should we roll our own solution?

Can we run both? I don't know if we'll run into problems with the daily (?) updates once we have them. I'm waiting for our VM server to return, and then I'm going to set up the sanitizing code. The next step will be to set up a VM with AWStats and/or Webalizer. I could imagine we might run into other problems then, so I'd rather not want to exclude either of the two tools yet.

If we can avoid it, let's avoid writing something ourselves at this point.

comment:10 in reply to:  9 ; Changed 8 years ago by runa

Replying to karsten:

Replying to runa:

Cool, can you make the results available?

How would I do that? I don't think the results are written to static HTML files, are they? Can I send you a tarball of some directory (that you're going to tell me), and you make the shiny stats available?

No static HTML as far as I know. You'd need to run apache2 on the same host. If you can create a tarball of the following directories (I don't think all of them are necessary, but hey), I'll make the stats available on the EC2 server:

/var/lib/awstats
/usr/share/doc-base/awstats
/usr/share/awstats
/usr/share/doc/awstats
/etc/awstats
/etc/cron.d/awstats

That means we have two options; AWStats and Webalizer. Which one's your favorite? Or should we roll our own solution?

Can we run both? I don't know if we'll run into problems with the daily (?) updates once we have them. I'm waiting for our VM server to return, and then I'm going to set up the sanitizing code. The next step will be to set up a VM with AWStats and/or Webalizer. I could imagine we might run into other problems then, so I'd rather not want to exclude either of the two tools yet.

I don't see a problem with running both, so yes. A lot of people run both because they like something from AWStats that isn't available in Webalizer and vice versa. Should we set up the web log analysis tools on the same VM as the one sanitizing the logs, or should we get a new one?

If we can avoid it, let's avoid writing something ourselves at this point.

I agree.

comment:11 in reply to:  10 ; Changed 8 years ago by karsten

Replying to runa:

No static HTML as far as I know. You'd need to run apache2 on the same host. If you can create a tarball of the following directories (I don't think all of them are necessary, but hey), I'll make the stats available on the EC2 server:

/var/lib/awstats
/usr/share/doc-base/awstats
/usr/share/awstats
/usr/share/doc/awstats
/etc/awstats
/etc/cron.d/awstats

I'm going to send you the tarballs later today.

I don't see a problem with running both, so yes. A lot of people run both because they like something from AWStats that isn't available in Webalizer and vice versa.

Sounds good.

Should we set up the web log analysis tools on the same VM as the one sanitizing the logs, or should we get a new one?

I think we should get a new one. The VM that has non-sanitized logs shouldn't run a web server. It will also be the VM that sanitizes bridge descriptors.

If you want to start setting up the VM with AWStats and Webalizer, please don't wait for me to set up the VM that sanitizes logs. The 2010 logs should be sufficient to get something running. The connection to the sanitizing VM will be a cronjob rsync'ing the sanitized logs as you find them in the tarballs. Both the AWStats and the Webalizer setup should be able to handle adding new sanitized log files and removing files older than, say, one week. We'll probably want to keep back the logs from a given day until that day is over (the sorting doesn't make much sense if we're sorting requests from just a few hours), so files shouldn't change once you get them.

If we can avoid it, let's avoid writing something ourselves at this point.

I agree.

Okay.

comment:12 in reply to:  11 Changed 8 years ago by runa

Replying to karsten:

Replying to runa:

No static HTML as far as I know. You'd need to run apache2 on the same host. If you can create a tarball of the following directories (I don't think all of them are necessary, but hey), I'll make the stats available on the EC2 server:

/var/lib/awstats
/usr/share/doc-base/awstats
/usr/share/awstats
/usr/share/doc/awstats
/etc/awstats
/etc/cron.d/awstats

I'm going to send you the tarballs later today.

Sounds good.

Should we set up the web log analysis tools on the same VM as the one sanitizing the logs, or should we get a new one?

I think we should get a new one. The VM that has non-sanitized logs shouldn't run a web server. It will also be the VM that sanitizes bridge descriptors.

Ok, I have requested a VM for AWStats and Webalizer in #4634.

If you want to start setting up the VM with AWStats and Webalizer, please don't wait for me to set up the VM that sanitizes logs. The 2010 logs should be sufficient to get something running. The connection to the sanitizing VM will be a cronjob rsync'ing the sanitized logs as you find them in the tarballs. Both the AWStats and the Webalizer setup should be able to handle adding new sanitized log files and removing files older than, say, one week. We'll probably want to keep back the logs from a given day until that day is over (the sorting doesn't make much sense if we're sorting requests from just a few hours), so files shouldn't change once you get them.

Sounds like a good plan.

comment:13 in reply to:  9 Changed 8 years ago by runa

Replying to karsten:

Replying to runa:

Cool, can you make the results available?

How would I do that? I don't think the results are written to static HTML files, are they? Can I send you a tarball of some directory (that you're going to tell me), and you make the shiny stats available?

All of the 2010 logs for www.tpo in AWStats can be found here: http://107.22.35.94/awstats/awstats.pl?month=01&year=2010&output=main&config=www&framename=index

Logs for 01-2010 and 02-2012 for wiki.tpo in AWStats can be found here: http://107.22.35.94/awstats/awstats.pl?month=01&year=2010&output=main&config=wiki&framename=index

comment:14 Changed 8 years ago by runa

The logs have been imported into webalizer (see https://webstats.torproject.org/webalizer/), waiting for awstats to be configured properly before running an update.

comment:16 Changed 8 years ago by runa

Resolution: fixed
Status: assignedclosed

We've got AWStats and Webalizer running, and we're working on adding logs for more torproject.org domains. Closing this ticket now, feel free to reopen if it's related to the web log analysis tools.

comment:17 Changed 7 years ago by karsten

Keywords: SponsorZ added
Milestone: Sponsor Z: December 31, 2011

Switching from using milestones to keywords for sponsor deliverables. See #6365 for details.

Note: See TracTickets for help on using tickets.