Copy Apache logs to stenodon

added component::webpages/website owner::phobos priority::medium resolution::invalid status::closed type::task labels

I believe we'll want to sanitize and process logs for the following hosts: www, blog, trac, metrics, gitweb, and bridges.

Trac:
Summary: Copy www.torproject.org-access.log to stenodon to Copy Apache logs to stenodon

What's the status here? Are you ready to take more?

Trac:
Status: new to needs_information

Yes, ready for more logs now. Should we start with www.tpo?

Trac:
Status: needs_information to new

In relation to #7004 (moved), having reporting current and daily would be great. I've like to know what our traffic actually is as it seems high to me.

Trac:
Cc: N/A to runa

Peter pointed out that we currently have four machines that are www.torproject.org, and asked what's needed to get them all counted on webstats.tpo. I'm not entirely sure how that part of the process works. Here's what Karsten said:

I'm not entirely sure, either. I think you only need to copy the original logs to 1 subdirectory per physical host, e.g., in/vescum/*, and files should only contain the virtual host in their file name, not the physical host, e.g., www.torproject.org-access.log-20111224.gz. So that would be in/vescum/www.torproject.org-access.log-20111224.gz. But I'm not 100% sure.

Can you try out the following?

Create a new webstats instance, for example on your laptop (don't test this on the live system).
Copy vescum's access.log files to in/vescum/www.torproject.org-access.log-20111224.gz, majus' access.log files to in/majus/www.torproject.org-access.log-20111224.gz, etc. Do that for a week worth of data that is at least 4 or 5 days old; e.g., Sep 17--23.
Run webstats and see if there's just a single file in out/ for www.torproject.org that contains requests from all input files.

Karsten: where is the script that takes logs, sanitizes them, and puts them in out/ for webalizer to copy and process?

Trac:
Cc: runa to runa, karsten

Replying to runa:

Karsten: where is the script that takes logs, sanitizes them, and puts them in out/ for webalizer to copy and process?

There's a single Java class doing that: https://gitweb.torproject.org/webstats.git/blob/HEAD:/src/org/torproject/webstats/Main.java

Hm, how do I run this? I tried java src/org/torproject/webstats/Main.java with in/ and out/ in ., but got "Could not find the main class: src/org/torproject/webstats/Main.java. Program will exit".

Here's the command from stenodon's /home/webstats/webstats/cron.sh script:

javac -d classes/ -cp lib/commons-compress-1.0.jar src/org/torproject/webstats/Main.java
java -cp classes/:lib/commons-compress-1.0.jar org.torproject.webstats.Main

Note that you'll have to download commons-compress-1.0.jar and put it in the lib/ directory. You could copy stenodon's file, of course.

webstats will output a single file in out/ for www.torproject.org, but I am not confident that it contains information from all input files. As a quick test, I counted the number of lines with "volunteer.html" in both the output and input files; the files in out/ have a total of 16812 lines, while the input files have a total of 34383 lines. Shouldn't each line in the input files count as one new request?

Did you compare log lines with a date at least 4--5 days in the past?

Also, 404's are discarded in the sanitizing process. You'll have to ignore these lines in your comparison, too.

There may be more differences between input and output files that I'm not aware of right now.

Why not just copy and use the raw log files?

If it's sensitive, we shouldn't be recording it at all. If it's sensitive, it's open to subpoena/theft/leaking and we shouldn't have the data at all.

Replying to phobos:

Why not just copy and use the raw log files?

If it's sensitive, we shouldn't be recording it at all. If it's sensitive, it's open to subpoena/theft/leaking and we shouldn't have the data at all.

It feels like we had this discussion a few times. I strongly suggest we don't use the raw log files directly. The fact that Runa has difficulties setting up webstats and I have hardly any time to help her with it shouldn't make us use raw log files.

Replying to karsten:

Did you compare log lines with a date at least 4--5 days in the past?

The logs I have are Sept 30 -- Oct 09.

Also, 404's are discarded in the sanitizing process. You'll have to ignore these lines in your comparison, too.

I tried counting lines with /css/master.css (using grep "GET /css/master.css"). I get a total of 735,140 lines in the sanitized files and 714,379 in the non-sanitized files.

There may be more differences between input and output files that I'm not aware of right now.

Can you think of anything that would explain why I am seeing more lines with /css/master.css in the sanitized files? It would be great if you could document the differences between input and output when you have more time.

Replying to phobos:

Why not just copy and use the raw log files?

If it's sensitive, we shouldn't be recording it at all. If it's sensitive, it's open to subpoena/theft/leaking and we shouldn't have the data at all.

We had this discussing a year ago. If you want to change our Apache logging format to equal what Karsten's sanitization script outputs, then sure, we can include the raw logs. We will still need to sanitize logs for 2011 and 2012, though.

Replying to runa:

Replying to karsten:

Did you compare log lines with a date at least 4--5 days in the past?

The logs I have are Sept 30 -- Oct 09.

Can you encrypt and upload those logs somewhere for me?

Also, 404's are discarded in the sanitizing process. You'll have to ignore these lines in your comparison, too.

I tried counting lines with /css/master.css (using grep "GET /css/master.css"). I get a total of 735,140 lines in the sanitized files and 714,379 in the non-sanitized files.

There may be more differences between input and output files that I'm not aware of right now.

Can you think of anything that would explain why I am seeing more lines with /css/master.css in the sanitized files?

Not yet.

It would be great if you could document the differences between input and output when you have more time.

Will do. That's actually a TODO in the Java file, but I never got around to it.

Replying to karsten:

Replying to runa:

I tried counting lines with /css/master.css (using grep "GET /css/master.css"). I get a total of 735,140 lines in the sanitized files and 714,379 in the non-sanitized files.

I ran webstats on the files you gave me and got 714379 lines containing "GET /css/master.css" in the input files and 714378 such lines in the output files. That looks normal to me. Can you check again that you have more lines in the input files than in the output files?

Replying to karsten:

Replying to karsten:

Replying to runa:

I tried counting lines with /css/master.css (using grep "GET /css/master.css"). I get a total of 735,140 lines in the sanitized files and 714,379 in the non-sanitized files.

I ran webstats on the files you gave me and got 714379 lines containing "GET /css/master.css" in the input files and 714378 such lines in the output files. That looks normal to me. Can you check again that you have more lines in the input files than in the output files?

I get the same numbers are you. Turns out I was counting things wrong. Now that we have the output that we want, can weasel copy logs to stenodon? :)

Replying to runa:

I get the same numbers are you. Turns out I was counting things wrong. Now that we have the output that we want, can weasel copy logs to stenodon? :)

Sure, sounds fine! (If that question was addressed to me.)

Replying to karsten:

Replying to runa:

It would be great if you could document the differences between input and output when you have more time. Will do. That's actually a TODO in the Java file, but I never got around to it.

Done. Please merge branch task-6196 from my public repository.

Trac:
Status: new to needs_review

There is now an /srv/webstats.tpo/incoming on stenodon. You probably want to consider this read-only for the webstats user - cronjob push and remove stuff there with rsync.

I have updated /home/webstats/bin/ssh-wrap to put files in /srv/webstats.torproject.org/home/webstats/in. weasel says we should not remove any of the logs in in even after processing them. Karsten, can you please update your script to not remove these logs?

Please see branch task-6196-2 in my public repository. Most important changes are:

Files in in/ are no longer deleted after parsing them.
New files in out/ are automatically .gz-compressed.

If you like these changes, please merge them into master.

I suspended the cronjob on stenodon. Once I have your okay, I'll switch to this branch and re-run it on the data that's now in the in/ directory. I expect that to work out fine and not run into problems with disk space, now that we're compressing output files on the fly.

What is the filename of a .gz-compressed log once it's available in out/? The reason for asking is that the Webalizer configuration file has to specify the full path and filename before Webalizer is able to parse and import it, so *.gz won't work.

Replying to runa:

What is the filename of a .gz-compressed log once it's available in out/? The reason for asking is that the Webalizer configuration file has to specify the full path and filename before Webalizer is able to parse and import it, so *.gz won't work.

Here's an example: out/2012/10/31/metrics.torproject.org-access.log.gz

Replying to karsten:

Replying to runa:

What is the filename of a .gz-compressed log once it's available in out/? The reason for asking is that the Webalizer configuration file has to specify the full path and filename before Webalizer is able to parse and import it, so *.gz won't work.

Here's an example: out/2012/10/31/metrics.torproject.org-access.log.gz

Oh, great, that makes everything really easy. I'll update the scripts now and let you know when it's safe to turn the cronjob back on.

By the way, /home/webstats/in/aroides/media.torproject.org-access.log-20121101.gz contains info from Oct 31st. Bug or by design?

Ok, please turn the cronjob back on. Should I include logs for onionoo at this point or continue to ignore them?

Replying to runa:

By the way, /home/webstats/in/aroides/media.torproject.org-access.log-20121101.gz contains info from Oct 31st. Bug or by design?

That's an input file that was created on November 1st. It's fine to have lines from October 31st in it.

Replying to runa:

Ok, please turn the cronjob back on.

The disk is already full with input files. We'll need about as much free disk space as there are input files, better twice that amount. Is there a reason why /home/webstats/webstats/in/ is a copy of /srv/webstats.torproject.org/incoming/ instead of a symbolic link? Can we change that and delete the files in /home/webstats/webstats/in/?

Should I include logs for onionoo at this point or continue to ignore them?

Recent logs should be fine to include, but logs from a few months ago might still have the old URL format that we can't sanitize very well.

Trac:
Status: needs_review to new

Replying to karsten:

Replying to runa:

Ok, please turn the cronjob back on.

The disk is already full with input files. We'll need about as much free disk space as there are input files, better twice that amount. Is there a reason why /home/webstats/webstats/in/ is a copy of /srv/webstats.torproject.org/incoming/ instead of a symbolic link? Can we change that and delete the files in /home/webstats/webstats/in/?

I first configured the ssh-wrap script to put files in /home/webstats/webstats/in/, and then changed it back to /srv/webstats.torproject.org/incoming/ without cleaning up. I believe it is safe to delete the files in /home/webstats/webstats/in/ and create a symbolic link instead.

Should I include logs for onionoo at this point or continue to ignore them?

Recent logs should be fine to include, but logs from a few months ago might still have the old URL format that we can't sanitize very well.

Ok, I updated the scripts to include onionoo from now on.

Replying to runa:

Replying to karsten:

Replying to runa:

Ok, please turn the cronjob back on.

The disk is already full with input files. We'll need about as much free disk space as there are input files, better twice that amount. Is there a reason why /home/webstats/webstats/in/ is a copy of /srv/webstats.torproject.org/incoming/ instead of a symbolic link? Can we change that and delete the files in /home/webstats/webstats/in/?

I first configured the ssh-wrap script to put files in /home/webstats/webstats/in/, and then changed it back to /srv/webstats.torproject.org/incoming/ without cleaning up. I believe it is safe to delete the files in /home/webstats/webstats/in/ and create a symbolic link instead.

I just looked, and it's not safe to delete the files in /home/webstats/webstats/in/, because yatei still rsyncs its files there, not to /srv/webstats.torproject.org/incoming/. (I can't fix this right now, because I'm doing too many things at once. Just leaving this note here. Should be an easy fix though.)

What's the status here? I haven't seen https://webstats.torproject.org/webalizer/ update since November 7th.

I can reactivate the cronjob whenever you're ready. I hope that sanitized logs compress better than the originals, because there are only 3.4G left on /srv with 4.0G original logs to be processed. Shall I start the processing?

We copy stuff to stenodon now. What you do with it is maybe something that doesn't involve torproject-admin

Trac:
Status: new to closed
Resolution: N/A to fixed

Please reactivate the cronjob and we'll see how it goes.

Trac:
Component: Tor Sysadmin Team to Metrics Website
Status: closed to reopened
Resolution: fixed to N/A

Trac:
Owner: N/A to karsten
Status: reopened to assigned

It's running for 18 hours now, chewing on months or even years of data. I can't monitor the current execution in the next week or two, but I'll keep it running. I started the job manually and did not reactivate the cronjob, because I'll first want to see the results. This is really a lot of data.

The first job took 28 hours and succeeded. I just ran another job to process logs from the past week and re-enabled the cronjob. I think that webstats correctly writes sanitized log files to out/, but I don't know what your scripts are doing to those files afterwards. Please check the output of your scripts.

I'm not maintaining webstats on stenodon anymore, and I'm not sure who does, or rather if anyone does. Reassigning to phobos in case he knows or has plans for this.

Trac:
Component: Metrics Website to Website
Owner: karsten to phobos

It appears the webstats stopped updating in June 2013. I'm fine with either finding an owner or tearing it all down or starting over.

I suggest tearing it all down.

Replying to phobos:

I suggest tearing it all down.

Just to be clear: tearing it all down means stopping all cronjobs copying weblogs to stenodon and then shutting down stenodon, right? Sounds good to me. But are you sure?

We'll probably want to create a new ticket for this and include weasel.

Right. They haven't been working since june and no one has noticed or cared. We need to maintain less things. so let's tear it down.

Created #10038 (moved) for this. Closing.

Trac:
Status: assigned to closed
Resolution: N/A to invalid

closed

mentioned in issue #10038 (moved)

Copy Apache logs to stenodon

Child items 0

Activity