Opened 8 years ago

Closed 7 years ago

#6196 closed task (invalid)

Copy Apache logs to stenodon

Reported by: runa Owned by: phobos
Priority: Medium Milestone:
Component: Webpages/Website Version:
Severity: Keywords:
Cc: runa, karsten Actual Points:
Parent ID: Points:
Reviewer: Sponsor:

Description

As part of #4859, we want to process more logs sanitize and process logs for more torproject.org domains. Section 1 in the README lists the steps to prepare the source and destination host, and copy the log files over to stenodon.

Child Tickets

Change History (46)

comment:1 Changed 8 years ago by runa

Summary: Copy www.torproject.org-access.log to stenodonCopy Apache logs to stenodon

I believe we'll want to sanitize and process logs for the following hosts: www, blog, trac, metrics, gitweb, and bridges.

comment:2 Changed 8 years ago by weasel

Status: newneeds_information

What's the status here? Are you ready to take more?

comment:3 Changed 8 years ago by runa

Status: needs_informationnew

Yes, ready for more logs now. Should we start with www.tpo?

comment:4 Changed 8 years ago by phobos

In relation to #7004, having reporting current and daily would be great. I've like to know what our traffic actually is as it seems high to me.

comment:5 Changed 8 years ago by runa

Cc: runa added

comment:6 Changed 8 years ago by runa

Peter pointed out that we currently have four machines that are www.torproject.org, and asked what's needed to get them all counted on webstats.tpo. I'm not entirely sure how that part of the process works. Here's what Karsten said:

I'm not entirely sure, either. I *think* you only need to copy the original logs to 1 subdirectory per physical host, e.g., in/vescum/*, and files should only contain the virtual host in their file name, not the physical host, e.g., www.torproject.org-access.log-20111224.gz. So that would be in/vescum/www.torproject.org-access.log-20111224.gz. But I'm not 100% sure.

Can you try out the following?

  • Create a new webstats instance, for example on your laptop (don't test this on the live system).
  • Copy vescum's access.log files to in/vescum/www.torproject.org-access.log-20111224.gz, majus' access.log files to in/majus/www.torproject.org-access.log-20111224.gz, etc. Do that for a week worth of data that is at least 4 or 5 days old; e.g., Sep 17--23.
  • Run webstats and see if there's just a single file in out/ for www.torproject.org that contains requests from all input files.


comment:7 Changed 8 years ago by runa

Karsten: where is the script that takes logs, sanitizes them, and puts them in out/ for webalizer to copy and process?

comment:8 Changed 8 years ago by runa

Cc: karsten added

comment:9 in reply to:  7 Changed 8 years ago by karsten

Replying to runa:

Karsten: where is the script that takes logs, sanitizes them, and puts them in out/ for webalizer to copy and process?

There's a single Java class doing that: https://gitweb.torproject.org/webstats.git/blob/HEAD:/src/org/torproject/webstats/Main.java

comment:10 Changed 8 years ago by runa

Hm, how do I run this? I tried java src/org/torproject/webstats/Main.java with in/ and out/ in ., but got "Could not find the main class: src/org/torproject/webstats/Main.java. Program will exit".

comment:11 Changed 8 years ago by karsten

Here's the command from stenodon's /home/webstats/webstats/cron.sh script:

javac -d classes/ -cp lib/commons-compress-1.0.jar src/org/torproject/webstats/Main.java
java -cp classes/:lib/commons-compress-1.0.jar org.torproject.webstats.Main

Note that you'll have to download commons-compress-1.0.jar and put it in the lib/ directory. You could copy stenodon's file, of course.

comment:12 Changed 8 years ago by runa

webstats will output a single file in out/ for www.torproject.org, but I am not confident that it contains information from all input files. As a quick test, I counted the number of lines with "volunteer.html" in both the output and input files; the files in out/ have a total of 16812 lines, while the input files have a total of 34383 lines. Shouldn't each line in the input files count as one new request?

comment:13 Changed 8 years ago by karsten

Did you compare log lines with a date at least 4--5 days in the past?

Also, 404's are discarded in the sanitizing process. You'll have to ignore these lines in your comparison, too.

There may be more differences between input and output files that I'm not aware of right now.

comment:14 Changed 8 years ago by phobos

Why not just copy and use the raw log files?

If it's sensitive, we shouldn't be recording it at all. If it's sensitive, it's open to subpoena/theft/leaking and we shouldn't have the data at all.

comment:15 in reply to:  14 Changed 8 years ago by karsten

Replying to phobos:

Why not just copy and use the raw log files?

If it's sensitive, we shouldn't be recording it at all. If it's sensitive, it's open to subpoena/theft/leaking and we shouldn't have the data at all.

It feels like we had this discussion a few times. I strongly suggest we don't use the raw log files directly. The fact that Runa has difficulties setting up webstats and I have hardly any time to help her with it shouldn't make us use raw log files.

comment:16 in reply to:  13 Changed 8 years ago by runa

Replying to karsten:

Did you compare log lines with a date at least 4--5 days in the past?

The logs I have are Sept 30 -- Oct 09.

Also, 404's are discarded in the sanitizing process. You'll have to ignore these lines in your comparison, too.

I tried counting lines with /css/master.css (using grep "GET /css/master.css"). I get a total of 735,140 lines in the sanitized files and 714,379 in the non-sanitized files.

There may be more differences between input and output files that I'm not aware of right now.

Can you think of anything that would explain why I am seeing more lines with /css/master.css in the sanitized files? It would be great if you could document the differences between input and output when you have more time.

comment:17 in reply to:  14 Changed 8 years ago by runa

Replying to phobos:

Why not just copy and use the raw log files?

If it's sensitive, we shouldn't be recording it at all. If it's sensitive, it's open to subpoena/theft/leaking and we shouldn't have the data at all.

We had this discussing a year ago. If you want to change our Apache logging format to equal what Karsten's sanitization script outputs, then sure, we can include the raw logs. We will still need to sanitize logs for 2011 and 2012, though.

comment:18 Changed 8 years ago by karsten

Replying to runa:

Replying to karsten:

Did you compare log lines with a date at least 4--5 days in the past?

The logs I have are Sept 30 -- Oct 09.

Can you encrypt and upload those logs somewhere for me?

Also, 404's are discarded in the sanitizing process. You'll have to ignore these lines in your comparison, too.

I tried counting lines with /css/master.css (using grep "GET /css/master.css"). I get a total of 735,140 lines in the sanitized files and 714,379 in the non-sanitized files.

There may be more differences between input and output files that I'm not aware of right now.

Can you think of anything that would explain why I am seeing more lines with /css/master.css in the sanitized files?

Not yet.

It would be great if you could document the differences between input and output when you have more time.

Will do. That's actually a TODO in the Java file, but I never got around to it.

comment:19 in reply to:  18 ; Changed 8 years ago by karsten

Replying to karsten:

Replying to runa:

I tried counting lines with /css/master.css (using grep "GET /css/master.css"). I get a total of 735,140 lines in the sanitized files and 714,379 in the non-sanitized files.

I ran webstats on the files you gave me and got 714379 lines containing "GET /css/master.css" in the input files and 714378 such lines in the output files. That looks normal to me. Can you check again that you have more lines in the input files than in the output files?

comment:20 in reply to:  19 ; Changed 8 years ago by runa

Replying to karsten:

Replying to karsten:

Replying to runa:

I tried counting lines with /css/master.css (using grep "GET /css/master.css"). I get a total of 735,140 lines in the sanitized files and 714,379 in the non-sanitized files.

I ran webstats on the files you gave me and got 714379 lines containing "GET /css/master.css" in the input files and 714378 such lines in the output files. That looks normal to me. Can you check again that you have more lines in the input files than in the output files?

I get the same numbers are you. Turns out I was counting things wrong. Now that we have the output that we want, can weasel copy logs to stenodon? :)

comment:21 in reply to:  20 Changed 8 years ago by karsten

Replying to runa:

I get the same numbers are you. Turns out I was counting things wrong. Now that we have the output that we want, can weasel copy logs to stenodon? :)

Sure, sounds fine! (If that question was addressed to me.)

comment:22 in reply to:  18 Changed 8 years ago by karsten

Status: newneeds_review

Replying to karsten:

Replying to runa:

It would be great if you could document the differences between input and output when you have more time.

Will do. That's actually a TODO in the Java file, but I never got around to it.

Done. Please merge branch task-6196 from my public repository.

comment:23 Changed 8 years ago by weasel

There is now an /srv/webstats.tpo/incoming on stenodon.
You probably want to consider this read-only for the webstats user - cronjob push and remove stuff there with rsync.

comment:24 Changed 8 years ago by runa

I have updated /home/webstats/bin/ssh-wrap to put files in /srv/webstats.torproject.org/home/webstats/in. weasel says we should not remove any of the logs in in even after processing them. Karsten, can you please update your script to not remove these logs?

comment:25 Changed 8 years ago by karsten

Please see branch task-6196-2 in my public repository. Most important changes are:

  1. Files in in/ are no longer deleted after parsing them.
  2. New files in out/ are automatically .gz-compressed.

If you like these changes, please merge them into master.

I suspended the cronjob on stenodon. Once I have your okay, I'll switch to this branch and re-run it on the data that's now in the in/ directory. I expect that to work out fine and not run into problems with disk space, now that we're compressing output files on the fly.

comment:26 Changed 8 years ago by runa

What is the filename of a .gz-compressed log once it's available in out/? The reason for asking is that the Webalizer configuration file has to specify the full path and filename before Webalizer is able to parse and import it, so *.gz won't work.

comment:27 in reply to:  26 ; Changed 8 years ago by karsten

Replying to runa:

What is the filename of a .gz-compressed log once it's available in out/? The reason for asking is that the Webalizer configuration file has to specify the full path and filename before Webalizer is able to parse and import it, so *.gz won't work.

Here's an example: out/2012/10/31/metrics.torproject.org-access.log.gz

comment:28 in reply to:  27 ; Changed 8 years ago by runa

Replying to karsten:

Replying to runa:

What is the filename of a .gz-compressed log once it's available in out/? The reason for asking is that the Webalizer configuration file has to specify the full path and filename before Webalizer is able to parse and import it, so *.gz won't work.

Here's an example: out/2012/10/31/metrics.torproject.org-access.log.gz

Oh, great, that makes everything really easy. I'll update the scripts now and let you know when it's safe to turn the cronjob back on.

By the way, /home/webstats/in/aroides/media.torproject.org-access.log-20121101.gz contains info from Oct 31st. Bug or by design?

comment:29 Changed 8 years ago by runa

Ok, please turn the cronjob back on. Should I include logs for onionoo at this point or continue to ignore them?

comment:30 in reply to:  28 Changed 8 years ago by karsten

Replying to runa:

By the way, /home/webstats/in/aroides/media.torproject.org-access.log-20121101.gz contains info from Oct 31st. Bug or by design?

That's an input file that was created on November 1st. It's fine to have lines from October 31st in it.

comment:31 in reply to:  29 ; Changed 8 years ago by karsten

Status: needs_reviewnew

Replying to runa:

Ok, please turn the cronjob back on.

The disk is already full with input files. We'll need about as much free disk space as there are input files, better twice that amount. Is there a reason why /home/webstats/webstats/in/ is a copy of /srv/webstats.torproject.org/incoming/ instead of a symbolic link? Can we change that and delete the files in /home/webstats/webstats/in/?

Should I include logs for onionoo at this point or continue to ignore them?

Recent logs should be fine to include, but logs from a few months ago might still have the old URL format that we can't sanitize very well.

comment:32 in reply to:  31 ; Changed 8 years ago by runa

Replying to karsten:

Replying to runa:

Ok, please turn the cronjob back on.

The disk is already full with input files. We'll need about as much free disk space as there are input files, better twice that amount. Is there a reason why /home/webstats/webstats/in/ is a copy of /srv/webstats.torproject.org/incoming/ instead of a symbolic link? Can we change that and delete the files in /home/webstats/webstats/in/?

I first configured the ssh-wrap script to put files in /home/webstats/webstats/in/, and then changed it back to /srv/webstats.torproject.org/incoming/ without cleaning up. I believe it is safe to delete the files in /home/webstats/webstats/in/ and create a symbolic link instead.

Should I include logs for onionoo at this point or continue to ignore them?

Recent logs should be fine to include, but logs from a few months ago might still have the old URL format that we can't sanitize very well.

Ok, I updated the scripts to include onionoo from now on.

comment:33 in reply to:  32 Changed 8 years ago by karsten

Replying to runa:

Replying to karsten:

Replying to runa:

Ok, please turn the cronjob back on.

The disk is already full with input files. We'll need about as much free disk space as there are input files, better twice that amount. Is there a reason why /home/webstats/webstats/in/ is a copy of /srv/webstats.torproject.org/incoming/ instead of a symbolic link? Can we change that and delete the files in /home/webstats/webstats/in/?

I first configured the ssh-wrap script to put files in /home/webstats/webstats/in/, and then changed it back to /srv/webstats.torproject.org/incoming/ without cleaning up. I believe it is safe to delete the files in /home/webstats/webstats/in/ and create a symbolic link instead.

I just looked, and it's not safe to delete the files in /home/webstats/webstats/in/, because yatei still rsyncs its files there, not to /srv/webstats.torproject.org/incoming/. (I can't fix this right now, because I'm doing too many things at once. Just leaving this note here. Should be an easy fix though.)

comment:34 Changed 8 years ago by runa

What's the status here? I haven't seen https://webstats.torproject.org/webalizer/ update since November 7th.

comment:35 Changed 8 years ago by karsten

I can reactivate the cronjob whenever you're ready. I hope that sanitized logs compress better than the originals, because there are only 3.4G left on /srv with 4.0G original logs to be processed. Shall I start the processing?

comment:36 Changed 8 years ago by weasel

Resolution: fixed
Status: newclosed

We copy stuff to stenodon now. What you do with it is maybe something that doesn't involve torproject-admin

comment:37 Changed 8 years ago by runa

Component: Tor Sysadmin TeamMetrics Website
Resolution: fixed
Status: closedreopened

Please reactivate the cronjob and we'll see how it goes.

comment:38 Changed 8 years ago by runa

Owner: set to karsten
Status: reopenedassigned

comment:39 Changed 8 years ago by karsten

It's running for 18 hours now, chewing on months or even years of data. I can't monitor the current execution in the next week or two, but I'll keep it running. I started the job manually and did not reactivate the cronjob, because I'll first want to see the results. This is really a lot of data.

comment:40 Changed 8 years ago by karsten

The first job took 28 hours and succeeded. I just ran another job to process logs from the past week and re-enabled the cronjob. I think that webstats correctly writes sanitized log files to out/, but I don't know what your scripts are doing to those files afterwards. Please check the output of your scripts.

comment:41 Changed 7 years ago by karsten

Component: Metrics WebsiteWebsite
Owner: changed from karsten to phobos

I'm not maintaining webstats on stenodon anymore, and I'm not sure who does, or rather if anyone does. Reassigning to phobos in case he knows or has plans for this.

comment:42 Changed 7 years ago by phobos

It appears the webstats stopped updating in June 2013. I'm fine with either finding an owner or tearing it all down or starting over.

comment:43 Changed 7 years ago by phobos

I suggest tearing it all down.

comment:44 in reply to:  43 Changed 7 years ago by karsten

Replying to phobos:

I suggest tearing it all down.

Just to be clear: tearing it all down means stopping all cronjobs copying weblogs to stenodon and then shutting down stenodon, right? Sounds good to me. But are you sure?

We'll probably want to create a new ticket for this and include weasel.

comment:45 Changed 7 years ago by phobos

Right. They haven't been working since june and no one has noticed or cared. We need to maintain less things. so let's tear it down.

comment:46 Changed 7 years ago by karsten

Resolution: invalid
Status: assignedclosed

Created #10038 for this. Closing.

Note: See TracTickets for help on using tickets.