As part of #4859 (moved), we want to process more logs sanitize and process logs for more torproject.org domains. Section 1 in the README lists the steps to prepare the source and destination host, and copy the log files over to stenodon.
To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information
Child items 0
Show closed items
No child items are currently assigned. Use child items to break down this issue into smaller parts.
Linked items 0
Link issues together to show that they're related.
Learn more.
In relation to #7004 (moved), having reporting current and daily would be great. I've like to know what our traffic actually is as it seems high to me.
Peter pointed out that we currently have four machines that are www.torproject.org, and asked what's needed to get them all counted on webstats.tpo. I'm not entirely sure how that part of the process works. Here's what Karsten said:
I'm not entirely sure, either. I think you only need to copy the original logs to 1 subdirectory per physical host, e.g., in/vescum/*, and files should only contain the virtual host in their file name, not the physical host, e.g., www.torproject.org-access.log-20111224.gz. So that would be in/vescum/www.torproject.org-access.log-20111224.gz. But I'm not 100% sure.
Can you try out the following?
Create a new webstats instance, for example on your laptop (don't test this on the live system).
Copy vescum's access.log files to in/vescum/www.torproject.org-access.log-20111224.gz, majus' access.log files to in/majus/www.torproject.org-access.log-20111224.gz, etc. Do that for a week worth of data that is at least 4 or 5 days old; e.g., Sep 17--23.
Run webstats and see if there's just a single file in out/ for www.torproject.org that contains requests from all input files.
Hm, how do I run this? I tried java src/org/torproject/webstats/Main.java with in/ and out/ in ., but got "Could not find the main class: src/org/torproject/webstats/Main.java. Program will exit".
webstats will output a single file in out/ for www.torproject.org, but I am not confident that it contains information from all input files. As a quick test, I counted the number of lines with "volunteer.html" in both the output and input files; the files in out/ have a total of 16812 lines, while the input files have a total of 34383 lines. Shouldn't each line in the input files count as one new request?
If it's sensitive, we shouldn't be recording it at all. If it's sensitive, it's open to subpoena/theft/leaking and we shouldn't have the data at all.
It feels like we had this discussion a few times. I strongly suggest we don't use the raw log files directly. The fact that Runa has difficulties setting up webstats and I have hardly any time to help her with it shouldn't make us use raw log files.
Did you compare log lines with a date at least 4--5 days in the past?
The logs I have are Sept 30 -- Oct 09.
Also, 404's are discarded in the sanitizing process. You'll have to ignore these lines in your comparison, too.
I tried counting lines with /css/master.css (using grep "GET /css/master.css"). I get a total of 735,140 lines in the sanitized files and 714,379 in the non-sanitized files.
There may be more differences between input and output files that I'm not aware of right now.
Can you think of anything that would explain why I am seeing more lines with /css/master.css in the sanitized files? It would be great if you could document the differences between input and output when you have more time.
If it's sensitive, we shouldn't be recording it at all. If it's sensitive, it's open to subpoena/theft/leaking and we shouldn't have the data at all.
We had this discussing a year ago. If you want to change our Apache logging format to equal what Karsten's sanitization script outputs, then sure, we can include the raw logs. We will still need to sanitize logs for 2011 and 2012, though.
Did you compare log lines with a date at least 4--5 days in the past?
The logs I have are Sept 30 -- Oct 09.
Can you encrypt and upload those logs somewhere for me?
Also, 404's are discarded in the sanitizing process. You'll have to ignore these lines in your comparison, too.
I tried counting lines with /css/master.css (using grep "GET /css/master.css"). I get a total of 735,140 lines in the sanitized files and 714,379 in the non-sanitized files.
There may be more differences between input and output files that I'm not aware of right now.
Can you think of anything that would explain why I am seeing more lines with /css/master.css in the sanitized files?
Not yet.
It would be great if you could document the differences between input and output when you have more time.
Will do. That's actually a TODO in the Java file, but I never got around to it.
I tried counting lines with /css/master.css (using grep "GET /css/master.css"). I get a total of 735,140 lines in the sanitized files and 714,379 in the non-sanitized files.
I ran webstats on the files you gave me and got 714379 lines containing "GET /css/master.css" in the input files and 714378 such lines in the output files. That looks normal to me. Can you check again that you have more lines in the input files than in the output files?
I tried counting lines with /css/master.css (using grep "GET /css/master.css"). I get a total of 735,140 lines in the sanitized files and 714,379 in the non-sanitized files.
I ran webstats on the files you gave me and got 714379 lines containing "GET /css/master.css" in the input files and 714378 such lines in the output files. That looks normal to me. Can you check again that you have more lines in the input files than in the output files?
I get the same numbers are you. Turns out I was counting things wrong. Now that we have the output that we want, can weasel copy logs to stenodon? :)
It would be great if you could document the differences between input and output when you have more time.
Will do. That's actually a TODO in the Java file, but I never got around to it.
Done. Please merge branch task-6196 from my public repository.
There is now an /srv/webstats.tpo/incoming on stenodon.
You probably want to consider this read-only for the webstats user - cronjob push and remove stuff there with rsync.
I have updated /home/webstats/bin/ssh-wrap to put files in /srv/webstats.torproject.org/home/webstats/in. weasel says we should not remove any of the logs in in even after processing them. Karsten, can you please update your script to not remove these logs?
Please see branch task-6196-2 in my public repository. Most important changes are:
Files in in/ are no longer deleted after parsing them.
New files in out/ are automatically .gz-compressed.
If you like these changes, please merge them into master.
I suspended the cronjob on stenodon. Once I have your okay, I'll switch to this branch and re-run it on the data that's now in the in/ directory. I expect that to work out fine and not run into problems with disk space, now that we're compressing output files on the fly.
What is the filename of a .gz-compressed log once it's available in out/? The reason for asking is that the Webalizer configuration file has to specify the full path and filename before Webalizer is able to parse and import it, so *.gz won't work.
What is the filename of a .gz-compressed log once it's available in out/? The reason for asking is that the Webalizer configuration file has to specify the full path and filename before Webalizer is able to parse and import it, so *.gz won't work.
Here's an example: out/2012/10/31/metrics.torproject.org-access.log.gz
What is the filename of a .gz-compressed log once it's available in out/? The reason for asking is that the Webalizer configuration file has to specify the full path and filename before Webalizer is able to parse and import it, so *.gz won't work.
Here's an example: out/2012/10/31/metrics.torproject.org-access.log.gz
Oh, great, that makes everything really easy. I'll update the scripts now and let you know when it's safe to turn the cronjob back on.
By the way, /home/webstats/in/aroides/media.torproject.org-access.log-20121101.gz contains info from Oct 31st. Bug or by design?
The disk is already full with input files. We'll need about as much free disk space as there are input files, better twice that amount. Is there a reason why /home/webstats/webstats/in/ is a copy of /srv/webstats.torproject.org/incoming/ instead of a symbolic link? Can we change that and delete the files in /home/webstats/webstats/in/?
Should I include logs for onionoo at this point or continue to ignore them?
Recent logs should be fine to include, but logs from a few months ago might still have the old URL format that we can't sanitize very well.
The disk is already full with input files. We'll need about as much free disk space as there are input files, better twice that amount. Is there a reason why /home/webstats/webstats/in/ is a copy of /srv/webstats.torproject.org/incoming/ instead of a symbolic link? Can we change that and delete the files in /home/webstats/webstats/in/?
I first configured the ssh-wrap script to put files in /home/webstats/webstats/in/, and then changed it back to /srv/webstats.torproject.org/incoming/ without cleaning up. I believe it is safe to delete the files in /home/webstats/webstats/in/ and create a symbolic link instead.
Should I include logs for onionoo at this point or continue to ignore them?
Recent logs should be fine to include, but logs from a few months ago might still have the old URL format that we can't sanitize very well.
Ok, I updated the scripts to include onionoo from now on.
The disk is already full with input files. We'll need about as much free disk space as there are input files, better twice that amount. Is there a reason why /home/webstats/webstats/in/ is a copy of /srv/webstats.torproject.org/incoming/ instead of a symbolic link? Can we change that and delete the files in /home/webstats/webstats/in/?
I first configured the ssh-wrap script to put files in /home/webstats/webstats/in/, and then changed it back to /srv/webstats.torproject.org/incoming/ without cleaning up. I believe it is safe to delete the files in /home/webstats/webstats/in/ and create a symbolic link instead.
I just looked, and it's not safe to delete the files in /home/webstats/webstats/in/, because yatei still rsyncs its files there, not to /srv/webstats.torproject.org/incoming/. (I can't fix this right now, because I'm doing too many things at once. Just leaving this note here. Should be an easy fix though.)
I can reactivate the cronjob whenever you're ready. I hope that sanitized logs compress better than the originals, because there are only 3.4G left on /srv with 4.0G original logs to be processed. Shall I start the processing?
It's running for 18 hours now, chewing on months or even years of data. I can't monitor the current execution in the next week or two, but I'll keep it running. I started the job manually and did not reactivate the cronjob, because I'll first want to see the results. This is really a lot of data.
The first job took 28 hours and succeeded. I just ran another job to process logs from the past week and re-enabled the cronjob. I think that webstats correctly writes sanitized log files to out/, but I don't know what your scripts are doing to those files afterwards. Please check the output of your scripts.
I'm not maintaining webstats on stenodon anymore, and I'm not sure who does, or rather if anyone does. Reassigning to phobos in case he knows or has plans for this.
Trac: Component: Metrics Website to Website Owner: karsten to phobos
Just to be clear: tearing it all down means stopping all cronjobs copying weblogs to stenodon and then shutting down stenodon, right? Sounds good to me. But are you sure?
We'll probably want to create a new ticket for this and include weasel.