2017-01: Add a visualization of Tor Browser downloads to Metrics. (Sponsor X 6.3. Metrics, part 1)
2017-01: Finalize visualizations of Tor Browser downloads on Metrics based on user feedback. (Sponsor X 6.3. Metrics, part 2)
We can re-use existing work from #20008 (moved) here where we defined a database schema and wrote an importer.
And we already have two prototypes (1 and 2) which are based on RStudio's Shiny, which we can't use here, but which we can use for inspiration. To be clear, what we do here will be less shiny, because we're stuck with our not-very-interactive graphing engine. But we cannot solve that now.
I'll attach early results in a second.
To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information
Child items 0
Show closed items
No child items are currently assigned. Use child items to break down this issue into smaller parts.
Linked items 0
Link issues together to show that they're related.
Learn more.
I'm attaching a graph showing downloads and updates between January 10 and 13, 2017:
I believe that the fonts will be a bit nicer after deploying this on the metrics server. This is related to slightly different R versions. But that's not what this is about. This is just to show a sample of what kind of graph to expect here.
Please review my branch task-21236 which contains a new data-processing module "webstats" and a new graph "Tor Browser downloads and updates". That code produced the .csv and .png files above. It also adds a graph page to Tor Metrics.
One known issue that I didn't find time to fix yet is that locale detection is not always as precise as it could be. Example from the .csv file: 2017-01-10,tmup,o,,0.3.0b1-0.3.0b2-linux32-en-US.xml,,3
Oh, and of course that code still contains an if block containing the line: // TODO take me out after testing which we really should do before merging. But it's useful for testing.
Ah, and there are no tests yet, but the code is in a relatively testable state. We could write tests similar to the ones for the "connbidirect" module.
A 'meta' question:
Shouldn't we use milestones for all Metrics Web and Atlas tickets related to Sporsor X? These will come in handy when writing the report in June.
As there are no releases yet, maybe milestones metrics-web-X16.17 and Atlas-X16.17?
Regarding your meta question, I was thinking of using the monthly team reports as input for the final Sponsor X report. And this ticket would certainly go into a monthly report as soon as the graph is ready. I agree that this wouldn't apply to all Atlas tickets, so my (secret) plan was to run a Trac query in June/July showing all tickets closed on the Sponsor X time frame and consider which to include in the report. I think that should cover most of what we did, and it takes less effort than manually tagging tickets now. What do you think, would that work?
And thanks in advance for looking at this ticket today. :)
The restructuring task (#19754 (moved)) didn't begin yet, but still new modules should rather improve the situation a little.
tl;dr: Please see my branch with some suggestions.
It uses build-base.xml, adapts to our general project structure a little more. The test task is now part of build-base.xml; I added a testrun task and command line parameters for integration testing; a database module, which might make testing easier and might be re-usable when doing #19754 (moved); the methods in Main are now accessible for unit tests.
Please reuse and maybe add some simple tests, if there is time?
Thanks for the quick response! And thanks a lot for making those build improvements.
I made some tweaks to your commits and pushed them to my task-21236-2 branch. I'd like to squash that commit and yours into my initial commit 597e715 at the end, if you don't mind. Unless there's a reason to keep them separate, in which case, please let me know.
But before we squash or don't squash anything, let's talk about the Database class. I like the idea of encapsulating database access---in theory. I even had a similar class like yours a day or two ago. But I took that out again, because I felt it only added complexity for no particular gain. Let me elaborate.
The Main class still imports types from java.sql, meaning that the Database class does not hide that we're talking to an SQL database via JDBC. And the methods in Database all contain just a single line of code or two, meaning that we're not really hiding much complexity. The reason is that JDBC is already an interface hiding away all the complexity that comes with talking to an SQL database.
So, let's talk about the purpose of having this separate Database class. I assume the main reason for having it is testing. Is it going to be easier to test the Database class than to test the single methods in modules that are related to database access? And maybe even more importantly, is it easier to mock the Database class than to provide a dummy database via a custom JDBC string and just test the methods contained in modules that way? If the answers are "yes and yes", would you mind writing a simple test for the new Database class, just to get us started?
Otherwise, would you mind if we postpone this change and keep all database methods in the Main class for now? It's easy to extract database functionality (and downloading functionality or .csv-file writing functionality) as part of the restructuring. Don't get me wrong, I'm a supporter of encapsulation if something is fully encapsulated and for a good reason. But I'm also a fan of self-contained and therefore often more readable code. (For example, do you think we'll remember in a year when we write the next module that we disable auto-commit when connecting to the database?!)
And here are some minor points that need discussion:
If we keep the Database class, we should either move it to the org.torproject.metrics.webstats package, or start a shared/src/ directory for code that is supposed to be shared between modules. My favorite would be to start with the former and consider doing the latter once we have a second module using it.
Should the arguments --filter and --db be switched around, so that people can specify their own database but omit the filter? Alternatively, we might want to say more clearly how to pass an argument for --filter that doesn't filter anything.
I didn't run this code yet. Does writeStatistics() add a newline after writing the header?
As you point out, we should write tests. I'm happy to do that, though I'm afraid the most interesting parts require mocking either the web server or the database. If you have suggestions for doing that, please let me know. Otherwise I'd start with some simple tests for parseLogLines() or writeStatistics().
And here's another remaining issue from above, listed for completeness:
I didn't find time to fix yet is that locale detection is not always as precise as it could be. Example from the .csv file: 2017-01-10,tmup,o,,0.3.0b1-0.3.0b2-linux32-en-US.xml,,3
Introducing the Database class didn't cause any overwhelming changes,
because Main is essentially functional programming.
Encapsulating the database methods is a tiny first step here to unclutter
Main for further refactoring. And, as you pointed out for testing.
As metrics-web is a 'living' project I'd prefer these tiny steps that
can be easily verified to not wreak havoc (until there are more tests :-)
(What is needed next are data objects. From the top of my head I'd say
with these data object it is possible to abstract/encapsulate/reuse
database persistence, file generation, and also R-access to some extend
(there is some java integration for R).
The actual redesign, should of course be done by also considering
all modules together in-depth.)
Moving Database.java to shared/src is fine (thanks for pointing that out).
(A more radical idea: all java code could be put into one source folder
in the base project already. This would reduce complexity and prepare
for adaption to metrics-base.)
Command line args:
The suggestion I made was ad-hoc aiming at replacing the TODO: remove after tests.
approach with something lasting, b/c what is facilitated by remove after tests code
is useful. The command line args were intended for testing not at all as a user convenience.
Thus, baking it more there are to options:
change the usage note that these args are only for testing, or
add the code to allow the database url as a single arg, too
Tests:
The simple tests for parseLogLines() (or only the matchers used?) and writeStatistics() are sufficient and needed for later refactoring. (You're probably right that my code suggestion would miss a newline after the header in csv.)
I think the most pragmatic and efficient way will be to pick code that produces vital output for writing the tests that can be re-used or simply adapted during refactoring. More complicated test coverage should be tackled after/while refactoring (when need arises).
As usual, I'd love to write more code here, but currently it competes for time with my non-code writing tasks.
Regarding graph and web-site text:
The graph and data description should maybe mention that the logs used were cleaned to avoid any privacy issues and also have a link to the logs?
Otherwise people seeing these graphs might wonder, how the particular data were gathered and worry about privacy.
(sql queries: As there are so many pattern matching queries and in case that these take real long, an operator class (varchar_pattern_ops, see Postgresql manual), e.g., on resource_string, could improve the situation. Maybe, useful once there are more logs to 'crunch'.)
I'm attaching a graph showing downloads and updates between January 10 and 13, 2017:
I believe that the fonts will be a bit nicer after deploying this on the metrics server. This is related to slightly different R versions. But that's not what this is about. This is just to show a sample of what kind of graph to expect here.
Looks good to me. Personally, I would not worry so much about missing shinyness as long as the graphs get the numbers across and that's what the one you attached does IMO. The results still confuse me (like: We have 100,000 new downloads each day but that does not show up in the update pings which basically stay the same? We still have more than 120,000 update requests every day, even after almost 6 weeks after the last release?). But that is probably another story. :)
@gk: Thanks for looking! I can't say much about those possibly confusing results. But I'm pretty sure that these particular questions (downloads not showing in update pings, lots of update requests weeks after the last release) will only be the first of a series of questions that we'll have to look into. I'd say let's deploy what we have now and discuss those questions when we have more data. They'll either turn into bugfixes or frequently asked questions. :)
@iwakeh: Thanks for all your input above! I incorporated some of your suggestions, discarded others (see below), and made a few more changes while running tests and reviewing my own code once more. Here's a summary of the changes:
After some back and forth I decided to postpone the Database class change. It needs more work before being actually useful, and I don't want to share code between modules that's not ready yet. Let's revisit this topic when we have enough time to fully encapsulate database functionality so that when we can easily mock it.
I also took out the part where we're parsing command-line arguments and put in jdbc:postgresql:webstats as database connection string. I think that we can expect operators to make sure that the local user who runs the java process has permission to connect to that database without telling us the password. And we can expect contributors to be able to edit the code and skip log files when developing.
I also tweaked the description a bit to avoid immediate concerns about privacy. Not that this would eliminate all concerns, but hey.
Optimizing SQL queries might indeed be useful, though it's not an issue yet. It probably helps that we have a separate resources table with unique resource strings. We should keep this in mind though if module performance deteriorates over time. (Such a thing has happened to other modules in the past, see https://lists.torproject.org/pipermail/tor-dev/2016-January/010139.html.)
I think I got the locale detection issues under control by tweaking regular expressions.
There, I think that addresses all suggestions and open issues. Please take a look at the fixup commit in my task-21236-2 branch, and please let me know whether that needs more work. After that, I'd want to squash, rebase, merge, deploy, and announce. Thanks!
Good to have some tests for a start!
Wondering, if a test should make explicit that the ending really doesn't matter, i.e., that SAMPLE_LOG_FILE_NAME could be almost anything, but tests would still pass?
Or, rather change the matcher to only choose .xz files?
The result sets etc. could all be ensured to be closed by using try-with-resources (in most cases w/o any catch).
The extended explanation will be useful either preventing questions or by being the place to point to when questions come up.
In general, metrics-web needs quite some work still. But, that'll be other tickets.
Good to have some tests for a start!
Wondering, if a test should make explicit that the ending really doesn't matter, i.e., that SAMPLE_LOG_FILE_NAME could be almost anything, but tests would still pass?
Or, rather change the matcher to only choose .xz files?
Changing the matcher wouldn't work, because we're using it for .xz files as well as subdirectories. Added a test to point this out.
The result sets etc. could all be ensured to be closed by using try-with-resources (in most cases w/o any catch).
Changed them all.
The extended explanation will be useful either preventing questions or by being the place to point to when questions come up.
Hopefully!
In general, metrics-web needs quite some work still. But, that'll be other tickets.
Absolutely! Let's keep this in mind for the next funding proposal.
I just ran into this bug while importing logs on the Metrics server:
[java] 18:58:10.454 [main] DEBUG org.torproject.metrics.webstats.Main - Importing log file https://webstats.torproject.org/out/archeotrichon.torproject.org/dist.torproject.org-access.log-20151219.xz. [java] Exception in thread "main" java.lang.OutOfMemoryError: Java heap space [java] at java.util.Arrays.copyOfRange(Arrays.java:2694) [java] at java.lang.String.<init>(String.java:203) [java] at java.io.BufferedReader.readLine(BufferedReader.java:349) [java] at java.io.BufferedReader.readLine(BufferedReader.java:382) [java] at org.torproject.metrics.webstats.Main.downloadLogFile(Main.java:219) [java] at org.torproject.metrics.webstats.Main.importLogFiles(Main.java:175) [java] at org.torproject.metrics.webstats.Main.main(Main.java:94) [java] Java Result: 1
That log file is 282K compressed and 1.0G uncompressed. Looks like we'll have to try harder not to keep the file in memory at once. I'm too tired to look into this today, but I'll try to fix this tomorrow, in particular how we solved this in the ad-hoc analysis of webstats in #20008 (moved).