Opened 4 months ago

Last modified 2 months ago

#29425 needs_revision enhancement

Write integration tests for data-processing modules

Reported by: karsten Owned by: karsten
Priority: Medium Milestone:
Component: Metrics/Statistics Version:
Severity: Normal Keywords: metrics-roadmap-2019-q2
Cc: metrics-team Actual Points:
Parent ID: Points: 8
Reviewer: irl Sponsor:

Description

We discussed in Brussels that we'll need at least integration tests for metrics-web in order to make code changes like the Java 8 Date/Time API update.

I started working on this. Here's what I did:

  • Pick a small set of descriptors as test data that are sufficient to produce at least something as .csv files.
  • Write a script that runs all data-processing modules.
  • Run the script once to get output that we would expect from future test runs.

The result is too big (IMHO) to add to the Git repository. That's why I uploaded it here:

https://people.torproject.org/~karsten/volatile/metrics-web-integ-tests.tar

shasum -a 256 metrics-web-integ-tests.tar 
728c4e4ee184f2260cd30286f2925aa627d7b3d572a236b646021cc1b461de10  metrics-web-integ-tests.tar

It would be great if somebody else besides me tries this out and verifies that their run produces the same output.

A next good step after that would be to talk about where/how to put this under version control. If that's impossible, we might be able to reduce the test data size a bit more, but maybe not as substantial as we'd want.

Oh, and we could probably fetch libs from Debian rather than shipping them. I didn't bother for now.

Child Tickets

Change History (15)

comment:1 Changed 4 months ago by karsten

Status: newneeds_review

comment:2 Changed 4 months ago by irl

Keywords: metrics-roadmap-2019-q2 added
Points: 8

comment:3 in reply to:  description Changed 4 months ago by karsten

Replying to karsten:

A next good step after that would be to talk about where/how to put this under version control. If that's impossible, we might be able to reduce the test data size a bit more, but maybe not as substantial as we'd want.

I put some more thoughts on this. How about we create a new metrics-test.git repository for this? Testing metrics-web could be the start, and we could easily extend tests to other metrics code bases in the future. To be clear, unit tests would still exist in the respective repositories, this would just be for integration/system tests.

comment:4 Changed 4 months ago by irl

Reviewer: irl

Planned for review party tomorrow.

comment:5 Changed 4 months ago by irl

Status: needs_reviewneeds_revision

git-annex is what I have used in the past for storing large test data in git repositories. Our git server does not currently support this but we could look at adding support. If it requires us to upgrade the gitolite then that is not an easy project, if we are running a new enough version then it could just be some lines in the config.

I looked at running the script:

Cloning metrics-web Git repository into metrics-web/ subdirectory...
Cloning into 'metrics-web'...
remote: Counting objects: 14553, done.
remote: Compressing objects: 100% (1213/1213), done.
remote: Total 14553 (delta 977), reused 503 (delta 233)
Receiving objects: 100% (14553/14553), 16.55 MiB | 2.64 MiB/s, done.
Resolving deltas: 100% (8303/8303), done.
Bootstrapping development environment...
Submodule 'src/build' (https://git.torproject.org/metrics-base.git) registered for path 'src/build'
Submodule 'src/submods/metrics-lib' (https://git.torproject.org/metrics-lib.git) registered for path 'src/submods/metrics-lib'
Cloning into '/tmp/metrics-web-integ-tests/metrics-web/src/build'...
Cloning into '/tmp/metrics-web-integ-tests/metrics-web/src/submods/metrics-lib'...
Submodule path 'src/build': checked out 'e639c697e9e94c6dbb26e946e5247c20a62c0661'
Submodule path 'src/submods/metrics-lib': checked out '23927c2777f273c42ad3e75fc0a2940ed8eb4bf6'
Submodule 'src/build' (https://git.torproject.org/metrics-base) registered for path 'src/build'
Cloning into '/tmp/metrics-web-integ-tests/metrics-web/src/submods/metrics-lib/src/build'...
Submodule path 'src/build': checked out 'e639c697e9e94c6dbb26e946e5247c20a62c0661'
Replacing absolute paths in build.xml with relative paths...
sed: can't read s/.srv.metrics.torproject.org.metrics/./: No such file or directory

I like the concept though. This is the sort of thing we can run in CI or Vagrant to be able to run it in disposable environments.

comment:6 in reply to:  5 Changed 4 months ago by karsten

Status: needs_revisionneeds_review

Replying to irl:

git-annex is what I have used in the past for storing large test data in git repositories. Our git server does not currently support this but we could look at adding support. If it requires us to upgrade the gitolite then that is not an easy project, if we are running a new enough version then it could just be some lines in the config.

Maybe we don't have to do this with a dedicated metrics-test repository, though. I wouldn't want to dump tons of descriptors into the actual code repositories, but if this repository is just for (integration) testing, people will have to fetch the data anyway. Don't feel strongly, though.

I looked at running the script:

Cloning metrics-web Git repository into metrics-web/ subdirectory...
Cloning into 'metrics-web'...
remote: Counting objects: 14553, done.
remote: Compressing objects: 100% (1213/1213), done.
remote: Total 14553 (delta 977), reused 503 (delta 233)
Receiving objects: 100% (14553/14553), 16.55 MiB | 2.64 MiB/s, done.
Resolving deltas: 100% (8303/8303), done.
Bootstrapping development environment...
Submodule 'src/build' (https://git.torproject.org/metrics-base.git) registered for path 'src/build'
Submodule 'src/submods/metrics-lib' (https://git.torproject.org/metrics-lib.git) registered for path 'src/submods/metrics-lib'
Cloning into '/tmp/metrics-web-integ-tests/metrics-web/src/build'...
Cloning into '/tmp/metrics-web-integ-tests/metrics-web/src/submods/metrics-lib'...
Submodule path 'src/build': checked out 'e639c697e9e94c6dbb26e946e5247c20a62c0661'
Submodule path 'src/submods/metrics-lib': checked out '23927c2777f273c42ad3e75fc0a2940ed8eb4bf6'
Submodule 'src/build' (https://git.torproject.org/metrics-base) registered for path 'src/build'
Cloning into '/tmp/metrics-web-integ-tests/metrics-web/src/submods/metrics-lib/src/build'...
Submodule path 'src/build': checked out 'e639c697e9e94c6dbb26e946e5247c20a62c0661'
Replacing absolute paths in build.xml with relative paths...
sed: can't read s/.srv.metrics.torproject.org.metrics/./: No such file or directory

I like the concept though. This is the sort of thing we can run in CI or Vagrant to be able to run it in disposable environments.

Hrmmm, looks like an incompatibility between macOS and Linux sed... Can you edit your script and use this line instead?

sed -i.bak 's/.srv.metrics.torproject.org.metrics/./' build.xml

If this works okay, I'll request a metrics-test repository and push the current script there. Except if you care more strongly about the git-annex part in which case I'd reach out to hiro and see what we can do there.

Thanks!

comment:7 Changed 4 months ago by irl

-Dfile.encoding=UTF8

https://stackoverflow.com/questions/464874/unmappable-character-for-encoding-warning-in-java

We really need to add this to our ant scripts. Running the test in a throwaway env and I had forgotten that I've set this now in my .profile to get around a load of errors.

comment:8 Changed 4 months ago by irl

Status: needs_reviewneeds_revision
[java] Exception in thread "main" org.postgresql.util.PSQLException: The server requested password-based authentication, but no password was provided.

I'm going to need more detailed instructions on how the postgres server should be set up. Some of the modules run OK, others have null pointer exceptions or psqlexceptions as above.

I'm running as root in a throwaway VM with a "root" postgres role created with login and superuser.

comment:9 Changed 4 months ago by karsten

I'm unclear how to reproduce or fix that UTF8 encoding issue. How would I do this?

Regarding the PostgreSQL server setup, the easiest way is to enable trust authentication for users connecting from localhost, so in other words disable authentication for whoever manages to connect to the database. In order to do this, you'll have to edit your pg_hba.conf file in /etc/postgresql/$postgresql_version/main/ by replacing these lines:

local   all             all                                     peer
host    all             all             127.0.0.1/32            md5
host    all             all             ::1/128                 md5

with these lines:

local   all             all                                     trust
host    all             all             127.0.0.1/32            trust
host    all             all             ::1/128                 trust

That's what I'm doing on my development machine, and it might well be that it was brew setting this up as default. I know that this setting is different on Debian, and it's likely going to be different on any production system. Still, for running tests this should be sufficient. And yes, if we require this, we should document it.

comment:10 Changed 4 months ago by karsten

Status: needs_revisionneeds_review

Please give it another try!

comment:11 Changed 4 months ago by karsten

Owner: changed from metrics-team to karsten
Status: needs_reviewaccepted

Taking this out of review after running it on a Linux machine and finding differences. Will try to fix these differences first and will put it back into needs_review afterwards.

comment:12 Changed 3 months ago by notirl

Reviewer: irl

Removing myself as reviewer for now. I'll probably be the reviewer when it comes back around but there is no reason a hypothetical third member of the metrics team couldn't also be a reviewer.

comment:13 Changed 3 months ago by karsten

I finally got it all working! There are still minor differences between macOS and Linux:

  • bandwidth.csv relies on the import order of descriptors, because either the first or last descriptor wins (I did not look at the code in detail). The differences in the output are tiny, but detectable. However, fixing this requires extensive changes to the bwhist module.
  • platforms.csv and userstats-combined.csv rely on sorting order, which is apparently platform-dependent. After sorting these files, they're equivalent. This is unfortunate, but acceptable, because we do not guarantee a specific sorting order. We could fix this by changing SQL statements to something like ORDER BY LOWER(x).

I decided to ignore these differences and expect the output as Linux would produce it.

I also made a temporary decision regarding git-annex vs. adding large files to git by creating this relatively large repository on GitHub for now: https://github.com/kloesing/metrics-test

I'd like to use these integration tests for some more actual metrics-web changes, including the Java 8 Date/Time updates. It's okay for me to just keep it on GitHub without review and later get it reviewed and on Tor's Git. For example, we could do that in April with more time on our hands. Hope that works for you!

comment:14 Changed 3 months ago by karsten

Reviewer: irl
Status: acceptedneeds_review

As discussed today, please give it another try now, so that we can resolve remaining issues with the script. See also my comment above regarding database authentication. Thanks!

comment:15 Changed 2 months ago by irl

Status: needs_reviewneeds_revision
Dropping any existing metrics-web related databases...
NOTICE:  database "userstats" does not exist, skipping
NOTICE:  database "tordir" does not exist, skipping
NOTICE:  database "totalcw" does not exist, skipping
NOTICE:  database "webstats" does not exist, skipping
NOTICE:  database "onionperf" does not exist, skipping
NOTICE:  database "ipv6servers" does not exist, skipping
(Re-)creating databases...
createdb: database creation failed: ERROR:  permission denied to create database

Still having trouble with postgres. Can we have a Vagrantfile with the script that builds a Debian VM to run in with postgres already configured?

Note: See TracTickets for help on using tickets.