#25523 closed defect (fixed)

Add support for webstats tarballs

Reported by: karsten Owned by: iwakeh
Priority: Medium Milestone: metrics-lib 2.3.0
Component: Metrics/Library Version:
Severity: Normal Keywords:
Cc: metrics-team Actual Points:
Parent ID: Points:
Reviewer: Sponsor:

Description (last modified by iwakeh)

I started creating tarballs containing .xz-compressed webstats files. When I attempt to feed them into DescriptorReader, it fails with an exception like the following:

Cannot parse descriptor file ’in/webstats-2016-01.tar’.
��s",�����k)�nnq����w؆jG�I�[1��eѰCx%��'.
        at org.torproject.descriptor.impl.DescriptorParserImpl.detectTypeAndParseDescriptors(DescriptorParserImpl.java:136)
        at org.torproject.descriptor.impl.DescriptorParserImpl.parseDescriptors(DescriptorParserImpl.java:33)
        at org.torproject.descriptor.impl.DescriptorReaderImpl$DescriptorReaderRunnable.readTarball(DescriptorReaderImpl.java:325)
        at org.torproject.descriptor.impl.DescriptorReaderImpl$DescriptorReaderRunnable.readTarballs(DescriptorReaderImpl.java:276)
        at org.torproject.descriptor.impl.DescriptorReaderImpl$DescriptorReaderRunnable.run(DescriptorReaderImpl.java:162)
        at java.lang.Thread.run(Thread.java:745)}

The tarballs I created contain files as follows:

$ tar tf webstats-2016-01.tar
[...]
webstats-2016-01/torproject.org/2016/01/25/torproject.org_aroides.torproject.org_access.log_20160125.xz
webstats-2016-01/torproject.org/2016/01/25/torproject.org_archeotrichon.torproject.org_access.log_20160125.xz

When I extract tarball files before reading them with DescriptorReader, this works just fine.

I think that the issue is that DescriptorParserImpl#detectTypeAndParseDescriptors() looks at descriptorFile rather than fileName to obtain the file name. The effect is that it learns the tarball file name, rather than the file name of the contained log file:

-    if (descriptorFile.getName().contains(LogDescriptorImpl.MARKER)
+    if (fileName.contains(LogDescriptorImpl.MARKER)

The above is untested and probably insufficient. It's just supposed to start the bug hunting. Priority is medium, because we can just extract tarballs for now. But it's a bug, and it may confuse users as soon as we provide these tarballs and no working code to process them.

This is also related to #22695.

Assigning to iwakeh who said they'd like to grab it.

Child Tickets

Change History (7)

comment:1 Changed 19 months ago by iwakeh

Description: modified (diff)
Status: assignedaccepted

Taken.

comment:2 Changed 18 months ago by iwakeh

Status: acceptedneeds_review

Your assumption was right.

I first clarified issue #22695 and based on the patch to that ticket, which only changes javadoc and parameter names, added a new test, adapted an old one, and implemented the suggested fix. Please review this commit.

comment:3 Changed 18 months ago by iwakeh

Milestone: metrics-lib 2.3.0

comment:4 Changed 18 months ago by karsten

Status: needs_reviewneeds_revision

This patch does enable metrics-lib to parse webstats tarballs. So far so good!

However, Descriptor#getDescriptorFile now returns a "virtual" File reference for the tarball entry, rather than the File reference of the tarball. For example, here are the results of that method for a webstats log and consensus:

www-staging.torproject.org_stellatum.torproject.org_access.log_20150211.xz
in/consensuses-2017-02.tar.xz

The specification of that method says: "Return the file, tarball or plain file, that contained this descriptor, or null if this descriptor was not read from a file."

Would you like to provide a fix for that? And can you include a change log entry for the change? Thanks!

comment:5 Changed 17 months ago by iwakeh

Status: needs_revisionaccepted

comment:6 Changed 17 months ago by iwakeh

Status: acceptedneeds_review

True, the implementation needs to be changed too.

Please review another commit, which also includes a changelog entry.

comment:7 Changed 17 months ago by karsten

Resolution: fixed
Status: needs_reviewclosed

Looks good! Made two trivial whitespace fixes, rephrased the change log entry a little bit, squashes everything, tested it locally, and pushed to master. Closing. Thanks!

Note: See TracTickets for help on using tickets.