Enable metrics-lib to process large (> 2G) logfiles
Metrics-lib receives compressed logs, usually of sizes below 600kB. As this can be dealt with in-memory, this ticket is about handling the logs that deflate to larger files (approx. 2G).
Commons-compressed doesn't provide methods for determining the deflated content size (as the command line tool xz does). Other compression types metrics-lib supports have this option, but it also would require more changes.
Compression can be very effective. Thus, using a cut-off compressed size is sort of arbitrary. An example for xz compression: the 3G deflated log has 589492 compressed input array length; using extreme compression it even shrinks to a length of 405480; on the other hand a deflated 64M file can have an input array of 509212 length.
For handling larger log files with metrics-lib some interface changes will be necessary. Here a suggestion:
public interface LogDescriptor extends Descriptor {
/**
- * Returns the decompressed raw descriptor bytes of the log.
+ * Returns the compressed raw descriptor bytes of the log.
+ *
+ * <p>For access to the log's decompressed bytes
+ * use method {@code decompressedByteStream}.</p>
+ *
* @since 2.2.0
*/
public byte[] getRawDescriptorBytes();
/**
+ * Returns the decompressed raw descriptor bytes of the log as stream.
+ *
+ * @since 2.2.0
+ */
+ public InputStream decompressedByteStream();
+
I think this might be easiest to understand and use; and of course the implementation wouldn't need to change processing for large and 'normal' logs. It also avoids deciding about the method to find out if a file is large or not.
Thoughts?