Opened 5 months ago

Last modified 4 weeks ago

#22983 needs_revision enhancement

Add a Descriptor subinterface and implementation for Tor web server logs

Reported by: iwakeh Owned by: metrics-team
Priority: Medium Milestone: metrics-lib 2.2.0
Component: Metrics/Library Version:
Severity: Normal Keywords: metrics-2017
Cc: Actual Points:
Parent ID: Points:
Reviewer: Sponsor:

Description

The webstats-log-files are the only files available on CollecTor (in future) that are not yet covered by metrics-lib.

Should there be a 'LogDescriptor' interface and implementation in metrics-lib? Are there any reasons why not?

The name LogDescriptor is just a working name; better naming suggestions welcome.
The interface should extend Descriptor and have additional methods for retrieving the measuring host and the served host, a method for retrieving the date of the log.
What else?

Child Tickets

Change History (53)

comment:1 Changed 5 months ago by karsten

Random thought: these logs can be really huge and also quite repetitive. Maybe we can try to save a bit of memory by trying out a few things when parsing them. Ideas:

  1. Store indexes into the raw descriptor string rather than copies of strings.
  2. Alternatively, only store fields that have changed compared to the previous line. If a field has not changed, store a reference to the first line that contains the same field contents. This makes use of the fact that lines are sorted.
  3. Maybe don't even store the raw descriptor string and instead throw an exception when it gets called. Only do this if 2 requires a lot less memory than 1 above.

comment:2 Changed 5 months ago by iwakeh

Well, metrics-lib doesn't need to 'read' the contents of the logs. Thus, the 'raw bytes' could be from the compressed content for now. Using the compressed content might be the most efficient in terms of runtime memory and development hassle.

A list of interface methods without the methods from Descriptor:

  • getMeasuringHost (taken from path, i.e. the immediate parent folder name)
  • getServedHost (first part up to a dash of the filename)
  • getLogDate (filename's last part before extension)
  • getCompressionType (the compression type of the raw bytes, taken from the file extension)

Maybe also for convenience

  • getUncompressedData (with a comment warning about the possibly huge deflate size)

What is missing?

comment:3 Changed 5 months ago by iwakeh

To visualize comment:2, please see the following interface:

/* Copyright 2017 The Tor Project
 * See LICENSE for licensing information */

package org.torproject.descriptor;

import java.util.List;

/**
 * Interface for sanitized (add a link to a description of the process)
 * log-file descriptors.
 *
 * @since 2.1.0
 */
public interface LogDescriptor extends Descriptor {

  /**
   * Returns the hostname of the machine on which the logs were gathered.
   *
   * @since 2.1.0
   */
  public String getMeasuringHost();

  /**
   * Returns the name of the host served and for which the logs were gathered.
   *
   * @since 2.1.0
   */
  public String getServedHost();

  /**
   * Returns the date of the log in yyymmdd format.
   *
   * @since 2.1.0
   */
  public String getLogDate();

  /**
   * Return the raw descriptor bytes of the compressed sanitized logs.
   * 
   * <p>The compression type is returned by {@link #getCompressionType()}
   * and is one of 'gz' or 'xz'.  Additional types might be added in
   * future.</p>
   *
   * @since 2.1.0
   */
  @Override
  public byte[] getRawDescriptorBytes();

  /**
   * Returns the compression type as one of 'gz' or 'xz'.
   *
   * <p>Additional types might be added in future.</p>
   *
   * @since 2.1.0
   */
  public byte[] getCompressionType();

  /**
   * Will always return an empty List as logs don't have any annotations.
   *
   * @since 2.1.0
   */
  @Override
  public List<String> getAnnotations();

  /**
   * Will always return an empty List as unrecognized lines are not provided
   * for sanitized logs.
   *
   * @since 2.1.0
   */
  @Override
  public List<String> getUnrecognizedLines();
}

Last edited 5 months ago by iwakeh (previous) (diff)

comment:4 Changed 5 months ago by karsten

Regarding the name, let's try to find something more descriptive. How about WebServerLog or even ApacheHttpServerAccessLog? Otherwise there's the risk of confusion with descriptor types added in the future, like a log file written by BridgeDB containing client requests for bridge addresses.

Regarding the suggested interface, I think there's a short term and a long term part here.

In the long term I think that it would be at least twice as useful if we read the log contents and added methods to read these parsed contents. It's true that this causes some development hassle. But that's why we do it once in the library rather than rely on possibly more than one application to get it right. And we can still include the raw descriptor bytes by storing the compressed bytes and inflate them upon request.

Some comments on the interface:

  • Let's include a subtype Request or similar for each line contained in the log file, and let's include a method getRequests() that returns Iterable<Request>.
  • Due to the fact that we cannot include a @type annotation with a version number, Request should ideally include getters for all fields contained in Apache's Combined Log Format.
  • Ideally, getLogDate() would return the date in milliseconds since the epoch to be conformant to the rest of metrics-lib, in which case it would probably be called getLogMillis().
  • I'm unclear what getCompressionType() returns. I think I'd expect a String that is either "gz" or "gz", but not a byte[]. Was that intended?
  • If we read and parse logs, we'll have to change getUnrecognizedLines() to return any unrecognized lines.

In the short term I can see how we might want to put the Request part on hold and only return metadata and uncompressed raw descriptor contents in this new descriptor type.

comment:5 in reply to:  4 ; Changed 5 months ago by iwakeh

Replying to karsten:

Thanks for the valuable input!

Regarding the name, let's try to find something more descriptive. How about WebServerLog or even ApacheHttpServerAccessLog? Otherwise there's the risk of confusion with descriptor types added in the future, like a log file written by BridgeDB containing client requests for bridge addresses.

I see an interface hierarchy here:
LogDescriptor as parent for all logs (then we drop 'Descriptor' from the names) and have the first extending interface WebServerAccessLog. Later we can add others *Log interfaces like BridgeDbClientLog etc.

So for now, I focus on the access-log integration and keep future extensions in mind for the design.

Regarding the suggested interface, I think there's a short term and a long term part here.

In the long term I think that it would be at least twice as useful if we read the log contents and added methods to read these parsed contents. It's true that this causes some development hassle. But that's why we do it once in the library rather than rely on possibly more than one application to get it right. And we can still include the raw descriptor bytes by storing the compressed bytes and inflate them upon request.

Yes, partially I have this in CollecTor anyway for sanitizing the logs. I'll add generally useful functionality to the metrics-lib code.
Should we have a new package for the implementations like org.torproject.descriptor.logs? The log processing and content differs from usual descriptors quite a bit.

Some comments on the interface:

  • Let's include a subtype Request or similar for each line contained in the log file, and let's include a method getRequests() that returns Iterable<Request>.

There could be a parent interface LogLine that is extended by an appropriate interface for each log type, like a Request interface for access-logs.
I think about it and definitly keep the design open for the addition, but would put it on lesser priority right now.

  • Due to the fact that we cannot include a @type annotation with a version number, Request should ideally include getters for all fields contained in Apache's Combined Log Format.
  • Ideally, getLogDate() would return the date in milliseconds since the epoch to be conformant to the rest of metrics-lib, in which case it would probably be called getLogMillis().

Fine, but we only have the date no time here. Thus, msec signals a precision we don't offer.
I don't feel strongly about that.

  • I'm unclear what getCompressionType() returns. I think I'd expect a String that is either "gz" or "gz", but not a byte[]. Was that intended?

Correct, this should read String getCompressionType(), just a typo. Actually, it might turn into an enum.

  • If we read and parse logs, we'll have to change getUnrecognizedLines() to return any unrecognized lines.

Yes, maybe with an upper limit in case a log got mangled?

In the short term I can see how we might want to put the Request part on hold and only return metadata and uncompressed raw descriptor contents in this new descriptor type.

Fine, as replied above.

Do you have a rough estimate of the future log file sizes metrics-lib will have to deal with?

Last edited 5 months ago by iwakeh (previous) (diff)

comment:6 in reply to:  5 Changed 5 months ago by karsten

Replying to iwakeh:

Replying to karsten:

Thanks for the valuable input!

Regarding the name, let's try to find something more descriptive. How about WebServerLog or even ApacheHttpServerAccessLog? Otherwise there's the risk of confusion with descriptor types added in the future, like a log file written by BridgeDB containing client requests for bridge addresses.

I see an interface hierarchy here:
LogDescriptor as parent for all logs (then we drop 'Descriptor' from the names) and have the first extending interface WebServerAccessLog. Later we can add others *Log interfaces like BridgeDbClientLog etc.

So for now, I focus on the access-log integration and keep future extensions in mind for the design.

Makes sense.

Regarding the suggested interface, I think there's a short term and a long term part here.

In the long term I think that it would be at least twice as useful if we read the log contents and added methods to read these parsed contents. It's true that this causes some development hassle. But that's why we do it once in the library rather than rely on possibly more than one application to get it right. And we can still include the raw descriptor bytes by storing the compressed bytes and inflate them upon request.

Yes, partially I have this in CollecTor anyway for sanitizing the logs. I'll add generally useful functionality to the metrics-lib code.
Should we have a new package for the implementations like org.torproject.descriptor.logs? The log processing and content differs from usual descriptors quite a bit.

As long as we keep all types that are relevant for applications in org.torproject.descriptor, I don't mind adding new subpackages.

Some comments on the interface:

  • Let's include a subtype Request or similar for each line contained in the log file, and let's include a method getRequests() that returns Iterable<Request>.

There could be a parent interface LogLine that is extended by an appropriate interface for each log type, like a Request interface for access-logs.
I think about it and definitly keep the design open for the addition, but would put it on lesser priority right now.

Long term sounds fine.

  • Due to the fact that we cannot include a @type annotation with a version number, Request should ideally include getters for all fields contained in Apache's Combined Log Format.
  • Ideally, getLogDate() would return the date in milliseconds since the epoch to be conformant to the rest of metrics-lib, in which case it would probably be called getLogMillis().

Fine, but we only have the date no time here. Thus, msec signals a precision we don't offer.
I don't feel strongly about that.

Me neither, I just think that it's easier to handle timestamps from different data sources if they all use the same format.

  • I'm unclear what getCompressionType() returns. I think I'd expect a String that is either "gz" or "gz", but not a byte[]. Was that intended?

Correct, this should read String getCompressionType(), just a typo. Actually, it might turn into an enum.

  • If we read and parse logs, we'll have to change getUnrecognizedLines() to return any unrecognized lines.

Yes, maybe with an upper limit in case a log got mangled?

Good idea. First 100?

In the short term I can see how we might want to put the Request part on hold and only return metadata and uncompressed raw descriptor contents in this new descriptor type.

Fine, as replied above.

Do you have a rough estimate of the future log file sizes metrics-lib will have to deal with?

No idea. I think some of the Apache logs are pretty large in uncompressed form. But other descriptors have grown a lot over time, too, like votes. And when we recently pondered appending all votes collected in a singly CollecTor sync run, our original expectation of the size turned out to be pretty useless. So, 20 times the size?

comment:7 Changed 5 months ago by iwakeh

Status: newneeds_review

Please review this branch.

The first commit adds the new interfaces and their implementations.

LogDescriptor contains all methods that will be hopefully applicable for all log-types possible.
WebServerAccessLog is the specialization for access-logs.

LogDescriptor also offers a sub-interface:

   /**
   * Providing a single function for removing sensitive data from a
   * given Apache Access Log log line.
   */
  public interface Sanitizer {

    /** Returns a cleaned log line, i.e., without possibly privacy
     * sensitive values. */
    public String clean(String line);
  }

and a method sanitize(). The latter applies the cleaning procedure to all log lines and sorts the resulting lines. The default sanitizer returns the line w/o any changes. This setup keeps all descriptor parsing, compression, un-compression in metrics-lib; CollecTor is not forced to re-implement parsing functionality and only needs to provide the log cleaning procedure. (A similar approach could be thought up for bridge-sanitation, too.)

The second commit makes DescriptorParser aware of the new types and avoids implementation javadoc comment generation for the new package.

All of the code is covered by tests which are added in this commit. Total coverage even improved by one percent :-)

The addition of another sub-interface LogDescriptor.LogLine (and the extensions to WebServerAccessLogLine) will be part of a new ticket, which will also provide unrecognized lines for access-logs.

comment:8 Changed 5 months ago by iwakeh

Milestone: metrics-lib 2.1.0

Prerequisite for CollecTor 1.3.0

comment:9 Changed 5 months ago by iwakeh

The new ticket for LogLine implementation is #23046.

comment:10 Changed 5 months ago by karsten

I need more time for this review. But here's a first question:

Should we really put the sanitizing code into metrics-lib rather than CollecTor? That's an important design decision and a change to what we have been doing in the past. Where would this code be used other than in CollecTor? So far, metrics-lib has primarily been the client-side library for applications using CollecTor data. But this change would turn it into a library that both the CollecTor server and its clients depend on. I'm yet undecided whether this is a good idea or not. In any case, we should discuss this first.

I'll continue my review later today and post my findings here.

comment:11 in reply to:  10 Changed 5 months ago by iwakeh

Replying to karsten:

I need more time for this review. But here's a first question:

Should we really put the sanitizing code into metrics-lib rather than CollecTor? That's an important design decision and a change to what we have been doing in the past. Where would this code be used other than in CollecTor? So far, metrics-lib has primarily been the client-side library for applications using CollecTor data. But this change would turn it into a library that both the CollecTor server and its clients depend on. I'm yet undecided whether this is a good idea or not. In any case, we should discuss this first.

The sanitizing code is not part of metrics-lib. Thus, we agree here. In the proposed patch metrics-lib enables adding sanitizing code from the 'outside' using method

public void setSanitizer(LogDescriptor.Sanitizer sani);

and Sanitizer is just a functional interface (i.e., having one method) that can be fulfilled by a lambda expression once we go to Java8, but that's an aside.

The given Sanitizer is applied when sanitize() is called. The resulting lines are sorted by metrics-lib.

A choice I made is to have a default identity sanitizer in case none was set instead of raising an exception.

With this approach metrics-lib is sanitizer-code-agnostic, but provides all else (compression, de-compression, etc.), which avoids duplicating code and enables us to implement performance and space saving code 'under the hood' once it is needed. Hope this explains my reasoning.

CollecTor depends on metrics-lib already as it uses Descriptors of all kinds as well as parser and reader from metrics-lib.

(I noticed that I missed adding a comment to LogDescriptor.setSantizer(). I'll add a fixup commit, but that shouldn't hinder review.)

comment:12 Changed 5 months ago by karsten

Status: needs_reviewneeds_revision

Hmm, no, as of today I'm not convinced that this is a good idea. It might be and I'm not seeing it yet, or how it would work for other sanitized descriptors. But I'd rather merge something simple in the next couple of days and not rush this somewhat major design change. I agree that there's no actual sanitizing code in metrics-lib. But except for this new interface, there's also no other interface in metrics-lib where one can plug in sanitizing code. Avoiding duplicate code is good, but keeping interfaces small and intuitive is good, too. Let's open a ticket for that and do it in a separate step when we have more time to think about it.

Here's some more feedback on the LogDescriptor interface:

  • The documentation of TYPE says that the name should include a string. But who should make sure that this is the case? The application developer? Can you rephrase that to say what is expected here?
  • Some of the method descriptions are written in 3rd person ("Returns ..."), some in 2nd person ("Return ..."). I believe that 3rd person is preferred, though we're not doing that consistently in the rest of metrics-lib. But we should at least try to do it consistently in one interface.
  • "yyymmdd" -> "yyyymmdd" in getLogDate()
  • I'm unclear what to expect as return value from getLogType(). What types exist? Do I get a class name, or what?
  • Maybe rename getLogMillis() to getLogDateMillis() to indicate that we're just returning the milliseconds of the date at 00:00:00 UTC, not the milliseconds of whatever time of the day the log was produced.
  • Please avoid abbreviations like "msec" and instead write "milliseconds since the epoch", for consistency.
  • The JavaDoc for getRawDescriptorBytes() should not copy the known compression types, but refer to getCompressionType() for the list. Right now, there's already an inconsistency between the list regarding bz2.
  • Are uncompressed logs supported? The documentation for getRawDescriptorBytes() doesn't indicate that, nor does getCompressionType() say what it would return in that case.
  • Shouldn't getRawDescriptorBytes() return the uncompressed bytes and a separate method getCompressedBytes() return the compressed bytes? Thinking of being consistent with other descriptors where getRawDescriptorBytes() returns uncompressed bytes, too. Not sure about this one.
  • "added in future" -> "added in the future"
  • We briefly discussed above to include the first, say, 100 unrecognized lines in getUnrecognizedLines(), but the documentation says it doesn't reply any. Why? Because it's not implemented yet?
  • As explained above, let's take out the Sanitizer subinterface and related methods.

I'd like to wait for your response and a revised branch before doing another review of the remaining code. I'm not around most of Saturday but can take a look after that. Or on Monday, if you'd rather enjoy your weekend. :) Thanks!

comment:13 Changed 5 months ago by iwakeh

Thanks for review! I'll get to the details later, but first:

A similar discussion came up during the work on JSON handling parts of CollecTor and metrics-lib. Thinking about this for a while, I'd say we should step back a little and tackle the main discussion to find a solution for having a clean API without any (cross-product) code duplication:

Goals:
1) keep metric-lib API a pure client API
2) avoid all code duplication throughout Metrics' products

CollecTor is an unusual metrics-lib client, because CollecTor alters descriptors and even creates the final descriptors (e.g., adding annotations, sanitizing privacy problematic data). Thus, there is some functional overlap (extracting bytes, (de)compression, JSON processing) with metrics-lib.

Adding another Metrics product for these cross-concerns seems too much overhead in terms of dealing with the additional dependency in Metrics' products, committing to maintain another product and another public API.
Extending metrics-lib's API (as with the above sanitize method) breaks 1.
On the other hand, rewriting parsing code (like for bridge-descs-sanitation) or de/compression functionality violates 2.

Suggestion:
(hope the following doesn't sound crazy)
What about granting CollecTor the special role (which it has anyway) by providing an 'internal' API in metrics-lib?
This would mainly consist in adding interfaces in sub-packages of o.tp.descriptor that allow access to implementational functionality, which are explicitly not part of the public (as in published javadoc-API) API.

The 'internal' interfaces serve as the special contract between CollecTor and metrics-lib, i.e., CollecTor code should not access metrics-lib's implementation, but only the 'internal' and public API to avoid code duplication. (For example, I could envision an interface o.tp.d.log.InternalLogDescriptor that enables setting the descriptor's byte array etc. And, in case you worry, I would not add a 'sanitize()' method to such an interface.)

This approach seems to have the benefits a new cross-concerns Metrics product could offer without the hassle of another code-base and what all is attached to it.
I think, it also would finally pave the way for trimming down and streamlining all CollecTor modules.

If that makes some sense, I can provide branches for the web-log related code for both metrics-lib and CollecTor. So, we can see how the approach works in code.

comment:14 Changed 5 months ago by karsten

I'm game. Your suggestion doesn't at all sound crazy. The only reason I pushed back was that such a change requires a lot of consideration and discussion and shouldn't be included together with a new feature but on its own. A few thoughts:

  • How about we include a second package with a second set of interfaces for CollecTor and other hypothetical applications providing and/or sanitizing descriptors? Ideally, applications that simply consume descriptors wouldn't have to worry about that second set. And implementation classes could implement interfaces from both sets.
  • I agree that the discussion of JSON handling parts in CollecTor and metrics-lib was related. Another aspect there is metrics-web that will soon need to parse CollecTor's index.json, too, in order to implement #22836.
  • I'm not opposed to including the sanitizing code in metrics-lib if it's hidden from the descriptor-consuming interfaces. CollecTor will still contain all the logic for timing downloads. But as soon as we put the sanitized bridge descriptor specification into metrics-web, CollecTor won't be the only place anymore that needs changing when we change sanitizing code anyway. So, it doesn't really matter whether the sanitizing code is in CollecTor or metrics-lib. I could imagine that we can remove some duplicate code by putting everything directly related to messing with descriptors in a single place.
  • We should focus on one change at a time. That is, we could either start with adding web logs without any sanitizing support and duplicate some code in CollecTor. Or we could start with this discussion of extending scope of metrics-lib and put the web logs extension on hold. I'm fine with either approach.

There, I didn't fully think this through, but I didn't want to delay this any further in case you want to give this more thoughts. I'll keep thinking about this a little, as time permits, but please let me know how you want to continue, and I'll try to help move it forward in that direction. Thanks!

comment:15 Changed 5 months ago by iwakeh

Replying to karsten:

I'm game. Your suggestion doesn't at all sound crazy. The only reason I pushed back was that such a change requires a lot of consideration and discussion and shouldn't be included together with a new feature but on its own. A few thoughts:

Fine :-)

  • How about we include a second package with a second set of interfaces for CollecTor and other hypothetical applications providing and/or sanitizing descriptors? Ideally, applications that simply consume descriptors wouldn't have to worry about that second set. And implementation classes could implement interfaces from both sets.

Makes sense; I would suggest org.torproject.descriptor.internal.

  • I agree that the discussion of JSON handling parts in CollecTor and metrics-lib was related. Another aspect there is metrics-web that will soon need to parse CollecTor's index.json, too, in order to implement #22836.

Yep.

  • I'm not opposed to including the sanitizing code in metrics-lib if it's hidden from the descriptor-consuming interfaces. CollecTor will still contain all the logic for timing downloads. But as soon as we put the sanitized bridge descriptor specification into metrics-web, CollecTor won't be the only place anymore that needs changing when we change sanitizing code anyway. So, it doesn't really matter whether the sanitizing code is in CollecTor or metrics-lib. I could imagine that we can remove some duplicate code by putting everything directly related to messing with descriptors in a single place.

Agreed. This would also allow tests that could catch problems when sanitation changes etc.

  • We should focus on one change at a time. That is, we could either start with adding web logs without any sanitizing support and duplicate some code in CollecTor. Or we could start with this discussion of extending scope of metrics-lib and put the web logs extension on hold. I'm fine with either approach.

Well, as for the next steps I would go ahead and add a section to metrics-lib README.md about the internal API package, which will condense and finalize the discussion (new ticket?).
Then, I'd also like to start immediately putting the definitions/design there into code for webstats. The code for the webstats module is already there and just needs to be 'distributed' across products.
The public-API addition to metrics-lib is defined (i.e., the above with the suggestions from your comment and w/o any sanitation related). The additional internal-API is internal and only used in CollecTor's new webstats module, which leaves a lot freedom to adapt things, if necessary.

There, I didn't fully think this through, but I didn't want to delay this any further in case you want to give this more thoughts. I'll keep thinking about this a little, as time permits, but please let me know how you want to continue, and I'll try to help move it forward in that direction. Thanks!

Thanks, too!
I think, this is a good step forward.

comment:16 Changed 5 months ago by karsten

I think we agree on most points here.

Just one suggestion for the internal package: we could label it as "beta" with the intention to making it a public API at a later stage, while reserving to break the API even between minor releases until we remove the "beta" label. This would give us some freedom while designing this new API to make mistakes and revise ideas.

comment:17 in reply to:  16 Changed 4 months ago by iwakeh

Replying to karsten:

I think we agree on most points here.

Just one suggestion for the internal package: we could label it as "beta" with the intention to making it a public API at a later stage, while reserving to break the API even between minor releases until we remove the "beta" label. This would give us some freedom while designing this new API to make mistakes and revise ideas.

I implemented the changes along the lines we discussed in the above comments.
There are plenty of tests for the new code (which already helped uncover subtle bugs) and javadoc for both the public and the internal parts, which should also address some of the questions from your first review comments.

Please review seven commits on this branch. Three of these commits (one, two, three) are tiny maintenance commits and the other four implement this task.

As a result of this patch the CollecTor patch for webstats (#22428) will be very small and quite some code reduction will be possible in other CollecTor modules.

comment:18 Changed 4 months ago by iwakeh

Status: needs_revisionneeds_review

comment:19 Changed 4 months ago by karsten

That's quite a few commits there, more than I can handle in a single chunk. But let me start going through them and put reviews here as I finish them.

e0c5774 and e224680 look good. Already merged to master.

Some suggestions for 77b143d:

  • JSON is not exactly a compression type, it's a file content type. Maybe we can remove it now and fall back to the same PLAIN type that we fall back to for uncompressed logs?
  • The (duplicate) documentation in FileType doesn't help that much. I'd say let's either document all types or none of them. If I had to choose, I'd say none.
  • In FileType#findType, I think it's bad to catch RuntimeException, because that covers all kinds of programming errors made by application developers. For example, if they give us ext = null as parameter, we'll tell them it's a PLAIN file, but really they shouldn't give us that parameter. Better catch IllegalArgumentException only and let them catch their NullPointerException if they think that giving us null is a good idea. Related to this, let's not say in the documentation that we're not throwing any exceptions. (Of course, if the plan is to handle null exactly like this, let's put IllegalArgumentException | NullPointerException in the catch clause to indicate that we put it there intentionally.)
  • If you make changes to this commit, please make them as fixup/squash commit on top of the branch, so that I can continue reviewing subsequent commits.

d687f44 looks good.

I didn't finish my review of 76ae1e7, but here's some early feedback:

  • There's a file IndexNode.java.orig which should go away (in a later fixup commit).
  • The documentation of WebServerAccessLogImpl says that "the file is not read." Yet, it says a few sentences later that "it will be compressed to the default compression type" in case it's not compressed yet. That's a contradiction. And should we really do this? Or should we provide a way for the (internal) user to compress the file using the compression type of their choice?

I'll resume the review as soon as I get to it.

comment:20 in reply to:  19 ; Changed 4 months ago by iwakeh

Replying to karsten:

That's quite a few commits there, more than I can handle in a single chunk. But let me start going through them and put reviews here as I finish them.

Yes, thanks for starting!

e0c5774 and e224680 look good. Already merged to master.

Some suggestions for 77b143d:

  • JSON is not exactly a compression type, it's a file content type. Maybe we can remove it now and fall back to the same PLAIN type that we fall back to for uncompressed logs?

I didn't want to impose additional changes for generating the index.json* files. I think it is also useful to state explicitly that files ending in '.json' are not compressed.
Maybe, we can revisit the question later when applying more changes, e.g., also adding enums for 'tar' and 'tar.gz' etc.?

  • The (duplicate) documentation in FileType doesn't help that much. I'd say let's either document all types or none of them. If I had to choose, I'd say none.

Agreed, I removed the comments.

  • In FileType#findType, I think it's bad to catch RuntimeException, because that covers all kinds of programming errors made by application developers. For example, if they give us ext = null as parameter, we'll tell them it's a PLAIN file, but really they shouldn't give us that parameter. Better catch IllegalArgumentException only and let them catch their NullPointerException if they think that giving us null is a good idea. Related to this, let's not say in the documentation that we're not throwing any exceptions. (Of course, if the plan is to handle null exactly like this, let's put IllegalArgumentException | NullPointerException in the catch clause to indicate that we put it there intentionally.)

True, the catch can be more restrictive; a fixup-commit contains the IllegalArgumentException | NullPointerException variation.

  • If you make changes to this commit, please make them as fixup/squash commit on top of the branch, so that I can continue reviewing subsequent commits.

d687f44 looks good.

I didn't finish my review of 76ae1e7, but here's some early feedback:

  • There's a file IndexNode.java.orig which should go away (in a later fixup commit).

Removed.

  • The documentation of WebServerAccessLogImpl says that "the file is not read." Yet, it says a few sentences later that "it will be compressed to the default compression type" in case it's not compressed yet. That's a contradiction. And should we really do this? Or should we provide a way for the (internal) user to compress the file using the compression type of their choice?

Good point! I added a fixup commit, which provides a constructor arg for the default compression, but that will only be used, if the supplied log file doesn't have any compression extension. All other cases, like changing compression type should be handled outside of metrics-lib using FileType enums directly.

There is no contradiction, b/c the given bytes will be compressed, but not the file. The File constructor argument is here to implement Descriptor interface's method getDescriptorFile (and also b/c we need filename and immediate parent path for meta-data).
I think, it doesn't hurt to move the discussion and resulting changes to 'the big picture' of ticket #22695.

I'll resume the review as soon as I get to it.

All changes are provided as fixup commits.

comment:21 in reply to:  20 Changed 4 months ago by karsten

Replying to iwakeh:

Replying to karsten:

That's quite a few commits there, more than I can handle in a single chunk. But let me start going through them and put reviews here as I finish them.

Yes, thanks for starting!

e0c5774 and e224680 look good. Already merged to master.

Some suggestions for 77b143d:

Fixes look good, squashed and merged to master.

d687f44 looks good.

Merged to master.

I didn't finish my review of 76ae1e7, but here's some early feedback:

Fixes so far look good, though I might have more thoughts on the compression part. I assumed that you wouldn't work on this branch without new input, so I went ahead and squashed the fixup commits and pushed everything to my task-22983-3 branch which I'll refer to from now on. Stay tuned! :)

comment:22 Changed 4 months ago by karsten

Status: needs_reviewneeds_revision

Alright, I looked at the remaining commits (in my task-22983-3 branch). I'll start with the major issues/questions in no specific order:

  1. Why do we compress previously uncompressed log files? I see the point of saving memory, but we'd be doing that by sacrificing CPU time. And if we later change to leaving file contents on disk and only storing offsets and lengths into files, it would be wasteful to store a compressed copy of the file in memory just in case the application might need it later. Ideally, we'd just store a reference to the byte[] in whatever compression we're given, including uncompressed.
  1. I already brought this up in comment 12 above but didn't see a response: "Shouldn't getRawDescriptorBytes() return the uncompressed bytes and a separate method getCompressedBytes() return the compressed bytes? Thinking of being consistent with other descriptors where getRawDescriptorBytes() returns uncompressed bytes, too. Not sure about this one." Related to this, should getRawDescriptorLength() return the length of the uncompressed byte array? (This possibly requires uncompressing the file on first invocation and storing the length in an attribute.) What do you think about changing this for the sake of library consistency?
  1. Why do sanitized log lines contain a trailing -, as in: ... 403 294 \"-\" \"-\" -\n? I know that the Python script also added that trailing dash, so I'm asking if you think there's a reason to keep that. The Combined Log Format does not specify one. If you think it can go away we should quickly check with Sebastian and then take it out.
  1. From the tests it seems like POST requests are kept, too. However, we should only keep GET and HEAD requests, just like the Python script. Likewise, 400 and 404 requests should be discarded. Maybe check for other deviations from the script yourself. And in the next review round we should compare the two sanitizers (Python and Java) using some real logs. Or do you still have logs to run some tests yourself?
  1. LogDescriptorImpl should not sort logs by default as part part of the sanitizing step. That's a specific sanitizing technique for web server logs. It might be that a future log format only requires removing certain fields but not re-ordering log lines. Maybe there should be a second method cleanLines(List<String>) in InternalLogDescriptor.Sanitizer, and the existing method should be renamed to cleanLine(String). The default sanitizer should keep the order unchanged and simply return the list it gets.

I also found a few minor issues where it might be easiest if I fix them myself. I'll do that in the next review round as one or more suggested commits, and if you agree with those changes, I'll squash them, and maybe we can include them in the coding conventions afterwards.

comment:23 Changed 4 months ago by iwakeh

Hmm, when looking through your comment:22 I notice many questions/concerns that are due to not having an explicit webserver-access-log specification. As we need that anyway I'll open a ticket for writing such spec and add all the open assumptions about 'real' and sanitized web-server-access-logs there. I think, from such an explicit spec it will be easier to find solution for the implementational/design details.

comment:24 in reply to:  23 Changed 4 months ago by iwakeh

Replying to iwakeh:

Hmm, when looking through your comment:22 I notice many questions/concerns that are due to not having an explicit webserver-access-log specification. As we need that anyway I'll open a ticket for writing such spec and add all the open assumptions about 'real' and sanitized web-server-access-logs there. I think, from such an explicit spec it will be easier to find solution for the implementational/design details.

See ticket #23243 for more.

comment:25 in reply to:  22 Changed 4 months ago by iwakeh

Replying to karsten:

Alright, I looked at the remaining commits (in my task-22983-3 branch). I'll start with the major issues/questions in no specific order:

  1. Why do we compress previously uncompressed log files? I see the point of saving memory, but we'd be doing that by sacrificing CPU time. And if we later change to leaving file contents on disk and only storing offsets and lengths into files, it would be wasteful to store a compressed copy of the file in memory just in case the application might need it later. Ideally, we'd just store a reference to the byte[] in whatever compression we're given, including uncompressed.
  1. I already brought this up in comment 12 above but didn't see a response: "Shouldn't getRawDescriptorBytes() return the uncompressed bytes and a separate method getCompressedBytes() return the compressed bytes? Thinking of being consistent with other descriptors where getRawDescriptorBytes() returns uncompressed bytes, too. Not sure about this one." Related to this, should getRawDescriptorLength() return the length of the uncompressed byte array? (This possibly requires uncompressing the file on first invocation and storing the length in an attribute.) What do you think about changing this for the sake of library consistency?

Here and in 1. I noticed that I had some implicit assumptions about log descriptors that led to the chosen implementation. Once #23243 is answered these concerns can be addressed in a better way.

  1. Why do sanitized log lines contain a trailing -, as in: ... 403 294 \"-\" \"-\" -\n? I know that the Python script also added that trailing dash, so I'm asking if you think there's a reason to keep that. The Combined Log Format does not specify one. If you think it can go away we should quickly check with Sebastian and then take it out.

Also a discussion for the spec ticket #23243. But in general I don't see a need for the trailing dash, I only reproduced the log-lines from the python implementation.

  1. From the tests it seems like POST requests are kept, too. However, we should only keep GET and HEAD requests, just like the Python script. Likewise, 400 and 404 requests should be discarded. Maybe check for other deviations from the script yourself. And in the next review round we should compare the two sanitizers (Python and Java) using some real logs. Or do you still have logs to run some tests yourself?

I did run such tests and some of the test files are taken from the real vs. python-cleaned logs (the real ones without pi info). In #23243 we should craft the input and target formats; once that is done change implementation and tests accordingly.

  1. LogDescriptorImpl should not sort logs by default as part part of the sanitizing step. That's a specific sanitizing technique for web server logs. It might be that a future log format only requires removing certain fields but not re-ordering log lines. Maybe there should be a second method cleanLines(List<String>) in InternalLogDescriptor.Sanitizer, and the existing method should be renamed to cleanLine(String). The default sanitizer should keep the order unchanged and simply return the list it gets.

True, the re-ordering is web-server-access-log specific and should be moved.

I also found a few minor issues where it might be easiest if I fix them myself. I'll do that in the next review round as one or more suggested commits, and if you agree with those changes, I'll squash them, and maybe we can include them in the coding conventions afterwards.

Ok, if these are 'orthogonal' to the above topics, but please give #23243 a higher priority.

Last edited 4 months ago by iwakeh (previous) (diff)

comment:26 Changed 4 months ago by iwakeh

Please also merge this commit to current master, to avoid javadoc for package 'internal' on the metrics.tp.o web site.

comment:27 in reply to:  26 Changed 4 months ago by karsten

Replying to iwakeh:

Please also merge this commit to current master, to avoid javadoc for package 'internal' on the metrics.tp.o web site.

Huh, good catch. Will do tomorrow.

comment:28 Changed 4 months ago by karsten

Cherry-picked and pushed that 1-line commit.

comment:29 Changed 3 months ago by iwakeh

Copied from #23243: Try out the code on actual logs (iwakeh; karsten can make more logs available)

comment:30 Changed 3 months ago by iwakeh

Milestone: metrics-lib 2.1.0metrics-lib 2.2.0

Move to next milestone in order to accommodate interim release.

comment:31 Changed 3 months ago by iwakeh

We agreed somewhere above to store only decompressed bytes in-memory. The method public byte[] getCompressionType(); becomes obsolete in this case and will be removed. Did I overlook any use for this method?

comment:32 Changed 3 months ago by iwakeh

Status: needs_revisionneeds_information

comment:33 in reply to:  31 ; Changed 3 months ago by karsten

Replying to iwakeh:

We agreed somewhere above to store only decompressed bytes in-memory. The method public byte[] getCompressionType(); becomes obsolete in this case and will be removed. Did I overlook any use for this method?

Uhm, good question. A search on this page shows 80 matches of the word "compress". So, not sure if we agreed on this, but couldn't we store the compressed bytes in memory and decompress on-the-fly when they're requested? That would still not make a getCompressionType() mandatory, because the descriptor could as well store this type internally and not give it out. But, this is just off the top of my head, not based on re-reading the entire discussion above.

However, another thought: adding a new method is a minor API change, whereas removing a method is a major API change. Maybe, if we can live without the method for now, let's try that.

comment:34 in reply to:  33 Changed 3 months ago by iwakeh

Replying to karsten:

Replying to iwakeh:

We agreed somewhere above to store only decompressed bytes in-memory. The method public byte[] getCompressionType(); becomes obsolete in this case and will be removed. Did I overlook any use for this method?

Uhm, good question. A search on this page shows 80 matches of the word "compress". So, not sure if we agreed on this, but couldn't we store the compressed bytes in memory and decompress on-the-fly when they're requested? That would still not make a getCompressionType() mandatory, because the descriptor could as well store this type internally and not give it out. But, this is just off the top of my head, not based on re-reading the entire discussion above.

Take a look at your comment:22 number 1; especially the cpu overhead would apply with storing compressed in-memory and decompressing when needed.

However, another thought: adding a new method is a minor API change, whereas removing a method is a major API change. Maybe, if we can live without the method for now, let's try that.

Yes, let's leave it out and store decompressed in-memory, but compressed on disk, but the latter rather is a topic for collector #22428 not metrics-lib.

comment:35 Changed 3 months ago by karsten

Hmm, now I'm confused. In my comment 22 above I'm basically saying we should store what we get. So, I'd say leave out the getCompressionType() method and store in memory whatever is on disk. (Sorry, gotta run now, let's discuss more later today if needed.)

comment:36 Changed 3 months ago by iwakeh

I'm not in favor of one or the other solution, I just want to clarify differences to processing other descriptors and make sure that all the differences are intended.

Usually CollecTor's implementation determines, what is on disk; and, so far, there are only uncompressed descriptors on disk (even though the descriptors were received in compressed format from an authority).
In that sense, CollecTor will receive compressed log files, but there is still the decision to also store these in compressed format on disk.
The list of options:

  1. store logs decompressed on-disk and in-memory, descriptor supplies uncompressed bytes. (no getCompressionType() needed.)
  2. store logs compressed on-disk and decompressed in-memory, descriptor supplies uncompressed bytes. (no getCompressionType() needed.)
  3. store logs compressed on-disk and in-memory, descriptor supplies uncompressed bytes (no getCompressionType(), but internal tracking needed).
  4. store logs compressed on-disk and in-memory, descriptor supplies un- and/or compressed bytes. requires getCompressionType() for the API user to make sense of compressed bytes.
  5. ...

I think 2 is the minimal effort option currently.

comment:37 Changed 3 months ago by karsten

It's true that 2 is probably the option with minimal effort, but I'm not sure if it's efficient enough in practice when metrics-lib attempts to keep a few dozen or even a few hundred logs in memory at a time. It might be that we'll have to switch to 3 in that case. But---that switch would be an implementation change without visible change to the interface, so we could start with 2 and switch to 3 as needed. Premature optimization and all that.

comment:38 Changed 3 months ago by karsten

Summary: add a descriptor interface and implementation for web-logsAdd a Descriptor subinterface and implementation for Tor web server logs

Tweak summary.

comment:39 Changed 3 months ago by iwakeh

repeating comment:30 from #22428:

"... A bunch of real logfiles for testing would be really great! Roughly the two week amount of successive logs should be sufficient to encountered odd edge cases and have a good testing ground. These also will help testing the metrics-lib part #22983 well.
Could you provide these?"

So, this doesn't drown in the recent flood of trac changes ;-)

comment:40 Changed 3 months ago by karsten

Keywords: metrics-2018 added

comment:41 Changed 3 months ago by karsten

Keywords: metrics-2017 added; metrics-2018 removed

comment:42 Changed 3 months ago by iwakeh

Status: needs_informationneeds_review

Please review this task branch, which is based on the current master and already uses java 8 features.

The implementation changed quite a bit as it could build on the web server log specification (#23243). That's why I didn't continue on the earlier branch, because I think the code is easier to review in total here.
This dev version of metrics-lib was also system-tested as part of CollecTor's new module, i.e., used for sync and local imports there.

comment:43 Changed 2 months ago by karsten

FWIW, I'm currently reviewing this branch.

comment:44 Changed 2 months ago by karsten

Status: needs_reviewneeds_revision

Alright, here's what I found:

Commit c6c805c looks trivially correct. It comes a bit out of nowhere, so may you could add a remark to the commit message of future commits like this saying why you're making this change now. Like, did it violate a checkstyle rule? And was this the only place that was lacking a space? In any case, ready to merge this one.

Commit 3b426bf contains a couple of code style violations: the summary fragment should describe the package and not be a warning; the summary fragment is not supposed to contain markup like <h1>; there needs to be an empty line before the next <p>; and the line starting with "descriptors.</p>" is wrongly indented. I'm providing some tweaks in a fixup commit. Please review and tweak further as needed.

Commit 7822a27 has several issues that require several fixup commits and/or discussion:

  • My commit e4d6e90 in my task-22983-4 branch contains a bunch of whitespace fixes and other trivial tweaks which I believe are mostly oversights. (Unless they are not, in which case see the next item.)
  • My commit 664f540 is a mix of suggestions and tweaks. Some of the tweaks are an attempt to make your code fit better into existing code, which is either based on our style guide or on implicit conventions in existing code for which we don't have a written style guide yet. Please review that commit carefully. If you believe that some of these are non-issues, let's discuss that. I'm not saying that anything or anyone is wrong, I'm just trying to keep (heh, or make?) the overall code base as readable as possible and at the same time make future review processes even smoother.
  • I wonder if we should take out the LogDescriptor interface for several reasons: there is just one subinterface extending that, so YAGNI; the only fields/methods that it adds are the log date, and it's unclear if future logs will cover a single date or have entirely different requirements. Looks like premature interface hierarchy optimization to me.
  • I'm unclear whether this new code will cover both original and sanitized web server logs. Possibly related to that question, why is validation separated from sanitization, and is validation performed before sanitization or not? Can you clarify?
  • I may have further comments based on your answers to my questions/remarks above.

I haven't reviewed commit 929e265 in detail yet.

When you make changes, please make them as fixup commits on top of my branch. Thanks!

comment:45 Changed 2 months ago by iwakeh

While the topics raised in the above comment are worked on, I think it is save to review commit 929e265 b/c especially with regard to the test data and to check that the tests confirm to the specification.

comment:46 in reply to:  44 Changed 2 months ago by iwakeh

Replying to karsten:

Alright, here's what I found:

Thanks for the detailed review!

Commit c6c805c looks trivially correct. It comes a bit out of nowhere, so may you could add a remark to the commit message of future commits like this saying why you're making this change now. Like, did it violate a checkstyle rule? And was this the only place that was lacking a space? In any case, ready to merge this one.

The misterious space results from a rule we don't have for spaces between cast operator and variable. I made the usage in this file consistent by adding the space as the earlier cast did have one. But, we should add a checkstyle rule to make it consistent everywhere; I don't have a preference with or without space is fine.

Commit 3b426bf contains a couple of code style violations: the summary fragment should describe the package and not be a warning; the summary fragment is not supposed to contain markup like <h1>; there needs to be an empty line before the next <p>; and the line starting with "descriptors.</p>" is wrongly indented. I'm providing some tweaks in a fixup commit. Please review and tweak further as needed.

These seem fine. Thanks for fixing them and rewording the javadoc.

Commit 7822a27 has several issues that require several fixup commits and/or discussion:

  • My commit e4d6e90 in my task-22983-4 branch contains a bunch of whitespace fixes and other trivial tweaks which I believe are mostly oversights. (Unless they are not, in which case see the next item.)

I think the javadoc reformulations are fine. Some tweaks are in my new branch.
I also agree to the other few tweaks to the code or came up with something else, which I added in tiny fixup commits to make reviewing easier (hopefully)

  • My commit 664f540 is a mix of suggestions and tweaks. Some of the tweaks are an attempt to make your code fit better into existing code, which is either based on our style guide or on implicit conventions in existing code for which we don't have a written style guide yet. Please review that commit carefully. If you believe that some of these are non-issues, let's discuss that. I'm not saying that anything or anyone is wrong, I'm just trying to keep (heh, or make?) the overall code base as readable as possible and at the same time make future review processes even smoother.

Ok, things should match the old code base, but at the same time evolve to the better, or be modernized.
I admid that I have a knack for terse variable names (just two letters seem verbose sometimes ;-) but you're right in some cases, thanks for fixing.
I'd say there are also very many non-issues here, which I know are different in the existing older code base, but seem to make the code less readable to me, when I started to become familiar with the old codebase. For example, there are almost line filling variable names (not in this patch) that differ only in one to five letters. And there is the word 'Descriptor' all over, which often feels like cluttering when reading the code for the first time. The following is an example:
(all following diffs are from the 664f540 commit)

-  public static List<Descriptor> parse(byte[] raw, File file)
...
+  public static List<Descriptor> parse(byte[] rawDescriptorBytes,
+      File descriptorFile) ... 

Maybe, 'raw' alone is too terse, but 'rawBytes' seems fine whereas in 'rawDescriptorBytes' the word part 'Descriptor' overwhelms the important information. The method 'parse' receives raw bytes and tries to find a descriptor.

Here some other examples from the current patch&fixup round (we could recycle them for the guide lines, so I try to be verbose):

Renaming of isValid, here:

-    public boolean isValid(String line);
+    public boolean validateLine(String line);

makes the code less readable. For example:

-          -> null != line && !line.isEmpty() && !validator.isValid(line))
  ...
+          -> null != line && !line.isEmpty() && !validator.validateLine(line))

validateLine doesn't say that the result of the validation is returned (as a boolean). In addition, isValid(line) is more 'readable' as it 'translates' (e.g., read aloud) to "is valid line", whereas validateLine(line) results in "validate line line", which even without the duplication of line doesn't hint what happens. Similarly, sanitizeLine(line) vs. sanitize(line) (where I had clean(line) initially, but I don't mind the renaming) and postProcessLines(lines) vs. postProcessing(lines).

Why rename logBytes to rawDescriptorBytes? logBytes seems fine in a log descriptor
implementation. If I read this code for a first time I would wonder if rawDescriptorBytes
is inherited because of its generic name.

(Instead of renaming 'extension' to 'fileExtension' I'd suggest 'fileType', because that's what it is, i.e., not only an extension, which could be mistaken to be a string.)

In general, a method name shouldn't contain the parameter (e.g., better sanitize than sanitizeLine).

  • I wonder if we should take out the LogDescriptor interface for several reasons: there is just one subinterface extending that, so YAGNI; the only fields/methods that it adds are the log date, and it's unclear if future logs will cover a single date or have entirely different requirements. Looks like premature interface hierarchy optimization to me.

Logs are different from other descriptors: they are only kept compressed in the filesystem; they don't have annotations; etc. So, basing them on another interface and not directly on Descriptor is useful, because almost all Descriptor inherited methods have a different or no meaning for logs.
Actually, I would suggest to even drop getLogDateMillis() and move getLogDate() to WebServerAccessLog (see the commit in my branch). This way it is a clear marker interface for logs and for future logs the methods in common with WebServerAccessLog could be added to LogDescriptor.

  • I'm unclear whether this new code will cover both original and sanitized web server logs. Possibly related to that question, why is validation separated from sanitization, and is validation performed before sanitization or not? Can you clarify?

Some points about log handling:

  • Sanitized as well as original log lines have to be valid and have to comply to the same pattern.
  • metrics-lib cannot distinguish between sanitized and original logs. An original or sanitized log can be read using metrics-lib as long as it complies to the line pattern defined in the specification.
  • metrics-lib won't attempt sanitization of log lines.
  • CollecTor will only supply sanitized logs and uses the sanitization functionality provided by metrics-lib internal functionality. This happens by reading the log, which is automatically validated, and then calling 'sanitize()'.

When you make changes, please make them as fixup commits on top of my branch. Thanks!

All additional commits are based on your new branch and contain small changes that correspond to the findings/explanations above.

Please review seven new commits on this branch, which is rebased on the current master.

This turned into a huge comment; please try to split the reply into smaller comments, otherwise we loose track of all the different topics. Thanks!

comment:47 Changed 2 months ago by karsten

Heh, that's indeed a huge comment! And this one is going to be huge, too. However, rather than splitting it into multiple smaller comment I'll try to keep it as organized as I can by using the output of git log --reverse --pretty=oneline origin/master..iwakeh/task-22983-4 as section headers in this little essaycomment.

1232ff767375993f65a4e78d3a9383b23681925b Added a space.

This was commit c6c805c before your rebase to master.

In the future, please avoid rebasing the branch under review if you can, because it changes all commit IDs and makes it harder to refer to commits here. In most cases I'll be able to do the rebase as part of merging to master just fine.

Other than that, the explanation in your previous comment makes sense to me and this commit is good to go in.

4434316a769ea6403d2872073bc76f10a2822cf2 Added package-info.

This was commit 3b426bf before your rebase to master.

We agree on this one plus the fixup commit below, so it's good to go in.

04997b1d9b3a73d2724a2e9d39a144895ded460f Add new descriptor type for web server access logs.

This was commit 7822a27 before your rebase to master.

It's the commit that most of the subsequent fixup and squash commits will be squashed into. I'll comment on your feedback under the fixup commits below.

444b55709bef99b1f63a20e3595c1439aaa68775 Add tests for log descriptor functionality.

This was commit 929e265 before your rebase to master.

This one still needs review. That review was so far blocking on my question regarding validation vs. sanitization. I'll get to this in the next round after that question is resolved. So, still needs review!

4b3f27e93761b172dd3ee102f89c66059bd3da7a squash! Added package-info.

This was commit 1bb7a43 before your rebase to master.

You say that you agree with these tweaks, so I'll take that as a go.

be986cc88e66cb1df9772e65421462ca0570e190 fixup! Add new descriptor type for web server access logs.

This was commit e4d6e90 before your rebase to master.

You say that you're fine with these changes and have more tweaks in your branch. I'll assume this one is good to be merged then, and I'll comment on your subsequent commits further down below.

3de717268c3d27ad096cf669fc73a52f049acb59 fixup! Add new descriptor type for web server access logs.

This was commit 664f540 before your rebase to master.

This commit brought up a few topics that we need to resolve before moving on.

Variable names

Thanks for writing down your thinking about variable names in the given detail. It helps a lot, not only for this ticket but also as a guide for future tickets. Let's go through the examples:

  • Leaving out the somewhat redundant "descriptor" from rawDescriptorBytes and descriptorFile is fine with me. It's indeed obvious what's meant here.
  • I quite strongly disagree with isValid(line) as a name for a method that takes a line, tries to validate it, and returns whether that was successful. To me, isX() is the name for a getter, not the name for a method that does something with a given parameter. If this were a Line class with a method isValid() that returns whether the line is valid or not, that would be something different. But that's not the case here. For another example, consider File.delete() which deletes the file and returns whether that was successful. We wouldn't argue about renaming that method to isDeleted() just because it returns true or false. As a general rule I'd say that the name of a method that performs something should be the verb describing the action, whereas is is typically reserved for getters. Ah, and in this case it's up to the documentation to say what validate(line) returns, though that's relatively obvious.
  • Leaving out "line" in sanitizeLine(line) and friends is okay, too.
  • You're probably right about keeping logBytes rather than rawDescriptorBytes.
LogDescriptor interface

I agree with keeping LogDescriptor. But let me explain my earlier hesitation:

  • I'm not a big fan of marker interfaces. :) However, it seems that we'll soon add a LogDescriptor.LogLine subinterface, and that is (to me) a good reason to keep the LogDescriptor interface.
  • In the future we might create an AnnotatedDescriptor interface with a getAnnotations() method, a RawDescriptor interface with a getRawDescriptorBytes() and a getRawDescriptorLength() method, and a ParsedDescriptor interface with a getUnrecognizedLines() method, and a ReadDescriptor interface with a getDescriptorFile() method. And then we'll turn Descriptor into a marker interface! :) And LogDescriptor would only extend maybe ParsedDescriptor and ReadDescriptor. And we'd create a TorDescriptor interface for all relay and bridge descriptors and a RelayDescriptor and BridgeDescriptor for the various types of relay and bridge descriptors. And so on. Not part of this ticket, obviously. But it seems okay to start going in this direction now with a LogDescriptor interface.
Validation vs. sanitization

I'm still confused what validation means in this context.

Is a line containing a POST request a valid line, or one that uses FTP as protocol or that returns HTTP 404 as status code. It's okay that CollecTor skips these lines as part of the sanitizing process. But that doesn't make them invalid.

I'm also a bit uncleear if the separate validate and sanitize steps have a negative impact on performance. In theory, it should be sufficient to touch each line once. But I could be convinced that we're trading performance for better design, if this is the case.

However, I'd really want us to be clear what it means for a line to be valid!

e8a2d57b3fad2fdb94d98ba472314fd2a0fb0ada fixup! fixup! Add new descriptor type for web server access logs.

This one is fine.

3864d14c8491c6a76514a28c702c3da5d1eb5775 fixup! fixup! Add new descriptor type for web server access logs.

This one, too.

3e3dd56e317ef61296261291dc3ee29d42ebf3b8 Make code more readable by renaming method names.

The commit message does not indicate this, so I'll ask: Is this commit and are subsequent commits supposed to be squashed with 04997b1?

See my rantremarks about isValid from above. That's something I'd like us to change.

5b20be449785b1e28fd10bcb814e697cb9de178d Rename rawDescriptorBytes to logBytes.

Okay.

One thing we should do is document whether private byte[] logBytes might be compressed or not. We have been discussing that many times now, and I'm deeply confused already what we're doing there.

f5456c474f6a5a9683c6cf6ba3f791bc5d6278ec Avoid missleading name for variable.

Sounds good.

629ef152be1fd2f5a00d203b614fc01e946c518d Tweak pattern for logline validation.

One question about the pattern: Does the ?: in "^((?:\\d{1,3}\\.){3}\\d{1,3}) ..." mean that we're now ignoring the first 3 octets regardless of whether they're 0.0.0. or not? That would not follow the specification then where we claim that "If the remote hostname starts with "0.0.0.", it is kept unchanged, otherwise it's rewritten to "0.0.0.0"." Example: 1.2.3.4 would be rewritten to 0.0.0.4 if we simply ignore the first three octets, rather than 0.0.0.0. In a way, I like that approach even better, but we'd have to change the spec for that. And we should both be certain that this is what we want to do.

e24dda11613f340da3fbd6f1a93ba07d857f0b16 Remove getLogDateMillis and move getLogDate to WebServerAccessLog.

I wonder why there's no getLogDateMillis() anymore. But I can be convinced that users should just extract millisecond from LocalDate if they need them. Is that the plan? If so, green light.

6a6e93a2fd8b3f3b912ce9a258f4b32069c18ef8 (iwakeh/task-22983-4) Set dev version.

Let's revert this commit. It bumps to the wrong dev version (2.2.0-dev rather than 2.1.1-dev), and it's something we should do as part of merging to master. In fact, I bumped to 2.1.1-dev this morning. So, this commit can go away, by reverting it, not by simply dropping it.

Closing remarks

Whee! This was hard work. And we're not done yet. But we're getting closer. :) Thanks for working on this!

comment:48 Changed 4 weeks ago by iwakeh

Thanks for hanging in and working through all!

I try to shorten my reply and simply leave out all points of agreement withou any further comment. Feel free to raise them again if necessary.

Replying to karsten:

Variable names

I copied the comments about variable names to ticket #24370, which is for defining some naming guidelines all over metrics. Let's continue the discussion there.

Validation vs. sanitization

I'm still confused what validation means in this context.

Metrics-lib provides web log parsing of sanitized logs as available on CollecTor. When parsing such a log file lines need to be validated, i.e., metrics-lib verifies that these are standard sanitized log lines. Metrics-lib does not supply a general log parser.

Sanitization is only supplied as internal metrics-lib feature and used by CollecTor to sanitize the logs to be published.

Is a line containing a POST request a valid line, or one that uses FTP as protocol or that returns HTTP 404 as status code. It's okay that CollecTor skips these lines as part of the sanitizing process. But that doesn't make them invalid.

From the point of a consumer that expects sanitized log lines the POST and FTP lines are invalid.

I'm also a bit uncleear if the separate validate and sanitize steps have a negative impact on performance. In theory, it should be sufficient to touch each line once. But I could be convinced that we're trading performance for better design, if this is the case.

Touching the lines once is only a concern during sanitization. The current code only operates once on each line. Simply parsing santized logs only needs the validation part anyway.

However, I'd really want us to be clear what it means for a line to be valid!

Yes!

...
One thing we should do is document whether private byte[] logBytes might be compressed or not. We have been discussing that many times now, and I'm deeply confused already what we're doing there.

Already addressed in the respective javadoc comments. All decompressed.

629ef152be1fd2f5a00d203b614fc01e946c518d Tweak pattern for logline validation.

One question about the pattern: Does the ?: in "^((?:\\d{1,3}\\.){3}\\d{1,3}) ..." mean that we're now ignoring the first 3 octets regardless of whether they're 0.0.0. or not? That would not follow the specification ...

I will re-check that with the new spec we have from #23243.

e24dda11613f340da3fbd6f1a93ba07d857f0b16 Remove getLogDateMillis and move getLogDate to WebServerAccessLog.

I wonder why there's no getLogDateMillis() anymore. But I can be convinced that users should just extract millisecond from LocalDate if they need them. Is that the plan? If so, green light.

Yep.

6a6e93a2fd8b3f3b912ce9a258f4b32069c18ef8 (iwakeh/task-22983-4) Set dev version.

Let's revert this commit. ...

I reverted this, but instead would like to add the '2.1.1-dev' commit. (Mainly for CollecTor #22428 implementation and testing.)

Next steps:

The latest changes in #23243 also makes a revision necessary. I'll use the current branch here to add the changes.
So far the only open topic is your question about validation&sanitization. The above answer might have resolved it a bit. Maybe, it'll also be more clear in the upcoming commit(s) for the spec changes.

comment:49 Changed 4 weeks ago by iwakeh

Clarification regarding the pattern question:

The pattern start ^((?:\\d{1,3}\\.){3}\\d{1,3}) ... results in one group

              ^((?:\\d{1,3}\\.){3}\\d{1,3})
               ^--------------------------^

The ?: prevents sub-groups.

The pattern is used for extracting the various parts of a log line and these are further checked.
The code follows the spec

if (!res.ip.startsWith("0.0.0.")) {
          res.ip = "0.0.0.0";
}

Spec:

 If the remote hostname starts with "0.0.0.", it is kept unchanged, otherwise it's rewritten to "0.0.0.0".

(cf.test cases)

comment:50 Changed 4 weeks ago by karsten

Agreed with all points above except one:

When parsing sanitized log lines metrics-lib should not reject log lines that it would discard when sanitizing original log lines.

It's not the job of the parser to ensure that its input is properly sanitized or to do some sort of post-sanitizing. Of course it needs to perform some basic format verifications to perform its job. But dropping lines because the sanitizer would drop them seems out of place.

Imagine a hypothetical situation where we decide at some point that HEAD requests are too sensitive and we take them out in the parser. However, previously sanitized logs would still contain them, including archives that people keep locally and that we can't update. If somebody then takes a recent metrics-lib version to parse their data, they'd suddenly don't get the HEAD lines anymore. That would be rather confusing.

I think sanitizing and parsing should be separate things. In this case, discarding lines because of certain field contents should be left to the sanitizer.

Does that mean we should provide a general-purpose log parser? Probably not. In the parser we don't have to provide getters for fields that we don't care about, like user-agent string. But we should be prepared to find request methods GET, HEAD, POST, or really anything else in log lines we're given.

Does that make sense, or am I overlooking something?

(By the way, it's a good thing that we're keeping the spec unchanged with regard to IP addresses not starting with 0.0.0.. I think it would have been pretty bad to just rewrite the first three octets to 0.0.0 and keep the fourth unchanged. Not very privacy-preserving.)

comment:51 in reply to:  50 Changed 4 weeks ago by iwakeh

Replying to karsten:

Agreed with all points above except one:

Fine.

...
Does that mean we should provide a general-purpose log parser? Probably not.

This is my main concern here.

... In the parser we don't have to provide getters for fields that we don't care about, like user-agent string. But we should be prepared to find request methods GET, HEAD, POST, or really anything else in log lines we're given.

Maybe, the general rule here could be that the parser/validator accepts the format in a backward compatible way, i.e., log lines that have been valid at some point will be parsed/valid in future? This would enable parsing of HEAD from your example above and also accommodate changes like accepting FTP in future with the benefit of having a defined pattern of valid log lines that is more restrictive than a general log parser.

Last edited 4 weeks ago by iwakeh (previous) (diff)

comment:52 Changed 4 weeks ago by karsten

Hmm, why are you concerned about providing a general-purpose log parser? It doesn't claim to be that, so if it doesn't parse FTP lines correctly, so what. It's supposed to be a tool that helps with parsing sanitized log lines. If it doesn't support parsing something in logs that never occurs in sanitized logs, we can fix that if we want, but we don't have to. I'm just saying that we shouldn't make it overly restrictive by rejecting lines that it could easily handle.

comment:53 in reply to:  52 Changed 4 weeks ago by iwakeh

Replying to karsten:

Hmm, why are you concerned about providing a general-purpose log parser? It doesn't claim to be that, so if it doesn't parse FTP lines correctly, so what. It's supposed to be a tool that helps with parsing sanitized log lines. If it doesn't support parsing something in logs that never occurs in sanitized logs, we can fix that if we want, but we don't have to.

This is the distinction I'm concerned about. So, all settled.

Note: See TracTickets for help on using tickets.