Add new descriptor source to fetch descriptors from CollecTor via https

added component::metrics/library owner::karsten priority::medium resolution::implemented severity::normal status::closed type::enhancement labels

Please somebody review branch task-16151 in my public repository.

Trac:
Cc: N/A to iwakeh
Status: new to needs_review

I'll take a look at it this weekend.

Is this commit https://gitweb.torproject.org/user/karsten/metrics-lib.git/commit/?h=task-16151&id=44a90638ffb9f41741ef1b60189441a636ba5ca0 the most up-to-date?

Yes, that's the most up-to-date commit. I'm planning to add another class, DescriptorWriter, that generalizes writing descriptors to a local directory structure, which would be necessary for Onionoo to fetch descriptors from CollecTor and cache them locally before processing them with DescriptorReader. But I'm not planning to change the DescriptorCollector code in the next few days except for incorporating feedback. Thanks!

Is there a task already for switching to java 7 and adding some logging framework? These two comprise my "ceterum censeo ..." :-) In general, could this change be also used to softly introduce some logging into metrics-lib?

Now the code comments/questions: (I hope I didn't miss any new test cases ...) It might be helpful to add some tests for DescriptorCollectorImpl that explicit test for certain behavior. These would be a very good source of documentation: e.g. take the comment inside "putTogetherHistory". A test could explicitly verify the functionality described in the comments.

DescriptorCollector interface: Why not turn the comments into javadoc? They are there already and very readable. Just adding the tags for return values and parameters is not a big deal.

Concerning DescriptorCollectorImpl: Let's try to avoid the TODOs here from the beginning ;-) For the various constants (COLLECTOR_BASE_URL, CONNECT_TIMEOUT_MILLIS, READ_TIMEOUT_MILLIS) I would suggest using a system property (defaulting to the current values) as for the class names in DescriptorSourceFactory using

onionoo.collector.url
onionoo.collector.timeout.read
onionoo.collector.timeout.write

as property names. Then it will be configurable immediately. The retrieval of the 'int' properties needs just one extra method that returns either the converted property value or logs an error and returns the default. The URL could also be verified, if there is time to implement it now.

The other TODO is the Apache's directory list parsing: the regex parsing approach is straigthforward and ok, but I'd make it testable and definitly log any problems. So a method 'addDirsFromListing(SortedMap<String, Long> list)' with the code from lines 177-200 and the additional logging might be useful.

Well, and the test for parsing directory string needs to be added. Tests for the property handling above would be helpful, too.

What about adding the @Override annotation where appropriate?

Replying to iwakeh:

Thanks for the extensive review!

Before I respond, I must confess that I'm currently thinking about making a major change to DescriptorCollector: the way it works right now is that it fetches descriptors, parses them, hands them over to you, and you do with them whatever you want. But most applications would want to cache descriptors locally before processing them, and in order to do that, we'll need a DescriptorWriter.

I started writing such a class and ran into three problems:

We need to distinguish whether a ServerDescriptor was published by a relay or a bridge. I think we can do that by looking at router-signature vs. router-digest, and I even wrote code for that, but that needs testing. So, not a big deal.
In order to write a Descriptor to disk, we sometimes need additional information. For example, a Microdescriptor doesn't contain a publication time, so we'd have to provide that seperately if we want to write descriptors to separate directories per month. Also, BridgeNetworkStatus doesn't contain the fingerprint of the bridge authority. Solving these issues is possible, but not pretty.
Here's the party killer: if we want to use DescriptorCollector to mirror CollecTor's descriptor archive tarballs, we'd instead receive all descriptors contained in tarballs and would then have to write them to new tarballs. In that use case we don't want to parse descriptors, which would even slow us down a lot.

So, I wonder if we should change DescriptorCollector to simply synchronize files from the CollecTor website to a local directory without parsing descriptors at all. Then we can use DescriptorReader as usual to process them. The only use case that this wouldn't cover is when we want to process CollecTor's descriptors on-the-fly without having them touch the disk, but that doesn't seem important enough to force the other use cases into this model.

Thoughts?

Let me also reply inline.

Is there a task already for switching to java 7 and adding some logging framework? These two comprise my "ceterum censeo ..." :-) In general, could this change be also used to softly introduce some logging into metrics-lib?

Both are good ideas, but let's do them when all services are migrated away from the dying yatei host.

Now the code comments/questions: (I hope I didn't miss any new test cases ...) It might be helpful to add some tests for DescriptorCollectorImpl that explicit test for certain behavior. These would be a very good source of documentation: e.g. take the comment inside "putTogetherHistory". A test could explicitly verify the functionality described in the comments.

Agreed. For now, I'd risk doing the testing by setting up the services locally with the new metrics-lib and seeing if they still work okay. But yes, we should add tests.

DescriptorCollector interface: Why not turn the comments into javadoc? They are there already and very readable. Just adding the tags for return values and parameters is not a big deal.

Sure, why not. I'll fix that. We might later want to go through the other classes and clean up docs there, too. But there's no harm in starting to do this with newly added code.

Concerning DescriptorCollectorImpl: Let's try to avoid the TODOs here from the beginning ;-) For the various constants (COLLECTOR_BASE_URL, CONNECT_TIMEOUT_MILLIS, READ_TIMEOUT_MILLIS) I would suggest using a system property (defaulting to the current values) as for the class names in DescriptorSourceFactory using {{{ onionoo.collector.url onionoo.collector.timeout.read onionoo.collector.timeout.write }}} as property names. Then it will be configurable immediately. The retrieval of the 'int' properties needs just one extra method that returns either the converted property value or logs an error and returns the default. The URL could also be verified, if there is time to implement it now.

The current way of configuring DescriptorCollector is by calling methods on it after it's created and before it starts collecting. I'd rather want to make things configurable that way. It's actually easy to do that, so we can change those TODOs into methods.

I'd also want to have a more comprehensive plan for using properties in metrics-lib. It's quite possible that using them is a better idea than the current way of configuring things. But I'd want to have a high-level overview of that first.

The other TODO is the Apache's directory list parsing: the regex parsing approach is straigthforward and ok, but I'd make it testable and definitly log any problems. So a method 'addDirsFromListing(SortedMap<String, Long> list)' with the code from lines 177-200 and the additional logging might be useful.

Will look into that. Making the code more testable sounds great.

Well, and the test for parsing directory string needs to be added.

Agreed.

Tests for the property handling above would be helpful, too.

(See above.)

What about adding the @Override annotation where appropriate?

Sure, if it's missing that's an oversight. I'll add that.

So, I'm curious to hear what you think about my suggestion above. Sorry for letting you review the current code and then suggesting to change it, but I didn't envision to run into all those problems. Thanks again!

There, I pushed another commit to the task-16151 branch in my public repository that implements my previous suggestion and incorporates some of your feedback. I also "tested" it with a modified ExoneraTor version which worked just fine.

For reference, here's how the modified ExoneraTor uses this new code:

  private static void fetchDescriptors() {
    DescriptorCollector collector =
        DescriptorSourceFactory.createDescriptorCollector();
    collector.setLocalDirectory(new File(importDirString));
    collector.addRemoteDirectory(
        "/recent/relay-descriptors/consensuses/");
    collector.addRemoteDirectory(
        "/recent/relay-descriptors/server-descriptors/");
    collector.addRemoteDirectory("/recent/exit-lists/");
    collector.collectRemoteFiles();
    collector.deleteExtraneousLocalFiles();
  }

Another review would be much appreciated.

Just mirroring the directory and handling the data later does seem to be the better approach.

Thanks, for providing the ExoneraTor use-case!

Here some comments and questions:

In my opinion/experience it usually pays off very well to write tests immediately when introducing regex-matching. This would document precisely what type of directory is expected and help with furture additions and changes.
As far a I can tell, you only want the "DescriptorCollector" to run once per instance? Why not just offer an interface consisting of one possibly overloaded method maybe with an additional boolean for the extraneous-file-cleanup (directories could be handled as List or as String array)? If the method signature is too long, one could introduce a simple configuration class "CollectorConfiguration" as parameter for "collectRemoteFiles".
Is the repeated call of "collectRemoteFiles" not a valid use case?
Should the file-cleanup not be performed after each collector run?
The return value of "parseDirectoryListing" is not used; and in case of a problem nothing is logged (or printed to System.err). Same with "fetchRemoteFile", here it might be even more valuable to know why the download failed.
If the above suggestion (2.) is not used, the code might be a little more readable having the following method

private void ifRunningThrowIllegalState(String msg){
  if (this.hasStartedCollecting) {
    throw new IllegalStateException(msg);
  }
}

replace all of the if (this.hasStartedCollecting) ... calls; it would also be easier to maintain.

Those are good suggestions!

I changed the interface towards a single method collectRemoteFiles() with five parameters. I took out the timeouts entirely, because I don't consider them important at all; having them in the previous interface was okay, but this interface is better with as few parameters as possible. We could have used overloaded methods, but these five parameters all looked reasonable to have in the interface.

Please find the latest commit in the updated branch task-16151 in my public repository. It comes with tests!

And here's the updated ExoneraTor method that uses this code:

  private static void fetchDescriptors() {
    DescriptorCollector collector =
        DescriptorSourceFactory.createDescriptorCollector();
    collector.collectRemoteFiles("https://collector.torproject.org",
        new String[] { "/recent/relay-descriptors/consensuses/",
        "/recent/relay-descriptors/server-descriptors/",
        "/recent/exit-lists/" }, 0L, new File(importDirString), true);
  }

I'm going to merge soon. If you have more suggestions, in particular on the interface (which would be harder to change later), please do tell! Thanks!

Yeah, the changes look much cleaner now!

Just one issue and one question about the test cases:

The changes to the "remoteDirectories" array cannot be inferred from the interface-contract and could cause unwanted side effects for the caller of "collectRemoteFiles".

Maybe, rather introduce a new method 'slashBeginAndEnd' (I cannot think of a better name):

private final static String SLASH = "/";
private final static String EMPTY = ""; 

public static String slashBeginAndEnd(String dir){
  return (dir.startsWith(SLASH) ? EMPTY : SLASH)  + dir
      + (dir.endsWith(SLASH) ? EMPTY : SLASH);
}

remove lines 39-44, and change line 105 to

String directoryUrl = collecTorBaseUrl + slashBeginAndEnd(remoteDirectory);

Good to have the tests! What should happen if the same file is listed twice? And, a missmatch between directory date and file date? E.g. combined in one test string something like:

"<a href=\"2015-05-24-12-00-00-consensus\">"
        + "2015-05-24-12-00-00-consensus</a></td>"
        + "<td align=\"right\">27-May-2015 14:07   </td>"
        + "<a href=\"2015-05-24-12-00-00-consensus\">"
        + "2015-05-24-12-00-01-consensus</a></td>"
        + "<td align=\"right\">27-May-2015 14:19 </td>"
        + "<a href=\"2015-05-24-12-00-00-consensus\">"
        + "2015-05-24-12-00-00-consensus</a></td>"
        + "<td align=\"right\">26-May-2015 14:07   </td>"

Replying to iwakeh:

The changes to the "remoteDirectories" array cannot be inferred from the interface-contract and could cause unwanted side effects for the caller of "collectRemoteFiles".

Oh, good point, didn't think about that. Fixed.

What should happen if the same file is listed twice?

That would be a bug in the input data. The current behavior is that the last entry wins. I wrote a test case for this.

And, a missmatch between directory date and file date?

We only look at the link in <a href="...">, not at the displayed filename. If the two differ, that's also a bug in the input data and nothing we should worry about.

Okay, sounds like we agree on the interface. Will merge tomorrow. Happy to add more tests later. Thanks again for your reviews! Very helpful!

Squashed and merged to master. I'll leave this ticket open for another day or two in case you have more ideas. Thanks again!

Trac:
Status: needs_review to needs_information

All seems fine.

Just one more question: When receiving something other than a 200 response code while fetching the directories or files shouldn't that be reported in some way? Things like redirects, bad requests, not authorized, or others might be important for the caller?

Good question. I just opened the more general ticket #16225 (moved) to come up with an answer. Let's leave this ticket open until we have one.

see #20323 (moved)

Trac:
Severity: N/A to Normal
Reviewer: N/A to N/A
Sponsor: N/A to N/A

Replying to karsten:

Good question. I just opened the more general ticket #16225 (moved) to come up with an answer. Let's leave this ticket open until we have one.

While going through all open metrics-lib tickets I found that keeping tickets open that have lots of comments doesn't really serve us well. The question above is still a good one, but it's unnecessary to read the bulk of the discussion above to find an answer to it. I copied over the question to #16225 (moved), so that we don't miss it, and now I'm finally closing this ticket as implemented.

Trac:
Status: needs_information to closed
Resolution: N/A to implemented

closed

mentioned in issue #16225 (moved)

mentioned in issue #20323 (moved)

Add new descriptor source to fetch descriptors from CollecTor via https

Child items 0

Activity