Opened 4 years ago

Closed 2 years ago

#16151 closed enhancement (implemented)

Add new descriptor source to fetch descriptors from CollecTor via https

Reported by: karsten Owned by: karsten
Priority: Medium Milestone:
Component: Metrics/Library Version:
Severity: Normal Keywords:
Cc: iwakeh Actual Points:
Parent ID: Points:
Reviewer: Sponsor:

Description

We changed CollecTor almost a year ago to provide descriptors via https rather than rsync. We even turned off rsync a few months later. But we failed to provide a means for metrics-lib based applications to fetch descriptors via https. We should add that now.

Child Tickets

Change History (15)

comment:1 Changed 4 years ago by karsten

Cc: iwakeh added
Status: newneeds_review

comment:3 Changed 4 years ago by karsten

Yes, that's the most up-to-date commit. I'm planning to add another class, DescriptorWriter, that generalizes writing descriptors to a local directory structure, which would be necessary for Onionoo to fetch descriptors from CollecTor and cache them locally before processing them with DescriptorReader. But I'm not planning to change the DescriptorCollector code in the next few days except for incorporating feedback. Thanks!

comment:4 Changed 4 years ago by iwakeh

Is there a task already for switching to java 7 and adding some logging framework?
These two comprise my "ceterum censeo ..." :-)
In general, could this change be also used to softly introduce some logging into metrics-lib?

Now the code comments/questions:
(I hope I didn't miss any new test cases ...)
It might be helpful to add some tests for DescriptorCollectorImpl that explicit test for certain behavior.
These would be a very good source of documentation: e.g. take the comment
inside "putTogetherHistory". A test could explicitly verify the functionality
described in the comments.

DescriptorCollector interface:
Why not turn the comments into javadoc? They are there already and very
readable. Just adding the tags for return values and parameters is not a big deal.

Concerning DescriptorCollectorImpl:
Let's try to avoid the TODOs here from the beginning ;-)
For the various constants (COLLECTOR_BASE_URL, CONNECT_TIMEOUT_MILLIS, READ_TIMEOUT_MILLIS)
I would suggest using a system property (defaulting to the current values) as for the class
names in DescriptorSourceFactory using

onionoo.collector.url
onionoo.collector.timeout.read
onionoo.collector.timeout.write

as property names. Then it will be configurable immediately.
The retrieval of the 'int' properties needs just one extra method that returns
either the converted property value or logs an error and returns the default.
The URL could also be verified, if there is time to implement it now.

The other TODO is the Apache's directory list parsing: the regex parsing
approach is straigthforward and ok, but I'd make it testable and definitly
log any problems. So a method 'addDirsFromListing(SortedMap<String, Long> list)'
with the code from lines 177-200 and the additional logging might be useful.

Well, and the test for parsing directory string needs to be added.
Tests for the property handling above would be helpful, too.

What about adding the @Override annotation where appropriate?

comment:5 Changed 4 years ago by karsten

Replying to iwakeh:

Thanks for the extensive review!

Before I respond, I must confess that I'm currently thinking about making a major change to DescriptorCollector: the way it works right now is that it fetches descriptors, parses them, hands them over to you, and you do with them whatever you want. But most applications would want to cache descriptors locally before processing them, and in order to do that, we'll need a DescriptorWriter.

I started writing such a class and ran into three problems:

  1. We need to distinguish whether a ServerDescriptor was published by a relay or a bridge. I think we can do that by looking at router-signature vs. router-digest, and I even wrote code for that, but that needs testing. So, not a big deal.
  2. In order to write a Descriptor to disk, we sometimes need additional information. For example, a Microdescriptor doesn't contain a publication time, so we'd have to provide that seperately if we want to write descriptors to separate directories per month. Also, BridgeNetworkStatus doesn't contain the fingerprint of the bridge authority. Solving these issues is possible, but not pretty.
  3. Here's the party killer: if we want to use DescriptorCollector to mirror CollecTor's descriptor archive tarballs, we'd instead receive all descriptors contained in tarballs and would then have to write them to new tarballs. In that use case we don't want to parse descriptors, which would even slow us down a lot.

So, I wonder if we should change DescriptorCollector to simply synchronize files from the CollecTor website to a local directory without parsing descriptors at all. Then we can use DescriptorReader as usual to process them. The only use case that this wouldn't cover is when we want to process CollecTor's descriptors on-the-fly without having them touch the disk, but that doesn't seem important enough to force the other use cases into this model.

Thoughts?

Let me also reply inline.

Is there a task already for switching to java 7 and adding some logging framework?
These two comprise my "ceterum censeo ..." :-)
In general, could this change be also used to softly introduce some logging into metrics-lib?

Both are good ideas, but let's do them when all services are migrated away from the dying yatei host.

Now the code comments/questions:
(I hope I didn't miss any new test cases ...)
It might be helpful to add some tests for DescriptorCollectorImpl that explicit test for certain behavior.
These would be a very good source of documentation: e.g. take the comment
inside "putTogetherHistory". A test could explicitly verify the functionality
described in the comments.

Agreed. For now, I'd risk doing the testing by setting up the services locally with the new metrics-lib and seeing if they still work okay. But yes, we should add tests.

DescriptorCollector interface:
Why not turn the comments into javadoc? They are there already and very
readable. Just adding the tags for return values and parameters is not a big deal.

Sure, why not. I'll fix that. We might later want to go through the other classes and clean up docs there, too. But there's no harm in starting to do this with newly added code.

Concerning DescriptorCollectorImpl:
Let's try to avoid the TODOs here from the beginning ;-)
For the various constants (COLLECTOR_BASE_URL, CONNECT_TIMEOUT_MILLIS, READ_TIMEOUT_MILLIS)
I would suggest using a system property (defaulting to the current values) as for the class
names in DescriptorSourceFactory using

onionoo.collector.url
onionoo.collector.timeout.read
onionoo.collector.timeout.write

as property names. Then it will be configurable immediately.
The retrieval of the 'int' properties needs just one extra method that returns
either the converted property value or logs an error and returns the default.
The URL could also be verified, if there is time to implement it now.

The current way of configuring DescriptorCollector is by calling methods on it after it's created and before it starts collecting. I'd rather want to make things configurable that way. It's actually easy to do that, so we can change those TODOs into methods.

I'd also want to have a more comprehensive plan for using properties in metrics-lib. It's quite possible that using them is a better idea than the current way of configuring things. But I'd want to have a high-level overview of that first.

The other TODO is the Apache's directory list parsing: the regex parsing
approach is straigthforward and ok, but I'd make it testable and definitly
log any problems. So a method 'addDirsFromListing(SortedMap<String, Long> list)'
with the code from lines 177-200 and the additional logging might be useful.

Will look into that. Making the code more testable sounds great.

Well, and the test for parsing directory string needs to be added.

Agreed.

Tests for the property handling above would be helpful, too.

(See above.)

What about adding the @Override annotation where appropriate?

Sure, if it's missing that's an oversight. I'll add that.

So, I'm curious to hear what you think about my suggestion above. Sorry for letting you review the current code and then suggesting to change it, but I didn't envision to run into all those problems. Thanks again!

comment:6 Changed 4 years ago by karsten

There, I pushed another commit to the task-16151 branch in my public repository that implements my previous suggestion and incorporates some of your feedback. I also "tested" it with a modified ExoneraTor version which worked just fine.

For reference, here's how the modified ExoneraTor uses this new code:

  private static void fetchDescriptors() {
    DescriptorCollector collector =
        DescriptorSourceFactory.createDescriptorCollector();
    collector.setLocalDirectory(new File(importDirString));
    collector.addRemoteDirectory(
        "/recent/relay-descriptors/consensuses/");
    collector.addRemoteDirectory(
        "/recent/relay-descriptors/server-descriptors/");
    collector.addRemoteDirectory("/recent/exit-lists/");
    collector.collectRemoteFiles();
    collector.deleteExtraneousLocalFiles();
  }

Another review would be much appreciated.

comment:7 Changed 4 years ago by iwakeh

Just mirroring the directory and handling the data later does seem to be the better approach.

Thanks, for providing the ExoneraTor use-case!

Here some comments and questions:

  1. In my opinion/experience it usually pays off very well to write tests immediately when introducing regex-matching. This would document precisely what type of directory is expected and help with furture additions and changes.
  1. As far a I can tell, you only want the "DescriptorCollector" to run once per instance? Why not just offer an interface consisting of one possibly overloaded method maybe with an additional boolean for the extraneous-file-cleanup (directories could be handled as List<String> or as String array)? If the method signature is too long, one could introduce a simple configuration class "CollectorConfiguration" as parameter for "collectRemoteFiles".
  1. Is the repeated call of "collectRemoteFiles" not a valid use case?
  1. Should the file-cleanup not be performed after each collector run?
  1. The return value of "parseDirectoryListing" is not used; and in case of a problem nothing is logged (or printed to System.err). Same with "fetchRemoteFile", here it might be even more valuable to know why the download failed.
  1. If the above suggestion (2.) is not used, the code might be a little more readable having the following method
    private void ifRunningThrowIllegalState(String msg){
      if (this.hasStartedCollecting) {
        throw new IllegalStateException(msg);
      }
    }
    

replace all of the if (this.hasStartedCollecting) ... calls; it would also be easier to maintain.

comment:8 Changed 4 years ago by karsten

Those are good suggestions!

I changed the interface towards a single method collectRemoteFiles() with five parameters. I took out the timeouts entirely, because I don't consider them important at all; having them in the previous interface was okay, but this interface is better with as few parameters as possible. We could have used overloaded methods, but these five parameters all looked reasonable to have in the interface.

Please find the latest commit in the updated branch task-16151 in my public repository. It comes with tests!

And here's the updated ExoneraTor method that uses this code:

  private static void fetchDescriptors() {
    DescriptorCollector collector =
        DescriptorSourceFactory.createDescriptorCollector();
    collector.collectRemoteFiles("https://collector.torproject.org",
        new String[] { "/recent/relay-descriptors/consensuses/",
        "/recent/relay-descriptors/server-descriptors/",
        "/recent/exit-lists/" }, 0L, new File(importDirString), true);
  }

I'm going to merge soon. If you have more suggestions, in particular on the interface (which would be harder to change later), please do tell! Thanks!

comment:9 Changed 4 years ago by iwakeh

Yeah, the changes look much cleaner now!

Just one issue and one question about the test cases:

The changes to the "remoteDirectories" array cannot be inferred from the interface-contract and could cause unwanted side effects for the caller of "collectRemoteFiles".

Maybe, rather introduce a new method 'slashBeginAndEnd' (I cannot think of a better name):

private final static String SLASH = "/";
private final static String EMPTY = ""; 

public static String slashBeginAndEnd(String dir){
  return (dir.startsWith(SLASH) ? EMPTY : SLASH)  + dir
      + (dir.endsWith(SLASH) ? EMPTY : SLASH);
}

remove lines 39-44, and change line 105 to

String directoryUrl = collecTorBaseUrl + slashBeginAndEnd(remoteDirectory);

Good to have the tests!
What should happen if the same file is listed twice?
And, a missmatch between directory date and file date?
E.g. combined in one test string something like:

"<a href=\"2015-05-24-12-00-00-consensus\">"
        + "2015-05-24-12-00-00-consensus</a></td>"
        + "<td align=\"right\">27-May-2015 14:07   </td>"
        + "<a href=\"2015-05-24-12-00-00-consensus\">"
        + "2015-05-24-12-00-01-consensus</a></td>"
        + "<td align=\"right\">27-May-2015 14:19 </td>"
        + "<a href=\"2015-05-24-12-00-00-consensus\">"
        + "2015-05-24-12-00-00-consensus</a></td>"
        + "<td align=\"right\">26-May-2015 14:07   </td>"


comment:10 in reply to:  9 Changed 4 years ago by karsten

Replying to iwakeh:

The changes to the "remoteDirectories" array cannot be inferred from the interface-contract and could cause unwanted side effects for the caller of "collectRemoteFiles".

Oh, good point, didn't think about that. Fixed.

What should happen if the same file is listed twice?

That would be a bug in the input data. The current behavior is that the last entry wins. I wrote a test case for this.

And, a missmatch between directory date and file date?

We only look at the link in <a href="...">, not at the displayed filename. If the two differ, that's also a bug in the input data and nothing we should worry about.

Okay, sounds like we agree on the interface. Will merge tomorrow. Happy to add more tests later. Thanks again for your reviews! Very helpful!

comment:11 Changed 4 years ago by karsten

Status: needs_reviewneeds_information

Squashed and merged to master. I'll leave this ticket open for another day or two in case you have more ideas. Thanks again!

comment:12 Changed 4 years ago by iwakeh

All seems fine.

Just one more question:
When receiving something other than a 200 response code while fetching the directories or files shouldn't that be reported in some way? Things like redirects, bad requests, not authorized, or others might be important for the caller?

comment:13 Changed 4 years ago by karsten

Good question. I just opened the more general ticket #16225 to come up with an answer. Let's leave this ticket open until we have one.

comment:14 Changed 3 years ago by iwakeh

Severity: Normal

see #20323

comment:15 in reply to:  13 Changed 2 years ago by karsten

Resolution: implemented
Status: needs_informationclosed

Replying to karsten:

Good question. I just opened the more general ticket #16225 to come up with an answer. Let's leave this ticket open until we have one.

While going through all open metrics-lib tickets I found that keeping tickets open that have lots of comments doesn't really serve us well. The question above is still a good one, but it's unnecessary to read the bulk of the discussion above to find an answer to it. I copied over the question to #16225, so that we don't miss it, and now I'm finally closing this ticket as implemented.

Note: See TracTickets for help on using tickets.