Opened 3 years ago

Closed 3 years ago

#19791 closed defect (implemented)

Use CollecTor's index.json for download ('''Alpha''' version); adapt current download to use new date format

Reported by: karsten Owned by: iwakeh
Priority: High Milestone: metrics-lib 1.4.0
Component: Metrics/Library Version:
Severity: Normal Keywords:
Cc: iwakeh Actual Points:
Parent ID: Points:
Reviewer: Sponsor:

Description

We're parsing dates from the CollecTor's Apache directory listings to decide whether or not to fetch a remote file. The CollecTor host was recently upgraded from wheezy to jessie, which apparently changed the date format from dd-MMM-yyyy to yyyy-MM-dd. Adapt to this change.

Obviously, parsing dates like this is very fragile. We should soon switch to using CollecTor's index.json file instead, ideally before the next release.

Child Tickets

Change History (22)

comment:1 Changed 3 years ago by karsten

Status: newneeds_review

Please find my branch task-19791 for a quick fix. Switching to index.json would require a somewhat bigger change, but we should prioritize this over other things and then release 1.3.1 this week.

comment:2 Changed 3 years ago by iwakeh

Status: needs_reviewmerge_ready

The change looks ok as first band-aid.

The switch to using index.json is indeed urgent.

comment:3 Changed 3 years ago by karsten

Status: merge_readyassigned

Thanks, pushed and re-setting to assigned to work on the switch to index.json.

comment:4 Changed 3 years ago by iwakeh

Owner: changed from karsten to iwakeh

comment:5 Changed 3 years ago by iwakeh

Birds-view description of approach for reading from index.json:

  • move currently private node-classes IndexNode, FileNode, DirectoryNode to metrics-lib org.torproject.descriptor.index.
  • provide metrics-lib with a reading interface and implementation
  • make collector use metrics-lib node-classes

What needs to be added/altered?

comment:6 Changed 3 years ago by iwakeh

Milestone: metrics-lib 1.4.0

Added to milestone 1.4.0

comment:7 Changed 3 years ago by iwakeh

Second part of reading from index.json:

  • create another implementation of DescriptorCollector, e.g. org.torproject.descriptor.index.IndexCollector (the index package contains all the necessary classes, cf. comment:5)
  • keep the old implementation as default; use descriptor.collector for setting the new implementation
  • switch default setting to the new implementation.
  • remove current implementation.

The final two steps of switching the implementation and removing the current one are not necessarily part of this ticket. The implementations can co-exist, but removing the current one reduces maintenance.

comment:8 Changed 3 years ago by iwakeh

Summary: Adapt to CollecTor's new date formatUse CollecTor's index.json for download; adapt current download to use new date format

Adapted title to describe what this ticket is about.

comment:9 Changed 3 years ago by iwakeh

Status: assignedneeds_review

Please review this branch (the four commits on top of slf4j).

There are also many tests now and coverage is reasonably high everywhere regarding new classes.

Please review carefully as I couldn't base the implementation on existing tests.

Setting -Ddescriptor.collector=org.torproject.descriptor.index.DescriptorIndexCollector will use the new descriptor collector method. Without the property setting the old method is used.

The json implementation is encapsulated in IndexNode. One test reads and writes a current index.json from the main CollecTor instance (see test.json in test/resources). The various compression methods are contained in FileType, where it is easy to add more in future, if necessary.

This json writing/reading will be used in the next CollecTor, too.

comment:10 Changed 3 years ago by karsten

Status: needs_reviewneeds_revision

Alright, let's go through the last four commits in that branch:

91fdb9b is already contained in master as 1dab40d.

a74563a looks good and is merged to my branch (see below) as 7a0e109.

1276c14 looks good, too, and is merged to my branch as e9c0731. Thanks for caring about tests!

32f628f is preliminary merged to my branch as 82eb1cf, but I have a couple of questions and change requests, as follows:

Commit bf5ba78 in my task-19791 branch contains a few trivial fixes. Please take a look and let me know if you're okay with those.

Regarding the new subpackage org.torproject.descriptor.index, I'm yet unclear whether that's supposed to be "visible" to API users or not, that is, if they're supposed to import and/or use those classes directly. If it is, that's going to grow the overall interface quite a bit, requiring us to keep the public parts stable in subsequent releases. For example, I'm unclear whether CollecTor is supposed to instantiate its own *Node classes when writing its index.json* files. Ideally, metrics-lib would hide away those classes by providing some kind of DescriptorIndexGenerator interface that takes a local directory as parameter and writes one or more index.json[*] files, but I'm not sure what your plan is there. If the plan is to hide that package from users, how about we rename it to org.torproject.descriptor.impl.index to make that clearer? We could even create similar implementation subpackages in the future to clean up the impl package a bit, but that's not something that users would have to care about.

Can we avoid changing the parameter usage of DescriptorIndexCollector#collectDescriptors() by overloading that method in the interface? How about we add a new method collectDescriptor(String collecTorIndexUrl, String collecTorIndexPath, String[] remoteDirectories, ...) and use an implementation-specific default for collecTorIndexPath in the current method, such as "/index.html" for DescriptorCollectorImpl and "/index.json.xz" for DescriptorIndexCollector? If we retain backward compatibility here, I'd say we could switch to DescriptorIndexCollector in 1.5.0 when this class has seen some more testing. Otherwise I'd suggest waiting for 2.0.0.

The condition if (!filepath.exists() && filepath.mkdirs()) in DescriptorIndexCollector#fetchRemoteFiles() looks wrong. Shouldn't it be !filepath.mkdirs() there? I didn't test this, but the API says that mkdirs() returns "true if and only if the directory was created, along with all necessary parent directories; false otherwise".

Is FileNode#lastModifiedMillis() called often? If so, it may be better to use ParseHelper#getDateFormat() to obtain a thread-safe DateFormat instance for the given format and store it in a map. That avoids creating single-use instances of that class over and over. We'd have to make that method public for this, but that shouldn't be an issue. However, if it's not called very often, we can skip this.

In IndexNode#retrieveFiles() I'm not yet clear what situations lead to returning the half-done map in the middle of the loop. If one invalid remote directory leads to that case, I think we should either warn and skip that remote directory or return an empty map. Otherwise, the order of remote directories would affect the result, and that might lead to bugs in user applications that are quite hard to find.

Can we pick a smaller test.json and compressed equivalents in src/test/resources/? For example, we could take the index.json from the main CollecTor instance and throw out all files that haven't been modified in the past, say, four weeks. Or did you include a large file on purpose to check something that cannot be tested with a smaller file?

I'd say once we have resolved these questions, it's good to go into 1.4.0, even today. Thanks for working on this!

comment:11 in reply to:  10 ; Changed 3 years ago by iwakeh

Status: needs_revisionneeds_review

Replying to karsten:

Alright, let's go through the last four commits in that branch:

Thanks for looking at all that!

91fdb9b is already contained in master as 1dab40d.

correct.

a74563a looks good and is merged to my branch (see below) as 7a0e109.

1276c14 looks good, too, and is merged to my branch as e9c0731. Thanks for caring about tests!

Without tests a codebase becomes unmaintainable very soon ...

32f628f is preliminary merged to my branch as 82eb1cf, but I have a couple of questions and change requests, as follows:

Commit bf5ba78 in my task-19791 branch contains a few trivial fixes. Please take a look and let me know if you're okay with those.

Does look fine.

Regarding the new subpackage org.torproject.descriptor.index, I'm yet unclear whether that's supposed to be "visible" to API users or not, that is, if they're supposed to import and/or use those classes directly. If it is, that's going to grow the overall interface quite a bit, requiring us to keep the public parts stable in subsequent releases. For example, I'm unclear whether CollecTor is supposed to instantiate its own *Node classes when writing its index.json* files. Ideally, metrics-lib would hide away those classes by providing some kind of DescriptorIndexGenerator interface that takes a local directory as parameter and writes one or more index.json[*] files, but I'm not sure what your plan is there. If the plan is to hide that package from users, how about we rename it to org.torproject.descriptor.impl.index to make that clearer? We could even create similar implementation subpackages in the future to clean up the impl package a bit, but that's not something that users would have to care about.

I'd like to add the package as alpha. Warning that it can change still in unexpected ways. Thus, it can be used in CollecTor and from the findings there we can decide to remove it from the public interface, or change it w/o being backward compatible.
I provide a commit with the package-info and more javadoc below.

The *Node classes are of use on their own for clients that want to know about a CollecTor's instance files. They are there anyway and maintained, so others can just use them as well.

Can we avoid changing the parameter usage of DescriptorIndexCollector#collectDescriptors() by overloading that method in the interface? How about we add a new method collectDescriptor(String collecTorIndexUrl, String collecTorIndexPath, String[] remoteDirectories, ...) and use an implementation-specific default for collecTorIndexPath in the current method, such as "/index.html" for DescriptorCollectorImpl and "/index.json.xz" for DescriptorIndexCollector? If we retain backward compatibility here, I'd say we could switch to DescriptorIndexCollector in 1.5.0 when this class has seen some more testing. Otherwise I'd suggest waiting for 2.0.0.

I wouldn't want to change the interface for that. Both are URL strings and whoever uses the non-default implementation must know what their doing, i.e. at least read javadoc of the class they explicitly request to be used. At the point DescriptorIndexCollector becomes default we might not want to keep maintaining the old implementation and simply remove it. Then the additional method wouldn't make sense any more.

The condition if (!filepath.exists() && filepath.mkdirs()) in DescriptorIndexCollector#fetchRemoteFiles() looks wrong. Shouldn't it be !filepath.mkdirs() there? I didn't test this, but the API says that mkdirs() returns "true if and only if the directory was created, along with all necessary parent directories; false otherwise".

Thanks for spotting this!
I'll added a test and the fix in a new commit.

Is FileNode#lastModifiedMillis() called often? If so, it may be better to use ParseHelper#getDateFormat() to obtain a thread-safe DateFormat instance for the given format and store it in a map. That avoids creating single-use instances of that class over and over. We'd have to make that method public for this, but that shouldn't be an issue. However, if it's not called very often, we can skip this.

I'm aware of ParseHelper#getDateFormat(), but I didn't want to introduce dependencies to the impl package.
Actually, ParseHelper should be split into a utility class that is part of the interface and the implementation specific part (should be part of the redesign ticket #19640). Once, this class exists the getDateFormat method should be used; until then it wont break I think.

In IndexNode#retrieveFiles() I'm not yet clear what situations lead to returning the half-done map in the middle of the loop. If one invalid remote directory leads to that case, I think we should either warn and skip that remote directory or return an empty map. Otherwise, the order of remote directories would affect the result, and that might lead to bugs in user applications that are quite hard to find.

Good catch! Indeed there should be a continue. Adapted the tests to catch this and corrected the method an also another bug.

Can we pick a smaller test.json and compressed equivalents in src/test/resources/? For example, we could take the index.json from the main CollecTor instance and throw out all files that haven't been modified in the past, say, four weeks. Or did you include a large file on purpose to check something that cannot be tested with a smaller file?

I wanted to use the 'real' file as it is provided by the current main CollecTor without any edits, because I only have these small toy test files otherwise. There can be a timeout to the test method and we'd notice if that takes too long?

I'd say once we have resolved these questions, it's good to go into 1.4.0, even today. Thanks for working on this!

Fine, thanks for all the things you spotted!

Please review the new commits based on your branch above.

comment:12 in reply to:  11 Changed 3 years ago by karsten

Status: needs_reviewmerge_ready

Replying to iwakeh:

Replying to karsten:
[...]

Regarding the new subpackage org.torproject.descriptor.index, I'm yet unclear whether that's supposed to be "visible" to API users or not, that is, if they're supposed to import and/or use those classes directly. If it is, that's going to grow the overall interface quite a bit, requiring us to keep the public parts stable in subsequent releases. For example, I'm unclear whether CollecTor is supposed to instantiate its own *Node classes when writing its index.json* files. Ideally, metrics-lib would hide away those classes by providing some kind of DescriptorIndexGenerator interface that takes a local directory as parameter and writes one or more index.json[*] files, but I'm not sure what your plan is there. If the plan is to hide that package from users, how about we rename it to org.torproject.descriptor.impl.index to make that clearer? We could even create similar implementation subpackages in the future to clean up the impl package a bit, but that's not something that users would have to care about.

I'd like to add the package as alpha. Warning that it can change still in unexpected ways. Thus, it can be used in CollecTor and from the findings there we can decide to remove it from the public interface, or change it w/o being backward compatible.
I provide a commit with the package-info and more javadoc below.

The *Node classes are of use on their own for clients that want to know about a CollecTor's instance files. They are there anyway and maintained, so others can just use them as well.

Okay, I'm fine postponing this decision, and adding that alpha notice should be clear enough.

However, I'm yet unconvinced by the reasons you're giving. Ideally, clients shouldn't have to deal with the details of an index.json file. We should give them interfaces that are powerful enough to do all the things they'd want to do. Also, my past experience is that maintaining code can be very different depending on whether it's exposed in an API or not. I'd rather want to provide fewer details in interfaces than what's already there. But, no need to find a conclusion for this today, we can still think about this for 1.5.0.

Can we avoid changing the parameter usage of DescriptorIndexCollector#collectDescriptors() by overloading that method in the interface? How about we add a new method collectDescriptor(String collecTorIndexUrl, String collecTorIndexPath, String[] remoteDirectories, ...) and use an implementation-specific default for collecTorIndexPath in the current method, such as "/index.html" for DescriptorCollectorImpl and "/index.json.xz" for DescriptorIndexCollector? If we retain backward compatibility here, I'd say we could switch to DescriptorIndexCollector in 1.5.0 when this class has seen some more testing. Otherwise I'd suggest waiting for 2.0.0.

I wouldn't want to change the interface for that. Both are URL strings and whoever uses the non-default implementation must know what their doing, i.e. at least read javadoc of the class they explicitly request to be used. At the point DescriptorIndexCollector becomes default we might not want to keep maintaining the old implementation and simply remove it. Then the additional method wouldn't make sense any more.

My point is that we could do the switch to the new implementation in a version 1.5.0 if it remains backward-compatible, whereas we'd have to call that version 2.0.0 otherwise. So, we can merge this code as is for 1.4.0, but we should aim for a backward-compatible version if we want to make it the new default in 1.5.0.

Here's an even better suggestion: we accept both, base URLs (like "https://collector.torproject.org") and full URLs to index files (like "https://collector.torproject.org/index.html" or "https://collector.torproject.org/index.json.xz"), in the current collectDescriptors() method, and if no file is specified, we'll use an implementation-specific default ("/index.html" in DescriptorCollectorImpl and "/index.json.xz" in DescriptorIndexCollector). That doesn't break current applications that just provide a base URL, and it enables future applications to select their favorite compression type. Likewise, no need to make a decision today, but let's consider this for 1.5.0.

[...]

Is FileNode#lastModifiedMillis() called often? If so, it may be better to use ParseHelper#getDateFormat() to obtain a thread-safe DateFormat instance for the given format and store it in a map. That avoids creating single-use instances of that class over and over. We'd have to make that method public for this, but that shouldn't be an issue. However, if it's not called very often, we can skip this.

I'm aware of ParseHelper#getDateFormat(), but I didn't want to introduce dependencies to the impl package.
Actually, ParseHelper should be split into a utility class that is part of the interface and the implementation specific part (should be part of the redesign ticket #19640). Once, this class exists the getDateFormat method should be used; until then it wont break I think.

Ah, fair enough. Happy to discuss this more on that other ticket.

[...]

Can we pick a smaller test.json and compressed equivalents in src/test/resources/? For example, we could take the index.json from the main CollecTor instance and throw out all files that haven't been modified in the past, say, four weeks. Or did you include a large file on purpose to check something that cannot be tested with a smaller file?

I wanted to use the 'real' file as it is provided by the current main CollecTor without any edits, because I only have these small toy test files otherwise. There can be a timeout to the test method and we'd notice if that takes too long?

Okay to leave these files in. No need to add a timeout to the test, as we'll notice slow tests pretty easily, and then we can always take them out. I was also more concerned about the file size than the test execution time, but I admit that file size doesn't really matter here. Never mind.

I'd say once we have resolved these questions, it's good to go into 1.4.0, even today. Thanks for working on this!

Fine, thanks for all the things you spotted!

Please review the new commits based on your branch above.

All changes look good. I'll squash your latest commits and mine into 32f628f, run some tests, and release 1.4.0. Let's just not forget the issues I'm raising above for 1.5.0.

comment:13 Changed 3 years ago by karsten

Status: merge_readyneeds_revision

Bad news for the 1.4.0 release! I ran into a bug that I couldn't resolve easily. Or rather, I shouldn't fix this bug quickly and then put out the 1.4.0 release, because that would make a 1.4.1 release quite likely. Let's postpone that 1.4.0 release until we're sure what the fix is.

Please look at my task-19791-2 branch which contains your #19791 commit and all subsequent tweaks squashed together, plus some tweaks to the change log. That branch also includes the following fix that I made after realizing that DescriptorIndexCollector doesn't download actual descriptor files but rather the index.html file:

diff --git a/src/main/java/org/torproject/descriptor/index/DescriptorIndexCollector.java b/src/main/java/org/torproject/descriptor/index/DescriptorIndexCollector.java
index 5d6bcdd..4f62be6 100644
--- a/src/main/java/org/torproject/descriptor/index/DescriptorIndexCollector.java
+++ b/src/main/java/org/torproject/descriptor/index/DescriptorIndexCollector.java
@@ -83,7 +83,8 @@ public class DescriptorIndexCollector implements DescriptorCollector {
       }
       File destinationFile = new File(filepath, filename);
       File tempDestinationFile = new File(filepath, "." + filename);
-      try (InputStream is = new URL(baseUrl).openStream()) {
+      try (InputStream is = new URL(baseUrl + "/" + filepathname)
+          .openStream()) {
         Files.copy(is, tempDestinationFile.toPath());
         if (tempDestinationFile.length() == entry.getValue().size) {
           tempDestinationFile.renameTo(destinationFile);

However, that change makes the tests fail, and that makes me less optimistic that the change was correct, despite the fact that it's working fine in a quickly written test application. I think you should take a look.

comment:14 Changed 3 years ago by iwakeh

Status: needs_revisionneeds_review

Didn't have time to verify much, but this commit might help.

comment:15 Changed 3 years ago by karsten

So, I looked a bit more into the tests here, and your suggestion indeed fixes the tests. I suspect that it's because the test files that are created with File#createNewFile() are empty. But then I wonder why the other tests run without trouble with w2 and z2 having non-zero sizes in that same newIndexFile() method. I think we'll have to change them to zero, too, and we'll have to adapt the testNormalCollectingWithDeletion() test. Mind taking a look at the last commit in my updated task-19791-2 branch?

comment:16 Changed 3 years ago by iwakeh

The other tests ran without trouble b/c they tested a different behavior:

  • a file that is too old shouldn't be fetched (z2)
  • a file with wrong size shouldn't be fetched (w2)

I added some more inline comments and also added another correct file 'a/b/c/u2' for demonstrating what is tested in testNormalCollectingWithDeletion, i.e. why your suggested 'correction' is not correct.

Please see this branch based on your latest branch above.

comment:17 Changed 3 years ago by karsten

Status: needs_reviewmerge_ready

Alright, pushed to master and uploaded a yet unsigned 1.4.0 pre-release tarball. Mind taking a final look before I upload the signature file, push the signed Git tag, and announce the new release?

comment:18 Changed 3 years ago by iwakeh

Looks fine.
Maybe add the minor change that the jars can now be found in generated/dist and a hint about where to find the jars to the README.md?

comment:19 Changed 3 years ago by karsten

Thanks for looking!

And those are good ideas. Would you want to submit a patch for that? But regardless of who writes this text, I shouldn't do the release dance today but rather tomorrow morning with a fresh brain.

comment:20 Changed 3 years ago by iwakeh

Wednesdays are a good release days, imo. here are the suggestions for readme and changelog.

comment:21 Changed 3 years ago by karsten

Milestone: metrics-lib 1.4.0metrics-lib 1.5.0
Status: merge_readyassigned

Whee, it's released. Congratulations!

Moving to milestone metrics-lib 1.5.0 to discuss how to integrate DescriptorIndexCollector in a fully backward-compatible way.

comment:22 Changed 3 years ago by iwakeh

Milestone: metrics-lib 1.5.0metrics-lib 1.4.0
Resolution: implemented
Status: assignedclosed
Summary: Use CollecTor's index.json for download; adapt current download to use new date formatUse CollecTor's index.json for download ('''Alpha''' version); adapt current download to use new date format

As milestones also document what went into a certain release I reset this to 1.4.0 and close the ticket at the same time. Title is changed to reflect the alpha-version of the json-parsing-implementation and a follow-up ticket #20039.

Note: See TracTickets for help on using tickets.