Opened 3 months ago

Closed 4 weeks ago

#22836 closed enhancement (implemented)

Parse CollecTor's index.json and provide our own directory listing

Reported by: karsten Owned by: karsten
Priority: Medium Milestone:
Component: Metrics/Website Version:
Severity: Normal Keywords:
Cc: iwakeh Actual Points:
Parent ID: Points:
Reviewer: Sponsor:

Description

From #21551: "Should we let users browser CollecTor files on the Tor Metrics website rather than in Apache-generated directory listings on the CollecTor page? My suggestion: I'd say yes, but that can be done in a subsequent patch. Also a fun project." -- "Yes, new ticket!"

Child Tickets

Attachments (1)

task-22836-screenshot.png (262.1 KB) - added by karsten 2 months ago.

Download all attachments as: .zip

Change History (29)

comment:1 Changed 3 months ago by karsten

Owner: changed from metrics-team to karsten
Status: newaccepted

I'd like to grab this one, unless somebody else really wants to do it.

Changed 2 months ago by karsten

Attachment: task-22836-screenshot.png added

comment:2 Changed 2 months ago by karsten

Status: acceptedneeds_review

I implemented this. Please review commit 12a42e61 in my task-22836 branch and admire this screenshot for an example of it will look like.

Once this is merged and deployed we can remove all web elements from CollecTor that have the sole purpose of making its directory listings prettier.

comment:3 in reply to:  2 Changed 2 months ago by iwakeh

Replying to karsten:

I implemented this. Please review commit 12a42e61 in my task-22836 branch and admire this screenshot for an example of it will look like.

The screenshot looks good!

Once this is merged and deployed we can remove all web elements from CollecTor that have the sole purpose of making its directory listings prettier.

Yes, that will be good.

I'll take a look at the 350+ new/changed lines of code soon.

comment:4 in reply to:  2 Changed 2 months ago by iwakeh

Replying to karsten:

I implemented this. Please review commit 12a42e61 in my task-22836 branch and admire this screenshot for an example of it will look like.

Once this is merged and deployed we can remove all web elements from CollecTor that have the sole purpose of making its directory listings prettier.

This gain does not justify implementing the JSON-index-handling a third time, in addition with slight differences. Even now as we're working toward using an internal API of metrics-lib and metrics-lib already supplies convenient methods for retrieving index.json*.
(metrics-lib has a way better coverage in addition.)

The change to the FileNode class only seems to concern the comparison. This leads to the question of either using this type of comparison also in metrics-lib (small new ticket) or simply providing a Comparator for the sorting needed here.

comment:5 Changed 2 months ago by karsten

Heh, the good old index-handling classes discussion. We really disagree how to move forward there, don't we? :)

So, I went back and forth about this, too.

On the one hand the corresponding classes in metrics-lib are not part of the API yet, and they're not ready for that. (For example, I'd like to avoid that applications instantiate their own *Node classes. My preferred model would be that we provide some kind of IndexGenerator that takes one or more directories from the file system and generates a *Node structure and index.json* files from that. And an IndexParser would take an index.json file as input (or download it from a server) and provide a *Node structure as output. And all *Node classes would be interfaces, not classes, and they'd be located in packages that are part of the API.) But I can't really work on finalizing these classes enough to make them part of the API at this point. I'm trying to clean up a few things before the next Tor meeting, and making Tor Metrics the only user-facing website is one of them.

On the other hand I agree that duplicating or triplicating code is bad. To be fair, we're talking about 20 lines of code here, so the risk for bugs is relatively low (though not zero).

I can see two ways forward: a) We accept these 20 lines of duplicate code for now and plan to replace those classes with ones from metrics-lib once they're part of the API. b) We use types from metrics-lib despite them not being part of the API yet and put up a big warning sign that things might break if metrics-lib changes.

What's your preference?

Did you find anything else, or did you not look further after finding these internal *Node classes? :)

comment:6 Changed 2 months ago by karsten

Hmmmmmm. While reviewing #23286 I thought about an alternative c): We move the org.torproject.descriptor.index package from metrics-lib to metrics-base where we call it org.torproject.metrics.index and re-use those classes in all code bases relying on metrics-base.

We might add even more commonly used classes to metrics-base as long as they are used by more than one code base. The difference to metrics-lib is that metrics-base is not an API to be used by products outside of the metrics space. If somebody uses that code, we change something, and their code breaks, it's not our fault. In contrast to that, keeping index things in metrics-lib and exposing them in the API, and maybe even adding more parts in the future, would lead to making it harder to maintain that interface in the future.

On the positive side, I'd be a lot less picky about the model for the index package if it's only contained in metrics-base and not in metrics-lib.

What do you think?

comment:7 in reply to:  6 Changed 8 weeks ago by iwakeh

Replying to karsten:

Hmmmmmm. While reviewing #23286 I thought about an alternative c): We move the org.torproject.descriptor.index package from metrics-lib to metrics-base where we call it org.torproject.metrics.index and re-use those classes in all code bases relying on metrics-base.

We might add even more commonly used classes to metrics-base as long as they are used by more than one code base. The difference to metrics-lib is that metrics-base is not an API to be used by products outside of the metrics space. If somebody uses that code, we change something, and their code breaks, it's not our fault. In contrast to that, keeping index things in metrics-lib and exposing them in the API, and maybe even adding more parts in the future, would lead to making it harder to maintain that interface in the future.

On the positive side, I'd be a lot less picky about the model for the index package if it's only contained in metrics-base and not in metrics-lib.

What do you think?

Most of the time trac tickets are useful for discussion. This time I feel we have many tickets with the same discussion, dropping different parts of the various threads. Let's keep this discussion for the next team meeting.

I'll add a summary and more thoughts to the pad :-)

comment:8 Changed 7 weeks ago by karsten

FWIW, I'm working on a patch based on our discussion from yesterday.

comment:9 Changed 7 weeks ago by karsten

Please review commit 59fa501 in my task-22836 branch which now uses the index.json-handling classes from metrics-lib.

comment:10 Changed 6 weeks ago by iwakeh

Cc: iwakeh added

comment:11 Changed 5 weeks ago by iwakeh

I'm wondering about the wait-construct when no index is available:
The thread could be interrupted for any reason a short time (e.g. 5ms) after calling wait and the index preparation might not be done. And, more likely, it is not interrupted and then there is a wait of 10 seconds, but I'd guesstimate index creation takes way less than one second. Thus, it might be useful to 1) embed the wait in a loop checking if the waiting time elapsed or if there is an index available.
2) Secondly, the actual call to wait should use only 200 ms (or some other amount way smaller than 10000) as value.

There are three format* methods together with extractDirectoryListings, which do not depend on the actual object and should be made static. As these format* plus extractDirectoryListings methods prepare the output, I really would like to see tests here. This is not an empty exercise, but facilitates easier maintenance in future and enables review in first place.

Last edited 5 weeks ago by iwakeh (previous) (diff)

comment:12 in reply to:  11 ; Changed 5 weeks ago by karsten

Status: needs_reviewneeds_revision

Replying to iwakeh:

I'm wondering about the wait-construct when no index is available:
The thread could be interrupted for any reason a short time (e.g. 5ms) after calling wait and the index preparation might not be done. And, more likely, it is not interrupted and then there is a wait of 10 seconds, but I'd guesstimate index creation takes way less than one second. Thus, it might be useful to 1) embed the wait in a loop checking if the waiting time elapsed or if there is an index available.
2) Secondly, the actual call to wait should use only 200 ms (or some other amount way smaller than 10000) as value.

Agreed about the loop.

But keep in mind that the timeout is for the time between sending a request for an index.json file until receiving and processing the response. If this were a local operation I'd say 200 ms are enough, but with the possible network delay I'd say we need to give it at least a few seconds. If 10000 ms are too much, how about 2000 ms?

There are three format* methods together with extractDirectoryListings, which do not depend on the actual object and should be made static. As these format* plus extractDirectoryListings methods prepare the output, I really would like to see tests here. This is not an empty exercise, but facilitates easier maintenance in future and enables review in first place.

Agreed.

This is on my list now. Will post a revised branch once I have one. (Thanks!)

comment:13 in reply to:  12 Changed 5 weeks ago by iwakeh

Replying to karsten:

Replying to iwakeh:

I'm wondering about the wait-construct when no index is available:
The thread could be interrupted for any reason a short time (e.g. 5ms) after calling wait and the index preparation might not be done. And, more likely, it is not interrupted and then there is a wait of 10 seconds, but I'd guesstimate index creation takes way less than one second. Thus, it might be useful to 1) embed the wait in a loop checking if the waiting time elapsed or if there is an index available.
2) Secondly, the actual call to wait should use only 200 ms (or some other amount way smaller than 10000) as value.

Agreed about the loop.

But keep in mind that the timeout is for the time between sending a request for an index.json file until receiving and processing the response. If this were a local operation I'd say 200 ms are enough, but with the possible network delay I'd say we need to give it at least a few seconds. If 10000 ms are too much, how about 2000 ms?

I was referring to the wait-statement not the entire waiting time:

do
   wait(200)
until(10 sec are over or an index is available)

The checks in the 'until' part are cheap and will improve response times. I assume we'll hardly ever reach the ten second limit.
In addition, there could be some logging (on debug) in order to adjust the values later.

There are three format* methods together with extractDirectoryListings, which do not depend on the actual object and should be made static. As these format* plus extractDirectoryListings methods prepare the output, I really would like to see tests here. This is not an empty exercise, but facilitates easier maintenance in future and enables review in first place.

Agreed.

This is on my list now. Will post a revised branch once I have one. (Thanks!)

Looking forward to the new branch :-)

comment:14 Changed 5 weeks ago by iwakeh

One other thought about the format* methods plus extractDirectoryListings:

This code could be encapsulated in a simple java object DirectoryListing that will receive a JSON string and provides methods returning listings.

comment:15 Changed 5 weeks ago by karsten

I haven't looked at the code yet, but I think we don't have to reduce the wait time inside the loop to 200 ms, because we're being notified as soon as the index.json is successfully parsed.

Adding some logging there seems reasonable. In particular the total wait time when downloading and processing index.json for the first time after startup is something we should learn by this.

Will keep your idea of encapsulating things in mind when looking at this code.

Thanks! Stay tuned! :)

comment:16 Changed 5 weeks ago by karsten

Status: needs_revisionneeds_review

Please review my updated task-22836 branch with changes discussed above. Not rebased yet, will do that when merging.

comment:17 Changed 4 weeks ago by iwakeh

I cannot find a test for static List<String[]> formatTableEntries, did you forget a commit?
The waiting time shouldn't be extended unnecessarily:

-  } while (System.currentTimeMillis() < waitingSinceMillis + 10000L);
+  } while (null == this.index
+      && System.currentTimeMillis() < waitingSinceMillis + 10000L);

In addition, it might be safer and shorter to simply use AtomicReference instead of the synchronized methods.
Actually, I was hinting for a separate DirectoryListing class (in comment:14), but separating the retrieval of the index is important, too.
The separation you provided made it easy to introduce a CollectorDirectoryProvider, which fetches and provides the index incl. the suggested use of AtomicReference and the second while-condition. DirectoryListing is a Map<String, List<String[]>> and encapsulates the conversion from Index* to lists of strings. CollectorDirectoryProvider only manages the scheduler and handing out the index.

Another test testListing needs close inspection; it uses a fantasy index.json and checks the resulting map.

Please, review the commit on top of this branch.

(I didn't change the byte-calculations, b/c they are only 'decoration' and the compiler will make sure that a constant is created for the Math.log(1024).)

Last edited 4 weeks ago by iwakeh (previous) (diff)

comment:18 in reply to:  17 ; Changed 4 weeks ago by karsten

Replying to iwakeh:

I cannot find a test for static List<String[]> formatTableEntries, did you forget a commit?

I did not include one, because I didn't find an easy way to test this method. But I like your new test method and think it already covers this method quite well. Do you think we need to test more?

The waiting time shouldn't be extended unnecessarily:

-  } while (System.currentTimeMillis() < waitingSinceMillis + 10000L);
+  } while (null == this.index
+      && System.currentTimeMillis() < waitingSinceMillis + 10000L);

Ouch. Yes, I meant to write that. Good catch!

In addition, it might be safer and shorter to simply use AtomicReference instead of the synchronized methods.

Works for me.

Actually, I was hinting for a separate DirectoryListing class (in comment:14), but separating the retrieval of the index is important, too.
The separation you provided made it easy to introduce a CollectorDirectoryProvider, which fetches and provides the index incl. the suggested use of AtomicReference and the second while-condition. DirectoryListing is a Map<String, List<String[]>> and encapsulates the conversion from Index* to lists of strings. CollectorDirectoryProvider only manages the scheduler and handing out the index.

Looks good to me!

Another test testListing needs close inspection; it uses a fantasy index.json and checks the resulting map.

Looks good!

Please, review the commit on top of this branch.

(I didn't change the byte-calculations, b/c they are only 'decoration' and the compiler will make sure that a constant is created for the Math.log(1024).)

I only made a few trivial tweaks in my updated task-22836 branch that is based on yours. If you like them, do you mind me squashing all commits into one before rebasing and pushing to master?

Thanks for the review and improvements!

comment:19 in reply to:  18 Changed 4 weeks ago by iwakeh

Status: needs_reviewmerge_ready

Replying to karsten:

Replying to iwakeh:

I cannot find a test for static List<String[]> formatTableEntries, did you forget a commit?

I did not include one, because I didn't find an easy way to test this method. But I like your new test method and think it already covers this method quite well. Do you think we need to test more?

Yes, only a different structure made ease of testing possible. If there had been another test, I would have suggested merging the test data.
In general, there are never enough tests ;-) All fine, until the next bug is discovered.

...

I only made a few trivial tweaks in my updated task-22836 branch that is based on yours. If you like them, do you mind me squashing all commits into one before rebasing and pushing to master?

All fine.
Ready for merge!

Thanks for the review and improvements!

comment:20 Changed 4 weeks ago by karsten

Resolution: implemented
Status: merge_readyclosed

Great! Squashed, merged, and deployed. Closing. Thanks again!

comment:21 Changed 4 weeks ago by karsten

FWIW, I just kicked out all of CollecTor's webapp stuff, which was one of the goals of this ticket. CollecTor's directory listings are still available, but they're generated by Apache without any additional styling provided by us. This seems fine to me, given that we don't expect the average user to browse that website anyway.

comment:22 in reply to:  21 Changed 4 weeks ago by iwakeh

Replying to karsten:

FWIW, I just kicked out all of CollecTor's webapp stuff, which was one of the goals of this ticket. CollecTor's directory listings are still available, but they're generated by Apache without any additional styling provided by us. This seems fine to me, given that we don't expect the average user to browse that website anyway.

That's true, but <nitpick> I would mention the removal of all styling resources (fonts, css, etc.) rather than calling it 'un-prettify' (which is a matter of taste) in the changelog. </nitpick>

comment:23 in reply to:  20 Changed 4 weeks ago by iwakeh

Replying to karsten:

Great! Squashed, merged, and deployed. Closing. Thanks again!

Should this be tagged as 'internal release'?

comment:24 Changed 4 weeks ago by karsten

Resolution: implemented
Status: closedreopened

Heh, right, the change log entry could be better. How about: "- Remove all styling resources (fonts, CSS, etc.) from directory listings."?

Regarding the internal release, I'm not sure. Wouldn't that be confusing with external releases? Should we have both in parallel?

And isn't this just a deployment decision? For example, collector2 didn't use any of these styling resources at all.

comment:25 in reply to:  24 Changed 4 weeks ago by iwakeh

Replying to karsten:

Heh, right, the change log entry could be better. How about: "- Remove all styling resources (fonts, CSS, etc.) from directory listings."?

Regarding the internal release, I'm not sure. Wouldn't that be confusing with external releases? Should we have both in parallel?

And isn't this just a deployment decision? For example, collector2 didn't use any of these styling resources at all.

My comment:23 was in reply to your merge to metrics-web git not collector. And, in metrics-web we wanted to start with 'internal releases'.

comment:26 Changed 4 weeks ago by karsten

Ah, sure, happy to start with internal releases for metrics-web. So, what remains to be done?

  • Create CHANGELOG.md in metrics-web and tag current HEAD as 1.0.0 with a generic comment "This it the initial release..."?
  • Update the CollecTor change log entry to "Remove all styling resources (fonts, CSS, etc.) from directory listings."?
  • Anything else or differently?

comment:27 in reply to:  26 Changed 4 weeks ago by iwakeh

Replying to karsten:

Ah, sure, happy to start with internal releases for metrics-web. So, what remains to be done?

  • Create CHANGELOG.md in metrics-web and tag current HEAD as 1.0.0 with a generic comment "This it the initial release..."?
  • Update the CollecTor change log entry to "Remove all styling resources (fonts, CSS, etc.) from directory listings."? ...

Sounds good!

comment:28 Changed 4 weeks ago by karsten

Resolution: implemented
Status: reopenedclosed

Done and done. Closing. Thanks!

Note: See TracTickets for help on using tickets.