Opened 22 months ago

Closed 22 months ago

Last modified 22 months ago

#21095 closed enhancement (implemented)

Accept more values for the "order" parameter

Reported by: lukechilds Owned by: metrics-team
Priority: Medium Milestone: Onionoo 3.2-1.1.0
Component: Metrics/Onionoo Version:
Severity: Normal Keywords: metrics-help
Cc: iwakeh Actual Points:
Parent ID: Points:
Reviewer: Sponsor:

Description

Unless I'm mistaken the only value accepted on the "order" parameter is "consensus_weight".

It would be really helpful and allow for more complex apps to be built if there were more values we could sort by. Specifically "first_seen", "last_seen", "last_restarted" and "bandwidth_rate".

I'm currently trying to order by "first_seen" and it's looking like the only way I could achieve this would be to query the details document of every single node and then reorder them based on the "first_seen" date in my application.

Child Tickets

Change History (30)

comment:1 Changed 22 months ago by karsten

Keywords: metrics-help added

Let's see.

If you're interested in details documents, what you can always do is send one query for all details documents and order them locally. That's okay for services that cache responses locally (Compass does this), but prohibitively expensive for applications that make requests on behalf of their user (like Atlas).

Are we talking about Atlas-like use cases here? If so, can you describe those in more detail?

If you'd want to download something else than details documents, things are a bit more complicated, even in Compass-like settings. In that case you'll have to do the thing above and then send one query per node in your ordered list. This should be fine for 10 or 20 hits, but anything beyond that seems like a bad idea. Or you also download all documents that you're interested in and keep them together with all the details documents.

Let's assume we have a use case where extending Onionoo makes sense. In such a case, adding "first_seen" and "last_seen" to the "order" parameter would be relatively easy.

(Implementation hints: See this part of the code for details: https://gitweb.torproject.org/onionoo.git/tree/src/main/java/org/torproject/onionoo/server/NodeIndexer.java#n202. Use SummaryDocument#getFirstSeenMillis() and SummaryDocument#getLastSeenMillis(), and consider turning orderRelaysByConsensusWeight into a map with the current list entries as map keys and (hashed) fingerprints as map values. Also make sure that ordering by multiple parameters gets implements, which was not necessary so far.)

The other two fields that you mention, "last_restarted" and "bandwidth_rate", would be more difficult, though doable if the use case is really convincing. (Implementation hint: the trouble is that the necessary data is not available in SummaryDocument, so any extension would have to take into account that currently running Onionoo instances don't have the required data yet.)

This could be a ticket for a new volunteer. Be sure to write tests, this can break in funny ways!

comment:2 Changed 22 months ago by lukechilds

Re use case, it was requested that there be a way to list the newest relays on the network on Tor Explorer as these relays have a lower chance of being blacklisted:
https://github.com/lukechilds/tor-explorer/issues/5

I've got it working by getting all details docs from https://onionoo.torproject.org/details and ordering locally. The thing is Onionoo takes around 15 seconds to respond with all this info and it's over 2MB of JSON to sort through. I don't think it's really a viable solution.

I only need the top ten for the first page, so it would be much more elegant if I could tell Onionoo to order by first_seen and then limit to 10. Then for the next page do the same query but offset by 10 etc.

The main value I think would be useful would be first_seen, however I was thinking of adding a drop down box to Tor Explorer to allow the user to order by whichever properties they want. Obviously if that's going to be a lot of effort on the Onionoo end then it's probably not worth it.

comment:3 Changed 22 months ago by karsten

Hmm, I wonder if "?order=-first_seen&limit=10" would really give you (or your users) a useful list of relays. There are probably dozens or hundreds of relays in that list, most of them tiny and some not even around anymore.

Another option would be "?first_seen_days=-1&order=-consensus_weight&limit=10" to obtain the 10 fastest relays that showed up in the past 24 hours. Want to take a look and maybe give that a try? (It's already implemented!)

But if the first option seems more promising, let's implement that one, too. As I said, "first_seen" is relatively easy.

comment:4 Changed 22 months ago by lukechilds

Would ?order=-first_seen&limit=10 not return the 10 most recent relays to join the network?

?first_seen_days=-1&order=-consensus_weight&limit=10 seems like it could definitely be helpful, however the first method would be ideal. If there haven't been many new relays on the current day we might need to go back to yesterday. And for users specifically trying to get a clean IP, age would be more important than performance.

If we could get sorting by first_seen that would be awesome!

I've had a look through the file you linked. I've never touched a line of Java in my life and to be honest I'm kind of struggling to grok it.

It looks like I could just copy and past the orderRelaysByConsensusWeight bits for first_seen but I'm not quite sure how that wires up with the order URL parameter.

comment:5 Changed 22 months ago by karsten

Alright, I had a look at the actual data, and it looks like you're right. I had assumed that there are dozens of new relays joining every hour, but that's not the case. Here are the "first_seen" times of today (until 15:00 UTC):

{"first_seen":"2017-01-03 15:00:00"},
{"first_seen":"2017-01-03 15:00:00"},
{"first_seen":"2017-01-03 15:00:00"},
{"first_seen":"2017-01-03 14:00:00"},
{"first_seen":"2017-01-03 14:00:00"},
{"first_seen":"2017-01-03 14:00:00"},
{"first_seen":"2017-01-03 14:00:00"},
{"first_seen":"2017-01-03 14:00:00"},
{"first_seen":"2017-01-03 13:00:00"},
{"first_seen":"2017-01-03 13:00:00"},
{"first_seen":"2017-01-03 13:00:00"},
{"first_seen":"2017-01-03 12:00:00"},
{"first_seen":"2017-01-03 11:00:00"},
{"first_seen":"2017-01-03 11:00:00"},
{"first_seen":"2017-01-03 11:00:00"},
{"first_seen":"2017-01-03 11:00:00"},
{"first_seen":"2017-01-03 10:00:00"},
{"first_seen":"2017-01-03 10:00:00"},
{"first_seen":"2017-01-03 09:00:00"},
{"first_seen":"2017-01-03 09:00:00"},
{"first_seen":"2017-01-03 08:00:00"},
{"first_seen":"2017-01-03 06:00:00"},
{"first_seen":"2017-01-03 05:00:00"},
{"first_seen":"2017-01-03 05:00:00"},
{"first_seen":"2017-01-03 04:00:00"},
{"first_seen":"2017-01-03 04:00:00"},
{"first_seen":"2017-01-03 04:00:00"},
{"first_seen":"2017-01-03 04:00:00"},
{"first_seen":"2017-01-03 03:00:00"},
{"first_seen":"2017-01-03 01:00:00"},
{"first_seen":"2017-01-03 01:00:00"},
{"first_seen":"2017-01-03 01:00:00"},
{"first_seen":"2017-01-03 01:00:00"},
{"first_seen":"2017-01-03 00:00:00"},
{"first_seen":"2017-01-03 00:00:00"},
{"first_seen":"2017-01-03 00:00:00"},

I'll give this a try and see if I can build this by dinner.

comment:6 Changed 22 months ago by karsten

Status: newneeds_review

comment:7 Changed 22 months ago by lukechilds

Thanks Karsten, this looks great!

Is there a live version I can test or do I need to run this locally?

comment:8 Changed 22 months ago by karsten

There, I just deployed this version here: https://onionoo.thecthulhu.com/details?order=-first_seen&fields=fingerprint,first_seen&limit=5. Please let us know how this works for you.

comment:9 Changed 22 months ago by lukechilds

Perfect, testing now.

I knew adding a configurable baseUrl option to my API client was gonna come in useful!

comment:10 Changed 22 months ago by karsten

Cc: iwakeh added

Copying iwakeh here, also as a reminder that this code still needs review, in addition to testing, before we merge it.

comment:11 Changed 22 months ago by iwakeh

Have this on my review-list.

comment:12 Changed 22 months ago by lukechilds

All working well on my end :)

comment:13 Changed 22 months ago by iwakeh

All fine and good to see all those tests :-)
(Although they only test summary docs, but that's not really part of this ticket.)

Maybe, add a test or change test data to accommodate testing the different order of order parameters. Currently -first_seen causes the same ordering of the test data as consensus_weight.

The protocol states Results are first ordered by the first list element, then by the second, and so on., but more than four order fields will cause the invalid request response 400, which is tested for in testOrderConsensusWeightFiveTimes but not mentioned in the protocol.

How is the impact on memory when this new list is added?

Unrelated: Should this:

    /* This variable can go away once all Onionoo services had their
     * hourly updater write effective families to summary documents at
     * least once.  Remove this code after September 8, 2015. */

be done now with an additional commit?

comment:14 in reply to:  13 ; Changed 22 months ago by karsten

Replying to iwakeh:

All fine and good to see all those tests :-)

Great, thanks for looking!

(Although they only test summary docs, but that's not really part of this ticket.)

Right, though I wouldn't expect much better coverage from including other document types. Details documents with the fields parameter maybe. But all in all there are lower-hanging fruit for increasing actual test coverage.

Maybe, add a test or change test data to accommodate testing the different order of order parameters. Currently -first_seen causes the same ordering of the test data as consensus_weight.

Done in a new commit in the same branch.

The protocol states Results are first ordered by the first list element, then by the second, and so on., but more than four order fields will cause the invalid request response 400, which is tested for in testOrderConsensusWeightFiveTimes but not mentioned in the protocol.

Right, I put that in to protect against potentially expensive queries asking us to sort results over and over. What we could also do is specify that each parameter can only be used once. Is that better?

By the way, is there a valid use case for ordering by the same parameter more than once? After all, the second (and subsequent) order parameters are only applied when two entries have the exact same value with respect to the first order parameter.

Does the above become clear from the specification, or how would we explain that more clearly? And is it what people would expect?

Example:

[ {"a": 1, "b": 2, "c": 3},
  {"a": 1, "b": 1, "c": 4} ]

ordered by a, b:
[ {"a": 1, "b": 1, "c": 4},
  {"a": 1, "b": 2, "c": 3} ]

ordered by b, a:
[ {"a": 1, "b": 1, "c": 3},
  {"a": 1, "b": 2, "c": 4} ]

How is the impact on memory when this new list is added?

I didn't check, but in theory we're keeping another list in memory that contains 12k fingerprints times 40 bytes = 480 kB, plus some overhead for the list. Not really an issue, I think.

Unrelated: Should this:

    /* This variable can go away once all Onionoo services had their
     * hourly updater write effective families to summary documents at
     * least once.  Remove this code after September 8, 2015. */

be done now with an additional commit?

Yes, let's do that in an additional commit, unrelated to this ticket. There maybe other parts in the code that we can take out. But let's move that to another ticket.

comment:15 in reply to:  14 Changed 22 months ago by iwakeh

Replying to karsten:

...

(Although they only test summary docs, but that's not really part of this ticket.)

Right, though I wouldn't expect much better coverage from including other document types. Details documents with the fields parameter maybe. But all in all there are lower-hanging fruit for increasing actual test coverage.

That's ok.

Maybe, add a test or change test data to accommodate testing the different order of order parameters. Currently -first_seen causes the same ordering of the test data as consensus_weight.

Done in a new commit in the same branch.

There needs to be a tie in the first order parameter in order to see the second being effective.
Or is that the case in the test data?

The protocol states Results are first ordered by the first list element, then by the second, and so on., but more than four order fields will cause the invalid request response 400, which is tested for in testOrderConsensusWeightFiveTimes but not mentioned in the protocol.

Right, I put that in to protect against potentially expensive queries asking us to sort results over and over. What we could also do is specify that each parameter can only be used once. Is that better?

At most specifying each parameter once (either with or w/o the minus sign) is the optimal choice, I think.

By the way, is there a valid use case for ordering by the same parameter more than once? After all, the second (and subsequent) order parameters are only applied when two entries have the exact same value with respect to the first order parameter.

There is no use-case for having two times the same parameter.

Does the above become clear from the specification, or how would we explain that more clearly? And is it what people would expect?

Without changing code, the protocol should state in addition that more than four order parameters will cause an error; it states that the first of one type takes precedence.

Example:

[ {"a": 1, "b": 2, "c": 3},
  {"a": 1, "b": 1, "c": 4} ]

ordered by a, b:
[ {"a": 1, "b": 1, "c": 4},
  {"a": 1, "b": 2, "c": 3} ]

ordered by b, a:
[ {"a": 1, "b": 1, "c": 3},
  {"a": 1, "b": 2, "c": 4} ]
random order and maybe order by a:
 [ {"a": 1, "x": 2, "c": 3},
   {"a": 1, "x": 1, "c": 4} ]
order a,x or just x:
 [ {"a": 1, "x": 1, "c": 4},
   {"a": 1, "x": 2, "c": 3} ]
order a,c or just c:
 [ {"a": 1, "x": 2, "c": 3},
   {"a": 1, "x": 1, "c": 4} ]

But, I don't know if these examples might confuse rather than help.

How is the impact on memory when this new list is added?

I didn't check, but in theory we're keeping another list in memory that contains 12k fingerprints times 40 bytes = 480 kB, plus some overhead for the list. Not really an issue, I think.

OK.

Unrelated: Should this:

    /* This variable can go away once all Onionoo services had their
     * hourly updater write effective families to summary documents at
     * least once.  Remove this code after September 8, 2015. */

be done now with an additional commit?

Yes, let's do that in an additional commit, unrelated to this ticket. There maybe other parts in the code that we can take out. But let's move that to another ticket.

OK.

comment:16 Changed 22 months ago by karsten

Oh well, your suggestion to modify the test data revealed a bug. (Sorting was not stable...) Please review this commit in my updated task-21095 branch.

Regarding the example, that was just meant for our discussion here, not for the protocol page. It would certainly confuse users.

comment:17 Changed 22 months ago by iwakeh

Yes! Comparator is the way to go about this task.

While looking at the code, I came up with suggestions/wishes:
I'd like to make Onionoo more testable, slowly but surely starting with this task. The monolithic test approach needs to be broken down, and it should be way easier to add test case data.
ResourceServletTest is fine for testing the url parameters, but it is cumbersome for testing the ordering scheme thoroughly.
The Comparator should be a separate class and have it's own test class (also for not growing RequestHandler more).
And, field names (used more than once) should be constants as well as the ordering names (for readability and compile time checks). This task can be broken down into small steps.
I have a branch with suggestions and as proof of the one-by-one concept (It's not a patch, but compiles and passes the existing tests.) Please, see also the commit comment there.

The remaining field constants could be added one by one and in a metrics-help ticket. Same for the remaining java7 issues (maybe also look for try-with-resources and similar).

Thoughts?

comment:18 Changed 22 months ago by karsten

Very quick response before running off for a few hours:

  • Using constants for field names makes sense, though "ascending" and "descending" don't really belong there but rather in the server package. Still, starting such a Fields interface or enum in the docs package sound fine.
  • Separate comparator class makes sense, though I'm not sure whether it rather belongs in the server package than docs.
  • Changes like copyright updates or Java 7 changes should be done in separate commits and for the entire code base.

Feel free to hack on this, and I'll review it later. Or I can start making changes myself this afternoon and ask you to review them afterwards. Up to you. Thanks!

comment:19 in reply to:  18 Changed 22 months ago by iwakeh

Replying to karsten:

Very quick response before running off for a few hours:

  • Using constants for field names makes sense, though "ascending" and "descending" don't really belong there but rather in the server package. Still, starting such a Fields interface or enum in the docs package sound fine.

The asc/des-fields are used by the comparators and should be in their package, maybe in a different interface, e.g., Order.java or the comparator class itself?

  • Separate comparator class makes sense, though I'm not sure whether it rather belongs in the server package than docs.

The comparator class is only concerned with docs, the server package uses docs and their comparators. Thus, the comparator class should be in docs.

  • Changes like copyright updates or Java 7 changes should be done in separate commits and for the entire code base.

Separate commits for these changes true (that's why I didn't call my branch patch).
Copyright can be done via commandline and for all classes at the same time, agreed.
The java 7 changes require more fine-grained code changes and could also be done whenever classes are touched anyway. So, it's more a question of available time to do these at once or class by class.

Feel free to hack on this, and I'll review it later. Or I can start making changes myself this afternoon and ask you to review them afterwards. Up to you. Thanks!

Having other high priority tasks at the moment, I could sneak that in next week at some point, if you don't get to it.

comment:20 Changed 22 months ago by iwakeh

Milestone: Onionoo 3.1-1.1.0

Beginning to collect more tickets for the next Onionoo milestone.

comment:21 Changed 22 months ago by iwakeh

Milestone: Onionoo 3.1-1.1.0Onionoo 3.2-1.1.0

Argh, because of the protocol change it's 3.2-1.1.0

comment:22 Changed 22 months ago by karsten

Updated copyright in all source files, will look into remaining issues this afternoon.

comment:23 Changed 22 months ago by karsten

Please review my branch task-21095-2 which is rebased to master and which contains separate commits for the more general changes (copyright update, which is already in master, and diamond operator for the entire codebase) and my interpretation of the changes you suggested above. :)

comment:24 Changed 22 months ago by iwakeh

The source for org.torproject.onionoo.docs.OrderParameterValues seems to be missing.

comment:25 Changed 22 months ago by karsten

Argh, that's the new "forgot the attachment", right? Pushed a fixup commit to my branch. Sorry!

comment:26 Changed 22 months ago by iwakeh

Thanks for implementing all that!
It looks good.

Please find three more commits on that branch.

One introducing a constant for the descending marker.
Another for using the OrderParameterValues constants also in the tests.
And, a third with a new test for SummaryDocumentComparator, which led to a bugfix.

comment:27 Changed 22 months ago by karsten

Status: needs_reviewmerge_ready

Those three commits look good. Merged, fixed a few trivial whitespace issues, squashed, rebased, and pushed to master. Changing status to merge_ready until this is deployed (maybe we'd need a deploy_ready, hmmm). Thanks!

comment:28 Changed 22 months ago by iwakeh

Thanks!

Regarding the "not closing before release" problem:
During the recent releases the open milestone signified the "not released" stage of the tickets.
I think it is fine to close done tickets that have a valid release milestone.
Also looks good in the milestone overview :-)

comment:29 Changed 22 months ago by karsten

Resolution: implemented
Status: merge_readyclosed

Alright, closing, but whoever watches this ticket: it'll take another week or so until this new feature will be available! :)

comment:30 Changed 22 months ago by lukechilds

Awesome!

Let me know when it's done, I'm still using https://onionoo.thecthulhu.com on my dev install :)

Note: See TracTickets for help on using tickets.