Without having reviewed the patch in detail, I'm worried about using types from metrics-lib's org.torproject.descriptor.index package which are still in alpha stage. Should we rather try to hide the details from CollecTor by providing a DescriptorIndexGenerator interface that takes a directory as input as well as a list of desired file extensions and that produces index.json* files accordingly?
The implementation of #18910 (moved) requires CollecTor to have a Java representation of index.json to choose the documents to download from the partner-synch-Collector instance(s), especially with the pick-and-choose requirements from comment:11 in #18910 (moved).
Shouldn't a CollecTor instance have more fine grained control when creating index.json? More than just specifying a directory. Currently it includes already recent and archive.
The package was designated as alpha to prevent too early reliance on the new API to have more flexibility when implementing #18910 (moved). It is well tested in metrics-lib.
The implementation of #18910 (moved) requires CollecTor to have a Java representation of index.json to choose the documents to download from the partner-synch-Collector instance(s), especially with the pick-and-choose requirements from comment:11 in #18910 (moved).
So, this is about using metrics-lib *Node classes for obtaining descriptors, not for providing index.json* files, right?
I don't recall the exact requirements we discussed for #18910 (moved), and I think we discussed quite a few variations there. But what we can already do is specify an array of directories to synchronize. The local CollecTor instance would then decide locally from looking at synchronized files which ones to keep and copy over and which ones to ignore.
What we could do is pass a list of excluded paths to DescriptorCollector, which would contain paths of consensuses and votes, possibly even with last modified times, that we already have and that we don't want to synchronize. However, this feels a bit like premature optimization. There's no big harm in downloading the entire recent/ folder from a remote CollecTor instance and decide locally what to do with the data. We're moving around larger chunks of bytes than that.
Shouldn't a CollecTor instance have more fine grained control when creating index.json? More than just specifying a directory. Currently it includes already recent and archive.
Well, we could accept an array of directories instead of just one, if that helps. Or a base directory and an array of contained subdirectories to include. Whatever is most intuitive and does the job.
The package was designated as alpha to prevent too early reliance on the new API to have more flexibility when implementing #18910 (moved). It is well tested in metrics-lib.
Oh, I'm not worried that it might not be tested well enough. I'm worried about making the API bigger and having to maintain these parts in the future. This whole index.json* stuff is something that library users ideally shouldn't have to worry about. That's why I'm still trying hard to hide it away as best as I can. If this turns out to be impossible or impracticable, so be it. But I'm not there yet. :)
Please review this branch containing the changes discussed in Berlin.
Set descriptor dependency version to 1.4.0-dev. For release it should be changed to the released version including the changes of #20039 (moved). I'm not that happy about keeping the same code in two products; more in #20039 (moved).
A small necessary tweak to SchedulerTest is included.