Transform into appropriate format and also add path descriptions for webstats.
Old description: It looks like the CollecTor mirror serves the index.json file at a different URL (/index/index.json) than the main CollecTor instance (/index.json). In theory, we could agree on a common place for that file and all other files and consider that part of the "protocol". And I'm happy to consider changing paths on the main CollecTor instance if different paths make more sense.
Here are the current directories and files on the main CollecTor instance:
/ # start page with all the content for humans/index.html # same as //css/ # web stuff/images/ # web stuff/header.html # used to style directory listings/footer.html # used to style directory listings/formats.html # not used anymore, could go away if we wanted/archive/ # archived descriptors/recent/ # recent descriptors/index.json # JSON file with all files in archive/ and recent//index.json.bz2 # same as /index.json, but compressed/index.json.gz # same as /index.json, but compressed/index.json.xz # same as /index.json, but compressed
I guess my original intention to put index.json directly in the root directory was to place it next to index.html and in the parent directory of archive/ and recent/ which are further described by index.json. But I guess your motivation for putting it in /index/ was to avoid cluttering the root directory any further, right?
What do you think, should we unify this and keep it unified? And if yes, who moves their index.json files? ;) I don't think they're used by anything yet, so we're unlikely to break anything. Again, happy to move mine if this makes more sense. Maybe we can briefly think of other files/directories we might be adding in the near future?
To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information
Child items ...
Show closed items
Linked items 0
Link issues together to show that they're related.
Learn more.
I would suggest changing the title of this ticket to "Define CollecTor's file-structure protocol 1.0".
For distribution we depend on that file structure and third parties downloading data also need something to rely on. So, this protocol description should be part of the Collector data description.
The three folders recent, archive, and index are a good basis for the protocol. Thus, we don't need to write about any other files in the root directory. And, these folders can be liberally placed wherever on the physical/virtual machine a CollecTor admin sees them fit as long as they show up under the CollecTor root. And, the html-files can be even in a different place (which is the case on the mirror).
Summary:
define CollecTor protocol 1.0
using the three folders recent, archive, and index as top folders
describe the substructure as it is hard-coded now.
I will start writing version 0.9 in order to reserve 1.0 for the change in #20228 (moved).
As #18910 (moved) depends heavily on this protocol I'm setting prio to high.
Trac: Owner: N/Ato iwakeh Status: new to assigned Priority: Medium to High
Thanks for starting this! Here are some answers and some feedback:
It makes sense to specify the web-visible directories in this protocol, but what's the reason for also specifying the web-invisible out/ directory there? If the audience is developers who rely on the directory structure provided via HTTP, I'd say it's fine and even better to leave out that last directory. And if the audience is operators and contributors, then we might have to include even more directories, including the stats/ directory and others. For comparison, the Onionoo protocol specification doesn't say anything about the status/ directory which would be important for operators and contributors but which Onionoo client developers don't need to worry about.
"Shouldn't 'exit-list' be changed to 'exit-lists'?" -- Yes, we can do that. In fact, I had this on my local TODO list for years and only recently dropped it, because meh, but if you also found this confusing, then it gets above the meh threshold again. Let's do it.
"Shouldn't there be different markers for different torperf sources?" -- Maybe, but I'd rather not want to touch anything with the label Torperf on it unless it breaks apart or explodes. Let's wait for the switch to OnionPerf and do something reasonable there.
"The 'compression-type' is one element of "xz", "gz", or "zip". XXXX Is this true?" -- No, the only compression type that is currently in use is "xz". We did use "bz2" until a few years ago, but we recompressed all tarballs, because "xz" compresses much better. Of course, there's no guarantee that we'll stick with "xz" forever, so it might be fine to mention all possible compression types there.
Section 2.4 says that server descriptors are sorted into tarballs by download date. That's not true, we're using published dates just like we're sorting extra-info descriptors into tarballs.
In Section 4.1.1, you ask: "Shouldn't the seconds be dropped?" -- No, because it's just coincidence that seconds are always zero. That's because the new scheduler is super precise compared to the cron-based scheduling which put a 01 or 02 there at times.
Also in Section 4.1.1, "Why not group extra-info according to published time?" -- I don't understand that question. Can you rephrase?
In Section 4.2.1, "What is the reason not to group according to published time?" -- This question is very related to my recent thoughts on appending multiple votes to a single file: https://trac.torproject.org/projects/tor/ticket/20228#comment:2. Basically, if we were to store server descriptors and extra-info descriptors in hourly files, I'd expect that we update a couple of those files during a single update run. (In fact, see the command and output below.) And a client who wants to stay up to date would have to download all files that have changed. Therefore it's much easier to append everything we learn in a single execution to a single file.
wget -O - https://collector.torproject.org/recent/relay-descriptors/server-descriptors/2016-09-28-09-05-00-server-descriptors | grep "^published " | cut -c1-23 | sort | uniq -c 1 published 2016-09-28 04 # <- this comes quite late 7 published 2016-09-28 07 # <- these, too 786 published 2016-09-28 08 # <- one would only expect those 16 published 2016-09-28 09 # <- and maybe a few of those 3 published 2016-09-28 10 # <- hello, future 1 published 2016-09-28 11 # <- and future 1 published 2016-09-28 16 # <- and future 1 published 2016-09-28 18 # <- hello, wrong clock
I didn't look at Section 5 yet, because it's yet unclear whether that section belongs in the protocol.
I only skimmed through yet, but want to quickly reply to the question about section 5.
As we're relying on the 'out' structure to produce tars this should be documented, and the part of the directory structure inside the tars is visible to clients.
The two use-cases I have in mind are:
When parsing unpacked tar-balls part of the structure of 'out' is part of the tar-balls' structure, i.e. below month there is the 'out' structure to be found.
When running a CollecTor instance for getting access to the data, it could make sense to use the 'out' structure to further operate on the data. Here the CollecTor instance's purpose would tend to be data collection not mirroring.
So, it is useful to describe 'out', I think.
And, I think you're right that also 'stats' and 'sync' (introduced with #18910 (moved)) should be part of the document. That won't be much more text, but really clarifies what all the directories are about and gives operators an idea where they should place these directories etc. And, it will help to get new developers started, or help us when debugging or changing things in a few months.
Please find two more commits on the above branch.
The first removes the questions that are answered or in discussion and corrected the issues you noticed.
And, it adds two small sections for 'sync' ans 'stats' (the latter still a placeholder).
The second commit corrects some directory names in section 5.
Regarding your question in no. 7: this is similar to no. 8. Both refer to grouping by published date vs. download date. I moved the discussion to #20228 (moved), as the question was first raised there.
Wait. Let me go back one step and ask: why are we writing this document now? Is this for ourselves, for future contributors, for operators, or even for users? And can't we update or extend the existing documentation on /index.html with the most relevant missing parts?
When I created this ticket I was thinking of coming up with a common structure for the web-facing parts of CollecTor, so that we can move forward with synchronization between CollecTor instances. I was not thinking of an implementation-level documentation of how we're using the file system, and I don't really see the urgent need for that. (When I mentioned the stats/ directory and others, I basically wanted to give an example of something that, IMHO, does not fit into the protocol rather than suggest to include it. I should have phrased that more clearly.)
Can we, for now, focus on any open questions you have about CollecTor's file structure and postpone the decision what documentation of the local file system structure we need?
And can we make a decision how we're changing existing web-facing directories like moving /index.json* to /index/index.json* on the main CollecTor instance?
Don't get me wrong, I do see the value of documentation, but I also see the cost of writing, reviewing, revising, and maintaining documentation, and in this case I don't yet see how the value is greater than the costs.
Please find my task-20234 branch with a few tweaks. Other than that I hope that we can integrate this into index.html or at least find a more compact notation. But let's merge this for now and make it better later. Let me know if you agree with my edits or want to edit more.
With the new webstats module the path description should be adapted.
Yes! Should we create a new ticket for that issue, though? Maybe "Extend CollecTor's file structure protocol by web server logs"? And do you want to prepare patch?
Should we use the newfound spec format here?
And, add this spec (once it comes in the new format and is adapted) to Metrics web?
Yes, that's a good idea. However, with all the other open issues I'd prefer if we can put this one on hold until we have resolved at least some of them. How about we update the summary of this ticket to reflect that the only remaining task here is to "Prettify CollecTor's file structure protocol and put it on Tor Metrics"?
Trac: Description: It looks like the CollecTor mirror serves the index.json file at a different URL (/index/index.json) than the main CollecTor instance (/index.json). In theory, we could agree on a common place for that file and all other files and consider that part of the "protocol". And I'm happy to consider changing paths on the main CollecTor instance if different paths make more sense.
Here are the current directories and files on the main CollecTor instance:
/ # start page with all the content for humans/index.html # same as //css/ # web stuff/images/ # web stuff/header.html # used to style directory listings/footer.html # used to style directory listings/formats.html # not used anymore, could go away if we wanted/archive/ # archived descriptors/recent/ # recent descriptors/index.json # JSON file with all files in archive/ and recent//index.json.bz2 # same as /index.json, but compressed/index.json.gz # same as /index.json, but compressed/index.json.xz # same as /index.json, but compressed
I guess my original intention to put index.json directly in the root directory was to place it next to index.html and in the parent directory of archive/ and recent/ which are further described by index.json. But I guess your motivation for putting it in /index/ was to avoid cluttering the root directory any further, right?
What do you think, should we unify this and keep it unified? And if yes, who moves their index.json files? ;) I don't think they're used by anything yet, so we're unlikely to break anything. Again, happy to move mine if this makes more sense. Maybe we can briefly think of other files/directories we might be adding in the near future?
to
Transform into appropriate format and also add path descriptions for webstats.
Old description: It looks like the CollecTor mirror serves the index.json file at a different URL (/index/index.json) than the main CollecTor instance (/index.json). In theory, we could agree on a common place for that file and all other files and consider that part of the "protocol". And I'm happy to consider changing paths on the main CollecTor instance if different paths make more sense.
Here are the current directories and files on the main CollecTor instance:
/ # start page with all the content for humans/index.html # same as //css/ # web stuff/images/ # web stuff/header.html # used to style directory listings/footer.html # used to style directory listings/formats.html # not used anymore, could go away if we wanted/archive/ # archived descriptors/recent/ # recent descriptors/index.json # JSON file with all files in archive/ and recent//index.json.bz2 # same as /index.json, but compressed/index.json.gz # same as /index.json, but compressed/index.json.xz # same as /index.json, but compressed
I guess my original intention to put index.json directly in the root directory was to place it next to index.html and in the parent directory of archive/ and recent/ which are further described by index.json. But I guess your motivation for putting it in /index/ was to avoid cluttering the root directory any further, right?
What do you think, should we unify this and keep it unified? And if yes, who moves their index.json files? ;) I don't think they're used by anything yet, so we're unlikely to break anything. Again, happy to move mine if this makes more sense. Maybe we can briefly think of other files/directories we might be adding in the near future? Component: Metrics/CollecTor to Metrics/Metrics website Priority: High to Low Summary: Define CollecTor's file-structure protocol to add CollecTor's file-structure protocol to Metrics-web
Please review the additions to CollecTor's file protocol description for webstats on this branch.
This mainly refers to the webstats spec in order to avoid duplication.