Add CollecTor's file-structure protocol

added component::metrics/website metrics-2018 owner::iwakeh priority::low resolution::implemented severity::normal status::closed type::enhancement labels

I would suggest changing the title of this ticket to "Define CollecTor's file-structure protocol 1.0".

For distribution we depend on that file structure and third parties downloading data also need something to rely on. So, this protocol description should be part of the Collector data description.

The three folders recent, archive, and index are a good basis for the protocol. Thus, we don't need to write about any other files in the root directory. And, these folders can be liberally placed wherever on the physical/virtual machine a CollecTor admin sees them fit as long as they show up under the CollecTor root. And, the html-files can be even in a different place (which is the case on the mirror).

Summary:

define CollecTor protocol 1.0
using the three folders recent, archive, and index as top folders
describe the substructure as it is hard-coded now.

Works for me! Want to make it happen?

Trac:
Summary: Agree on common directory structure for CollecTor's web page to Define CollecTor's file-structure protocol 1.0

I will start writing version 0.9 in order to reserve 1.0 for the change in #20228 (moved). As #18910 (moved) depends heavily on this protocol I'm setting prio to high.

Trac:
Owner: N/A to iwakeh
Status: new to assigned
Priority: Medium to High

Please review my branch with the first filestructure protocol version 0.9

I chose simple text as format and prefixed my questions with XXXX.

This is an important basis for completing the merge-part of collector-sync.

Trac:
Status: assigned to needs_review

Thanks for starting this! Here are some answers and some feedback:

It makes sense to specify the web-visible directories in this protocol, but what's the reason for also specifying the web-invisible out/ directory there? If the audience is developers who rely on the directory structure provided via HTTP, I'd say it's fine and even better to leave out that last directory. And if the audience is operators and contributors, then we might have to include even more directories, including the stats/ directory and others. For comparison, the Onionoo protocol specification doesn't say anything about the status/ directory which would be important for operators and contributors but which Onionoo client developers don't need to worry about.
"Shouldn't 'exit-list' be changed to 'exit-lists'?" -- Yes, we can do that. In fact, I had this on my local TODO list for years and only recently dropped it, because meh, but if you also found this confusing, then it gets above the meh threshold again. Let's do it.
"Shouldn't there be different markers for different torperf sources?" -- Maybe, but I'd rather not want to touch anything with the label Torperf on it unless it breaks apart or explodes. Let's wait for the switch to OnionPerf and do something reasonable there.
"The 'compression-type' is one element of "xz", "gz", or "zip". XXXX Is this true?" -- No, the only compression type that is currently in use is "xz". We did use "bz2" until a few years ago, but we recompressed all tarballs, because "xz" compresses much better. Of course, there's no guarantee that we'll stick with "xz" forever, so it might be fine to mention all possible compression types there.
Section 2.4 says that server descriptors are sorted into tarballs by download date. That's not true, we're using published dates just like we're sorting extra-info descriptors into tarballs.
In Section 4.1.1, you ask: "Shouldn't the seconds be dropped?" -- No, because it's just coincidence that seconds are always zero. That's because the new scheduler is super precise compared to the cron-based scheduling which put a 01 or 02 there at times.
Also in Section 4.1.1, "Why not group extra-info according to published time?" -- I don't understand that question. Can you rephrase?
In Section 4.2.1, "What is the reason not to group according to published time?" -- This question is very related to my recent thoughts on appending multiple votes to a single file: https://trac.torproject.org/projects/tor/ticket/20228#comment:2. Basically, if we were to store server descriptors and extra-info descriptors in hourly files, I'd expect that we update a couple of those files during a single update run. (In fact, see the command and output below.) And a client who wants to stay up to date would have to download all files that have changed. Therefore it's much easier to append everything we learn in a single execution to a single file.

wget -O - https://collector.torproject.org/recent/relay-descriptors/server-descriptors/2016-09-28-09-05-00-server-descriptors | grep "^published " | cut -c1-23 | sort | uniq -c
   1 published 2016-09-28 04   # <- this comes quite late
   7 published 2016-09-28 07   # <- these, too
 786 published 2016-09-28 08   # <- one would only expect those
  16 published 2016-09-28 09   # <- and maybe a few of those
   3 published 2016-09-28 10   # <- hello, future
   1 published 2016-09-28 11   # <- and future
   1 published 2016-09-28 16   # <- and future
   1 published 2016-09-28 18   # <- hello, wrong clock

I didn't look at Section 5 yet, because it's yet unclear whether that section belongs in the protocol.

Again, thanks for writing this document!

Trac:
Status: needs_review to needs_revision

Thanks for the thorough review!

I only skimmed through yet, but want to quickly reply to the question about section 5.

As we're relying on the 'out' structure to produce tars this should be documented, and the part of the directory structure inside the tars is visible to clients.

The two use-cases I have in mind are:

When parsing unpacked tar-balls part of the structure of 'out' is part of the tar-balls' structure, i.e. below month there is the 'out' structure to be found.
When running a CollecTor instance for getting access to the data, it could make sense to use the 'out' structure to further operate on the data. Here the CollecTor instance's purpose would tend to be data collection not mirroring.

So, it is useful to describe 'out', I think.

And, I think you're right that also 'stats' and 'sync' (introduced with #18910 (moved)) should be part of the document. That won't be much more text, but really clarifies what all the directories are about and gives operators an idea where they should place these directories etc. And, it will help to get new developers started, or help us when debugging or changing things in a few months.

Please find two more commits on the above branch. The first removes the questions that are answered or in discussion and corrected the issues you noticed. And, it adds two small sections for 'sync' ans 'stats' (the latter still a placeholder).

The second commit corrects some directory names in section 5.

Regarding your question in no. 7: this is similar to no. 8. Both refer to grouping by published date vs. download date. I moved the discussion to #20228 (moved), as the question was first raised there.

Wait. Let me go back one step and ask: why are we writing this document now? Is this for ourselves, for future contributors, for operators, or even for users? And can't we update or extend the existing documentation on /index.html with the most relevant missing parts?

When I created this ticket I was thinking of coming up with a common structure for the web-facing parts of CollecTor, so that we can move forward with synchronization between CollecTor instances. I was not thinking of an implementation-level documentation of how we're using the file system, and I don't really see the urgent need for that. (When I mentioned the stats/ directory and others, I basically wanted to give an example of something that, IMHO, does not fit into the protocol rather than suggest to include it. I should have phrased that more clearly.)

Can we, for now, focus on any open questions you have about CollecTor's file structure and postpone the decision what documentation of the local file system structure we need?

And can we make a decision how we're changing existing web-facing directories like moving /index.json* to /index/index.json* on the main CollecTor instance?

Don't get me wrong, I do see the value of documentation, but I also see the cost of writing, reviewing, revising, and maintaining documentation, and in this case I don't yet see how the value is greater than the costs.

Yes, the audience sort of grew while writing and it's good to take a step back and answer this question first.
Suggestions for next steps:

We agree that we should write this for client and not internal audience.

Keep the first four sections? Maybe adapt 5 to one about tar-ball contents (below month), as most of it is written already?
Remove 6. and 7. these should later be part of the operator documentation.

Then there is a mixture of change proposal and documentation for the 'index*' part to be resolved.

Have version 0.9 with the "index.json*" as the main CollecTor serves it and reserve the change for 1.0 (together with the vote-issue #20228 (moved))?

Thoughts?

As talked about in irc and described above, please find the new version 0.9 in my new branch for review.

Trac:
Status: needs_revision to needs_review

Removed the version number from ticket title, as this task is about writing this protocol for the first time, not about a particular version of it.

Trac:
Summary: Define CollecTor's file-structure protocol 1.0 to Define CollecTor's file-structure protocol

Please find my task-20234 branch with a few tweaks. Other than that I hope that we can integrate this into index.html or at least find a more compact notation. But let's merge this for now and make it better later. Let me know if you agree with my edits or want to edit more.

Trac:
Status: needs_review to merge_ready

That looks all fine. Thanks!

Yes, some more compact notation an nicer notation would be good. It could be a volunteer task to think up a representation in html?

Great, rebased, squashed, pushed. Thanks!

I'll think more about possible representations in HTML, but if you or somebody else comes up with something first, please mention it here.

With the new webstats module the path description should be adapted.

Should we use the newfound spec format here?

And, add this spec (once it comes in the new format and is adapted) to Metrics web?

Trac:
Status: merge_ready to needs_revision

Trac:
Status: needs_revision to needs_information

Replying to iwakeh:

With the new webstats module the path description should be adapted.

Yes! Should we create a new ticket for that issue, though? Maybe "Extend CollecTor's file structure protocol by web server logs"? And do you want to prepare patch?

Should we use the newfound spec format here?

And, add this spec (once it comes in the new format and is adapted) to Metrics web?

Yes, that's a good idea. However, with all the other open issues I'd prefer if we can put this one on hold until we have resolved at least some of them. How about we update the summary of this ticket to reflect that the only remaining task here is to "Prettify CollecTor's file structure protocol and put it on Tor Metrics"?

Example paths for webstat webserver logs:

 recent/webstats/metrics.torproject.org-meronense.torproject.org-access.log-20170905.xz
 archive/webstats/metrics.torproject.org/2017/09/05/metrics.torproject.org-meronense.torproject.org-access.log-20170905.xz

Trac:
Description: It looks like the CollecTor mirror serves the index.json file at a different URL (/index/index.json) than the main CollecTor instance (/index.json). In theory, we could agree on a common place for that file and all other files and consider that part of the "protocol". And I'm happy to consider changing paths on the main CollecTor instance if different paths make more sense.

Here are the current directories and files on the main CollecTor instance:

/                # start page with all the content for humans
/index.html      # same as /
/css/            # web stuff
/images/         # web stuff
/header.html     # used to style directory listings
/footer.html     # used to style directory listings
/formats.html    # not used anymore, could go away if we wanted
/archive/        # archived descriptors
/recent/         # recent descriptors
/index.json      # JSON file with all files in archive/ and recent/
/index.json.bz2  # same as /index.json, but compressed
/index.json.gz   # same as /index.json, but compressed
/index.json.xz   # same as /index.json, but compressed

I guess my original intention to put index.json directly in the root directory was to place it next to index.html and in the parent directory of archive/ and recent/ which are further described by index.json. But I guess your motivation for putting it in /index/ was to avoid cluttering the root directory any further, right?

What do you think, should we unify this and keep it unified? And if yes, who moves their index.json files? ;) I don't think they're used by anything yet, so we're unlikely to break anything. Again, happy to move mine if this makes more sense. Maybe we can briefly think of other files/directories we might be adding in the near future?

to

Transform into appropriate format and also add path descriptions for webstats.

Old description: It looks like the CollecTor mirror serves the index.json file at a different URL (/index/index.json) than the main CollecTor instance (/index.json). In theory, we could agree on a common place for that file and all other files and consider that part of the "protocol". And I'm happy to consider changing paths on the main CollecTor instance if different paths make more sense.

Here are the current directories and files on the main CollecTor instance:

/                # start page with all the content for humans
/index.html      # same as /
/css/            # web stuff
/images/         # web stuff
/header.html     # used to style directory listings
/footer.html     # used to style directory listings
/formats.html    # not used anymore, could go away if we wanted
/archive/        # archived descriptors
/recent/         # recent descriptors
/index.json      # JSON file with all files in archive/ and recent/
/index.json.bz2  # same as /index.json, but compressed
/index.json.gz   # same as /index.json, but compressed
/index.json.xz   # same as /index.json, but compressed

I guess my original intention to put index.json directly in the root directory was to place it next to index.html and in the parent directory of archive/ and recent/ which are further described by index.json. But I guess your motivation for putting it in /index/ was to avoid cluttering the root directory any further, right?

What do you think, should we unify this and keep it unified? And if yes, who moves their index.json files? ;) I don't think they're used by anything yet, so we're unlikely to break anything. Again, happy to move mine if this makes more sense. Maybe we can briefly think of other files/directories we might be adding in the near future?
Component: Metrics/CollecTor to Metrics/Metrics website
Priority: High to Low
Summary: Define CollecTor's file-structure protocol to add CollecTor's file-structure protocol to Metrics-web

Capitalize and simplify summary.

Trac:
Summary: add CollecTor's file-structure protocol to Metrics-web to Add CollecTor's file-structure protocol

Trac:
Keywords: N/A deleted, metrics-2018 added

Trac:
Keywords: metrics-2018 deleted, metrics-2017 added

Will be completed in 2018.

Trac:
Keywords: metrics-2017 deleted, metrics-2018 added

Move to metrics-team as these are not worked on by me during the next week.

Trac:
Owner: iwakeh to metrics-team
Status: needs_information to assigned

Trac:
Status: assigned to accepted
Owner: metrics-team to iwakeh

Please review the additions to CollecTor's file protocol description for webstats on this branch. This mainly refers to the webstats spec in order to avoid duplication.

Trac:
Status: accepted to needs_review

Looks good. Merged!

It looks like there's nothing to review at the moment. Not sure what remains to be done.

Trac:
Status: needs_review to new

Adding metrics-team to cc

Trac:
Cc: iwakeh to iwakeh, metrics-team

It seems after the merge (cf. comment:26) it was simply forgotten to close this ticket.

Closing now.

Trac:
Status: new to closed
Resolution: N/A to implemented

closed

mentioned in issue #20287 (moved)

Add CollecTor's file-structure protocol

Child items ...

Activity