Opened 3 years ago

Closed 15 months ago

#20234 closed enhancement (implemented)

Add CollecTor's file-structure protocol

Reported by: karsten Owned by: iwakeh
Priority: Low Milestone:
Component: Metrics/Website Version:
Severity: Normal Keywords: metrics-2018
Cc: iwakeh, metrics-team Actual Points:
Parent ID: Points:
Reviewer: Sponsor:

Description (last modified by iwakeh)

Transform into appropriate format and also add path descriptions for webstats.

Old description: It looks like the CollecTor mirror serves the index.json file at a different URL (/index/index.json) than the main CollecTor instance (/index.json). In theory, we could agree on a common place for that file and all other files and consider that part of the "protocol". And I'm happy to consider changing paths on the main CollecTor instance if different paths make more sense.

Here are the current directories and files on the main CollecTor instance:

/                # start page with all the content for humans
/index.html      # same as /
/css/            # web stuff
/images/         # web stuff
/header.html     # used to style directory listings
/footer.html     # used to style directory listings
/formats.html    # not used anymore, could go away if we wanted
/archive/        # archived descriptors
/recent/         # recent descriptors
/index.json      # JSON file with all files in archive/ and recent/
/index.json.bz2  # same as /index.json, but compressed
/index.json.gz   # same as /index.json, but compressed
/index.json.xz   # same as /index.json, but compressed

I guess my original intention to put index.json directly in the root directory was to place it next to index.html and in the parent directory of archive/ and recent/ which are further described by index.json. But I guess your motivation for putting it in /index/ was to avoid cluttering the root directory any further, right?

What do you think, should we unify this and keep it unified? And if yes, who moves their index.json files? ;) I don't think they're used by anything yet, so we're unlikely to break anything. Again, happy to move mine if this makes more sense. Maybe we can briefly think of other files/directories we might be adding in the near future?

Child Tickets

TicketTypeStatusOwnerSummary
#20287defectclosediwakehPerform another review of CollecTor's file protocol and fix any remaining differences to the code

Change History (29)

comment:1 Changed 3 years ago by iwakeh

I would suggest changing the title of this ticket to "Define CollecTor's file-structure protocol 1.0".

For distribution we depend on that file structure and third parties downloading data also need something to rely on. So, this protocol description should be part of the Collector data description.

The three folders recent, archive, and index are a good basis for the protocol. Thus, we don't need to write about any other files in the root directory. And, these folders can be liberally placed wherever on the physical/virtual machine a CollecTor admin sees them fit as long as they show up under the CollecTor root. And, the html-files can be even in a different place (which is the case on the mirror).

Summary:

  • define CollecTor protocol 1.0
  • using the three folders recent, archive, and index as top folders
  • describe the substructure as it is hard-coded now.

comment:2 Changed 3 years ago by karsten

Summary: Agree on common directory structure for CollecTor's web pageDefine CollecTor's file-structure protocol 1.0

Works for me! Want to make it happen?

comment:3 Changed 3 years ago by iwakeh

Owner: set to iwakeh
Priority: MediumHigh
Status: newassigned

I will start writing version 0.9 in order to reserve 1.0 for the change in #20228.
As #18910 depends heavily on this protocol I'm setting prio to high.

comment:4 Changed 3 years ago by iwakeh

Status: assignedneeds_review

Please review my branch with the first filestructure protocol version 0.9

I chose simple text as format and prefixed my questions with XXXX.

This is an important basis for completing the merge-part of collector-sync.

comment:5 Changed 3 years ago by karsten

Status: needs_reviewneeds_revision

Thanks for starting this! Here are some answers and some feedback:

  • It makes sense to specify the web-visible directories in this protocol, but what's the reason for also specifying the web-invisible out/ directory there? If the audience is developers who rely on the directory structure provided via HTTP, I'd say it's fine and even better to leave out that last directory. And if the audience is operators and contributors, then we might have to include even more directories, including the stats/ directory and others. For comparison, the Onionoo protocol specification doesn't say anything about the status/ directory which would be important for operators and contributors but which Onionoo client developers don't need to worry about.
  • "Shouldn't 'exit-list' be changed to 'exit-lists'?" -- Yes, we can do that. In fact, I had this on my local TODO list for years and only recently dropped it, because meh, but if you also found this confusing, then it gets above the meh threshold again. Let's do it.
  • "Shouldn't there be different markers for different torperf sources?" -- Maybe, but I'd rather not want to touch anything with the label Torperf on it unless it breaks apart or explodes. Let's wait for the switch to OnionPerf and do something reasonable there.
  • "The 'compression-type' is one element of "xz", "gz", or "zip". XXXX Is this true?" -- No, the only compression type that is currently in use is "xz". We did use "bz2" until a few years ago, but we recompressed all tarballs, because "xz" compresses much better. Of course, there's no guarantee that we'll stick with "xz" forever, so it might be fine to mention all possible compression types there.
  • Section 2.4 says that server descriptors are sorted into tarballs by download date. That's not true, we're using published dates just like we're sorting extra-info descriptors into tarballs.
  • In Section 4.1.1, you ask: "Shouldn't the seconds be dropped?" -- No, because it's just coincidence that seconds are always zero. That's because the new scheduler is super precise compared to the cron-based scheduling which put a 01 or 02 there at times.
  • Also in Section 4.1.1, "Why not group extra-info according to published time?" -- I don't understand that question. Can you rephrase?
  • In Section 4.2.1, "What is the reason _not_ to group according to published time?" -- This question is very related to my recent thoughts on appending multiple votes to a single file: https://trac.torproject.org/projects/tor/ticket/20228#comment:2. Basically, if we were to store server descriptors and extra-info descriptors in hourly files, I'd expect that we update a couple of those files during a single update run. (In fact, see the command and output below.) And a client who wants to stay up to date would have to download all files that have changed. Therefore it's much easier to append everything we learn in a single execution to a single file.
wget -O - https://collector.torproject.org/recent/relay-descriptors/server-descriptors/2016-09-28-09-05-00-server-descriptors | grep "^published " | cut -c1-23 | sort | uniq -c
   1 published 2016-09-28 04   # <- this comes quite late
   7 published 2016-09-28 07   # <- these, too
 786 published 2016-09-28 08   # <- one would only expect those
  16 published 2016-09-28 09   # <- and maybe a few of those
   3 published 2016-09-28 10   # <- hello, future
   1 published 2016-09-28 11   # <- and future
   1 published 2016-09-28 16   # <- and future
   1 published 2016-09-28 18   # <- hello, wrong clock
  • I didn't look at Section 5 yet, because it's yet unclear whether that section belongs in the protocol.

Again, thanks for writing this document!

comment:6 Changed 3 years ago by iwakeh

Thanks for the thorough review!

I only skimmed through yet, but want to quickly reply to the question about section 5.

As we're relying on the 'out' structure to produce tars this should be documented, and the part of the directory structure inside the tars is visible to clients.

The two use-cases I have in mind are:

  1. When parsing unpacked tar-balls part of the structure of 'out' is part of the tar-balls' structure, i.e. below month there is the 'out' structure to be found.
  2. When running a CollecTor instance for getting access to the data, it could make sense to use the 'out' structure to further operate on the data. Here the CollecTor instance's purpose would tend to be data collection not mirroring.

So, it is useful to describe 'out', I think.

And, I think you're right that also 'stats' and 'sync' (introduced with #18910) should be part of the document. That won't be much more text, but really clarifies what all the directories are about and gives operators an idea where they should place these directories etc. And, it will help to get new developers started, or help us when debugging or changing things in a few months.

comment:7 Changed 3 years ago by iwakeh

Please find two more commits on the above branch.
The first removes the questions that are answered or in discussion and corrected the issues you noticed.
And, it adds two small sections for 'sync' ans 'stats' (the latter still a placeholder).

The second commit corrects some directory names in section 5.

Regarding your question in no. 7: this is similar to no. 8. Both refer to grouping by published date vs. download date. I moved the discussion to #20228, as the question was first raised there.

comment:8 Changed 3 years ago by karsten

Wait. Let me go back one step and ask: why are we writing this document now? Is this for ourselves, for future contributors, for operators, or even for users? And can't we update or extend the existing documentation on /index.html with the most relevant missing parts?

When I created this ticket I was thinking of coming up with a common structure for the web-facing parts of CollecTor, so that we can move forward with synchronization between CollecTor instances. I was not thinking of an implementation-level documentation of how we're using the file system, and I don't really see the urgent need for that. (When I mentioned the stats/ directory and others, I basically wanted to give an example of something that, IMHO, does not fit into the protocol rather than suggest to include it. I should have phrased that more clearly.)

Can we, for now, focus on any open questions you have about CollecTor's file structure and postpone the decision what documentation of the local file system structure we need?

And can we make a decision how we're changing existing web-facing directories like moving /index.json* to /index/index.json* on the main CollecTor instance?

Don't get me wrong, I do see the value of documentation, but I also see the cost of writing, reviewing, revising, and maintaining documentation, and in this case I don't yet see how the value is greater than the costs.

comment:9 Changed 3 years ago by iwakeh

Yes, the audience sort of grew while writing and it's good to take a step back and answer this question first.
Suggestions for next steps:

  • We agree that we should write this for client and not internal audience.
    • Keep the first four sections? Maybe adapt 5 to one about tar-ball contents (below month), as most of it is written already?
    • Remove 6. and 7. these should later be part of the operator documentation.
  • Then there is a mixture of change proposal and documentation for the 'index*' part to be resolved.
    • Have version 0.9 with the "index.json*" as the main CollecTor serves it and reserve the change for 1.0 (together with the vote-issue #20228)?

Thoughts?

comment:10 Changed 3 years ago by iwakeh

Status: needs_revisionneeds_review

As talked about in irc and described above, please find the new version 0.9 in my new branch for review.

comment:11 Changed 3 years ago by iwakeh

Summary: Define CollecTor's file-structure protocol 1.0Define CollecTor's file-structure protocol

Removed the version number from ticket title, as this task is about writing this protocol for the first time, not about a particular version of it.

comment:12 Changed 3 years ago by karsten

Status: needs_reviewmerge_ready

Please find my task-20234 branch with a few tweaks. Other than that I hope that we can integrate this into index.html or at least find a more compact notation. But let's merge this for now and make it better later. Let me know if you agree with my edits or want to edit more.

comment:13 Changed 3 years ago by iwakeh

That looks all fine. Thanks!

Yes, some more compact notation an nicer notation would be good.
It could be a volunteer task to think up a representation in html?

comment:14 Changed 3 years ago by karsten

Great, rebased, squashed, pushed. Thanks!

I'll think more about possible representations in HTML, but if you or somebody else comes up with something first, please mention it here.

comment:15 Changed 22 months ago by iwakeh

Status: merge_readyneeds_revision

With the new webstats module the path description should be adapted.

Should we use the newfound spec format here?

And, add this spec (once it comes in the new format and is adapted) to Metrics web?

comment:16 Changed 22 months ago by iwakeh

Status: needs_revisionneeds_information

comment:17 in reply to:  15 Changed 22 months ago by karsten

Replying to iwakeh:

With the new webstats module the path description should be adapted.

Yes! Should we create a new ticket for that issue, though? Maybe "Extend CollecTor's file structure protocol by web server logs"? And do you want to prepare patch?

Should we use the newfound spec format here?

And, add this spec (once it comes in the new format and is adapted) to Metrics web?

Yes, that's a good idea. However, with all the other open issues I'd prefer if we can put this one on hold until we have resolved at least some of them. How about we update the summary of this ticket to reflect that the only remaining task here is to "Prettify CollecTor's file structure protocol and put it on Tor Metrics"?

comment:18 Changed 22 months ago by iwakeh

Component: Metrics/CollecTorMetrics/Metrics website
Description: modified (diff)
Priority: HighLow
Summary: Define CollecTor's file-structure protocoladd CollecTor's file-structure protocol to Metrics-web

Example paths for webstat webserver logs:

 recent/webstats/metrics.torproject.org-meronense.torproject.org-access.log-20170905.xz
 archive/webstats/metrics.torproject.org/2017/09/05/metrics.torproject.org-meronense.torproject.org-access.log-20170905.xz

comment:19 Changed 21 months ago by karsten

Summary: add CollecTor's file-structure protocol to Metrics-webAdd CollecTor's file-structure protocol

Capitalize and simplify summary.

comment:20 Changed 21 months ago by karsten

Keywords: metrics-2018 added

comment:21 Changed 21 months ago by karsten

Keywords: metrics-2017 added; metrics-2018 removed

comment:22 Changed 18 months ago by iwakeh

Keywords: metrics-2018 added; metrics-2017 removed

Will be completed in 2018.

comment:23 Changed 17 months ago by iwakeh

Owner: changed from iwakeh to metrics-team
Status: needs_informationassigned

Move to metrics-team as these are not worked on by me during the next week.

comment:24 Changed 16 months ago by iwakeh

Owner: changed from metrics-team to iwakeh
Status: assignedaccepted

comment:25 Changed 16 months ago by iwakeh

Status: acceptedneeds_review

Please review the additions to CollecTor's file protocol description for webstats on this branch.
This mainly refers to the webstats spec in order to avoid duplication.

comment:26 Changed 16 months ago by karsten

Looks good. Merged!

comment:27 Changed 15 months ago by karsten

Status: needs_reviewnew

It looks like there's nothing to review at the moment. Not sure what remains to be done.

comment:28 Changed 15 months ago by irl

Cc: metrics-team added

Adding metrics-team to cc

comment:29 Changed 15 months ago by iwakeh

Resolution: implemented
Status: newclosed

It seems after the merge (cf. comment:26) it was simply forgotten to close this ticket.

Closing now.

Note: See TracTickets for help on using tickets.