Opened 5 weeks ago

Last modified 2 weeks ago

#31204 accepted enhancement

Extend file objects in index.json to include descriptor types, publication times, and file digests

Reported by: karsten Owned by: karsten
Priority: Medium Milestone:
Component: Metrics/CollecTor Version:
Severity: Normal Keywords:
Cc: metrics-team, atagar Actual Points:
Parent ID: Points:
Reviewer: Sponsor:

Description

atagar suggested to extend file objects in CollecTor's index.json to include descriptor types, publication times, and file digests.

As of now, file objects in the index.json file have the following fields:

  • "path": Relative path of the file.
  • "size": Size of the file in bytes.
  • "last_modified": Timestamp when the file was last modified using pattern "YYYY-MM-DD HH:MM" in the UTC timezone.

The new fields could be defined as follows, though this is very much subject to discussion on this ticket:

  • "types": List of descriptor types as found in @type annotations of contained descriptors (optional).
  • "first_published": Earliest published timestamp (or similar) of contained descriptors (optional).
  • "last_published": Latest published timestamp (or similar) of contained descriptors (optional).
  • "sha256": SHA-256 digest of the file, encoded as base64 (optional).

All these new fields seem reasonable things to add, and I don't see why we wouldn't want to add them. The index will get bigger, but that sounds acceptable. The coding effort is non-zero, which is something we'll have to admit. But all in all, I don't see a blocker for doing this.

Implementation note: All these new fields have in common that they're not just file attributes that we can easily obtain from Java's File class. We'll have to open and read files in order to obtain these fields, and that's very time-consuming. I could see how we do this in a background thread (or thread pool) started by CollecTor's CreateIndexJson.java with a state file of some sort to avoid reprocessing files that haven't changed. And while this thread (pool) hasn't completed processing a file, the index would simply omit these new fields (not files!), which is why fields are defined as optional above.

What else did I miss? atagar, please fill in any thoughts that I left out.

Once we agree on the spec here, this could be a fine little project for a volunteer.

Child Tickets

Change History (3)

comment:1 Changed 4 weeks ago by atagar

This looks great, thanks Karsten!

comment:2 Changed 2 weeks ago by karsten

Owner: changed from metrics-team to karsten
Status: newaccepted

I started working on this today. I do have some code here that supports running in the background using a thread pool, but I'll have to spend at least another day or two on this before it's ready for review.

A few observations from writing this code and testing it locally:

  1. Reading tarballs to find out descriptor types and publication times is really time consuming. A test run with 643M of data took roughly 10 minutes on my laptop. For comparison, our archive is 95G in size, so about 150 times the size. We might want to index the archive on an external machine that is not the CollecTor host. And we need to be clear that the server will be busy for 10-20 minutes after creating new tarballs every 2 to 3 days. Neither of which being a major concern, just stating it.
  1. Interestingly, computing SHA-256 digests of tarballs only took about 5 seconds of these 10 minutes, so that's really, really cheap compared to reading tarballs and extracting descriptor types and publication times.
  1. I wonder how it will work out in practice that these new fields will be blank for 10-20 minutes for newly created tarballs. In many cases, newly created tarballs replace existing tarballs from a few days ago for which these fields were available. One effect would be that the latest published timestamp for a given descriptor type will flap between, say, middle of a month to end of the previous month, only because the tarball for the current month is replaced. Maybe we need to do something more elaborate where we put newly created tarballs into a staging area where we parse them and then move them into place.

I'll think more about these issues (mainly the third one) and work more on the code as time permits. Grabbing the ticket, because it doesn't really make sense for somebody else to re-do what I did so far.

comment:3 Changed 2 weeks ago by atagar

I started working on this today.

Wonderful, thanks Karsten! I've been making progress toward a stem collector module on my end too. The sha256 attribute will be particularly helpful to determine if download requests can be short circuited or not.

https://gitweb.torproject.org/user/atagar/stem.git/tree/stem/descriptor/collector.py?h=collector#n262

Reading tarballs to find out descriptor types and publication times is really time consuming.

What creates the tarballs? Just naive spit balling on my end, but @type annotations get added by metrics code somewhere so I wonder if the process that determines tarball @types and filenames can be shared and used by the indexer.

Feel free to disregard if this is silly. I'm pretty fuzzy on how CollecTor's architected. :P

I wonder how it will work out in practice that these new fields will be blank for 10-20 minutes for newly created tarballs

Ooph. That would definitely be confusing for users.

Note: See TracTickets for help on using tickets.