Opened 5 months ago

Closed 4 weeks ago

Last modified 8 days ago

#31204 closed enhancement (fixed)

Extend file objects in index.json to include descriptor types, publication times, and file digests

Reported by: karsten Owned by: karsten
Priority: Medium Milestone:
Component: Metrics/CollecTor Version:
Severity: Normal Keywords:
Cc: metrics-team, atagar Actual Points:
Parent ID: Points:
Reviewer: Sponsor:

Description

atagar suggested to extend file objects in CollecTor's index.json to include descriptor types, publication times, and file digests.

As of now, file objects in the index.json file have the following fields:

  • "path": Relative path of the file.
  • "size": Size of the file in bytes.
  • "last_modified": Timestamp when the file was last modified using pattern "YYYY-MM-DD HH:MM" in the UTC timezone.

The new fields could be defined as follows, though this is very much subject to discussion on this ticket:

  • "types": List of descriptor types as found in @type annotations of contained descriptors (optional).
  • "first_published": Earliest published timestamp (or similar) of contained descriptors (optional).
  • "last_published": Latest published timestamp (or similar) of contained descriptors (optional).
  • "sha256": SHA-256 digest of the file, encoded as base64 (optional).

All these new fields seem reasonable things to add, and I don't see why we wouldn't want to add them. The index will get bigger, but that sounds acceptable. The coding effort is non-zero, which is something we'll have to admit. But all in all, I don't see a blocker for doing this.

Implementation note: All these new fields have in common that they're not just file attributes that we can easily obtain from Java's File class. We'll have to open and read files in order to obtain these fields, and that's very time-consuming. I could see how we do this in a background thread (or thread pool) started by CollecTor's CreateIndexJson.java with a state file of some sort to avoid reprocessing files that haven't changed. And while this thread (pool) hasn't completed processing a file, the index would simply omit these new fields (not files!), which is why fields are defined as optional above.

What else did I miss? atagar, please fill in any thoughts that I left out.

Once we agree on the spec here, this could be a fine little project for a volunteer.

Child Tickets

Change History (12)

comment:1 Changed 5 months ago by atagar

This looks great, thanks Karsten!

comment:2 Changed 4 months ago by karsten

Owner: changed from metrics-team to karsten
Status: newaccepted

I started working on this today. I do have some code here that supports running in the background using a thread pool, but I'll have to spend at least another day or two on this before it's ready for review.

A few observations from writing this code and testing it locally:

  1. Reading tarballs to find out descriptor types and publication times is really time consuming. A test run with 643M of data took roughly 10 minutes on my laptop. For comparison, our archive is 95G in size, so about 150 times the size. We might want to index the archive on an external machine that is not the CollecTor host. And we need to be clear that the server will be busy for 10-20 minutes after creating new tarballs every 2 to 3 days. Neither of which being a major concern, just stating it.
  1. Interestingly, computing SHA-256 digests of tarballs only took about 5 seconds of these 10 minutes, so that's really, really cheap compared to reading tarballs and extracting descriptor types and publication times.
  1. I wonder how it will work out in practice that these new fields will be blank for 10-20 minutes for newly created tarballs. In many cases, newly created tarballs replace existing tarballs from a few days ago for which these fields were available. One effect would be that the latest published timestamp for a given descriptor type will flap between, say, middle of a month to end of the previous month, only because the tarball for the current month is replaced. Maybe we need to do something more elaborate where we put newly created tarballs into a staging area where we parse them and then move them into place.

I'll think more about these issues (mainly the third one) and work more on the code as time permits. Grabbing the ticket, because it doesn't really make sense for somebody else to re-do what I did so far.

comment:3 Changed 4 months ago by atagar

I started working on this today.

Wonderful, thanks Karsten! I've been making progress toward a stem collector module on my end too. The sha256 attribute will be particularly helpful to determine if download requests can be short circuited or not.

https://gitweb.torproject.org/user/atagar/stem.git/tree/stem/descriptor/collector.py?h=collector#n262

Reading tarballs to find out descriptor types and publication times is really time consuming.

What creates the tarballs? Just naive spit balling on my end, but @type annotations get added by metrics code somewhere so I wonder if the process that determines tarball @types and filenames can be shared and used by the indexer.

Feel free to disregard if this is silly. I'm pretty fuzzy on how CollecTor's architected. :P

I wonder how it will work out in practice that these new fields will be blank for 10-20 minutes for newly created tarballs

Ooph. That would definitely be confusing for users.

comment:4 Changed 4 months ago by karsten

After thinking more about this I believe that we should implement something like the staging area concept mentioned above. Everything else would indeed be too confusing. Unfortunately, this makes the implementation slightly more complex and pushes this ticket back in the list. We should still do it, but it might take a while longer until it's there.

comment:5 Changed 4 months ago by atagar

Thanks Karsten, sounds good. No rush. :)

comment:6 Changed 6 weeks ago by karsten

I now have a running version of the "slightly more complex" implementation mentioned above. It still contains a dozen TODOs and needs more testing, but I don't see any major blockers anymore.

Here's a sample output index.json file produced by my code:

{
  "index_created": "2019-10-24 14:40",
  "build_revision": "f602b218",
  "path": "https://collector.torproject.org",
  "directories": [
    {
      "path": "archive",
      "directories": [
        {
          "path": "relay-descriptors",
          "directories": [
            {
              "path": "consensuses",
              "files": [
                {
                  "path": "consensuses-2019-10.tar.xz",
                  "size": 16798840,
                  "last_modified": "2019-10-23 03:44",
                  "types": [
                    "network-status-consensus-3 1.0"
                  ],
                  "first_published": "2019-10-01 00:00",
                  "last_published": "2019-10-23 03:00",
                  "sha256": "d8fhWyp3Gft/uFD4x1Fwu4IcBJsj6xGb2r/J3UzAZB8="
                }
              ]
            }
          ]
        }
      ]
    }
  ]
}

Any thoughts on this output before I finalize my code and put it here for review?

comment:7 Changed 6 weeks ago by atagar

Thanks Karsten, that looks great to me!

comment:8 Changed 6 weeks ago by karsten

Status: acceptedneeds_review

Finally, my code is ready for review! It's commit 251f5a8 in my task-31204 branch. As indicated above, it's a bit more complex than originally thought. But I believe the design is robust, and the code is fully documented and tested.

comment:9 Changed 5 weeks ago by irl

Status: needs_reviewmerge_ready

LGTM. Tested with a local build of metrics-lib with a modified junittest.policy, I could test again once metrics-{base,lib} changes are made and released, but it's probably fine.

comment:10 Changed 5 weeks ago by karsten

Status: merge_readyneeds_review

When running this code on the full archive I ran into another bug where we would keep scheduling new indexer tasks for files that are already in the queue for being indexed. I wrote a fix and test in commit 5f3ccc1 in my task-31204 branch. Please take another look! I'd like to release and deploy this together with the latest #19332 fix.

comment:11 Changed 4 weeks ago by karsten

Resolution: fixed
Status: needs_reviewclosed

On second thought this bugfix doesn't require another code review. I just merged this code into master and will release and deloy it later today. Closing. Thanks!

comment:12 Changed 8 days ago by atagar

Thanks Karsten! Pushed the stem adjustments to take advantage of these. Much appreciated.

Note: See TracTickets for help on using tickets.