Extend file objects in index.json to include descriptor types, publication times, and file digests

added component::metrics/collector owner::karsten priority::medium resolution::fixed severity::normal status::closed type::enhancement labels

This looks great, thanks Karsten!

I started working on this today. I do have some code here that supports running in the background using a thread pool, but I'll have to spend at least another day or two on this before it's ready for review.

A few observations from writing this code and testing it locally:

Reading tarballs to find out descriptor types and publication times is really time consuming. A test run with 643M of data took roughly 10 minutes on my laptop. For comparison, our archive is 95G in size, so about 150 times the size. We might want to index the archive on an external machine that is not the CollecTor host. And we need to be clear that the server will be busy for 10-20 minutes after creating new tarballs every 2 to 3 days. Neither of which being a major concern, just stating it.
Interestingly, computing SHA-256 digests of tarballs only took about 5 seconds of these 10 minutes, so that's really, really cheap compared to reading tarballs and extracting descriptor types and publication times.
I wonder how it will work out in practice that these new fields will be blank for 10-20 minutes for newly created tarballs. In many cases, newly created tarballs replace existing tarballs from a few days ago for which these fields were available. One effect would be that the latest published timestamp for a given descriptor type will flap between, say, middle of a month to end of the previous month, only because the tarball for the current month is replaced. Maybe we need to do something more elaborate where we put newly created tarballs into a staging area where we parse them and then move them into place.

I'll think more about these issues (mainly the third one) and work more on the code as time permits. Grabbing the ticket, because it doesn't really make sense for somebody else to re-do what I did so far.

Trac:
Owner: metrics-team to karsten
Status: new to accepted

I started working on this today.

Wonderful, thanks Karsten! I've been making progress toward a stem collector module on my end too. The sha256 attribute will be particularly helpful to determine if download requests can be short circuited or not.

https://gitweb.torproject.org/user/atagar/stem.git/tree/stem/descriptor/collector.py?h=collector#n262

Reading tarballs to find out descriptor types and publication times is really time consuming.

What creates the tarballs? Just naive spit balling on my end, but @type annotations get added by metrics code somewhere so I wonder if the process that determines tarball @types and filenames can be shared and used by the indexer.

Feel free to disregard if this is silly. I'm pretty fuzzy on how CollecTor's architected. :P

I wonder how it will work out in practice that these new fields will be blank for 10-20 minutes for newly created tarballs

Ooph. That would definitely be confusing for users.

After thinking more about this I believe that we should implement something like the staging area concept mentioned above. Everything else would indeed be too confusing. Unfortunately, this makes the implementation slightly more complex and pushes this ticket back in the list. We should still do it, but it might take a while longer until it's there.

Thanks Karsten, sounds good. No rush. :)

I now have a running version of the "slightly more complex" implementation mentioned above. It still contains a dozen TODOs and needs more testing, but I don't see any major blockers anymore.

Here's a sample output index.json file produced by my code:

{
  "index_created": "2019-10-24 14:40",
  "build_revision": "f602b218",
  "path": "https://collector.torproject.org",
  "directories": [
    {
      "path": "archive",
      "directories": [
        {
          "path": "relay-descriptors",
          "directories": [
            {
              "path": "consensuses",
              "files": [
                {
                  "path": "consensuses-2019-10.tar.xz",
                  "size": 16798840,
                  "last_modified": "2019-10-23 03:44",
                  "types": [
                    "network-status-consensus-3 1.0"
                  ],
                  "first_published": "2019-10-01 00:00",
                  "last_published": "2019-10-23 03:00",
                  "sha256": "d8fhWyp3Gft/uFD4x1Fwu4IcBJsj6xGb2r/J3UzAZB8="
                }
              ]
            }
          ]
        }
      ]
    }
  ]
}

Any thoughts on this output before I finalize my code and put it here for review?

Thanks Karsten, that looks great to me!

Finally, my code is ready for review! It's commit 251f5a8 in my task-31204 branch. As indicated above, it's a bit more complex than originally thought. But I believe the design is robust, and the code is fully documented and tested.

Trac:
Status: accepted to needs_review

LGTM. Tested with a local build of metrics-lib with a modified junittest.policy, I could test again once metrics-{base,lib} changes are made and released, but it's probably fine.

Trac:
Status: needs_review to merge_ready

When running this code on the full archive I ran into another bug where we would keep scheduling new indexer tasks for files that are already in the queue for being indexed. I wrote a fix and test in commit 5f3ccc1 in my task-31204 branch. Please take another look! I'd like to release and deploy this together with the latest #19332 (moved) fix.

Trac:
Status: merge_ready to needs_review

On second thought this bugfix doesn't require another code review. I just merged this code into master and will release and deloy it later today. Closing. Thanks!

Trac:
Status: needs_review to closed
Resolution: N/A to fixed

Thanks Karsten! Pushed the stem adjustments to take advantage of these. Much appreciated.

closed

mentioned in issue #31866 (moved)

mentioned in issue #32280 (moved)

mentioned in issue #32441 (moved)

mentioned in issue #32660 (moved)

Extend file objects in index.json to include descriptor types, publication times, and file digests

Child items ...

Activity