Opened 4 years ago

Closed 4 years ago

Last modified 4 years ago

#17321 closed enhancement (implemented)

Index to better support downloaders

Reported by: atagar Owned by: karsten
Priority: High Milestone:
Component: Metrics/CollecTor Version:
Severity: Major Keywords:
Cc: iwakeh Actual Points:
Parent ID: Points:
Reviewer: Sponsor:

Description

Hi Karsten, as discussed at the dev meeting it would be sweeeeeet for Stem to have a CollecTorDownloader class that supports getting and processing descriptors from CollecTor.

Presently you read and process the Apache index html, but it would it could be a lot nicer to have a machine readable index.json file instead. This is the ticket for it.

To support future expansion it would be nice to be able to include additional metadata, so I propose a 'contents' hash, such as...

https://collector.torproject.org/recent/index.json

{
  'resource': 'https://collector.torproject.org/recent/index.json',
  'contents': [
    {
      'name': 'bridge-descriptors',
      'type': 'directory',
      'last_modified': '31-May-2014 09:49',  # or a unix timestam?
      'url': 'https://collector.torproject.org/recent/bridge-descriptors/'
    },
    {
       'name': 'exit-lists',
       'type': 'directory',
       'last_modified': '11-Oct-2015 15:02',
       'url': 'https://collector.torproject.org/recent/exit-lists/'
    },
    {
        'name': 'example-file.txt',
        'type': 'file',
        'last_modified': '31-May-2014 09:50',
        'size': 42901,
        'url': 'https://collector.torproject.org/recent/example-file.txt'
    }
  ]
}

Child Tickets

Attachments (2)

index.json (232.3 KB) - added by karsten 4 years ago.
pretty-index.json (421.8 KB) - added by karsten 4 years ago.

Download all attachments as: .zip

Change History (21)

comment:1 Changed 4 years ago by karsten

Thanks for starting this discussion. I spent some time thinking about this in the past days, too, and while I don't have a complete plan in mind yet, I'd want to share some ideas:

You placed your example index.json somewhere in the middle of the directory tree and listed all directly contained directories and files. That's how Apache's index.html works. But that also means that tools like Stem would need to navigate through the directory tree and read multiple of these index.json files. And it means that CollecTor would have to rewrite all these index.json files after an update. While this could work, it's somewhat complex.

How about we write a single index.json, say https://collector.torproject.org/index.json, that contains all directories and files in the directory tree? This would make processing a lot easier.

The obvious downside is that this file could grow quite big. I'm listing all directories and the number of contained files here:

   0 https://collector.torproject.org/
   0 https://collector.torproject.org/archive/
  89 https://collector.torproject.org/archive/bridge-descriptors/
  52 https://collector.torproject.org/archive/bridge-pool-assignments/
  68 https://collector.torproject.org/archive/exit-lists/
   1 https://collector.torproject.org/archive/relay-descriptors/
  96 https://collector.torproject.org/archive/relay-descriptors/consensuses/
  98 https://collector.torproject.org/archive/relay-descriptors/extra-infos/
  21 https://collector.torproject.org/archive/relay-descriptors/microdescs/
 117 https://collector.torproject.org/archive/relay-descriptors/server-descriptors/
  76 https://collector.torproject.org/archive/relay-descriptors/statuses/
  40 https://collector.torproject.org/archive/relay-descriptors/tor/
  96 https://collector.torproject.org/archive/relay-descriptors/votes/
  75 https://collector.torproject.org/archive/torperf/
   0 https://collector.torproject.org/recent/
   0 https://collector.torproject.org/recent/bridge-descriptors/
  72 https://collector.torproject.org/recent/bridge-descriptors/extra-infos/
  72 https://collector.torproject.org/recent/bridge-descriptors/server-descriptors/
  72 https://collector.torproject.org/recent/bridge-descriptors/statuses/
  72 https://collector.torproject.org/recent/exit-lists/
   0 https://collector.torproject.org/recent/relay-descriptors/
  72 https://collector.torproject.org/recent/relay-descriptors/consensuses/
  72 https://collector.torproject.org/recent/relay-descriptors/extra-infos/
   0 https://collector.torproject.org/recent/relay-descriptors/microdescs/
  72 https://collector.torproject.org/recent/relay-descriptors/microdescs/consensus-microdesc/
  72 https://collector.torproject.org/recent/relay-descriptors/microdescs/
  72 https://collector.torproject.org/recent/relay-descriptors/server-descriptors/
 576 https://collector.torproject.org/recent/relay-descriptors/votes/
  37 https://collector.torproject.org/recent/torperf/
2090 (total)

If we assume that each directory or file requires 200 characters/bytes in the index.json, that's an uncompressed file size of 413 KiB. We can probably save a bit here by removing whitespace, not repeating the https://collector.torproject.org/ part over and over, etc. What do you think, is that still reasonable?

comment:2 Changed 4 years ago by atagar

What do you think, is that still reasonable?

Very. I like this idea a lot. Modeling the directory tree as follows would be pretty dense...

{
  'contents': {
    'bridge-descriptors': {
      'extra-infos': {
        '2015-10-09-16-09-02-extra-infos': {
           'last_modified': '09-Oct-2015 16:09',
           'size': 1572864
        },
        '2015-10-09-17-09-02-extra-infos': {
           'last_modified': '09-Oct-2015 17:09',
           'size': 1468006
        }
     }
   }
}

That's 90 characters without whitespaces. The last_modified doesn't seem worth the space and we can add them later if we need 'em, further dropping it to 53 characters per entry. That's ~108 KB. Very reasonable.

comment:3 Changed 4 years ago by karsten

Oh, we'll need the "last_modified" field, so that applications can decide whether they'll need to download a file or not. We could also use a file digest here, but it's probably easier to keep the "last_modified" field. Though we should probably switch to the ISO format.

I think I'd also want to avoid using parts of the path as field names. Maybe we can use one or two object types for directories and files with specified field names. For example, a directory object could contain a "path" and two optional arrays with "directories" and "files", and a file object could contain a "path", a "size", and a "last_modified" time.

And maybe we could add an optional "types" field that helps applications filter files based on the contained descriptor type(s). (Some tarballs may contain more than one descriptor type, like bridge descriptors and microdescriptors.) We could even use that "types" field for directory objects if a directory only contains files of the given descriptor type(s).

How's this?

{
    "path": "https://collector.torproject.org/",
    "directories": [
        {
            "path": "archive/",
            "directories": [
                {
                    "path": "relay-descriptors/",
                    "files": [
                        {
                            "path": "certs.tar.xz",
                            "size": 80400,
                            "last_modified": "2015-10-10 03:39",
                            "types": [
                                "dir-key-certificate-3"
                            ]
                        }
                    ],
                    "directories": [
                        {
                            "path": "consensuses/",
                            "types": [
                                "network-status-consensus-3"
                            ],
                            "files": [
                                {
                                    "path": "consensuses-2007-10.tar.xz",
                                    "size": 1061648,
                                    "last_modified": "2012-05-15 14:35"
                                },
                                {
                                    "path": "consensuses-2007-11.tar.xz",
                                    "size": 6810308,
                                    "last_modified": "2012-05-15 14:35"
                                }
                            ]
                        }
                    ]
                }
            ]
        }
    ]
}

comment:4 Changed 4 years ago by atagar

Oh, we'll need the "last_modified" field, so that applications can decide whether they'll need to download a file or not.

My thinking was that the filenames already contain the timestamp, so it's redundant. Is that not the case?

Changed 4 years ago by karsten

Attachment: index.json added

Changed 4 years ago by karsten

Attachment: pretty-index.json added

comment:5 Changed 4 years ago by karsten

Cc: iwakeh added

Some file names contain the timestamp, but not all of them. Even worse, some files contain dates and may still change throughout the day. So, yes, we'll need the timestamp.

Please find the two attached files containing a sample index.json of the entire directory tree served by CollecTor. We'll want to remove a few files like .html from that, but that's pretty much what we'd serve. There's also a pretty-index.json with the same content just printed prettily, for the discussion here. These files don't contain the "types" field yet.

What do you think, should we move forward with this? I could start providing this file (index.json, not pretty-index.json, unless there's a reason for providing that) in the next few days.

I'm also copying iwakeh who helps maintain metrics-lib and Onionoo and who generally has good input on these things.

comment:6 Changed 4 years ago by atagar

What do you think, should we move forward with this?

Yikes, I'm surprised you put this together so quickly. This is great, lets do it!

"path": "/srv/collector.torproject.org/htdocs/",

The filesystem path is being used for the first entry. This threw me off for a sec. In the example above you have "https://collector.torproject.org/" which made a bit more sense for the root.

"last_modified": "2014-07-07 09:15"

You mentioned above that you wanted to switch to ISO timestamps. That, or these timestamp (which tor uses) are both fine to me as long as it's UTC.

Personally I'm still not clear on its use though. I care about 'what time period are the descriptors in this resource for'. I don't have a use case at the moment for last modified timestamps, though if you have need for them then feel free to include them.

If we're certain they're needed include them. If unsure I'd suggest leaving them out for now to cut down on the size. We can always add them later.

"path": "exit-lists/",

Very minor but personally I'd opt to not include the trailing slash. It doesn't add anything. But that's definitely in bikeshed territory. :P

One general suggestion: lets put the whole thing in a 'contents' entry so we can include additional metadata. For instance, 'when was this index created' could be very useful to help detect if the script keeping it up to date broke...

{
  "index_created": "2014-06-05 10:52",
  "contents":
    ... all the stuff...
  }
}

More fields might come into play in the future.

comment:7 in reply to:  6 Changed 4 years ago by karsten

Severity: Blocker

Replying to atagar:

What do you think, should we move forward with this?

Yikes, I'm surprised you put this together so quickly. This is great, lets do it!

Great!

"path": "/srv/collector.torproject.org/htdocs/",

The filesystem path is being used for the first entry. This threw me off for a sec. In the example above you have "https://collector.torproject.org/" which made a bit more sense for the root.

Sure, it's supposed to be that. I just didn't edit the output file.

"last_modified": "2014-07-07 09:15"

You mentioned above that you wanted to switch to ISO timestamps. That, or these timestamp (which tor uses) are both fine to me as long as it's UTC.

Oh, and ISO would be "2014-07-07T09:15Z"? Well, I'd say let's use the format that tor uses then.

Personally I'm still not clear on its use though. I care about 'what time period are the descriptors in this resource for'. I don't have a use case at the moment for last modified timestamps, though if you have need for them then feel free to include them.

If we're certain they're needed include them. If unsure I'd suggest leaving them out for now to cut down on the size. We can always add them later.

The main use case is to only fetch files that have changed since the last time we fetched. While most files cannot change, some can. For example, Torperf files will be appended to multiple times over the day and monthly tarballs are updated every three days.

So, I think I'll need this field in metrics-lib.

"path": "exit-lists/",

Very minor but personally I'd opt to not include the trailing slash. It doesn't add anything. But that's definitely in bikeshed territory. :P

Bikeshedding is still fine at this stage. Removed the trailing slash.

But let me bikeshed back: also removed the trailing "/" from the "https://collector.torproject.org/" that you suggested. ;)

One general suggestion: lets put the whole thing in a 'contents' entry so we can include additional metadata. For instance, 'when was this index created' could be very useful to help detect if the script keeping it up to date broke...

{
  "index_created": "2014-06-05 10:52",
  "contents":
    ... all the stuff...
  }
}

More fields might come into play in the future.

Good idea. I tweaked it a tiny bit by turning the top-level object into a special type of directory object with "path", "directories", and (optional) "files", but which may also contain additional fields like "index_created". I think that's easier to process for applications.

Here's an example:

{
  "index_created": "2015-10-13 21:00",
  "path": "https://collector.torproject.org",
  "directories": [
    {
      "path": "archive",
      "directories": [
        {
          "path": "bridge-descriptors",
          "files": [
            {
              "path": "bridge-descriptors-2008-05.tar.xz",
              "size": 624156,
              "last_modified": "2012-05-30 19:41"
            },
            {
              "path": "bridge-descriptors-2008-06.tar.xz",
              "size": 1010648,
              "last_modified": "2012-05-30 19:41"
            },
            {
              "path": "bridge-descriptors-2008-07.tar.xz",
              "size": 1173032,
              "last_modified": "2012-05-30 19:41"
            },

What do you think?

comment:8 Changed 4 years ago by karsten

Severity: BlockerMajor

(Undoing the Severity change. I think something in Trac has changed while I was writing the response. I didn't intend to change anything there.)

comment:9 Changed 4 years ago by atagar

What do you think?

Looks good to me! I was worrying a bit about size (especially if we add new fields like the descriptor type), but on reflection this includes a lot of repeated bits ('bridge-descriptors-' and such). That compresses very, very nicely.

Your example index.json (237.9 kB) can be gz compressed to 39.7 kB, bz2 compressed to 30.5 kB, or xz compressed to 26.6 kB. If reasonably easy it would be nice if CollecTor provided the index with all three. With python at least builtin xz support wasn't added until python 3.3.

Cheers! -Damian

comment:10 Changed 4 years ago by karsten

Owner: set to karsten
Status: newaccepted

Great, will try to implement this as part of moving CollecTor to a new host.

And yes, providing different compressed files in addition to the uncompressed file is a fine plan. Those would be available as https://collector.torproject.org/index.json.gz etc.

Thanks for all your input! Will let you know as soon as this is available.

comment:11 Changed 4 years ago by karsten

I just deployed the code to generate four index files:

Please give them a try.

Next step will be to document these files on the website, maybe as part of https://collector.torproject.org/#download.

comment:12 Changed 4 years ago by atagar

Looks great! Parsed it with python and all looks well. I question a bit if files/directories should be a list as it is now or a map. Presently it takes an O(n) lookup to say "I want the 'relay-descriptors' directory". That said, obviously not a big whoop - it's not a big document.

comment:13 Changed 4 years ago by karsten

Hmm. Part of me agrees because performance is important, but another part of me is hesitant because this format with fixed field names is really easy to parse (and easy to produce). I'd guess that most applications would parse the whole thing to memory anyway before looking at it, and then it should be cheap to just build an internal map of contents for performance reasons. I think I'll leave the format unchanged and give you and me the chance to implement it in Stem and metrics-lib. Thanks for the feedback though!

comment:14 Changed 4 years ago by atagar

I'd guess that most applications would parse the whole thing to memory anyway before looking at it, and then it should be cheap to just build an internal map of contents for performance reasons.

You're right that it's not a big whoop either way but essentially it means applications will be parsing twice:

  • Json decode the index. It gives a bunch of lists.
  • Read over the decoded document and convert to maps.

If you agree that maps make more sense here then I question if ease of production is a good reason (it's not making things easier overall - it's just punting that small bit of work to all users).

comment:15 Changed 4 years ago by karsten

Ease of production is not a good reason, but I think writing a parser is also more complicated if field names are dynamic. That might not be the case with Python, but at least for Java/Gson it seems easier to just define a few classes with (statically named) attributes and let Gson do some magic to parse a JSON string for you. I don't know about other languages/libraries.

It's also not clear that looking up a specific directory or file by name is the only use case for applications. It could also be that applications go through the entire list to see which files have changed, and they could stop at paths they don't care about. Or they could go through the full list once we have "type" information and do something with all files of a given descriptor type. Many choices, so it's not clear whether we should try to optimize for the path lookup use case, especially if it has downsides like more complex parsers.

comment:16 Changed 4 years ago by karsten

What do you think? Are you fine with keeping the current index.json? If not, can you suggest a new maps-based data format? Thanks!

comment:17 Changed 4 years ago by atagar

Hi Karsten. My two cents is still that a map makes more sense but I also don't think it's terribly important so we can go with this if you'd like.

comment:18 Changed 4 years ago by karsten

Resolution: implemented
Status: acceptedclosed

Okay. Then I'd say let's leave it as it is now. It's still way better than parsing Apache's directory listings.

This format is also documented on https://collector.torproject.org/index.html#index-json, so I'd say this issue is resolved. Closing.

Thanks!

comment:19 Changed 4 years ago by atagar

Great, thanks for implementing this Karsten!

Note: See TracTickets for help on using tickets.