Hi Karsten, as discussed at the dev meeting it would be sweeeeeet for Stem to have a CollecTorDownloader class that supports getting and processing descriptors from CollecTor.
Presently you read and process the Apache index html, but it would it could be a lot nicer to have a machine readable index.json file instead. This is the ticket for it.
To support future expansion it would be nice to be able to include additional metadata, so I propose a 'contents' hash, such as...
Thanks for starting this discussion. I spent some time thinking about this in the past days, too, and while I don't have a complete plan in mind yet, I'd want to share some ideas:
You placed your example index.json somewhere in the middle of the directory tree and listed all directly contained directories and files. That's how Apache's index.html works. But that also means that tools like Stem would need to navigate through the directory tree and read multiple of these index.json files. And it means that CollecTor would have to rewrite all these index.json files after an update. While this could work, it's somewhat complex.
How about we write a single index.json, say https://collector.torproject.org/index.json, that contains all directories and files in the directory tree? This would make processing a lot easier.
The obvious downside is that this file could grow quite big. I'm listing all directories and the number of contained files here:
If we assume that each directory or file requires 200 characters/bytes in the index.json, that's an uncompressed file size of 413 KiB. We can probably save a bit here by removing whitespace, not repeating the https://collector.torproject.org/ part over and over, etc. What do you think, is that still reasonable?
That's 90 characters without whitespaces. The last_modified doesn't seem worth the space and we can add them later if we need 'em, further dropping it to 53 characters per entry. That's ~108 KB. Very reasonable.
Oh, we'll need the "last_modified" field, so that applications can decide whether they'll need to download a file or not. We could also use a file digest here, but it's probably easier to keep the "last_modified" field. Though we should probably switch to the ISO format.
I think I'd also want to avoid using parts of the path as field names. Maybe we can use one or two object types for directories and files with specified field names. For example, a directory object could contain a "path" and two optional arrays with "directories" and "files", and a file object could contain a "path", a "size", and a "last_modified" time.
And maybe we could add an optional "types" field that helps applications filter files based on the contained descriptor type(s). (Some tarballs may contain more than one descriptor type, like bridge descriptors and microdescriptors.) We could even use that "types" field for directory objects if a directory only contains files of the given descriptor type(s).
Some file names contain the timestamp, but not all of them. Even worse, some files contain dates and may still change throughout the day. So, yes, we'll need the timestamp.
Please find the two attached files containing a sample index.json of the entire directory tree served by CollecTor. We'll want to remove a few files like .html from that, but that's pretty much what we'd serve. There's also a pretty-index.json with the same content just printed prettily, for the discussion here. These files don't contain the "types" field yet.
What do you think, should we move forward with this? I could start providing this file (index.json, not pretty-index.json, unless there's a reason for providing that) in the next few days.
I'm also copying iwakeh who helps maintain metrics-lib and Onionoo and who generally has good input on these things.
What do you think, should we move forward with this?
Yikes, I'm surprised you put this together so quickly. This is great, lets do it!
"path": "/srv/collector.torproject.org/htdocs/",
The filesystem path is being used for the first entry. This threw me off for a sec. In the example above you have "https://collector.torproject.org/" which made a bit more sense for the root.
"last_modified": "2014-07-07 09:15"
You mentioned above that you wanted to switch to ISO timestamps. That, or these timestamp (which tor uses) are both fine to me as long as it's UTC.
Personally I'm still not clear on its use though. I care about 'what time period are the descriptors in this resource for'. I don't have a use case at the moment for last modified timestamps, though if you have need for them then feel free to include them.
If we're certain they're needed include them. If unsure I'd suggest leaving them out for now to cut down on the size. We can always add them later.
"path": "exit-lists/",
Very minor but personally I'd opt to not include the trailing slash. It doesn't add anything. But that's definitely in bikeshed territory. :P
One general suggestion: lets put the whole thing in a 'contents' entry so we can include additional metadata. For instance, 'when was this index created' could be very useful to help detect if the script keeping it up to date broke...
{ "index_created": "2014-06-05 10:52", "contents": ... all the stuff... }}
What do you think, should we move forward with this?
Yikes, I'm surprised you put this together so quickly. This is great, lets do it!
Great!
"path": "/srv/collector.torproject.org/htdocs/",
The filesystem path is being used for the first entry. This threw me off for a sec. In the example above you have "https://collector.torproject.org/" which made a bit more sense for the root.
Sure, it's supposed to be that. I just didn't edit the output file.
"last_modified": "2014-07-07 09:15"
You mentioned above that you wanted to switch to ISO timestamps. That, or these timestamp (which tor uses) are both fine to me as long as it's UTC.
Oh, and ISO would be "2014-07-07T09:15Z"? Well, I'd say let's use the format that tor uses then.
Personally I'm still not clear on its use though. I care about 'what time period are the descriptors in this resource for'. I don't have a use case at the moment for last modified timestamps, though if you have need for them then feel free to include them.
If we're certain they're needed include them. If unsure I'd suggest leaving them out for now to cut down on the size. We can always add them later.
The main use case is to only fetch files that have changed since the last time we fetched. While most files cannot change, some can. For example, Torperf files will be appended to multiple times over the day and monthly tarballs are updated every three days.
So, I think I'll need this field in metrics-lib.
"path": "exit-lists/",
Very minor but personally I'd opt to not include the trailing slash. It doesn't add anything. But that's definitely in bikeshed territory. :P
Bikeshedding is still fine at this stage. Removed the trailing slash.
One general suggestion: lets put the whole thing in a 'contents' entry so we can include additional metadata. For instance, 'when was this index created' could be very useful to help detect if the script keeping it up to date broke...
{ "index_created": "2014-06-05 10:52", "contents": ... all the stuff... }}}}}More fields might come into play in the future.
Good idea. I tweaked it a tiny bit by turning the top-level object into a special type of directory object with "path", "directories", and (optional) "files", but which may also contain additional fields like "index_created". I think that's easier to process for applications.
Looks good to me! I was worrying a bit about size (especially if we add new fields like the descriptor type), but on reflection this includes a lot of repeated bits ('bridge-descriptors-' and such). That compresses very, very nicely.
Your example index.json (237.9 kB) can be gz compressed to 39.7 kB, bz2 compressed to 30.5 kB, or xz compressed to 26.6 kB. If reasonably easy it would be nice if CollecTor provided the index with all three. With python at least builtin xz support wasn't added until python 3.3.
Great, will try to implement this as part of moving CollecTor to a new host.
And yes, providing different compressed files in addition to the uncompressed file is a fine plan. Those would be available as https://collector.torproject.org/index.json.gz etc.
Thanks for all your input! Will let you know as soon as this is available.
Trac: Status: new to accepted Owner: N/Ato karsten
Looks great! Parsed it with python and all looks well. I question a bit if files/directories should be a list as it is now or a map. Presently it takes an O(n) lookup to say "I want the 'relay-descriptors' directory". That said, obviously not a big whoop - it's not a big document.
Hmm. Part of me agrees because performance is important, but another part of me is hesitant because this format with fixed field names is really easy to parse (and easy to produce). I'd guess that most applications would parse the whole thing to memory anyway before looking at it, and then it should be cheap to just build an internal map of contents for performance reasons. I think I'll leave the format unchanged and give you and me the chance to implement it in Stem and metrics-lib. Thanks for the feedback though!
I'd guess that most applications would parse the whole thing to memory anyway before looking at it, and then it should be cheap to just build an internal map of contents for performance reasons.
You're right that it's not a big whoop either way but essentially it means applications will be parsing twice:
Json decode the index. It gives a bunch of lists.
Read over the decoded document and convert to maps.
If you agree that maps make more sense here then I question if ease of production is a good reason (it's not making things easier overall - it's just punting that small bit of work to all users).
Ease of production is not a good reason, but I think writing a parser is also more complicated if field names are dynamic. That might not be the case with Python, but at least for Java/Gson it seems easier to just define a few classes with (statically named) attributes and let Gson do some magic to parse a JSON string for you. I don't know about other languages/libraries.
It's also not clear that looking up a specific directory or file by name is the only use case for applications. It could also be that applications go through the entire list to see which files have changed, and they could stop at paths they don't care about. Or they could go through the full list once we have "type" information and do something with all files of a given descriptor type. Many choices, so it's not clear whether we should try to optimize for the path lookup use case, especially if it has downsides like more complex parsers.