It would be helpful to parsers for the first line of metrics descriptors to contain the descriptor type and version. Moving this discussion from an email thread...
Speaking of versioning and such, I would really appreciate it if all of these documents came with a header saying what the document type and version is. At present I need to guess based on the filename and first line of content which is both hacky and will break when new versions roll out. Actually, to differentiate between bridge and relay descriptors I need to rely on the '10.x.x.x' scrubbing... :/
Thoughts on how we should format it?
Cheers! -Damian
To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information
Child items 0
Show closed items
No child items are currently assigned. Use child items to break down this issue into smaller parts.
Linked items 0
Link issues together to show that they're related.
Learn more.
How about we add an annotation line @type descriptor-type major.minor as the new first line of a descriptor?
The @type part would be the same for all descriptor type annotations we add, and the descriptor-type major.minor part would be something we agree on for each descriptor type. major would be the version of the format produced by Tor or whichever tool produces the descriptor and minor would be the version of the format that metrics-db makes of it.
We might want to reserve @type in dir-spec.txt, or we may run into trouble once Tor writes such an annotation itself. Maybe that means we'll have to pick something more specific than @type, hmm. The list of descriptor types and versions would go on the metrics formats page.
There are three kinds of descriptors: a) descriptors produced by Tor that we leave unchanged, b) descriptors produced by Tor that we sanitize, c) descriptors produced by Tor services like BridgeDB, GetTor, Tor Check, or Torperf which we may also reformat.
The descriptors in a) could use the version numbers taken from dir-spec.txt where they were first described; we'll want to ask nickm/arma whether the version numbers we choose make sense. In theory, we don't mess with these descriptors at all, but just in case we have to, we can change the .0 part to reflect that:
@type server-descriptor 1.0
@type extra-info 1.0
@type directory 1.0
@type network-status 2.0
@type network-status-consensus 3.0
@type network-status-vote 3.0
The descriptors in b) should probably have version numbers based on dir-spec.txt and our sanitizing approach. The .0 part would be incremented whenever we change something in the sanitizing approach:
@type bridge-network-status 1.0
@type bridge-server-descriptor 1.0
@type bridge-extra-info 1.0
The descriptors in c) are not based on dir-spec.txt and should therefore start at version 1, plus they should have a .0 part that we can increment whenever we change their format.
How about we add an annotation line @type descriptor-type major.minor as the new first line of a descriptor?
Sounds great! In an ideal world the header would answer three questions...
what type of descriptor am I trying to parse?
among my parsing methods, which should I use?
if this isn't a recognized version then should I still parse it, or does the change break backward compatibility?
I'm not sure if the third is something that we want to answer via the header or not. Usually that's distinguished via major versions and minor versions, but here it's a tor version and metrics version instead.
How would you like our libraries to handle unrecognized versions? Abort? Attempt to parse anyway? Somehow decide based on the version?
We might want to reserve @type in dir-spec.txt, or we may run into trouble once Tor writes such an annotation itself.
Probably not needed. Annotations aren't part of the dir-spec since they reside outside of the descriptor itself. When I asked about them I was told that they're a tor implementation detail and hence something liable to change at any time.
That said, this might be an exception since it's used outside of tor itself. Regardless, it would be nice if we added the @type annotation to tor's cached descriptors too...
Hmm. After reading your last comment a couple of times, especially the "break backward compatibility" and "tor implementation detail" parts, I'd like to step back a bit.
What we are trying to achieve here is that whenever we obtain descriptors---produced by Tor or by a Tor-related service and maybe post-processed by metrics-db---we want to get hints for parsing them. The four descriptor sources that come to mind are: 1) request from Tor's control port, 2) request from a (remote) Tor's directory port, 3) read from a cached-* file in a (local) Tor data directory, 4) read from a metrics tarball. For cases 1 to 3 we already have enough context to know what descriptor type and version to expect: for 1) and 2) we know what we requested from Tor, so we can use that as a hint to parse what we get returned; for 3), we can derive what descriptors are contained in a cached-* file from the filename---it's a hack anyway, because Tor's cached-* files are highly implementation-specific and we're messing with them on our own risk. Only for 4) it's not so easy to tell what descriptors are contained in a tarball, especially if tarballs are extracted, so having annotations in files contained in metrics tarballs would make sense. But I think we shouldn't aim for adding these annotations to anything that Tor produces. We also shouldn't mimic Tor's annotation syntax: let's start metrics annotations with "# " instead of "@type".
Your point of using the version number to decide if a change is backward-compatible or not is a good one. Ideally, we should distinguish between changes that add information and are harmless if parsed by an old parser, and changes that cannot be parsed by an old parser in a reasonable way. For example: leaving the original bridge nickname in sanitized descriptors instead of "Unnamed" is a minor change (#5684 (moved)), replacing server descriptor digests with correctly calculated ones would be a major change (#5607 (moved)). We should have major and minor versions just for that. We could encode the dir-spec.txt version in the descriptor type name, because a new dir-spec version is very likely not backward-compatible; and even if it is, we can easily use the same parser. The rule would be that we wouldn't parse a descriptor with unknown type or higher major version than ours, but that we would parse a descriptor with higher minor version, possibly emitting a warning.
So, how about we include the following descriptor-type annotations to files contained in metrics tarballs?
# server-descriptor 1.0
# extra-info 1.0
# directory 1.0
# network-status-2 1.0
# network-status-consensus-3 1.0
# network-status-vote-3 1.0
# bridge-network-status 1.0
# bridge-server-descriptor 1.0
# bridge-extra-info 1.0
# torperf 1.0
# bridge-pool-assignment 1.0
# gettor 1.0
# tordnsel 1.0
Whenever something changes in Tor or in a Tor service in a backward-_in_compatible way, the descriptor type name would change. Whenever we change something in metrics-db, which happens much more frequently, the version number would change.
For cases 1 to 3 we already have enough context to know what descriptor type and version to expect
That assumes that these files will always be, say, v3 server descriptors? Might they be v2 server descriptors in old versions? Or v4 in the future?
for 3), we can derive what descriptors are contained in a cached-* file from the filename---it's a hack anyway, because Tor's cached-* files are highly implementation-specific and we're messing with them on our own risk
Yes, it's a hack and reading directly from the data directory is completely unsupported by tor. However, I think that this is a bad idea and not really necessary.
If tor supported, say, "The data directory may contain descriptor data, identifiable by the first line of its contents." then some use cases will no longer require an open control port at all and tor hasn't painted itself into a corner in terms of future changes.
We also shouldn't mimic Tor's annotation syntax: let's start metrics annotations with "# " instead of "@type".
Could, though mimicking annotations would make sense if we planned to leave the door open for including this with cached descriptors later.
For example: leaving the original bridge nickname in sanitized descriptors instead of "Unnamed" is a minor change (#5684 (moved))
I've been watching that ticket with some interest since it actually does impact stem's parser. On reflection I shouldn't validate bridge scrubbing by default - I should move it to an 'is_scrubbed()' method instead...
The rule would be that we wouldn't parse a descriptor with unknown type or higher major version than ours, but that we would parse a descriptor with higher minor version, possibly emitting a warning.
For cases 1 to 3 we already have enough context to know what descriptor type and version to expect
That assumes that these files will always be, say, v3 server descriptors? Might they be v2 server descriptors in old versions? Or v4 in the future?
Minor nitpick: they have always been v1 server descriptors, not v3. The current server descriptor format was introduced in version 1 of the directory protocol, and it hasn't changed in a backward-incompatible way since then. Keywords have been added, but clients shall ignore keywords they don't understand.
So, if you request a server descriptor via Tor's control or directory port using the command or URL that you always used to get a v1 server descriptor, and Tor gives you a v2 server descriptor that you can't parse, well, then Tor is doing it wrong. And it would break a lot of things, so that's not going to happen, I think. What I'd expect is that there'd be a new command or URL to give you the v2 server descriptor.
for 3), we can derive what descriptors are contained in a cached-* file from the filename---it's a hack anyway, because Tor's cached-* files are highly implementation-specific and we're messing with them on our own risk
Yes, it's a hack and reading directly from the data directory is completely unsupported by tor. However, I think that this is a bad idea and not really necessary.
If tor supported, say, "The data directory may contain descriptor data, identifiable by the first line of its contents." then some use cases will no longer require an open control port at all and tor hasn't painted itself into a corner in terms of future changes.
I see the advantage. The downside, though, is that Tor will create an interface that it may support. I feel that's not going to happen. But I think it's okay to make a best-effort approach for cached-* files.
We also shouldn't mimic Tor's annotation syntax: let's start metrics annotations with "# " instead of "@type".
Could, though mimicking annotations would make sense if we planned to leave the door open for including this with cached descriptors later.
Right, okay. We can do "@type".
For example: leaving the original bridge nickname in sanitized descriptors instead of "Unnamed" is a minor change (#5684 (moved))
I've been watching that ticket with some interest since it actually does impact stem's parser. On reflection I shouldn't validate bridge scrubbing by default - I should move it to an 'is_scrubbed()' method instead...
Yeah, the scrubbing method isn't set in stone. Every now and then there will be changes like this. In addition to the nickname thing, there are also vague plans to include the country code in the contact line of a server descriptor (which doesn't even have a ticket number yet). And some time ago we added the 10.x.x.x thing for IP addresses instead of the 127.0.0.1 as it was before. I understand that it's painful to adapt stem and metrics-lib to those changes, so I'm trying to avoid them if possible.
The rule would be that we wouldn't parse a descriptor with unknown type or higher major version than ours, but that we would parse a descriptor with higher minor version, possibly emitting a warning.
Sounds good
Great.
server-descriptor 1.0
Shouldn't this be "server-descriptor-3 1.0"?
(I don't think so, for the reason stated at the beginning of this comment.)
How do we proceed?
Should we ask Nick and Roger what they think about including @type annotations in descriptors returned by the control port, dir port, and/or included in cached-* files?
And should I specify the annotations for files in metrics tarballs somewhere on metrics.tpo and start including them in the existing tarballs? Or is there anything we overlooked?
What I'd expect is that there'd be a new command or URL to give you the v2 server descriptor.
Right, that's essentially what microdescriptors are I think.
The downside, though, is that Tor will create an interface that it may support.
Yup. On second thought you're right that tor shouldn't vend its data directory's descriptors. However, since they're around anyway (and not likely to go away any time soon) I still think that making this part of the cached files would be helpful.
Yeah, the scrubbing method isn't set in stone.
When it changes do you re-publish old bridge descriptors with the new scheme or are there a variety of scrubbing schemes floating around? My parser will need to change when scrubbing changes are invalid according to the dir-spec. For instance if we drop a mandatory field (like we did for onion-key and such) then we'll need to update...
https://gitweb.torproject.org/stem.git/blob/HEAD:/stem/descriptor/server_descriptor.py#l776
Oh well, that's why I made it a separate subclass.
And some time ago we added the 10.x.x.x thing for IP addresses instead of the 127.0.0.1 as it was before.
If there's descriptors around with the old address then I have an issue since my parser relies on the first line having a 'router Unnamed 10.' prefix to figure out if it's a bridge descriptor or not. Oh well, that's what this ticket will be solving. :)
If the old scrubbing scheme is still around then I'll check for a 'router Unnamed 127.0.0.1' prefix too.
Should we ask Nick and Roger what they think about including @type annotations in descriptors returned by the control port, dir port, and/or included in cached-* files?
Annotations currently only appear in cached descriptors, not via the control or dir port. They don't need the @type header since the caller knows what they're requesting, and including it would likely confuse and break callers.
The cached descriptors is all that I would like to change in tor. And yup, we should ask for input from them next.
And should I specify the annotations for files in metrics tarballs somewhere on metrics.tpo and start including them in the existing tarballs? Or is there anything we overlooked?
Sounds good to me but lets first get Nick's opinion before proceeding.
As Tor stores stuff now, it supports annotations for router descriptors and microdescriptors and (I think) extrainfo docs, but I don't think it supports even storing them for cached-consensus or certificate stuff. (I could be wrong there.) This data is not explicitly typed as it is stored; rather, the type of these documents is inferred (in Tor) from where they are stored. Also, I don't think Tor has a notion of the "version" of these documents; I don't know what a 1.1 or a 1.0 even means in this case.
We could stick more annotations in the stuff that currently allows annotations, but adding annotations to things that don't currently allow them would break backward compatibility for the data directory contents, and we generally try not to do that.
It's not IMO hacky to say that a cached-consensus file contains a consensus directory; that's what it means for it to be in that file. It's supposed to be allowed to infer what stuff is from where it is.
I think I'm missing the actual application here, which was probably more explicit in the original email thread. What are you trying to do, under what constraint?
The goal of this is to allow a descriptor parser to reliably figure out how it should parse files. At present stem and metrics-lib infer a file's type by checking its filename and the first line of its contents, for instance a first line starting with 'router Unnamed 10.' indicates that it's a bridge descriptor. This is pretty hacky.
Admittedly this is mostly just an issue for metrics descriptors since the filename is a reliable indicator for cached files. Using the filename feels like a bit of a hack which is why I wanted to apply this to cached descriptors too, but it's not horrible. I'm fine with proceeding with this as being a metrics-only header. Thoughts, Karsten?
When it changes do you re-publish old bridge descriptors with the new scheme or are there a variety of scrubbing schemes floating around? My parser will need to change when scrubbing changes are invalid according to the dir-spec. For instance if we drop a mandatory field (like we did for onion-key and such) then we'll need to update...
Whenever there are changes to the scrubbing scheme, I'm updating all files. That's one of the reasons why there are at most 1 to 2 changes per year. It takes about week to process all descriptors again. But having more than one scheme around would be too confusing.
If there's descriptors around with the old address then I have an issue since my parser relies on the first line having a 'router Unnamed 10.' prefix to figure out if it's a bridge descriptor or not. Oh well, that's what this ticket will be solving. :)
There are only descriptors with "router Unnamed 10." around.
Admittedly this is mostly just an issue for metrics descriptors since the filename is a reliable indicator for cached files. Using the filename feels like a bit of a hack which is why I wanted to apply this to cached descriptors too, but it's not horrible. I'm fine with proceeding with this as being a metrics-only header. Thoughts, Karsten?
I'm fine with adding annotation headers to metrics tarballs only. I just created a new section on the formats page. I'm going to start adding meta data to server descriptors today.
Hmm, maybe I should wait for your feedback before starting to add meta data to server descriptors. Does the meta-data format on the formats page look good?
Stealing this ticket to note down my 4 points for the discussion so far and the estimated, let's say, 6 points for adding annotations to existing relay descriptors and making sure the resulting tarballs are still valid.
Trac: Points: N/Ato 10 Owner: N/Ato karsten Status: new to assigned
The currently used relay descriptor types (server descriptors, extra-info descriptors, consensuses, votes, and certs) now contain @type annotations. See the formats page for the notation. I updated all existing tarballs on the data page containing these descriptor types.
I'm planning to annotate v1 directories and v2 statuses next, bridge pool assignments and bridge descriptors in two weeks from now, and Torperf files, GetTor statistics files, and exit lists after that.
I didn't add @type annotations for bridge descriptors, because the tarballs don't contain those lines yet. But you're right, there's nothing wrong in adding them. Just did so.
All descriptors available here contain @type annotations and metrics-lib uses these annotations, if available, to decide how to parse a descriptor. Closing.