I've heard others say that the following lines allow non-ASCII content, but I'm not sure if that's actually the case, and if it is, how many relays this would affect:
the "platform" line in relay descriptors, which is a "human-readable string",
the contact "info" line in relay descriptors, which has an undefined format.
Edit: allowing users to spell their names correctly is important. That's why we'll use utf-8 for relay descriptors, votes, and consensuses.
If it is, I'd recommend we make them all ASCII for consistency, and update torspec to clarify, and include it as a "major" change in an 0.2.x tor release.
(This means that some users will be unable to spell their names correctly. But there was never any guarantee that 8-bit characters in "info" would be interpreted as users intended. I think security is more important here.)
To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information
Child items 0
Show closed items
No child items are currently assigned. Use child items to break down this issue into smaller parts.
Linked items 0
Link issues together to show that they're related.
Learn more.
If it is, I'd recommend we make them all ASCII for consistency
For what it's worth I'd love for the contact and platform lines to be ASCII. Having them be the one and only 'special snowflakes' in this regard is unnecessary and a pita for parsers.
Migrating the consensus version with this change could be interesting, I suggest we do the following:
if, at vote time, the current consensus version is >= MIN_VERSION_TO_EXCLUDE_NON_ASCII_DIR_DOCS, authorities exclude relays with descriptors containing non-ascii characters from their votes
to avoid a consensus split, authorities always accept uploaded descriptors, containing non-ascii characters, if they are uploaded from other authorities
authorities with this bugfix reject all uploaded documents, including descriptors and extra-info, containing non-ascii characters
this gives relay operators ample warning to modify their contact lines
relays also do this validation before upload
Separately, perhaps in conjunction with prop224:
hidden service directories validate and reject (encrypted) hidden service descriptors containing non-ascii characters
clients validate and reject (decrypted) hidden service descriptors containing non-ascii characters
hidden services also do this validation before upload
Minimal alternative: Authorities just start rejecting these when they are uploaded to the authorities.
(We have to make sure that if you're an authority who rejects these descriptors, you won't go crazy re-fetching them if other authorities list them. But I think we fixed that bug.)
I find the 'fast' comment amusing. I first reported that a relay was publishing malformed extrainfo descriptors over three months ago so... yeah. 'Fast' isn't the word that comes to mind. :P
Directory authorities should not publish malformed documents. This is a gap in the validation tor does. This is why I have a DocTor check that validates everything coming from the DirAuths - because Tor's validation is quite a bit laxer than Stem's.
How many relays does this affect?
At the moment one. A single relay has been publishing malformed ExtraInfo documents for months.
shouldn't we first warn the relay operators about it?
Oops, forgot to mention - we did reach out to this relay three months back. He was puzzled why his tor instance was publishing malformed 'dirreq-v3-reqs' lines and we didn't follow up.
My comment is about an ASCII-requirement in contact info. Moving fast has nothing to do with how quickly we're patching it, I think we should have a couple of stables out that reject such configurations before dirauths reject relays with these kinds of descriptors.
Seems we're moving rather fast with this, shouldn't we first warn the relay operators about it? How many relays does this affect?
It affects no relay descriptors, and the one relay extrainfo that agagar mentioned.
grep "[^A-Za-z0-9:*/@.=+ <>\[\]_,-.]" cached-descriptors returns no lines. And we'd use a less restrictive set of ASCII characters space through ~.
I can't even find the platform line that caused the issue in Atlas, so I wonder if it's an Atlas bug. The most recent descriptor for it has a normal platform line:
I'm happy to reject it on the relay side in one release, and then have authorities reject it later. But given it affects 1-2 relays out of 7000, I would also be happy to fix it simultaneously on authorities and clients.
If we use nickm's simpler fix where authorities reject non-ASCII documents, it would only take effect after a majority of authorities upgraded to 0.2.9, or all authorities upgraded to 0.2.9, depending on where we do the check.
Ahhh! Sorry, misunderstood. Non-ascii contact info certainly does come up a bit. Here ya go...
We have 24 descriptors with non-ascii contact infoWe have 0 descriptors with non-ascii platform infocontact:1wiki <tor(��t)1wiki.de>Random Person <treacheroussunset ��������� openmailbox ��������� org>Random Person <trash ��t psy-coding com>0x6FBAB4BD076683498B71AB812C8A7BBF7B85E1AB Oddbj��rn Norstrand <oddbjorn AT norstrand dot priv dot no>fogmountain[��t]gMx-D0t-nEt [tor-relay.co]0x775BFC87 Hloup�� Honza <dumbjack AT seznam dot cz>Mynameis Nobody <bleckbox ��t ouvaton dodt org>0xdf0c3d316b7312d5 Alexander Kj��ll <alexander.kjall@gmail.com>gpg-fingerprint: 294F 4893 913F 208E 98B6 925E 538D 9C7C 219C 76AD ���: olbrichski@gmail.com0x0D3130F5 Cl��ment F��vrier <clement AT forumanalogue dot fr>2048R/421F554B renke <renke ��T mobtm PUNKT com>0x01086FDA Cristian Rasch <cristianrasch AT fastmail dot fm>��<5p4m ��T gmx d0t de>Random Nobody Person <kimskrams��hotmail.com>Adri��n Lavi��s <adrian.lavios AT tutanota dot com>Node Handler <node �� handler]at[marvid ��� france>0xDADCA1EE Torexit Wall <torexitwall ��t hitler.rocks>Elmo M��ntynen <elmo dot mantynen AT iki dot fi>Random Person <simtim6 �� elitemail point org>Lightning Rider <syroeska-ru ��T `mail` [ dot] ru>Miko��aj Florkiewicz <tor@florkiewicz.me>++����������++ c.m.i(at)mail.ru ++ hkp://keyserver.ubuntu.com:11371 ++ Bitcoin? 153gfzos233LcSnJpDF5u3q76iVAACwTAd��B`��0xDADCA1EE Torexit Wall <torexitwall ��t hitler.rocks>platform:
Script to get this...
from stem.descriptor import remotenon_ascii_contact, non_ascii_platform = [], []def is_ascii(s): return all(ord(c) < 128 for c in s)for desc in remote.get_server_descriptors(): if desc.contact and not is_ascii(desc.contact): non_ascii_contact.append(desc.contact) if desc.platform and not is_ascii(desc.platform): non_ascii_contact.append(desc.platform)print "We have %i descriptors with non-ascii contact info" % len(non_ascii_contact)print "We have %i descriptors with non-ascii platform info" % len(non_ascii_platform)print "\ncontact:\n"for line in non_ascii_contact: print lineprint "\nplatform:\n"for line in non_ascii_platform: print line
I don't like Nick's easy fix I think. The dirauths that upgrade often are also the ones that do the important stuff (badexit, bwauth) so you might be able to ensure you don't get the badexit flag by putting non-ascii into your descriptor.
To the parser argument, I kinda think the ship has sailed for anything that wants to be able to parse historic descriptors. I'm still in favor of not allowing arbitrary bytes in contact info going forward, but I think we should have it in relays before we have it in dirauths.
0.3.0 or 0.3.1: a new consensus method where authorities refuse to vote for relays with non-ASCII descriptors
0.3.2 (or whenever we use that consensus method): Authorities can reject non-ASCII uploads
I assume "printing ASCII" means "space to tilde, tab, and linefeed" but we should also clarify that in the torspec.
I'm not sure if it's possible to get non-ASCII content in a hidden service descriptor without memory corruption. But in any case, hidden services, HSDirs, and clients should reject that, too. I've split the HS part off into #19647 (moved).
As one of the people with non-ascii ContactInfo, I strongly advise against making that config ascii-only. It might not be obvious to english-native speakers, but in countries with non-ascii characters in their language the introduction of IDN and non-ascii mail addresses was a major advance; restricting this would be a step backward, which will probably need to be corrected again in the future when non-ascii mail addresses become more ubiquitous.
I would prefer for all UTF-8 chars to be usable in the ContactInfo, which also allows to not have to transliterate your name into ascii.
As one of the people with non-ascii ContactInfo, I strongly advise against making that config ascii-only. It might not be obvious to english-native speakers, but in countries with non-ascii characters in their language the introduction of IDN and non-ascii mail addresses was a major advance; restricting this would be a step backward, which will probably need to be corrected again in the future when non-ascii mail addresses become more ubiquitous.
I would prefer for all UTF-8 chars to be usable in the ContactInfo, which also allows to not have to transliterate your name into ascii.
Currently, the Tor ContactInfo and Platform consist of arbitrary binary data, terminated by an ASCII linefeed byte.
There's no indication of how they should be interpreted - whether they're a particular extended-ASCII codepage, or UTF-8, or something else.
If the ContactInfo and Platform are UTF-8, it's entirely safe to parse the entire file as UTF-8, then restrict all other lines to ASCII. It's also entirely safe to parse the file as ASCII, except for the ContactInfo and Platform, which can be any bytes except ASCII LF. (UTF-8 encodes 0-127 as 0-127, and never maps any other characters to bytes 0-127.)
Some encodings for the ContactInfo and Platform may even produce linefeed bytes, which is clearly unsuitable.
I think we have 3 options:
We could specify validate that ContactInfo and Platform are valid UTF-8 instead. But I'd hate to have to import a changing series of Unicode libraries to do this. Or specify a particular Unicode version. Or deal with the character ambiguities or parser security risks Unicode entails. (Yes, there are attacks on Unicode parsers - remember the iPhone emoji bug?)
We could remain with the current spec, which is under-specified, and leave them as arbitrary, unspecified-encoding bytes. But this is not ideal - how can a relay operator be contacted, when the encoding of their address is unclear?
We could require relay operators to have an ASCII email address (it could be another account, an alias, a transliteration, or an IDN-ASCII-encoding). Which means that there's no encoding ambiguity, and people whose descriptor-viewing or mail programs don't understand UTF-8 can still email operators. It's onerous for those whose names are not ASCII, but so is the risk of being uncontactable via non-Unicode descriptor readers and/or mail programs.