Opened 3 years ago

Closed 9 months ago

#18938 closed defect (duplicate)

Authorities should reject non-UTF-8 content in ExtraInfo descriptors

Reported by: teor Owned by: neel
Priority: Medium Milestone: Tor: unspecified
Component: Core Tor/Tor Version:
Severity: Normal Keywords: needs-proposal, tor-dirauth, needs-spec, easy, 034-triage-20180328, 034-removed-20180328
Cc: catalyst, neel@… Actual Points:
Parent ID: #27367 Points: 1
Reviewer: Sponsor:

Description (last modified by teor)

In #18656, we discovered that authorities don't validate that ExtraInfo descriptors are printable ASCII before accepting them.

Authorities (and HSDirs) should check every directory extrainfo document they receive consists only of "printing ASCII" UTF-8, as defined in torspec... prop285:
https://gitweb.torproject.org/torspec.git/tree/proposals/285-utf-8.txt

I've heard others say that the following lines allow non-ASCII content, but I'm not sure if that's actually the case, and if it is, how many relays this would affect:

  • the "platform" line in relay descriptors, which is a "human-readable string",
  • the contact "info" line in relay descriptors, which has an undefined format.

Edit: allowing users to spell their names correctly is important. That's why we'll use utf-8 for relay descriptors, votes, and consensuses.

If it is, I'd recommend we make them all ASCII for consistency, and update torspec to clarify, and include it as a "major" change in an 0.2.x tor release.

(This means that some users will be unable to spell their names correctly. But there was never any guarantee that 8-bit characters in "info" would be interpreted as users intended. I think security is more important here.)

Child Tickets

Change History (39)

comment:1 Changed 3 years ago by teor

Tor authorities, relays, and onion services should validate directory documents before uploading them as well.

comment:2 Changed 3 years ago by atagar

Thanks for filing this teor!

If it is, I'd recommend we make them all ASCII for consistency

For what it's worth I'd love for the contact and platform lines to be ASCII. Having them be the one and only 'special snowflakes' in this regard is unnecessary and a pita for parsers.

comment:3 Changed 3 years ago by teor

Keywords: needs-proposal-maybe added

Migrating the consensus version with this change could be interesting, I suggest we do the following:

  • if, at vote time, the current consensus version is >= MIN_VERSION_TO_EXCLUDE_NON_ASCII_DIR_DOCS, authorities exclude relays with descriptors containing non-ascii characters from their votes
  • to avoid a consensus split, authorities always accept uploaded descriptors, containing non-ascii characters, if they are uploaded from other authorities
  • authorities with this bugfix reject all uploaded documents, including descriptors and extra-info, containing non-ascii characters
    • this gives relay operators ample warning to modify their contact lines
  • relays also do this validation before upload

Separately, perhaps in conjunction with prop224:

  • hidden service directories validate and reject (encrypted) hidden service descriptors containing non-ascii characters
  • clients validate and reject (decrypted) hidden service descriptors containing non-ascii characters
  • hidden services also do this validation before upload

comment:4 Changed 3 years ago by nickm

Keywords: 029-nickm-unsure added

Marking these tickets as the ones I think I need more feedback about in order to figure out if I think it should go in 0.2.9.

comment:5 Changed 3 years ago by nickm

Keywords: 029-proposed 029-nickm-unsure removed
Milestone: Tor: 0.2.???Tor: 0.2.9.x-final

Minimal alternative: Authorities just start rejecting these when they are uploaded to the authorities.

(We have to make sure that if you're an authority who rejects these descriptors, you won't go crazy re-fetching them if other authorities list them. But I think we fixed that bug.)

comment:6 Changed 3 years ago by isabela

Points: small1

comment:7 Changed 3 years ago by Sebastian

Seems we're moving rather fast with this, shouldn't we first warn the relay operators about it? How many relays does this affect?

comment:8 Changed 3 years ago by atagar

I find the 'fast' comment amusing. I first reported that a relay was publishing malformed extrainfo descriptors over three months ago so... yeah. 'Fast' isn't the word that comes to mind. :P

Directory authorities should not publish malformed documents. This is a gap in the validation tor does. This is why I have a DocTor check that validates everything coming from the DirAuths - because Tor's validation is quite a bit laxer than Stem's.

How many relays does this affect?

At the moment one. A single relay has been publishing malformed ExtraInfo documents for months.

comment:9 Changed 3 years ago by atagar

shouldn't we first warn the relay operators about it?

Oops, forgot to mention - we did reach out to this relay three months back. He was puzzled why his tor instance was publishing malformed 'dirreq-v3-reqs' lines and we didn't follow up.

comment:10 Changed 3 years ago by Sebastian

My comment is about an ASCII-requirement in contact info. Moving fast has nothing to do with how quickly we're patching it, I think we should have a couple of stables out that reject such configurations before dirauths reject relays with these kinds of descriptors.

comment:11 in reply to:  7 Changed 3 years ago by teor

Replying to Sebastian:

Seems we're moving rather fast with this, shouldn't we first warn the relay operators about it? How many relays does this affect?

It affects no relay descriptors, and the one relay extrainfo that agagar mentioned.

grep "[^A-Za-z0-9:*/@.=+ <>\[\]_,-.]" cached-descriptors returns no lines. And we'd use a less restrictive set of ASCII characters space through ~.

I can't even find the platform line that caused the issue in Atlas, so I wonder if it's an Atlas bug. The most recent descriptor for it has a normal platform line:

https://collector.torproject.org/recent/relay-descriptors/server-descriptors/2016-07-06-06-05-14-server-descriptors

I'm happy to reject it on the relay side in one release, and then have authorities reject it later. But given it affects 1-2 relays out of 7000, I would also be happy to fix it simultaneously on authorities and clients.

If we use nickm's simpler fix where authorities reject non-ASCII documents, it would only take effect after a majority of authorities upgraded to 0.2.9, or all authorities upgraded to 0.2.9, depending on where we do the check.

comment:12 Changed 3 years ago by atagar

Ahhh! Sorry, misunderstood. Non-ascii contact info certainly does come up a bit. Here ya go...

We have 24 descriptors with non-ascii contact info
We have 0 descriptors with non-ascii platform info

contact:

1wiki <tor(��t)1wiki.de>
Random Person <treacheroussunset ��������� openmailbox ��������� org>
Random Person <trash ��t psy-coding com>
0x6FBAB4BD076683498B71AB812C8A7BBF7B85E1AB Oddbj��rn Norstrand <oddbjorn AT norstrand dot priv dot no>
fogmountain[��t]gMx-D0t-nEt [tor-relay.co]
0x775BFC87 Hloup�� Honza <dumbjack AT seznam dot cz>
Mynameis Nobody <bleckbox ��t ouvaton dodt org>
0xdf0c3d316b7312d5 Alexander Kj��ll <alexander.kjall@gmail.com>
gpg-fingerprint: 294F 4893 913F 208E 98B6  925E 538D 9C7C 219C 76AD ���: olbrichski@gmail.com
0x0D3130F5 Cl��ment F��vrier <clement AT forumanalogue dot fr>
2048R/421F554B renke <renke ��T mobtm PUNKT com>
0x01086FDA Cristian Rasch <cristianrasch AT fastmail dot fm>��
<5p4m ��T gmx d0t de>
Random Nobody Person <kimskrams��hotmail.com>
Adri��n Lavi��s <adrian.lavios AT tutanota dot com>
Node Handler <node �� handler]at[marvid ��� france>
0xDADCA1EE Torexit Wall <torexitwall ��t hitler.rocks>
Elmo M��ntynen <elmo dot mantynen AT iki dot fi>
Random Person <simtim6 �� elitemail point org>
Lightning Rider <syroeska-ru ��T `mail` [ dot] ru>
Miko��aj Florkiewicz <tor@florkiewicz.me>
++����������++ c.m.i(at)mail.ru    ++ hkp://keyserver.ubuntu.com:11371 ++ Bitcoin? 153gfzos233LcSnJpDF5u3q76iVAACwTAd
��B`�
0xDADCA1EE Torexit Wall <torexitwall ��t hitler.rocks>

platform:

Script to get this...

from stem.descriptor import remote

non_ascii_contact, non_ascii_platform = [], []

def is_ascii(s):
  return all(ord(c) < 128 for c in s)

for desc in remote.get_server_descriptors():
  if desc.contact and not is_ascii(desc.contact):
    non_ascii_contact.append(desc.contact)

  if desc.platform and not is_ascii(desc.platform):
    non_ascii_contact.append(desc.platform)

print "We have %i descriptors with non-ascii contact info" % len(non_ascii_contact)
print "We have %i descriptors with non-ascii platform info" % len(non_ascii_platform)

print "\ncontact:\n"

for line in non_ascii_contact:
  print line

print "\nplatform:\n"

for line in non_ascii_platform:
  print line

comment:13 Changed 3 years ago by atagar

Bah, apologies for the bug (where non-ascii platforms were added to the non_ascii_contact). With that correction there actually is one...

We have 24 descriptors with non-ascii contact info
We have 1 descriptors with non-ascii platform info

contact:

1wiki <tor(��t)1wiki.de>
Random Person <treacheroussunset ��������� openmailbox ��������� org>
Random Person <trash ��t psy-coding com>
0x6FBAB4BD076683498B71AB812C8A7BBF7B85E1AB Oddbj��rn Norstrand <oddbjorn AT norstrand dot priv dot no>
fogmountain[��t]gMx-D0t-nEt [tor-relay.co]
0x775BFC87 Hloup�� Honza <dumbjack AT seznam dot cz>
Mynameis Nobody <bleckbox ��t ouvaton dodt org>
0xdf0c3d316b7312d5 Alexander Kj��ll <alexander.kjall@gmail.com>
gpg-fingerprint: 294F 4893 913F 208E 98B6  925E 538D 9C7C 219C 76AD ���: olbrichski@gmail.com
0xDAC4CCB6722FF7E5 Alexander B��hm <alexander.boehm@malbolge.net>
0x0D3130F5 Cl��ment F��vrier <clement AT forumanalogue dot fr>
2048R/421F554B renke <renke ��T mobtm PUNKT com>
0x01086FDA Cristian Rasch <cristianrasch AT fastmail dot fm>��
<5p4m ��T gmx d0t de>
Random Nobody Person <kimskrams��hotmail.com>
Adri��n Lavi��s <adrian.lavios AT tutanota dot com>
Node Handler <node �� handler]at[marvid ��� france>
0xDADCA1EE Torexit Wall <torexitwall ��t hitler.rocks>
Elmo M��ntynen <elmo dot mantynen AT iki dot fi>
Random Person <simtim6 �� elitemail point org>
Lightning Rider <syroeska-ru ��T `mail` [ dot] ru>
Miko��aj Florkiewicz <tor@florkiewicz.me>
++����������++ c.m.i(at)mail.ru    ++ hkp://keyserver.ubuntu.com:11371 ++ Bitcoin? 153gfzos233LcSnJpDF5u3q76iVAACwTAd
0xDADCA1EE Torexit Wall <torexitwall ��t hitler.rocks>

platform:

��B`�

comment:14 Changed 3 years ago by Sebastian

I don't like Nick's easy fix I think. The dirauths that upgrade often are also the ones that do the important stuff (badexit, bwauth) so you might be able to ensure you don't get the badexit flag by putting non-ascii into your descriptor.

To the parser argument, I kinda think the ship has sailed for anything that wants to be able to parse historic descriptors. I'm still in favor of not allowing arbitrary bytes in contact info going forward, but I think we should have it in relays before we have it in dirauths.

comment:15 Changed 3 years ago by teor

So I propose we do the following:

  • 0.2.9: check all documents published by relays
  • 0.3.0 or 0.3.1: a new consensus method where authorities refuse to vote for relays with non-ASCII descriptors
  • 0.3.2 (or whenever we use that consensus method): Authorities can reject non-ASCII uploads

I assume "printing ASCII" means "space to tilde, tab, and linefeed" but we should also clarify that in the torspec.

I'm not sure if it's possible to get non-ASCII content in a hidden service descriptor without memory corruption. But in any case, hidden services, HSDirs, and clients should reject that, too. I've split the HS part off into #19647.

comment:16 Changed 3 years ago by cypherpunks

As one of the people with non-ascii ContactInfo, I strongly advise against making that config ascii-only. It might not be obvious to english-native speakers, but in countries with non-ascii characters in their language the introduction of IDN and non-ascii mail addresses was a major advance; restricting this would be a step backward, which will probably need to be corrected again in the future when non-ascii mail addresses become more ubiquitous.

I would prefer for all UTF-8 chars to be usable in the ContactInfo, which also allows to not have to transliterate your name into ascii.

comment:17 in reply to:  16 ; Changed 3 years ago by teor

Replying to cypherpunks:

As one of the people with non-ascii ContactInfo, I strongly advise against making that config ascii-only. It might not be obvious to english-native speakers, but in countries with non-ascii characters in their language the introduction of IDN and non-ascii mail addresses was a major advance; restricting this would be a step backward, which will probably need to be corrected again in the future when non-ascii mail addresses become more ubiquitous.

I would prefer for all UTF-8 chars to be usable in the ContactInfo, which also allows to not have to transliterate your name into ascii.

Currently, the Tor ContactInfo and Platform consist of arbitrary binary data, terminated by an ASCII linefeed byte.

There's no indication of how they should be interpreted - whether they're a particular extended-ASCII codepage, or UTF-8, or something else.

If the ContactInfo and Platform are UTF-8, it's entirely safe to parse the entire file as UTF-8, then restrict all other lines to ASCII. It's also entirely safe to parse the file as ASCII, except for the ContactInfo and Platform, which can be any bytes except ASCII LF. (UTF-8 encodes 0-127 as 0-127, and never maps any other characters to bytes 0-127.)

Some encodings for the ContactInfo and Platform may even produce linefeed bytes, which is clearly unsuitable.

I think we have 3 options:

  1. We could specify validate that ContactInfo and Platform are valid UTF-8 instead. But I'd hate to have to import a changing series of Unicode libraries to do this. Or specify a particular Unicode version. Or deal with the character ambiguities or parser security risks Unicode entails. (Yes, there are attacks on Unicode parsers - remember the iPhone emoji bug?)
  1. We could remain with the current spec, which is under-specified, and leave them as arbitrary, unspecified-encoding bytes. But this is not ideal - how can a relay operator be contacted, when the encoding of their address is unclear?
  1. We could require relay operators to have an ASCII email address (it could be another account, an alias, a transliteration, or an IDN-ASCII-encoding). Which means that there's no encoding ambiguity, and people whose descriptor-viewing or mail programs don't understand UTF-8 can still email operators. It's onerous for those whose names are not ASCII, but so is the risk of being uncontactable via non-Unicode descriptor readers and/or mail programs.

What do you think, cypherpunks?

comment:18 in reply to:  17 ; Changed 3 years ago by cypherpunks

Another cypherpunks here.

Replying to teor:

Currently, the Tor ContactInfo and Platform consist of arbitrary binary data, terminated by an ASCII linefeed byte.

That's not what the spec says. It's supposed to be ASCII. dir-spec 1.2:

NL = The ascii LF character (hex value 0x0a).
Document ::= (Item | NL)+
Item ::= KeywordLine Object*
KeywordLine ::= Keyword NL | Keyword WS ArgumentChar+ NL
Keyword = KeywordChar+
KeywordChar ::= 'A' ... 'Z' | 'a' ... 'z' | '0' ... '9' | '-'
ArgumentChar ::= any printing ASCII character except NL.
WS = (SP | TAB)+
Object ::= BeginLine Base64-encoded-data EndLine
BeginLine ::= "-----BEGIN " Keyword "-----" NL
EndLine ::= "-----END " Keyword "-----" NL

Contact info:

"contact" info NL

[At most once]

Describes a way to contact the relay's administrator, preferably
including an email address and a PGP key fingerprint.

comment:19 in reply to:  18 Changed 3 years ago by teor

Keywords: needs-proposal added; needs-proposal-maybe removed

Replying to cypherpunks:

Another cypherpunks here.

Replying to teor:

Currently, the Tor ContactInfo and Platform consist of arbitrary binary data, terminated by an ASCII linefeed byte.

That's not what the spec says. It's supposed to be ASCII. dir-spec 1.2:

NL = The ascii LF character (hex value 0x0a).
Document ::= (Item | NL)+
Item ::= KeywordLine Object*
KeywordLine ::= Keyword NL | Keyword WS ArgumentChar+ NL
Keyword = KeywordChar+
KeywordChar ::= 'A' ... 'Z' | 'a' ... 'z' | '0' ... '9' | '-'
ArgumentChar ::= any printing ASCII character except NL.
WS = (SP | TAB)+

Contact info:

"contact" info NL

[At most once]

Describes a way to contact the relay's administrator, preferably
including an email address and a PGP key fingerprint.

Yes, we have a spec issue. The implementation of platform and contact allows arbitrary data. The format has never been enforced.

That said, there are multiple ways to allow users to spell their names correctly, but mandate ASCII in descriptors:

  • users can provide the ASCII alias for their EAI email address

https://en.wikipedia.org/wiki/Email_address#Internationalization

  • users can provide an ASCII website address (in IDN form if necessary) that links to their name and email as they would like it displayed

https://en.wikipedia.org/wiki/Internationalized_domain_name

  • users can choose a transliteration or nickname they feel most comfortable with, and use that for their name and email address

There are excellent technologies available for accurately displaying names in a variety of character sets. But Tor descriptors and the associated infrastructure can't reliably do that at the moment.

Regardless of what decision we make about the format of these fields, there will be ambiguity. There will be display inconsistency. There will be users who can't email non-ASCII addresses.

And this means that the Contact field can't achieve its purpose: helping people contact relay operators.

I'm also concerned that making unicode a requirement in some other Tor applications could have security impacts, as well as taking some work to implement correctly.

I'm happy to discuss other alternatives. Perhaps we need a proposal. And perhaps we need more time to work this out.

comment:20 Changed 3 years ago by isabela

Keywords: isaremoved added
Milestone: Tor: 0.2.9.x-finalTor: 0.2.???

comment:21 Changed 3 years ago by teor

Milestone: Tor: 0.2.???Tor: 0.3.???

Milestone renamed

comment:22 Changed 2 years ago by nickm

Keywords: tor-03-unspecified-201612 added
Milestone: Tor: 0.3.???Tor: unspecified

Finally admitting that 0.3.??? was a euphemism for Tor: unspecified all along.

comment:23 Changed 2 years ago by catalyst

Cc: catalyst added

comment:24 Changed 2 years ago by nickm

Keywords: tor-03-unspecified-201612 removed

Remove an old triaging keyword.

comment:25 Changed 2 years ago by nickm

Keywords: isaremoved removed

comment:26 Changed 23 months ago by nickm

Keywords: tor-dirauth needs-spec easy added
Parent ID: #18656

Note that the root cause of #18656 was probably fixed in #22490. But that's no reason to close this ticket.

comment:27 Changed 19 months ago by nickm

Parent ID: #24033

comment:28 Changed 17 months ago by atagar

This is nipping us again with another extrainfo field: #24821.

comment:29 Changed 17 months ago by teor

Milestone: Tor: unspecifiedTor: 0.3.4.x-final

Let's try to make progress on this in 0.3.4

comment:30 Changed 14 months ago by nickm

Keywords: 034-triage-20180328 added

comment:31 Changed 14 months ago by nickm

Keywords: 034-removed-20180328 added

Per our triage process, these tickets are pending removal from 0.3.4.

comment:32 Changed 14 months ago by nickm

Milestone: Tor: 0.3.4.x-finalTor: unspecified

These tickets, tagged with 034-removed-*, are no longer in-scope for 0.3.4. We can reconsider any of them, if time permits.

comment:33 Changed 9 months ago by neel

Cc: neel@… added
Owner: set to neel
Status: newassigned

comment:34 Changed 9 months ago by teor

Description: modified (diff)

Change description based on prop285 utf-8 support.

comment:36 Changed 9 months ago by teor

Description: modified (diff)
Summary: Authorities should reject non-ASCII content in ExtraInfo descriptorsAuthorities should reject non-UTF-8content in ExtraInfo descriptors

comment:37 Changed 9 months ago by teor

Summary: Authorities should reject non-UTF-8content in ExtraInfo descriptorsAuthorities should reject non-UTF-8 content in ExtraInfo descriptors

comment:38 Changed 9 months ago by neel

I have a few questions:

  1. As prop285 already exists, I assume I don't need to make a proposal. Is this correct?
  2. Should I put this in dirserv_add_extrainfo()?
  3. In extrainfo_t, should I check that cache_info.signed_descriptor_body is UTF-8, and if it isn't, I should reject the descriptor?
  4. I don't think there is any library for checking for UTF-8 text in Tor. Can I include external library from GitHub (I am thinking about using https://github.com/chansen/c-utf8-valid) and modify it to fit with Tor (meaning including it in src/ext, not linking to another library an adding a dependency)? Is there going to be a security issue with a third-party library?

comment:39 in reply to:  38 Changed 9 months ago by teor

Parent ID: #24033#27367
Resolution: duplicate
Status: assignedclosed

Hi,

It turns out that the branch in #27367 already implements the extrainfo check.
Would you like to review it? (Please put your review on #27367.)

Please feel free to pick up any of the similar tickets that are children of #24033.

I had answered some of your questions before I realised. I hope the answers are still helpful.

Replying to neel:

I have a few questions:

  1. As prop285 already exists, I assume I don't need to make a proposal. Is this correct?

You don't need to make a proposal. But please read prop285, and tell us if it doesn't make sense.

The proposal also contains some extra rules on top of UTF-8 for:

  • C string compatibility, and
  • compatibility with older Tor versions that expect ASCII

https://gitweb.torproject.org/torspec.git/tree/proposals/285-utf-8.txt#n70

...

  1. I don't think there is any library for checking for UTF-8 text in Tor. Can I include external library from GitHub (I am thinking about using https://github.com/chansen/c-utf8-valid) and modify it to fit with Tor (meaning including it in src/ext, not linking to another library an adding a dependency)? Is there going to be a security issue with a third-party library?

There's a branch in #27373 that implements a UTF-8 string check function.

Note: See TracTickets for help on using tickets.