If it is meant to return true for all IP addresses, it should return true for IPv6 as well.
If it's meant to return true just for hostnames, it should reject IPv4 addresses.
a valid host name can never have the dotted-decimal form #.#.#.#, since at least the highest-level component label will be alphabetic
It is meant to return true for everything that exits will let you (try to) connect to.
Maybe that means changing the word "hostname" to a word that doesn't make some people think it can't be an IP address.
Hmmm the way I understand it on how it is used in the code, it can not take an IP address. It has to be a DNS name. Also, every unit tests for this function is testing for a name, never an address.
It also sounds like allowing ipv6 addresses is the right thing to do.
So this function is doing what it should imo except returning true for an IPv4. One simple test would be to pass the value through inet_pton and if it returns -1, we know we do not have an IP address?
We have specific functions to check for an IPv4/IPv6, we should use those.
So maybe what we need is new function that does what arma proposes that is anything valid that you can connect to (DNS name + IP) ?
Please update the function comment for string_is_valid_hostname(), to say that it only accepts hostnames that end in an alphabetical TLD.
Comment the two blocks within the inner loop, and explain why they're different.
Add unit tests for alphabetical and non-alphabetical TLDs.
Add a unit test containing a punycode TLD, and modify the function to accept it, and modify the function comment. (Hint: the TLD can contain dashes.)
Explain why it's ok to use ctype.h, because the previous code didn't.
Do we use it elsewhere in Tor?
Is it available and consistent on all supported platforms?
Is this code called when parsing votes or creating the consensus?
I was wrong, it doesn't say that the TLD must be alphabetical:
// Last label of a hostname is required to be alphabetic according to // RFC 1123 Section 2.1.
Instead, it says:
The syntax of a legal Internet host name was specified in RFC-952 [DNS:4]. One aspect of host name syntax is hereby changed: the restriction on the first character is relaxed to allow either a letter or a digit. Host software MUST support this more liberal syntax.... Whenever a user inputs the identity of an Internet host, it SHOULD be possible to enter either (1) a host domain name or (2) an IP address in dotted-decimal ("#.#.#.#") form. The host SHOULD check the string syntactically for a dotted-decimal number before looking it up in the Domain Name System.
Therefore, rather than trying to detect punycode or alphabetical endings, let's use this logic:
If it is a valid IPv4 or IPv6 address, it is not a hostname
Otherwise, if it has alphanumeric, -, or _, in the right format, it is a hostname.
Being lax makes Tor more future proof to domain name format changes.
And it confirms to the RFC.
Therefore, rather than trying to detect punycode or alphabetical endings, let's use this logic:
If it is a valid IPv4 or IPv6 address, it is not a hostname
Otherwise, if it has alphanumeric, -, or _, in the right format, it is a hostname.
Being lax makes Tor more future proof to domain name format changes.
And it confirms to the RFC.
Technically IP address strings are not hostnames, but we do allow them to be used in SOCKS requests. That's why I introduced string_is_valid_dest that checks destination validity by branching out to tor_inet_pton and string_is_valid_hostname. We want string_is_valid_hostname to do exactly what the function name says - DNS hostname validation.
I think supporting punycode TLDs is a good idea. Currently all entries in of valid TLDs [0] are either alphabetic or punycode.
let's not break as soon as IETF slightly changes the IDN or other DNS formats, please.
Please make the hostname validation check for alphanumeric, hyphen, and underscore in every component.
Do not check for the precise punycode format as of 2018, that's fragile, and will break if the IDN format is changed, or a new DNS format is added.
If you wish, you can exclude IPv4 addresses, which would otherwise match the new hostname check.
Please keep the unit tests from the two commits you reverted.
Punycode should pass, as should partially numeric TLDs.
You can add a test for a fully numeric TLD if you'd like, too.
Once you've cleaned up the branch and added the unit tests, please flip this ticket into merge_ready.
The tor_assert(len > 0) check in string_is_valid_dest() makes me quite nervous. There's nothing in the documentation for that function that says you can't call it with "" as an argument. It would make much more sense to me if the function just returns 0 when len == 0.
I don't understand why you changed the check do {} while loop from result to result > 0.
Hrmm. Downside of tons of reviewers: I have a weak preference that the hostname be treated strictly. I think that permitting more things on the socksport itself is fine, but that our function names should reflect current RFC notions, and not leave wiggle room for future potential changes, since in other cases uses of this function may end up causing bugs. This would mean that string_is_valid_hostname() would enforce full RFC hostname strictness, but then string_is_valid_dest() would or that together with string_is_valid_ipv4, string_is_valid_ipv6, and another string_is_valid_extra()...
But my preference is only a weak one. Everything else about this code looks good to me now. I'm going to mark this merge_ready. If someone else decides to agree with me, feel free to change to needs_revision.
nickm: Naming this appropriately is a good idea. I like string_is_valid_nonrfc_hostname() and a warning comment. Technically the function allows things like luck.y13 (which is explicitly tested in the unit tests), but that is not any kind of valid DNS name or IP.