Why the non-greedy globbing (the "?" modifier added to "+" or "*") when you are matching all the rest of the line anyway?
Compare:
520 - let matchResult = string.match(/^250[ -].+?=(.*?)$/mi) ||520 + let matchResult = string.match(/^250[ -].+?=(.*)$/mi) ||
You do this in several other places in this file. Just wondering, I don't think it makes a difference.
Also, based on the regex above, shouldn't the comment in line 515 be like so?
{{{
515 - // or 250 circuit-status...
515 + // or 250 circuit-status=...
}}}
To match the rest of the comment.
Also, I do not know what the assumptions/pre-conditions are on the input string, but you do not check that the first line of the multiline case ("^250\+") ends with "=", as exemplified in the commentary for "info.keyValueStringsFromMessage".
Anyway, sorry for wasting your time, I'm not familiar with this code.
Here are some things I thought about while reading the code in tor-control-port.js. I am confused about other things(, too) (especially while looking at the spec + the controller.c file) but these might be more for a second clean-up (after I had more time to sort them out).
// GETCONF with a single argument returns results that look like // results from GETINFO with multiple arguments."
"kind of parsing for" <- seems the sentence is trailing off
s/displatcher/dispatcher/
I wonder why we only have getInfoMultiple() given that both GETINFO and GETCONF can have multiple keywords (we use them both with just one keyword right now as far as I can see which does not explain the different treatment of both to me)
There are a bunch of comments in getInfoMultiple() and getInfo we probably might want to get rid of.
Tor Launcher has a different way of parsing things (leading to different results it seems (take the case of a 250 Bridge reply for GETCONF bridge, e.g.). I think we should look at it and have our Tor Browser controllers behave the same way even if they are using different parsing means.
// matchResult finds a single-line result for `250-` or a multi-line one for `250+`.
is interesting as you are not mentioning the 250 case although your regex is checking for it. But I think the comment is right here (but not the code) as the 250 case should not make it that far as I guess EndReplyLine does not contain a = (see the 250 Bridge example) and key would therefore be null. That's one of the things I still need to figure out which is a bit tricky as ReplyText is not specified in the spec.
Why the non-greedy globbing (the "?" modifier added to "+" or "*") when you are matching all the rest of the line anyway?
Compare:
520 - let matchResult = string.match(/^250[ -].+?=(.*?)$/mi) ||520 + let matchResult = string.match(/^250[ -].+?=(.*)$/mi) ||
You do this in several other places in this file. Just wondering, I don't think it makes a difference.
Fixed here. It's important that I turned off the multiline flag as well here, so we don't accidentally match multiple lines.
Also, based on the regex above, shouldn't the comment in line 515 be like so?
515 - // or `250 circuit-status...` 515 + // or `250 circuit-status=...`}}}
Fixed.
To match the rest of the comment.
Also, I do not know what the assumptions/pre-conditions are on the input string, but you do not check that the first line of the multiline case ("^250\+") ends with "=", as exemplified in the commentary for "info.keyValueStringsFromMessage".
I've added an end-of-line marker "$" following the "=".
Here are some things I thought about while reading the code in tor-control-port.js. I am confused about other things(, too) (especially while looking at the spec + the controller.c file) but these might be more for a second clean-up (after I had more time to sort them out).
{{{
// GETCONF with a single argument returns results that look like
// results from GETINFO with multiple arguments."
}}}
Does not seem to be so:
{{{
[11-10 15:33:46] Torbutton INFO: controlPort << getconf bridge
That's a good point. Right now, we're not parsing getconf lines such as 250 Bridge. We'd need to write some extra code for this. I tried to clarify the comment to reflect the current situation.
"kind of parsing for" <- seems the sentence is trailing off
Fixed.
s/displatcher/dispatcher/
Fixed.
I wonder why we only have getInfoMultiple() given that both GETINFO and GETCONF can have multiple keywords (we use them both with just one keyword right now as far as I can see which does not explain the different treatment of both to me)
I never implemented getConfMultiple, but it should be pretty straightforward if we need it.
There are a bunch of comments in getInfoMultiple() and getInfo we probably might want to get rid of.
I fixed this error handling so it should correctly return a rejected promise.
Tor Launcher has a different way of parsing things (leading to different results it seems (take the case of a 250 Bridge reply for GETCONF bridge, e.g.). I think we should look at it and have our Tor Browser controllers behave the same way even if they are using different parsing means.
That sounds like a good project, but perhaps on a separate ticket? We have discussed the possibility of merging some of the code.
{{{
// matchResult finds a single-line result for 250- or a multi-line one for 250+.
}}}
is interesting as you are not mentioning the 250 case although your regex is checking for it. But I think the comment is right here (but not the code) as the 250 case should not make it that far as I guess EndReplyLine does not contain a = (see the 250 Bridge example) and key would therefore be null. That's one of the things I still need to figure out which is a bit tricky as ReplyText is not specified in the spec.
If you activate a bridge in Tor Browser, such as obfs3, then you get:
Here are some things I thought about while reading the code in tor-control-port.js. I am confused about other things(, too) (especially while looking at the spec + the controller.c file) but these might be more for a second clean-up (after I had more time to sort them out).
{{{
// GETCONF with a single argument returns results that look like
// results from GETINFO with multiple arguments."
}}}
Does not seem to be so:
{{{
[11-10 15:33:46] Torbutton INFO: controlPort << getconf bridge
That's a good point. Right now, we're not parsing getconf lines such as 250 Bridge. We'd need to write some extra code for this. I tried to clarify the comment to reflect the current situation.
Hmm... Did you really mean 250[+ ]key=value? It should be something like 250[ -]key=value instead, no? At least that is what I would guess reading handle_control_getconf() in tor's control.c.
I wonder why we only have getInfoMultiple() given that both GETINFO and GETCONF can have multiple keywords (we use them both with just one keyword right now as far as I can see which does not explain the different treatment of both to me)
I never implemented getConfMultiple, but it should be pretty straightforward if we need it.
Yes, but as I said we don't need getInfoMultiple() either at the moment as far as I can see. Why do we have this additional code then?
{{{
// matchResult finds a single-line result for 250- or a multi-line one for 250+.
}}}
is interesting as you are not mentioning the 250 case although your regex is checking for it. But I think the comment is right here (but not the code) as the 250 case should not make it that far as I guess EndReplyLine does not contain a = (see the 250 Bridge example) and key would therefore be null. That's one of the things I still need to figure out which is a bit tricky as ReplyText is not specified in the spec.
If you activate a bridge in Tor Browser, such as obfs3, then you get:
Here are some things I thought about while reading the code in tor-control-port.js. I am confused about other things(, too) (especially while looking at the spec + the controller.c file) but these might be more for a second clean-up (after I had more time to sort them out).
{{{
// GETCONF with a single argument returns results that look like
// results from GETINFO with multiple arguments."
}}}
Does not seem to be so:
{{{
[11-10 15:33:46] Torbutton INFO: controlPort << getconf bridge
That's a good point. Right now, we're not parsing getconf lines such as 250 Bridge. We'd need to write some extra code for this. I tried to clarify the comment to reflect the current situation.
Hmm... Did you really mean 250[+ ]key=value? It should be something like 250[ -]key=value instead, no? At least that is what I would guess reading handle_control_getconf() in tor's control.c.
Oops, dumb mistake. Thanks for catching it.
I wonder why we only have getInfoMultiple() given that both GETINFO and GETCONF can have multiple keywords (we use them both with just one keyword right now as far as I can see which does not explain the different treatment of both to me)
I never implemented getConfMultiple, but it should be pretty straightforward if we need it.
Yes, but as I said we don't need getInfoMultiple() either at the moment as far as I can see. Why do we have this additional code then?
Sorry I missed the point first time around. You're right, we don't need it. I think I wrote it thinking it might be useful in some situation, but now I think that's unlikely. I'm removing it for now.
I was trying to track the input back to Tor's output and stumbled across the 6500-lines control.c... So what I was wondering was:
In general, what is the threat expectation here? What has to be considered adversary-controlled input?
Is it worth re-implementing the full control protocol parser in JS so that you can verify each reply?
Hopefully control.c takes a good defensive parsing approach. Does control.c offer some guarantees about its output so that JS can just rely on it?
This is a good point worth considering. What do you consider to be a full control protocol parser.
So, like I said, I'm not familiar with the codebase and so far only skimmed the spec, please bear with me.
After reading gk's message where he described the flaw that caused the bug in the ticked that spawned this one, my thinking went something like:
Aha, so the are 2 kinds of data being handled here: data that comes from the network (from a remote Tor node; e.g. the name thingy gk mentioned), and data generated by the local Tor node.
The first kind is evil, this is data you must not trust and do your best to validate before attempting to use.
One may get away with trusting the second kind, since the program producing it is already running on your system. If you trust this data, this implies you trust the Tor node you are talking to. Since you trust the local Tor node, just make sure you only connect to it.
Now, you get both kinds of data munged into a single protocol: the Tor control protocol. But in fact, the evil data was already parsed by the Tor node (in control.c or even earlier, IIUC).
Hopefully, that parsing is correct, and the result, safe. If it is, then clients of the control protocol, "controllers", actually only get the second kind of data. This is what I meant with "guarantees offered by control.c".
In such case, controllers can get away with being a bit sloppy and parsing with things like, umm, I don't know, regex.
So when I said "full control protocol" I was thinking of not just the "protocol" part of the control protocol, but the data as well (some of which might be evil): a complete, comprehensive parser.
Maybe this is what the current code tries to do already?
The right solution, everyone knows, is to (a) define a complete grammar and then (b) a strict parser for that grammar.
Do we have (a) yet? In "control-spec.txt" I saw several non-terminal undefined (look for "XXX", for example).
Once we have (b), that same parser should be used in all JS code (button, launcher, etc.)
Notice that this must have been done already: there are a few standalone Tor controllers out there (I remember "stem" but I think there are others).
Another possibility is to consider whether the Tor control port should switch to JSON or something similar to simplify code at both ends.
I guess the usefulness of this would be to simply rely on existing, correct, secure parsers of JSON (or whatever else). A library like that would indeed simplify code in both Tor and the controller.
"Tor expects any commands to be terminated by CRLF." -> "Controllers should send any commands terminated by CRLF." maybe? (this would reflect the spec better where it is a "SHOULD")
We have 250[-\ ] and 250[- ]. I guess we should decide whether we need to escape the white space or not... ;)
It seems we only need info parsers for ns/id/, ip-to-country/, circuit-status and bridge at the moment. What should we do with the other ones, especially if they need a parsing function which is not used somewhere else anyway (like the config-text case)?
I think the following two RegExes could get cleaned up as well:
new RegExp("^650." + type, "i")/^650 \S+?\s(.*?)$/m
The latter according to the other clean-up done in your 17568 branch. For the former, according to the spec use-case the reply is "650 STREAM", so nothing case-sensitive should be needed and having a white space after "650" is defined as well.
It seems to me we don't need "m" : data => utils.listMapData(data, []) as this is only available in the votes.
Hmm... this is confusing. What consensus type are we using here? I'd assume a microdescriptor-based one but the log output I get does not convince me, see e.g.
r madiba yG0vPe/ih6Duso1Ih68U41wXJzM Xsg83S1uvP1CQFkzWAhtUFUpkPw 2015-11-17 22:46:34 209.222.8.196 443 80
If we don't do that, why not, given that we have it cached lying around?
Another thing I want to do for this ticket is to simulate errors in various parts of the code, and ensure that we have error handling that allows us to log the errors and then gracefully recover. That should help avoid the situation where a single error causes the tor circuit display to die.
I think taking care of the +-case is fine but I wonder whether we need to do that given that we only watch for STREAM events (currently). If we want to cover the generic case here and not want to bother with the RegEx again in case we add more events to watch for later, then I wonder whether the RegEx is correct at all given e.g. 650+" Severity CRLF Data 650 SP "OK" CRLF (you are checking only for one \s before trying to capture but you get CRLF).
Then there is still the +-case from '(the "?" modifier added to "+" or "*")' in comment:1, no? Could you make an argument for why we (still) need this?
Then there is still the +-case from '(the "?" modifier added to "+" or "*")' in comment:1, no? Could you make an argument for why we (still) need this?
And for the one after the * in string.match(/^250\+.+?=([\s\S]*?)^\.$/m), as well?
Seems I forgot to mention that we don't need "circ" : info.circuitStatusParser either at the moment as it seems. :(
Commented out for now.
One additional thing:
{{{
let dataText = message.match(/^650[ +-]\S+?\s(.*)/m)[1];
return controlSocket.addNotificationCallback(new RegExp("^650[ +-]" + type),
}}}
I think taking care of the +-case is fine but I wonder whether we need to do that given that we only watch for STREAM events (currently). If we want to cover the generic case here and not want to bother with the RegEx again in case we add more events to watch for later, then I wonder whether the RegEx is correct at all given e.g. 650+" Severity CRLF Data 650 SP "OK" CRLF (you are checking only for one \s before trying to capture but you get CRLF).
Good point. I have dropped the "650+" and "650-" cases for now, as we aren't using them.
Then there is still the +-case from '(the "?" modifier added to "+" or "*")' in comment:1, no? Could you make an argument for why we (still) need this?
And for the one after the * in string.match(/^250+.+?=([\s\S]*?)^.$/m), as well?
In general, I think it is safer to use lazy rather than greedy quantifiers. At least it avoids errors where we accidentally match a second (or later) instance of something when we intend to match on the first instance.