Opened 4 years ago

Closed 3 years ago

#17568 closed task (fixed)

Clean up tor-control-port.js in Torbutton

Reported by: gk Owned by: tbb-team
Priority: Medium Milestone:
Component: Applications/Tor Browser Version:
Severity: Normal Keywords: tbb-torbutton, TorBrowserTeam201512R
Cc: arthuredelstein, brade, mcs Actual Points:
Parent ID: Points:
Reviewer: Sponsor:

Description

We should look again over tor-control-port.js Torbutton and add a bit more error handling code + do a general clean-up where deemed necessary.

Child Tickets

Change History (24)

comment:1 Changed 4 years ago by arthuredelstein

Some useful comments, from
https://trac.torproject.org/projects/tor/ticket/16990#comment:6

Why the non-greedy globbing (the "?" modifier added to "+" or "*") when you are matching all the rest of the line anyway?

Compare:

520 - let matchResult = string.match(/^250[ -].+?=(.*?)$/mi) ||
520 + let matchResult = string.match(/^250[ -].+?=(.*)$/mi) ||

You do this in several other places in this file. Just wondering, I don't think it makes a difference.

Also, based on the regex above, shouldn't the comment in line 515 be like so?

  515 -	// or `250 circuit-status...`
  515 +	// or `250 circuit-status=...`

To match the rest of the comment.

Also, I do not know what the assumptions/pre-conditions are on the input string, but you do not check that the first line of the multiline case ("^250\+") ends with "=", as exemplified in the commentary for "info.keyValueStringsFromMessage".

Anyway, sorry for wasting your time, I'm not familiar with this code.

comment:2 Changed 4 years ago by mcs

Cc: brade mcs added

comment:3 Changed 4 years ago by cypherpunks

Why the case-insensitive flag ("i") when the pattern does not contain any alphabetic character?

Seems like you should also drop the multiline flag ("m") when you are only trying to match a single-line reply.

Aside:

I was trying to track the input back to Tor's output and stumbled across the 6500-lines control.c... So what I was wondering was:

  • In general, what is the threat expectation here? What has to be considered adversary-controlled input?
  • Is it worth re-implementing the full control protocol parser in JS so that you can verify each reply?
  • Hopefully control.c takes a good defensive parsing approach. Does control.c offer some guarantees about its output so that JS can just rely on it?

comment:4 Changed 4 years ago by gk

Here are some things I thought about while reading the code in tor-control-port.js. I am confused about other things(, too) (especially while looking at the spec + the controller.c file) but these might be more for a second clean-up (after I had more time to sort them out).

1)

  // GETCONF with a single argument returns results that look like
  // results from GETINFO with multiple arguments."

Does not seem to be so:

[11-10 15:33:46] Torbutton INFO: controlPort << getconf bridge


[11-10 15:33:46] Torbutton INFO: controlPort >> 250 Bridge
[11-10 15:33:46] Torbutton INFO: controlPort << getinfo version config-file


[11-10 15:33:46] Torbutton INFO: controlPort >> 250-version=0.2.7.4-rc (git-f55d23e1e66e9b0f)
250-config-file=/path/to/torbrowser/tor-browser/Browser/TorBrowser/Data/Tor/torrc
250 OK

2)
"kind of parsing for" <- seems the sentence is trailing off

3)
s/displatcher/dispatcher/

4) I wonder why we only have getInfoMultiple() given that both GETINFO and GETCONF can have multiple keywords (we use them both with just one keyword right now as far as I can see which does not explain the different treatment of both to me)

5) There are a bunch of comments in getInfoMultiple() and getInfo we probably might want to get rid of.

6) Tor Launcher has a different way of parsing things (leading to different results it seems (take the case of a 250 Bridge reply for GETCONF bridge, e.g.). I think we should look at it and have our Tor Browser controllers behave the same way even if they are using different parsing means.

comment:5 Changed 4 years ago by gk

  // matchResult finds a single-line result for `250-` or a multi-line one for `250+`.

is interesting as you are not mentioning the 250 case although your regex is checking for it. But I think the comment is right here (but not the code) as the 250 case should not make it that far as I guess EndReplyLine does not contain a = (see the 250 Bridge example) and key would therefore be null. That's one of the things I still need to figure out which is a bit tricky as ReplyText is not specified in the spec.

comment:6 in reply to:  5 ; Changed 4 years ago by arthuredelstein

Keywords: TorBrowserTeam201511R added
Status: newneeds_review

Here's a new patch for review, where I tried to address all of issues mentioned in the above comments:

https://github.com/arthuredelstein/torbutton/commit/17568

Please see my replies below for more details.

(It probably makes sense to leave this ticket open, even if we land the first patch.)


Replying to arthuredelstein:

Some useful comments, from
https://trac.torproject.org/projects/tor/ticket/16990#comment:6

Why the non-greedy globbing (the "?" modifier added to "+" or "*") when you are matching all the rest of the line anyway?

Compare:

520 - let matchResult = string.match(/^250[ -].+?=(.*?)$/mi) ||
520 + let matchResult = string.match(/^250[ -].+?=(.*)$/mi) ||

You do this in several other places in this file. Just wondering, I don't think it makes a difference.

Fixed here. It's important that I turned off the multiline flag as well here, so we don't accidentally match multiple lines.

Also, based on the regex above, shouldn't the comment in line 515 be like so?

  515 -	// or `250 circuit-status...`
  515 +	// or `250 circuit-status=...`

Fixed.

To match the rest of the comment.

Also, I do not know what the assumptions/pre-conditions are on the input string, but you do not check that the first line of the multiline case ("^250\+") ends with "=", as exemplified in the commentary for "info.keyValueStringsFromMessage".

I've added an end-of-line marker "$" following the "=".


Replying to cypherpunks:

Why the case-insensitive flag ("i") when the pattern does not contain any alphabetic character?

Fixed.

Seems like you should also drop the multiline flag ("m") when you are only trying to match a single-line reply.

Done.

Aside:

I was trying to track the input back to Tor's output and stumbled across the 6500-lines control.c... So what I was wondering was:

  • In general, what is the threat expectation here? What has to be considered adversary-controlled input?
  • Is it worth re-implementing the full control protocol parser in JS so that you can verify each reply?
  • Hopefully control.c takes a good defensive parsing approach. Does control.c offer some guarantees about its output so that JS can just rely on it?

This is a good point worth considering. What do you consider to be a full control protocol parser.

Another possibility is to consider whether the Tor control port should switch to JSON or something similar to simplify code at both ends.


Replying to gk:

Here are some things I thought about while reading the code in tor-control-port.js. I am confused about other things(, too) (especially while looking at the spec + the controller.c file) but these might be more for a second clean-up (after I had more time to sort them out).

1)

  // GETCONF with a single argument returns results that look like
  // results from GETINFO with multiple arguments."

Does not seem to be so:

[11-10 15:33:46] Torbutton INFO: controlPort << getconf bridge


[11-10 15:33:46] Torbutton INFO: controlPort >> 250 Bridge
[11-10 15:33:46] Torbutton INFO: controlPort << getinfo version config-file


[11-10 15:33:46] Torbutton INFO: controlPort >> 250-version=0.2.7.4-rc (git-f55d23e1e66e9b0f)
250-config-file=/path/to/torbrowser/tor-browser/Browser/TorBrowser/Data/Tor/torrc
250 OK

That's a good point. Right now, we're not parsing getconf lines such as 250 Bridge. We'd need to write some extra code for this. I tried to clarify the comment to reflect the current situation.

2)
"kind of parsing for" <- seems the sentence is trailing off

Fixed.

3)
s/displatcher/dispatcher/

Fixed.

4) I wonder why we only have getInfoMultiple() given that both GETINFO and GETCONF can have multiple keywords (we use them both with just one keyword right now as far as I can see which does not explain the different treatment of both to me)

I never implemented getConfMultiple, but it should be pretty straightforward if we need it.

5) There are a bunch of comments in getInfoMultiple() and getInfo we probably might want to get rid of.

I fixed this error handling so it should correctly return a rejected promise.

6) Tor Launcher has a different way of parsing things (leading to different results it seems (take the case of a 250 Bridge reply for GETCONF bridge, e.g.). I think we should look at it and have our Tor Browser controllers behave the same way even if they are using different parsing means.

That sounds like a good project, but perhaps on a separate ticket? We have discussed the possibility of merging some of the code.


Replying to gk:

  // matchResult finds a single-line result for `250-` or a multi-line one for `250+`.

is interesting as you are not mentioning the 250 case although your regex is checking for it. But I think the comment is right here (but not the code) as the 250 case should not make it that far as I guess EndReplyLine does not contain a = (see the 250 Bridge example) and key would therefore be null. That's one of the things I still need to figure out which is a bit tricky as ReplyText is not specified in the spec.

If you activate a bridge in Tor Browser, such as obfs3, then you get:

controlPort << getconf bridge

controlPort >> 250-Bridge=obfs3 [ip] [fp]
250-Bridge=obfs3 [ip] [fp]
250-Bridge=obfs3 [ip] [fp]
250-Bridge=obfs3 [ip] [fp]
250 Bridge=obfs3 [ip] [fp]

Note the final line without a -.

comment:7 in reply to:  6 ; Changed 4 years ago by gk

Replying to arthuredelstein:

Replying to gk:

Here are some things I thought about while reading the code in tor-control-port.js. I am confused about other things(, too) (especially while looking at the spec + the controller.c file) but these might be more for a second clean-up (after I had more time to sort them out).

1)

  // GETCONF with a single argument returns results that look like
  // results from GETINFO with multiple arguments."

Does not seem to be so:

[11-10 15:33:46] Torbutton INFO: controlPort << getconf bridge


[11-10 15:33:46] Torbutton INFO: controlPort >> 250 Bridge
[11-10 15:33:46] Torbutton INFO: controlPort << getinfo version config-file


[11-10 15:33:46] Torbutton INFO: controlPort >> 250-version=0.2.7.4-rc (git-f55d23e1e66e9b0f)
250-config-file=/path/to/torbrowser/tor-browser/Browser/TorBrowser/Data/Tor/torrc
250 OK

That's a good point. Right now, we're not parsing getconf lines such as 250 Bridge. We'd need to write some extra code for this. I tried to clarify the comment to reflect the current situation.

Hmm... Did you really mean 250[+ ]key=value? It should be something like 250[ -]key=value instead, no? At least that is what I would guess reading handle_control_getconf() in tor's control.c.

4) I wonder why we only have getInfoMultiple() given that both GETINFO and GETCONF can have multiple keywords (we use them both with just one keyword right now as far as I can see which does not explain the different treatment of both to me)

I never implemented getConfMultiple, but it should be pretty straightforward if we need it.

Yes, but as I said we don't need getInfoMultiple() either at the moment as far as I can see. Why do we have this additional code then?

Replying to gk:

  // matchResult finds a single-line result for `250-` or a multi-line one for `250+`.

is interesting as you are not mentioning the 250 case although your regex is checking for it. But I think the comment is right here (but not the code) as the 250 case should not make it that far as I guess EndReplyLine does not contain a = (see the 250 Bridge example) and key would therefore be null. That's one of the things I still need to figure out which is a bit tricky as ReplyText is not specified in the spec.

If you activate a bridge in Tor Browser, such as obfs3, then you get:

controlPort << getconf bridge

controlPort >> 250-Bridge=obfs3 [ip] [fp]
250-Bridge=obfs3 [ip] [fp]
250-Bridge=obfs3 [ip] [fp]
250-Bridge=obfs3 [ip] [fp]
250 Bridge=obfs3 [ip] [fp]

Note the final line without a -.

Thanks, I missed that one.

comment:8 in reply to:  7 ; Changed 4 years ago by arthuredelstein

Replying to gk:

Replying to arthuredelstein:

Replying to gk:

Here are some things I thought about while reading the code in tor-control-port.js. I am confused about other things(, too) (especially while looking at the spec + the controller.c file) but these might be more for a second clean-up (after I had more time to sort them out).

1)

  // GETCONF with a single argument returns results that look like
  // results from GETINFO with multiple arguments."

Does not seem to be so:

[11-10 15:33:46] Torbutton INFO: controlPort << getconf bridge


[11-10 15:33:46] Torbutton INFO: controlPort >> 250 Bridge
[11-10 15:33:46] Torbutton INFO: controlPort << getinfo version config-file


[11-10 15:33:46] Torbutton INFO: controlPort >> 250-version=0.2.7.4-rc (git-f55d23e1e66e9b0f)
250-config-file=/path/to/torbrowser/tor-browser/Browser/TorBrowser/Data/Tor/torrc
250 OK

That's a good point. Right now, we're not parsing getconf lines such as 250 Bridge. We'd need to write some extra code for this. I tried to clarify the comment to reflect the current situation.

Hmm... Did you really mean 250[+ ]key=value? It should be something like 250[ -]key=value instead, no? At least that is what I would guess reading handle_control_getconf() in tor's control.c.

Oops, dumb mistake. Thanks for catching it.

4) I wonder why we only have getInfoMultiple() given that both GETINFO and GETCONF can have multiple keywords (we use them both with just one keyword right now as far as I can see which does not explain the different treatment of both to me)

I never implemented getConfMultiple, but it should be pretty straightforward if we need it.

Yes, but as I said we don't need getInfoMultiple() either at the moment as far as I can see. Why do we have this additional code then?

Sorry I missed the point first time around. You're right, we don't need it. I think I wrote it thinking it might be useful in some situation, but now I think that's unlikely. I'm removing it for now.

Here is the new version:
https://github.com/arthuredelstein/torbutton/commit/17568+1

comment:9 in reply to:  6 Changed 4 years ago by cypherpunks

Replying to arthuredelstein:

I was trying to track the input back to Tor's output and stumbled across the 6500-lines control.c... So what I was wondering was:

  • In general, what is the threat expectation here? What has to be considered adversary-controlled input?
  • Is it worth re-implementing the full control protocol parser in JS so that you can verify each reply?
  • Hopefully control.c takes a good defensive parsing approach. Does control.c offer some guarantees about its output so that JS can just rely on it?

This is a good point worth considering. What do you consider to be a full control protocol parser.

So, like I said, I'm not familiar with the codebase and so far only skimmed the spec, please bear with me.

After reading gk's message where he described the flaw that caused the bug in the ticked that spawned this one, my thinking went something like:

Aha, so the are 2 kinds of data being handled here: data that comes from the network (from a remote Tor node; e.g. the name thingy gk mentioned), and data generated by the local Tor node.

The first kind is evil, this is data you must not trust and do your best to validate before attempting to use.

One may get away with trusting the second kind, since the program producing it is already running on your system. If you trust this data, this implies you trust the Tor node you are talking to. Since you trust the local Tor node, just make sure you only connect to *it*.

Now, you get both kinds of data munged into a single protocol: the Tor control protocol. But in fact, the evil data was already parsed by the Tor node (in control.c or even earlier, IIUC).

Hopefully, that parsing is correct, and the result, safe. If it is, then clients of the control protocol, "controllers", actually only get the second kind of data. This is what I meant with "guarantees offered by control.c".

In such case, controllers can get away with being a bit sloppy and parsing with things like, umm, I don't know, regex. <insert trite quote about regex here>

So when I said "full control protocol" I was thinking of not just the "protocol" part of the control protocol, but the data as well (some of which might be evil): a complete, comprehensive parser.

Maybe this is what the current code tries to do already?

The right solution, everyone knows, is to (a) define a complete grammar and then (b) a strict parser for that grammar.

Do we have (a) yet? In "control-spec.txt" I saw several non-terminal undefined (look for "XXX", for example).

Once we have (b), that same parser should be used in all JS code (button, launcher, etc.)

Notice that this must have been done already: there are a few standalone Tor controllers out there (I remember "stem" but I think there are others).

Another possibility is to consider whether the Tor control port should switch to JSON or something similar to simplify code at both ends.

I guess the usefulness of this would be to simply rely on existing, correct, secure parsers of JSON (or whatever else). A library like that would indeed simplify code in both Tor and the controller.

comment:10 in reply to:  8 Changed 4 years ago by gk

Replying to arthuredelstein:

Here is the new version:
https://github.com/arthuredelstein/torbutton/commit/17568+1

Alright, we are getting closer!

"Tor expects any commands to be terminated by CRLF." -> "Controllers should send any commands terminated by CRLF." maybe? (this would reflect the spec better where it is a "SHOULD")

We have 250[-\ ] and 250[- ]. I guess we should decide whether we need to escape the white space or not... ;)

It seems we only need info parsers for ns/id/, ip-to-country/, circuit-status and bridge at the moment. What should we do with the other ones, especially if they need a parsing function which is not used somewhere else anyway (like the config-text case)?

Last edited 4 years ago by gk (previous) (diff)

comment:11 Changed 4 years ago by gk

Status: needs_reviewneeds_revision

Another thing: we don't have getInfoMultiple() anymore, thus, getInfoMultiple : keys => info.getInfoMultiple(socket, keys), can go away as well.

comment:12 Changed 4 years ago by gk

I think the following two RegExes could get cleaned up as well:

new RegExp("^650." + type, "i")
/^650 \S+?\s(.*?)$/m

The latter according to the other clean-up done in your 17568 branch. For the former, according to the spec use-case the reply is "650 STREAM", so nothing case-sensitive should be needed and having a white space after "650" is defined as well.

comment:13 Changed 4 years ago by gk

It seems to me we don't need "m" : data => utils.listMapData(data, []) as this is only available in the votes.

comment:14 in reply to:  13 Changed 4 years ago by gk

Replying to gk:

It seems to me we don't need "m" : data => utils.listMapData(data, []) as this is only available in the votes.

Hmm... this is confusing. What consensus type are we using here? I'd assume a microdescriptor-based one but the log output I get does not convince me, see e.g.

r madiba yG0vPe/ih6Duso1Ih68U41wXJzM Xsg83S1uvP1CQFkzWAhtUFUpkPw 2015-11-17 22:46:34 209.222.8.196 443 80

If we don't do that, why not, given that we have it cached lying around?

comment:15 Changed 4 years ago by arthuredelstein

Another thing I want to do for this ticket is to simulate errors in various parts of the code, and ensure that we have error handling that allows us to log the errors and then gracefully recover. That should help avoid the situation where a single error causes the tor circuit display to die.

comment:16 Changed 3 years ago by arthuredelstein

Here's a new version of the patch. I tried to address all comments, except comment:14 and comment:15, which need more study.

https://github.com/arthuredelstein/torbutton/commit/17568+2

comment:17 Changed 3 years ago by arthuredelstein

Status: needs_revisionneeds_review

comment:18 Changed 3 years ago by gk

Seems I forgot to mention that we don't need "circ" : info.circuitStatusParser either at the moment as it seems. :(

One additional thing:

+  let dataText = message.match(/^650[ \+-]\S+?\s(.*)/m)[1];
+  return controlSocket.addNotificationCallback(new RegExp("^650[ \+-]" + type),

I think taking care of the +-case is fine but I wonder whether we need to do that given that we only watch for STREAM events (currently). If we want to cover the generic case here and not want to bother with the RegEx again in case we add more events to watch for later, then I wonder whether the RegEx is correct at all given e.g. 650+" Severity CRLF Data 650 SP "OK" CRLF (you are checking only for one \s before trying to capture but you get CRLF).

Then there is still the +-case from '(the "?" modifier added to "+" or "*")' in comment:1, no? Could you make an argument for why we (still) need this?

comment:19 in reply to:  18 Changed 3 years ago by gk

Replying to gk:

Then there is still the +-case from '(the "?" modifier added to "+" or "*")' in comment:1, no? Could you make an argument for why we (still) need this?

And for the one after the * in string.match(/^250\+.+?=([\s\S]*?)^\.$/m), as well?

comment:20 Changed 3 years ago by mikeperry

Keywords: TorBrowserTeam201512R added; TorBrowserTeam201511R removed

comment:21 in reply to:  18 ; Changed 3 years ago by arthuredelstein

Replying to gk:

Seems I forgot to mention that we don't need "circ" : info.circuitStatusParser either at the moment as it seems. :(

Commented out for now.

One additional thing:

+  let dataText = message.match(/^650[ \+-]\S+?\s(.*)/m)[1];
+  return controlSocket.addNotificationCallback(new RegExp("^650[ \+-]" + type),

I think taking care of the +-case is fine but I wonder whether we need to do that given that we only watch for STREAM events (currently). If we want to cover the generic case here and not want to bother with the RegEx again in case we add more events to watch for later, then I wonder whether the RegEx is correct at all given e.g. 650+" Severity CRLF Data 650 SP "OK" CRLF (you are checking only for one \s before trying to capture but you get CRLF).

Good point. I have dropped the "650+" and "650-" cases for now, as we aren't using them.

Then there is still the +-case from '(the "?" modifier added to "+" or "*")' in comment:1, no? Could you make an argument for why we (still) need this?

And for the one after the * in string.match(/250\+.+?=([\s\S]*?)\.$/m), as well?

In general, I think it is safer to use lazy rather than greedy quantifiers. At least it avoids errors where we accidentally match a second (or later) instance of something when we intend to match on the first instance.

A couple of blog posts make this argument:

Here's the revised patch, for review:
https://github.com/arthuredelstein/torbutton/commit/17568+3

comment:22 in reply to:  21 ; Changed 3 years ago by gk

Replying to arthuredelstein:

Here's the revised patch, for review:
https://github.com/arthuredelstein/torbutton/commit/17568+3

Do you mind pushing this branch? :)

comment:23 in reply to:  22 Changed 3 years ago by arthuredelstein

Replying to gk:

Replying to arthuredelstein:

Here's the revised patch, for review:
https://github.com/arthuredelstein/torbutton/commit/17568+3

Do you mind pushing this branch? :)

Oops, sorry! Pushed now.

comment:24 Changed 3 years ago by gk

Resolution: fixed
Status: needs_reviewclosed

I applied the fix to master, thanks: commit 548c10907863257c636e9393babe707fe79a4f40 has it.

Note: See TracTickets for help on using tickets.