Opened 7 weeks ago

Closed 5 weeks ago

#24826 closed defect (fixed)

consensus diffs stall Tor Browser launch for at least 20s or break it entirely if compiled with --enable-expensive-hardening

Reported by: gk Owned by: nickm
Priority: Medium Milestone: Tor: 0.3.3.x-final
Component: Core Tor/Tor Version:
Severity: Normal Keywords: review-group-30
Cc: ahf, mcs, brade Actual Points:
Parent ID: #22341 Points:
Reviewer: Sponsor:

Description (last modified by gk)

In #22341 we are thinking about picking up LZMA- and Zstandard-support for consensus diffs.

For LZMA-compressed diffs I often encounter boostrap delays of at least 20s like this:

Dec 26 21:01:18.000 [debug] HTTP body from server 'XX.XX.XX.XX:9001' was labeled as LZMA compressed, and it seems to be LZMA compressed.
Dec 26 21:01:38.000 [info] handle_response_fetch_consensus(): Applied consensus diff (size 482897) from server 'XX.XX.XX.XX:9001', resulting in a new consensus document (size 1903167).

But, worse, it might even break Tor Browser bootstrap entirely when blocking a Tor Launcher GETCONF call (throwing an exception as Tor Launcher thinks tor is not working) like so:

Dec 30 23:54:54.000 [debug] HTTP body from server 'XX.XX.XX.XX:9001' was labeled as LZMA compressed, and it seems to be LZMA compressed.
Illegal AddressMatcher: [xpconnect wrapped nsIPrefBranch] -- TypeError: s.split is not a function
Illegal AddressMatcher: [xpconnect wrapped nsIPrefBranch] -- TypeError: s.split is not a function
[12-30 22:54:55] TorLauncher DBUG: readTorSettings ----------------------------------------------
[12-30 22:54:55] TorLauncher DBUG: Sending Tor command: GETCONF UseBridges
[12-30 22:55:10] TorLauncher NOTE: Exception on control port [Exception... "Component returned failure code: 0x804b000e (NS_ERROR_NET_TIMEOUT) [nsIBinaryInputStream.readBytes]"  nsresult: "0x804b000e (NS_ERROR_NET_TIMEOUT)"  location: "JS frame :: jar:file:///home/thomas/Arbeit/TBB/tor-browser_en-US/Browser/TorBrowser/Data/Browser/profile.default/extensions/tor-launcher@torproject.org.xpi!/components/tl-protocol.js :: TorProtocolService.prototype._torReadLine :: line 920"  data: no]

Child Tickets

Change History (29)

comment:1 Changed 6 weeks ago by dgoulet

Milestone: Tor: 0.3.3.x-final

I've observed such load on my relays. Depending on the available CPU, this can take quite a bit of time, I've seen more than a minute on a very loaded machine.

comment:2 Changed 6 weeks ago by cypherpunks

7.0.11 sometimes gives

[01-09 05:25:15] Torbutton NOTE: Exception on control port [Exception... "Component returned failure code: 0x804b000e (NS_ERROR_NET_TIMEOUT) [nsIBinaryInputStream.readBytes]"  nsresult: "0x804b000e (NS_ERROR_NET_TIMEOUT)"  location: "JS frame :: chrome://torbutton/content/torbutton.js :: torbutton_socket_readline :: line 824"  data: no]

on a slow Windows machine (didn't you forget to set the compression thread's priority below normal?).

comment:3 in reply to:  2 ; Changed 6 weeks ago by gk

Replying to cypherpunks:

7.0.11 sometimes gives

[01-09 05:25:15] Torbutton NOTE: Exception on control port [Exception... "Component returned failure code: 0x804b000e (NS_ERROR_NET_TIMEOUT) [nsIBinaryInputStream.readBytes]"  nsresult: "0x804b000e (NS_ERROR_NET_TIMEOUT)"  location: "JS frame :: chrome://torbutton/content/torbutton.js :: torbutton_socket_readline :: line 824"  data: no]

on a slow Windows machine (didn't you forget to set the compression thread's priority below normal?).

That's a different issue. There is no released Tor Browser that supports LZMA compression at the moment.

Last edited 6 weeks ago by gk (previous) (diff)

comment:4 in reply to:  3 Changed 6 weeks ago by cypherpunks

Replying to gk:

Replying to cypherpunks:

7.0.11 sometimes gives

[01-09 05:25:15] Torbutton NOTE: Exception on control port [Exception... "Component returned failure code: 0x804b000e (NS_ERROR_NET_TIMEOUT) [nsIBinaryInputStream.readBytes]"  nsresult: "0x804b000e (NS_ERROR_NET_TIMEOUT)"  location: "JS frame :: chrome://torbutton/content/torbutton.js :: torbutton_socket_readline :: line 824"  data: no]

on a slow Windows machine (didn't you forget to set the compression thread's priority below normal?).

That's a different issue. There is no released Tor Browser that supports LZMA compression at the moment.

Not so different as you think. Without LZMA it stalls for ~15 sec there and breaks Torbutton when more. And it's not funny that Tor periodically causes glitches in all other programs (check with 1 CPU VM). BTW, shouldn't Tor respond to Tor Launcher even when it's parsing diffs in the background? (Is timeout set to 15 sec as your logs show?)

comment:5 in reply to:  1 Changed 6 weeks ago by arma

Replying to dgoulet:

I've observed such load on my relays. Depending on the available CPU, this can take quite a bit of time, I've seen more than a minute on a very loaded machine.

Wuh!

Was this in *generating* the diff, or in *applying* it?

comment:6 Changed 6 weeks ago by arma

(it seems in gk's case, it was *applying* it, which would be very sad indeed.)

comment:7 Changed 6 weeks ago by mcs

For reference:

  • The control port read timeout that is used by Tor Launcher is 15 seconds (used when reading Tor configuration via the control port, to issue commands during startup such as TAKEOWNERSHIP, and so on).
  • The control port read timeout that is used by Torbutton is 2 seconds (used when sending control port commands for the local SOCKS port check during browser startup and for new identity).

Should these timeouts be increased? What value should we use?

As someone pointed out, in an ideal world tor would be responsive to control port commands at all times.

comment:8 Changed 6 weeks ago by gk

Description: modified (diff)
Summary: LZMA-compressed consensus diffs stall Tor Browser launch for at least 20s or break it entirelyLZMA- and Zstandard compressed consensus diffs stall Tor Browser launch for at least 20s or break it entirely

Turns out this is affecting Zstandard-compressed diffs as well:

Jan 10 07:55:33.000 [debug] HTTP body from server 'XX.XX.XX.XX:9001' was labeled as Zstandard compressed, and it seems to be Zstandard compressed.
Jan 10 07:55:53.000 [info] handle_response_fetch_consensus(): Applied consensus diff (size 522839) from server 'XX.XX.XX.XX:9001', resulting in a new consensus document (size 1866898).

This time it hit the 2 second Torbutton timeout breaking the local SOCKS port check.

comment:9 Changed 6 weeks ago by gk

Parent ID: #22341

comment:10 Changed 6 weeks ago by gk

Look what I got:

Jan 10 14:48:55.000 [debug] HTTP body from server 'XX.XX.XX.XX:9001' was labeled as gzipped, and it seems to be gzipped.
Jan 10 14:49:12.000 [info] handle_response_fetch_consensus(): Applied consensus diff (size 179832) from server 'XX.XX.XX.XX:9001', resulting in a new consensus document (size 1885627).

This was on an idle laptop which is usually pretty fast with 8GB RAM. So, I wonder whether something else changed that is causing these delays...

comment:11 Changed 6 weeks ago by nickm

So if it's happening with gzip, maybe it's something in the consensus diff application. But ... 17 seconds seems ridiculously long.

I'm going to try this myself, with some extra log messages.

comment:12 Changed 6 weeks ago by nickm

If you'd like to try too, I've put the extra log messages in a branch extra_consdiff_logs against maint-0.3.2. They're mostly at debug level, and it's mostly the timestamps that should be interesting.

comment:13 Changed 6 weeks ago by nickm

Hm. On 0.3.2, I failed to see this bug happen. The timing was:

Jan 10 09:25:35.000 [debug] dir_client_decompress_response_body(): Successfully 
decompressed body using LZMA compressed
Jan 10 09:25:35.000 [debug] consensus_diff_apply(): About to apply consensus dif
f.
Jan 10 09:25:35.000 [debug] consensus_diff_apply(): Computed digest-as-signed fo
r applying consensus diff.
Jan 10 09:25:35.000 [debug] consdiff_apply_diff(): Extracted digests for consens
us diff.
Jan 10 09:25:35.000 [debug] consdiff_apply_diff(): Digests match; applying diff.
Jan 10 09:25:35.000 [debug] consdiff_apply_diff(): Diff applied.
Jan 10 09:25:35.000 [debug] consdiff_apply_diff(): Computing digest of resulting
 consensus.
Jan 10 09:25:35.000 [debug] consdiff_apply_diff(): Digests are equal; diff appli
cation succeeded.
Jan 10 09:25:35.000 [info] handle_response_fetch_consensus(): Applied consensus 
diff (size 97517) from server '154.35.175.225:443', resulting in a new consensus
 document (size 1885985).

I'll try again with the same patch on 0.3.3.

comment:14 Changed 6 weeks ago by nickm

With that patch on master, I see:

Jan 10 10:17:51.000 [debug] HTTP body from server '154.35.175.225:443' was label
ed as LZMA compressed, and it seems to be LZMA compressed.
Jan 10 10:17:51.000 [debug] dir_client_decompress_response_body(): Successfully 
decompressed body using LZMA compressed
Jan 10 10:17:51.000 [debug] consensus_diff_apply(): About to apply consensus dif
f.
Jan 10 10:17:51.000 [debug] consensus_diff_apply(): Computed digest-as-signed fo
r applying consensus diff.
Jan 10 10:17:51.000 [debug] consdiff_apply_diff(): Extracted digests for consens
us diff.
Jan 10 10:17:51.000 [debug] consdiff_apply_diff(): Digests match; applying diff.
Jan 10 10:17:51.000 [debug] consdiff_apply_diff(): Diff applied.
Jan 10 10:17:51.000 [debug] consdiff_apply_diff(): Computing digest of resulting
 consensus.
Jan 10 10:17:51.000 [debug] consdiff_apply_diff(): Digests are equal; diff appli
cation succeeded.
Jan 10 10:17:51.000 [info] handle_response_fetch_consensus(): Applied consensus 
diff (size 101910) from server '154.35.175.225:443', resulting in a new consensu
s document (size 1884843).

I wonder what could be going wrong here.

comment:15 Changed 6 weeks ago by nickm

(That is to say, I did not see the bug happening in either case.)

comment:16 Changed 6 weeks ago by gk

Okay, here we are:

Jan 10 20:20:31.000 [debug] HTTP body from server 'XX.XX.XX.XX:9001' was labeled as LZMA compressed, and it seems to be LZMA compressed.
Jan 10 20:20:31.000 [debug] dir_client_decompress_response_body(): Successfully decompressed body using LZMA compressed
Jan 10 20:20:31.000 [debug] consensus_diff_apply(): About to apply consensus diff.
Jan 10 20:20:31.000 [debug] consensus_diff_apply(): Computed digest-as-signed for applying consensus diff.
Jan 10 20:20:48.000 [debug] consdiff_apply_diff(): Extracted digests for consensus diff.
Jan 10 20:20:48.000 [debug] consdiff_apply_diff(): Digests match; applying diff.
Jan 10 20:20:48.000 [debug] consdiff_apply_diff(): Diff applied.
Jan 10 20:20:48.000 [debug] consdiff_apply_diff(): Computing digest of resulting consensus.
Jan 10 20:20:48.000 [debug] consdiff_apply_diff(): Digests are equal; diff application succeeded.
Jan 10 20:20:48.000 [info] handle_response_fetch_consensus(): Applied consensus diff (size 235301) from server 'XX.XX.XX.XX:9001', resulting in a new consensus document (size 1897672).

I compile tor with --enable-fragile-hardening but it seems to me that does not explain the 17 seconds for extracting digests for consensus diff. But I might be wrong.

comment:17 Changed 6 weeks ago by nickm

Oh! Actually, fragile-hardening does make some code extremely slow, and the sha3 code might be one such code. Do you get the same result without it?

comment:18 in reply to:  17 Changed 6 weeks ago by gk

Replying to nickm:

Oh! Actually, fragile-hardening does make some code extremely slow, and the sha3 code might be one such code. Do you get the same result without it?

Interesting. So far it looks good. I'll keep testing a bit more, though, to be on the safe side.

comment:19 Changed 6 weeks ago by nickm

Okay. If this turns out to be the case, we should add the sha3 module to the list of stuff we build without the expensive sanitizer flags.

comment:20 Changed 6 weeks ago by gk

Summary: LZMA- and Zstandard compressed consensus diffs stall Tor Browser launch for at least 20s or break it entirelyLZMA- and Zstandard compressed consensus diffs stall Tor Browser launch for at least 20s or break it entirely if compiled with --enable-expensive-hardening

After some testing it does indeed seem to be the case that --enable-expensive-hardening is causing the issue here.

comment:21 Changed 5 weeks ago by nickm

Okay, the plot grows thicker: It seems that we already are compiling the sha3 module without the hardening flags. So maybe the problem here isn't the hardening in sha3, but somewhere else instead.

It could be either in consensus_split_lines, or in consdiff_get_digests. I'll try them both out. I suspect it might be split_lines, though.

comment:22 Changed 5 weeks ago by nickm

okay, I see:

Jan 16 10:28:52.000 [notice] About to apply consensus diff.
Jan 16 10:28:52.000 [notice] Computed digest-as-signed for applying consensus diff.
Jan 16 10:29:04.000 [notice] Split consensus into lines.
Jan 16 10:29:04.000 [notice] Split diff into lines.
Jan 16 10:29:04.000 [notice] Extracted digests for consensus diff.
Jan 16 10:29:04.000 [notice] Digests match; applying diff.

So it seems that consdiff_split_lines is the culprit here.

comment:23 Changed 5 weeks ago by nickm

Status: newneeds_review

In my testing, it helped to change the strchr() in that function to use memchr() instead. I don't know why, but maybe it will help you too. Branch is bug24826_031.

comment:24 Changed 5 weeks ago by nickm

Keywords: review-group-30 added

comment:25 Changed 5 weeks ago by nickm

Owner: set to nickm
Status: needs_reviewassigned

setting owner

comment:26 Changed 5 weeks ago by nickm

Status: assignedneeds_review

comment:27 in reply to:  23 Changed 5 weeks ago by gk

Replying to nickm:

In my testing, it helped to change the strchr() in that function to use memchr() instead. I don't know why, but maybe it will help you too. Branch is bug24826_031.

Seems to work for me, thanks.

comment:28 Changed 5 weeks ago by nickm

Summary: LZMA- and Zstandard compressed consensus diffs stall Tor Browser launch for at least 20s or break it entirely if compiled with --enable-expensive-hardeningconsensus diffs stall Tor Browser launch for at least 20s or break it entirely if compiled with --enable-expensive-hardening

comment:29 Changed 5 weeks ago by nickm

Resolution: fixed
Status: needs_reviewclosed

Okay! I'm merging this to 0.3.1 and forward.

Note: See TracTickets for help on using tickets.