Opened 5 months ago

Closed 5 months ago

Last modified 3 months ago

#24826 closed defect (fixed)

consensus diffs stall Tor Browser launch for at least 20s or break it entirely if compiled with --enable-expensive-hardening

Reported by: gk Owned by: nickm
Priority: Medium Milestone: Tor: 0.3.3.x-final
Component: Core Tor/Tor Version:
Severity: Normal Keywords: review-group-30
Cc: ahf, mcs, brade Actual Points:
Parent ID: Points:
Reviewer: Sponsor:

Description (last modified by gk)

In #22341 we are thinking about picking up LZMA- and Zstandard-support for consensus diffs.

For LZMA-compressed diffs I often encounter boostrap delays of at least 20s like this:

Dec 26 21:01:18.000 [debug] HTTP body from server 'XX.XX.XX.XX:9001' was labeled as LZMA compressed, and it seems to be LZMA compressed.
Dec 26 21:01:38.000 [info] handle_response_fetch_consensus(): Applied consensus diff (size 482897) from server 'XX.XX.XX.XX:9001', resulting in a new consensus document (size 1903167).

But, worse, it might even break Tor Browser bootstrap entirely when blocking a Tor Launcher GETCONF call (throwing an exception as Tor Launcher thinks tor is not working) like so:

Dec 30 23:54:54.000 [debug] HTTP body from server 'XX.XX.XX.XX:9001' was labeled as LZMA compressed, and it seems to be LZMA compressed.
Illegal AddressMatcher: [xpconnect wrapped nsIPrefBranch] -- TypeError: s.split is not a function
Illegal AddressMatcher: [xpconnect wrapped nsIPrefBranch] -- TypeError: s.split is not a function
[12-30 22:54:55] TorLauncher DBUG: readTorSettings ----------------------------------------------
[12-30 22:54:55] TorLauncher DBUG: Sending Tor command: GETCONF UseBridges
[12-30 22:55:10] TorLauncher NOTE: Exception on control port [Exception... "Component returned failure code: 0x804b000e (NS_ERROR_NET_TIMEOUT) [nsIBinaryInputStream.readBytes]"  nsresult: "0x804b000e (NS_ERROR_NET_TIMEOUT)"  location: "JS frame :: jar:file:///home/thomas/Arbeit/TBB/tor-browser_en-US/Browser/TorBrowser/Data/Browser/profile.default/extensions/tor-launcher@torproject.org.xpi!/components/tl-protocol.js :: TorProtocolService.prototype._torReadLine :: line 920"  data: no]

Child Tickets

Change History (30)

comment:1 Changed 5 months ago by dgoulet

Milestone: Tor: 0.3.3.x-final

I've observed such load on my relays. Depending on the available CPU, this can take quite a bit of time, I've seen more than a minute on a very loaded machine.

comment:2 Changed 5 months ago by cypherpunks

7.0.11 sometimes gives

[01-09 05:25:15] Torbutton NOTE: Exception on control port [Exception... "Component returned failure code: 0x804b000e (NS_ERROR_NET_TIMEOUT) [nsIBinaryInputStream.readBytes]"  nsresult: "0x804b000e (NS_ERROR_NET_TIMEOUT)"  location: "JS frame :: chrome://torbutton/content/torbutton.js :: torbutton_socket_readline :: line 824"  data: no]

on a slow Windows machine (didn't you forget to set the compression thread's priority below normal?).

comment:3 in reply to:  2 ; Changed 5 months ago by gk

Replying to cypherpunks:

7.0.11 sometimes gives

[01-09 05:25:15] Torbutton NOTE: Exception on control port [Exception... "Component returned failure code: 0x804b000e (NS_ERROR_NET_TIMEOUT) [nsIBinaryInputStream.readBytes]"  nsresult: "0x804b000e (NS_ERROR_NET_TIMEOUT)"  location: "JS frame :: chrome://torbutton/content/torbutton.js :: torbutton_socket_readline :: line 824"  data: no]

on a slow Windows machine (didn't you forget to set the compression thread's priority below normal?).

That's a different issue. There is no released Tor Browser that supports LZMA compression at the moment.

Last edited 5 months ago by gk (previous) (diff)

comment:4 in reply to:  3 Changed 5 months ago by cypherpunks

Replying to gk:

Replying to cypherpunks:

7.0.11 sometimes gives

[01-09 05:25:15] Torbutton NOTE: Exception on control port [Exception... "Component returned failure code: 0x804b000e (NS_ERROR_NET_TIMEOUT) [nsIBinaryInputStream.readBytes]"  nsresult: "0x804b000e (NS_ERROR_NET_TIMEOUT)"  location: "JS frame :: chrome://torbutton/content/torbutton.js :: torbutton_socket_readline :: line 824"  data: no]

on a slow Windows machine (didn't you forget to set the compression thread's priority below normal?).

That's a different issue. There is no released Tor Browser that supports LZMA compression at the moment.

Not so different as you think. Without LZMA it stalls for ~15 sec there and breaks Torbutton when more. And it's not funny that Tor periodically causes glitches in all other programs (check with 1 CPU VM). BTW, shouldn't Tor respond to Tor Launcher even when it's parsing diffs in the background? (Is timeout set to 15 sec as your logs show?)

comment:5 in reply to:  1 Changed 5 months ago by arma

Replying to dgoulet:

I've observed such load on my relays. Depending on the available CPU, this can take quite a bit of time, I've seen more than a minute on a very loaded machine.

Wuh!

Was this in *generating* the diff, or in *applying* it?

comment:6 Changed 5 months ago by arma

(it seems in gk's case, it was *applying* it, which would be very sad indeed.)

comment:7 Changed 5 months ago by mcs

For reference:

  • The control port read timeout that is used by Tor Launcher is 15 seconds (used when reading Tor configuration via the control port, to issue commands during startup such as TAKEOWNERSHIP, and so on).
  • The control port read timeout that is used by Torbutton is 2 seconds (used when sending control port commands for the local SOCKS port check during browser startup and for new identity).

Should these timeouts be increased? What value should we use?

As someone pointed out, in an ideal world tor would be responsive to control port commands at all times.

comment:8 Changed 5 months ago by gk

Description: modified (diff)
Summary: LZMA-compressed consensus diffs stall Tor Browser launch for at least 20s or break it entirelyLZMA- and Zstandard compressed consensus diffs stall Tor Browser launch for at least 20s or break it entirely

Turns out this is affecting Zstandard-compressed diffs as well:

Jan 10 07:55:33.000 [debug] HTTP body from server 'XX.XX.XX.XX:9001' was labeled as Zstandard compressed, and it seems to be Zstandard compressed.
Jan 10 07:55:53.000 [info] handle_response_fetch_consensus(): Applied consensus diff (size 522839) from server 'XX.XX.XX.XX:9001', resulting in a new consensus document (size 1866898).

This time it hit the 2 second Torbutton timeout breaking the local SOCKS port check.

comment:9 Changed 5 months ago by gk

Parent ID: #22341

comment:10 Changed 5 months ago by gk

Look what I got:

Jan 10 14:48:55.000 [debug] HTTP body from server 'XX.XX.XX.XX:9001' was labeled as gzipped, and it seems to be gzipped.
Jan 10 14:49:12.000 [info] handle_response_fetch_consensus(): Applied consensus diff (size 179832) from server 'XX.XX.XX.XX:9001', resulting in a new consensus document (size 1885627).

This was on an idle laptop which is usually pretty fast with 8GB RAM. So, I wonder whether something else changed that is causing these delays...

comment:11 Changed 5 months ago by nickm

So if it's happening with gzip, maybe it's something in the consensus diff application. But ... 17 seconds seems ridiculously long.

I'm going to try this myself, with some extra log messages.

comment:12 Changed 5 months ago by nickm

If you'd like to try too, I've put the extra log messages in a branch extra_consdiff_logs against maint-0.3.2. They're mostly at debug level, and it's mostly the timestamps that should be interesting.

comment:13 Changed 5 months ago by nickm

Hm. On 0.3.2, I failed to see this bug happen. The timing was:

Jan 10 09:25:35.000 [debug] dir_client_decompress_response_body(): Successfully 
decompressed body using LZMA compressed
Jan 10 09:25:35.000 [debug] consensus_diff_apply(): About to apply consensus dif
f.
Jan 10 09:25:35.000 [debug] consensus_diff_apply(): Computed digest-as-signed fo
r applying consensus diff.
Jan 10 09:25:35.000 [debug] consdiff_apply_diff(): Extracted digests for consens
us diff.
Jan 10 09:25:35.000 [debug] consdiff_apply_diff(): Digests match; applying diff.
Jan 10 09:25:35.000 [debug] consdiff_apply_diff(): Diff applied.
Jan 10 09:25:35.000 [debug] consdiff_apply_diff(): Computing digest of resulting
 consensus.
Jan 10 09:25:35.000 [debug] consdiff_apply_diff(): Digests are equal; diff appli
cation succeeded.
Jan 10 09:25:35.000 [info] handle_response_fetch_consensus(): Applied consensus 
diff (size 97517) from server '154.35.175.225:443', resulting in a new consensus
 document (size 1885985).

I'll try again with the same patch on 0.3.3.

comment:14 Changed 5 months ago by nickm

With that patch on master, I see:

Jan 10 10:17:51.000 [debug] HTTP body from server '154.35.175.225:443' was label
ed as LZMA compressed, and it seems to be LZMA compressed.
Jan 10 10:17:51.000 [debug] dir_client_decompress_response_body(): Successfully 
decompressed body using LZMA compressed
Jan 10 10:17:51.000 [debug] consensus_diff_apply(): About to apply consensus dif
f.
Jan 10 10:17:51.000 [debug] consensus_diff_apply(): Computed digest-as-signed fo
r applying consensus diff.
Jan 10 10:17:51.000 [debug] consdiff_apply_diff(): Extracted digests for consens
us diff.
Jan 10 10:17:51.000 [debug] consdiff_apply_diff(): Digests match; applying diff.
Jan 10 10:17:51.000 [debug] consdiff_apply_diff(): Diff applied.
Jan 10 10:17:51.000 [debug] consdiff_apply_diff(): Computing digest of resulting
 consensus.
Jan 10 10:17:51.000 [debug] consdiff_apply_diff(): Digests are equal; diff appli
cation succeeded.
Jan 10 10:17:51.000 [info] handle_response_fetch_consensus(): Applied consensus 
diff (size 101910) from server '154.35.175.225:443', resulting in a new consensu
s document (size 1884843).

I wonder what could be going wrong here.

comment:15 Changed 5 months ago by nickm

(That is to say, I did not see the bug happening in either case.)

comment:16 Changed 5 months ago by gk

Okay, here we are:

Jan 10 20:20:31.000 [debug] HTTP body from server 'XX.XX.XX.XX:9001' was labeled as LZMA compressed, and it seems to be LZMA compressed.
Jan 10 20:20:31.000 [debug] dir_client_decompress_response_body(): Successfully decompressed body using LZMA compressed
Jan 10 20:20:31.000 [debug] consensus_diff_apply(): About to apply consensus diff.
Jan 10 20:20:31.000 [debug] consensus_diff_apply(): Computed digest-as-signed for applying consensus diff.
Jan 10 20:20:48.000 [debug] consdiff_apply_diff(): Extracted digests for consensus diff.
Jan 10 20:20:48.000 [debug] consdiff_apply_diff(): Digests match; applying diff.
Jan 10 20:20:48.000 [debug] consdiff_apply_diff(): Diff applied.
Jan 10 20:20:48.000 [debug] consdiff_apply_diff(): Computing digest of resulting consensus.
Jan 10 20:20:48.000 [debug] consdiff_apply_diff(): Digests are equal; diff application succeeded.
Jan 10 20:20:48.000 [info] handle_response_fetch_consensus(): Applied consensus diff (size 235301) from server 'XX.XX.XX.XX:9001', resulting in a new consensus document (size 1897672).

I compile tor with --enable-fragile-hardening but it seems to me that does not explain the 17 seconds for extracting digests for consensus diff. But I might be wrong.

comment:17 Changed 5 months ago by nickm

Oh! Actually, fragile-hardening does make some code extremely slow, and the sha3 code might be one such code. Do you get the same result without it?

comment:18 in reply to:  17 Changed 5 months ago by gk

Replying to nickm:

Oh! Actually, fragile-hardening does make some code extremely slow, and the sha3 code might be one such code. Do you get the same result without it?

Interesting. So far it looks good. I'll keep testing a bit more, though, to be on the safe side.

comment:19 Changed 5 months ago by nickm

Okay. If this turns out to be the case, we should add the sha3 module to the list of stuff we build without the expensive sanitizer flags.

comment:20 Changed 5 months ago by gk

Summary: LZMA- and Zstandard compressed consensus diffs stall Tor Browser launch for at least 20s or break it entirelyLZMA- and Zstandard compressed consensus diffs stall Tor Browser launch for at least 20s or break it entirely if compiled with --enable-expensive-hardening

After some testing it does indeed seem to be the case that --enable-expensive-hardening is causing the issue here.

comment:21 Changed 5 months ago by nickm

Okay, the plot grows thicker: It seems that we already are compiling the sha3 module without the hardening flags. So maybe the problem here isn't the hardening in sha3, but somewhere else instead.

It could be either in consensus_split_lines, or in consdiff_get_digests. I'll try them both out. I suspect it might be split_lines, though.

comment:22 Changed 5 months ago by nickm

okay, I see:

Jan 16 10:28:52.000 [notice] About to apply consensus diff.
Jan 16 10:28:52.000 [notice] Computed digest-as-signed for applying consensus diff.
Jan 16 10:29:04.000 [notice] Split consensus into lines.
Jan 16 10:29:04.000 [notice] Split diff into lines.
Jan 16 10:29:04.000 [notice] Extracted digests for consensus diff.
Jan 16 10:29:04.000 [notice] Digests match; applying diff.

So it seems that consdiff_split_lines is the culprit here.

comment:23 Changed 5 months ago by nickm

Status: newneeds_review

In my testing, it helped to change the strchr() in that function to use memchr() instead. I don't know why, but maybe it will help you too. Branch is bug24826_031.

comment:24 Changed 5 months ago by nickm

Keywords: review-group-30 added

comment:25 Changed 5 months ago by nickm

Owner: set to nickm
Status: needs_reviewassigned

setting owner

comment:26 Changed 5 months ago by nickm

Status: assignedneeds_review

comment:27 in reply to:  23 Changed 5 months ago by gk

Replying to nickm:

In my testing, it helped to change the strchr() in that function to use memchr() instead. I don't know why, but maybe it will help you too. Branch is bug24826_031.

Seems to work for me, thanks.

comment:28 Changed 5 months ago by nickm

Summary: LZMA- and Zstandard compressed consensus diffs stall Tor Browser launch for at least 20s or break it entirely if compiled with --enable-expensive-hardeningconsensus diffs stall Tor Browser launch for at least 20s or break it entirely if compiled with --enable-expensive-hardening

comment:29 Changed 5 months ago by nickm

Resolution: fixed
Status: needs_reviewclosed

Okay! I'm merging this to 0.3.1 and forward.

comment:30 Changed 3 months ago by cypherpunks

Parent ID: #22341

Unfortunately, it is still happening on Windows XP.

Note: See TracTickets for help on using tickets.