Opened 2 years ago

Closed 4 months ago

#22210 closed defect (fixed)

circuit_is_acceptable is slow due to IP and fingerprint parsing

Reported by: teor Owned by: neel
Priority: Medium Milestone: Tor: 0.4.1.x-final
Component: Core Tor/Tor Version: Tor: 0.2.8.12
Severity: Normal Keywords: tor-hs, 032-unreached
Cc: neel Actual Points:
Parent ID: #30221 Points: 1
Reviewer: mikeperry Sponsor:

Description

(I received this bug report from special. Please feel free to contact me (teor) with questions.)

Using Tor version 0.3.1.0-alpha-dev (git-04f1ddaa2a33194b), also applies to at least all 0.2.8.x and above.

Testcase is a ricochet client with a very large number of contacts. Effectively, it’s a client that attempts to build circuits to >100 <500 hidden services almost simultaneously, and retries over time with increasing delays.

This tor client will constantly use 100% of a CPU core for >15 minutes after start. There’s a variety of buggy behavior going on, and a lot of circuits, so it’s not entirely surprising to see some high load. More interesting is where that load actually comes from.

The vast majority of CPU time — 2.1 minutes out of a 3 minute trace — is spent under circuit_is_acceptable, as called from connection_ap_attach_pending (via circuit_get_open_circ_or_launch and circuit_get_best). The breakdown from there is:

2.10 min 100.0% 6.26 s circuit_is_acceptable

58.93 s 46.7% 1.05 s tor_addr_parse
56.34 s 44.7% 1.84 s tor_inet_pton
52.53 s 41.7% 1.68 s tor_inet_aton
50.71 s 40.2% 914.00 ms tor_sscanf
49.74 s 39.4% 13.78 s tor_vsscanf
29.30 s 23.2% 20.11 s scan_unsigned
5.29 s 4.2% 5.29 s TOR_ISDIGIT
1.13 s 0.8% 1.13 s digit_to_num

43.76 s 34.7% 1.30 s connection_ap_can_use_exit
39.51 s 31.3% 574.00 ms node_get_by_nickname
38.93 s 30.9% 764.00 ms node_get_by_hex_id
23.82 s 18.9% 869.00 ms hex_digest_nickname_decode
21.61 s 17.1% 10.95 s base16_decode
10.07 s 7.9% 10.07 s hex_decode_digit_
14.32 s 11.3% 156.00 ms node_get_by_id

16.39 s 13.0% 1.47 s build_state_get_exit_node
14.93 s 11.8% 119.00 ms node_get_by_id

That is super hard to read in this format, but essentially a majority of time is spent on:

  1. Parsing IP addresses from strings out of node descriptors
  2. Looking up nodes by ID
  3. Decoding hex digests to get node IDs

This implies to me that these could be cached in a much more useful way in the node struct, the algorithmic complexity of some of the circuit selection functions is out of control, and I’m also guessing that the circuit attach functions are being called way too often.

Without circuit_is_acceptable being considered, the hottest path is under choose_good_middle_server for node selection, which seems much more reasonable.

The profile quoted here had —enable-expensive-hardening (including ASAN) and slightly older tor, but I’ve rerun with 04f1ddaa and without hardening with the same conclusions, including 2 out of 3 minutes of CPU time being spent under circuit_is_acceptable.

Child Tickets

Change History (13)

comment:1 in reply to:  description ; Changed 2 years ago by arma

Replying to teor:

2.10 min 100.0% 6.26 s circuit_is_acceptable
58.93 s 46.7% 1.05 s tor_addr_parse

  1. Parsing IP addresses from strings out of node descriptors

To be precise, it looks like this one is from

    const int family = tor_addr_parse(&addr, conn->socks_request->address);

which is actually parsing IP addresses from strings in the socks request, not the node descriptor.

For how that code came to be that way, see commit 2b22c0ae from #7582.

comment:2 in reply to:  1 Changed 2 years ago by arma

Replying to arma:

    const int family = tor_addr_parse(&addr, conn->socks_request->address);

It looks like we could pretty easily only do that tor_addr_parse call in the two places below it where we would actually use the value of family -- that is,

  • (a) for a one-hop circuit where we've set chosen_exit_name and we don't have a digest for the exit, which should be rare if it ever happens, and
  • (b) when origin_circ->prepend_policy is set, meaning we've already received a surprising exit policy failure on that circuit and called adjust_exit_policy_from_exitpolicy_failure() on it.

comment:3 in reply to:  description Changed 2 years ago by arma

Replying to teor:

43.76 s 34.7% 1.30 s connection_ap_can_use_exit
39.51 s 31.3% 574.00 ms node_get_by_nickname
38.93 s 30.9% 764.00 ms node_get_by_hex_id

This case happens because they're hidden service descriptor lookups, so they're general (exit) streams yet they have conn->chosen_exit_name set.

That's a surprising amount of time in hex_digest_nickname_decode(), but I guess with all the strlcpy's, etc, that it does, it shouldn't be that surprising.

You're right that it's inefficient here, since the logic is "consider a given circuit, and then reparse what exit node this conn says it's for, and then see if that circuit ends at that exit node", and in theory we don't need to keep reparsing. I would be tempted to try to take a step farther back though and figure out if there's a way to call the broader "how's it going, are any streams ready to be attached to circuits yet" function more sparingly.

comment:4 Changed 2 years ago by teor

I added a hs-intro chutney network in #22599 that might be a test for this.
It currently doesn't verify (see #22600), so it might be a chutney test for this issue.

comment:5 Changed 2 years ago by nickm

Keywords: 032-unreached added
Milestone: Tor: 0.3.2.x-finalTor: unspecified

Mark a large number of tickets that I do not think we will do for 0.3.2.

comment:6 Changed 22 months ago by teor

Parent ID: #15251

This probably blocks #15251

comment:7 Changed 5 months ago by asn

Parent ID: #15251#30221

comment:8 Changed 5 months ago by neel

Cc: neel added
Owner: set to neel
Status: newassigned

comment:9 Changed 5 months ago by neel

Status: assignedneeds_review

PR is here: https://github.com/torproject/tor/pull/1004

The only changes I have done is calling tor_addr_parse() only where it is needed as arma said in Comment 2.

Keep in mind that AppVeyor failed, but it is not a build error as much as it is that AppVeyor 64-bit was unable to fetch the Git repo:

Build started
git clone -q --depth=50 https://github.com/torproject/tor.git C:\projects\tor
fatal: unable to access 'https://github.com/torproject/tor.git/': Failed to connect to github.com port 443: Timed out
Command exited with code 128
C:\Python27\python.exe %APPVEYOR_BUILD_FOLDER%\scripts\test\appveyor-irc-notify.py irc.oftc.net:6697 tor-ci failure
C:\Python27\python.exe: can't open file 'C:\projects\tor\scripts\test\appveyor-irc-notify.py': [Errno 2] No such file or directory
Command exited with code 2

Source: https://ci.appveyor.com/project/torproject/tor/builds/24367946/job/xu0uiwp0h847t6nu

comment:10 Changed 4 months ago by dgoulet

Reviewer: mikeperry

comment:11 Changed 4 months ago by mikeperry

Status: needs_reviewmerge_ready

Simple code relocation. Looks good to me. Thanks, neel!

comment:12 Changed 4 months ago by nickm

Milestone: Tor: unspecifiedTor: 0.4.1.x-final

comment:13 Changed 4 months ago by nickm

Resolution: fixed
Status: merge_readyclosed

Looks okay to me too. Merging this.

Closing it too, since we have other tickets for profile-driven improvements to onion service performance.

Note: See TracTickets for help on using tickets.