Opened 5 years ago

Closed 4 years ago

#12464 closed defect (implemented)

When Tor 0.2.6.x is closer to done, profile relays running Tor 0.2.6.x and optimize accordingly

Reported by: nickm Owned by: dgoulet
Priority: Medium Milestone: Tor: 0.2.6.x-final
Component: Core Tor/Tor Version:
Severity: Keywords: tor-relay, performance, 026-triaged-1, 027-triaged-1-out
Cc: Actual Points:
Parent ID: Points:
Reviewer: Sponsor:


See #11332 for the ticket where we did this in 0.2.5, for experience and ideas.

Child Tickets

Attachments (1)

tor-perf-026-alpha.png (152.1 KB) - added by dgoulet 4 years ago.

Download all attachments as: .zip

Change History (17)

comment:1 Changed 5 years ago by nickm

Keywords: 026-triaged-1 added

Changed 4 years ago by dgoulet

Attachment: tor-perf-026-alpha.png added

comment:2 Changed 4 years ago by dgoulet

Latest results I have with perf on my 50Mb/s relay (non exit) running

Taken over 15 minutes of sampling generating 206MB of events.

comment:3 Changed 4 years ago by nickm

Is that before or after we merged #9262?

comment:4 Changed 4 years ago by dgoulet

Yes, this is at commit b3b840443ddbf126c5a9c25533986e9b9ea33585.

Will wait a bit before updating to latest alpha because I need my relay to report HS stats until jan 12th at least.

comment:5 Changed 4 years ago by nickm

Owner: set to dgoulet
Status: newassigned

comment:6 Changed 4 years ago by dgoulet

Latest profiling at commit 99e915dbfea9b2b985a10368721d9c79b358abbe

+  15.86%  tor  [.] fmonty
+   3.79%  tor  [.] circuitmux_find_map_entry
+   2.81%  tor  [.] connection_bucket_refill
+   2.57%  tor  [.] connection_handle_read_impl
+   1.99%  tor  [.] buf_datalen
+   1.79%  tor  [.] assert_connection_ok
+   1.52%  tor  [.] connection_or_process_cells_from_inbuf

I can see a couple of way of reducing circuitmux_find_map_entry() cpu time by calling it less. For instance, update_circuit_on_cmux_ calls it 3 times in the same function querying the same cmux,circ pair.

Clearly biggest part of the CPU is crypto. Here is an other view without a procname "tor" filter. First value of 37.04% is kernel work.

+  37.04%  tor  [ctr]                  [k] 0xffffffff8104f45a
+   3.85%  tor  tor                    [.] fmonty
+   2.85%  tor     [.] gcm_ghash_clmul
+   2.82%  tor     [.] aesni_ctr32_encrypt_blocks
+   2.14%  tor     [.] aesni_cbc_sha1_enc_avx
+   2.01%  tor     [.] bn_sqr4x_mont
+   1.62%  tor           [.] _int_malloc
+   1.60%  tor     [.] sha1_block_data_order_avx
+   1.04%  tor           [.] __memcpy_sse2_unaligned
+   0.95%  tor  [.] event_base_loop
+   0.95%  tor          [.] 0x0000000000002a87
+   0.92%  tor  tor                    [.] circuitmux_find_map_entry 

comment:7 in reply to:  6 Changed 4 years ago by yawning

Replying to dgoulet:

+  15.86%  tor  [.] fmonty
+   3.85%  tor  tor                    [.] fmonty

Time to dust off #8897. It's also worth noting (maybe this should belong in the other ticket) that ed25519-donna (also by Floodberry) has code for the faster scalar base mult, so we should give serious consideration into ditching the ref0 code and folding it in if it's up to par.

comment:8 Changed 4 years ago by tmpname0901

The reference to memcpy_sse2_unaligned() above reminds me that data should always be aligned for more efficient read/write.

There are tools (Valgrind?) that can report this. For x86(_64), buffers should always be aligned to at least mod 16.

comment:9 in reply to:  8 Changed 4 years ago by yawning

Replying to tmpname0901:

The reference to memcpy_sse2_unaligned() above reminds me that data should always be aligned for more efficient read/write.

There are tools (Valgrind?) that can report this. For x86(_64), buffers should always be aligned to at least mod 16.

Note: Discussing Intel and compatible processors here.

To be pedantic:

Assembly/Compiler Coding Rule 46. (H impact, H generality) Align data on natural operand size address boundaries. If the data will be accessed with vector instruction loads and stores, align the data on 16-byte boundaries.

The performance hit for non-vector access comes when an access straddles a cache line boundary (64 bytes), unless the processor is an iPotato86 that should have been retired a long time ago (P1, PMMX, AMD <= K8). Vector access needs to be 16 byte aligned, unless you are using AVX where it doesn't matter (So looking towards the bright future, this will matter less and less).

The memcpy that I imagine dominates memcpy's runtime would be the one in buffers.c:write_to_buf(), invoked from connection_write_to_buf() from the RELAY_COMMAND_DATA handler logic (a quick skim shows that everything else is infrequent, should be reasonably aligned, or the copy is too small to be interesting).

Here the destination will only always be nicely aligned when a new chunk is allocated for the buffer (or if the data in the buffer happened to end on a 16 byte boundary), and the source will never be aligned correctly (cell->payload + RELAY_HEADER_SIZE).

I'm currently of the opinion that before messing with this, faster crypto will gain more mileage for our development time.

comment:10 Changed 4 years ago by nickm

I tend to agree there, though if we do someday decide to optimize memcpy, I'd like to see call graph info to see which memcpy really matters; I bet we might be surprised.

wrt crypto speed improvements, that's an 0.2.7 issue at earliest.

The other functions that appear above in the profile don't matter a lot individually, and they don't look like bottlenecks per se, but I bet there's some opportunity to make them faster. Not for 0.2.6, though.

Two questions:

  1. Let's think about the elephant in ring 0: that 37% of our time that we're spending in the kernel. I don't think this is a regression, though, but we should check. And some time, we should investigate where the time is going.
  1. Can anybody profile... an exit? a busy hidden service? Some other relevant configuration?

comment:11 Changed 4 years ago by dgoulet

Here are some perf results from a busy hidden service. Everything above 2% is shown here:

+  34.25%  tor  [.] fmonty
-   6.43%  tor  [.] smartlist_remove
   - smartlist_remove
      - 99.84% smartlist_subtract
         - circuit_launch_by_extend_info
            + 78.39% rend_service_introduce
            + 19.19% circuit_launch
            + 2.42% rend_service_relaunch_rendezvous
-   4.85%  tor  [.] node_get_prim_orport
   - node_get_prim_orport
      - 97.86% node_get_addr
         - nodelist_add_node_and_family
            + 71.40% choose_good_middle_server
            + 28.60% populate_live_entry_guards
      + 2.14% nodelist_add_node_and_family
+   4.84%  tor  [.] tor_addr_compare_masked
+   2.95%  tor  [.] siphash24
+   2.78%  tor  [.] swap_conditional
+   2.51%  tor  [.] nodelist_add_node_and_family         
+   2.41%  tor  [.] tor_memeq
+   2.36%  tor  [.] compute_weighted_bandwidths
+   2.34%  tor  [.] crecip

comment:12 Changed 4 years ago by nickm

Interesting but not FWICT a regression. I say we close this. Could I ask you to make a couple of new tickets in 0.2.7 or later for doing less smartlist_remove(), and for moving client-side ntor into workers, and close this when you're done?

comment:13 Changed 4 years ago by nickm

Milestone: Tor: 0.2.6.x-finalTor: 0.2.7.x-final

comment:14 Changed 4 years ago by nickm

Keywords: 027-triaged-1-out added

Marking triaged-out items from first round of 0.2.7 triage.

comment:15 Changed 4 years ago by nickm

Milestone: Tor: 0.2.7.x-finalTor: 0.2.???

Make all non-needs_review, non-needs_revision, 027-triaged-1-out items belong to 0.2.???

comment:16 Changed 4 years ago by nickm

Milestone: Tor: 0.2.???Tor: 0.2.6.x-final
Resolution: implemented
Status: assignedclosed
Note: See TracTickets for help on using tickets.