Opened 9 months ago

Last modified 6 weeks ago

#25461 assigned defect

main event-loop spins consuming 100% of a CPU core running scheduler_set_channel_state

Reported by: Dhalgren Owned by:
Priority: Medium Milestone: Tor: unspecified
Component: Core Tor/Tor Version: Tor: 0.2.9.14
Severity: Normal Keywords: performance, 034-roadmap-subtask, 034-triage-20180328, 034-included-20180328 035-removed
Cc: Actual Points:
Parent ID: Points:
Reviewer: Sponsor:

Description

Lately have observed my exit hitting 100% cpu on the main even-loop thread, sometimes continuously, sometimes cyclically. Captured full-debug of recent cyclical event where CPU started at 30% and rose to 100%, for about one cycle. Chopped the 1G log into eight slices and took a simple function call-count histogram. What's notable is not an increase of calls during saturation, but a reduction of several that seem to relate to connection close events (conn_close_if_marked, flush_chunk). Left column is for the first slice where CPU was 30%, right column is for fourth slice where cpu was 100%. Functions with less than 1000 calls not included below, but complete histograms attached. Wrote about this on tor-relays:

https://lists.torproject.org/pipermail/tor-relays/2018-March/014730.html

This might be an attack of some kind, or perhaps a misbehavior related to the KIST scheduler.

append_cell_to_circuit_queue 6787                 append_cell_to_circuit_queue 7280
channel_flush_from_first_active_circuit 6781      channel_flush_from_first_active_circuit 7190
channel_process_cell 11904                        channel_process_cell 11813
channel_write_packed_cell 120301                  channel_write_packed_cell 126330
channel_write_to_kernel 8588                      channel_write_to_kernel 10048
circuit_consider_stop_edge_reading 146965         circuit_consider_stop_edge_reading 152665
circuit_get_by_circid_channel_impl 14128          circuit_get_by_circid_channel_impl 13468
circuit_receive_relay_cell 11483                  circuit_receive_relay_cell 11341
circuit_resume_edge_reading 1203                  circuit_resume_edge_reading 1231
conn_close_if_marked 39033                        conn_close_if_marked 779
conn_read_callback 14743                          conn_read_callback 15645
conn_write_callback 4531                          conn_write_callback 4447
connection_add_impl 1023                          connection_add_impl 739
connection_bucket_refill_helper 14787             connection_bucket_refill_helper 15842
connection_buf_read_from_socket 16196             connection_buf_read_from_socket 17152
connection_connect 1016                           connection_connect 732
connection_connect_sockaddr 1016                  connection_connect_sockaddr 732
connection_edge_package_raw_inbuf 237303          connection_edge_package_raw_inbuf 255347
connection_edge_process_relay_cell 22219          connection_edge_process_relay_cell 22332
connection_exit_begin_conn 3165                   connection_exit_begin_conn 2315
connection_exit_connect 1050                      connection_exit_connect 772
connection_handle_write_impl 9240                 connection_handle_write_impl 10539
connection_or_process_cells_from_inbuf 20042      connection_or_process_cells_from_inbuf 20448
flush_chunk 38192                                 flush_chunk 12
flush_chunk_tls 22283                             flush_chunk_tls 24061
free_outbuf_info_by_ent 8588                      free_outbuf_info_by_ent 10047
outbuf_table_add 8588                             outbuf_table_add 10014
read_to_chunk 6856                                read_to_chunk 7254
relay_lookup_conn 8459                            relay_lookup_conn 8525
relay_send_command_from_edge_ 119963              relay_send_command_from_edge_ 128738
rep_hist_note_exit_bytes 13913                    rep_hist_note_exit_bytes 14534
scheduler_set_channel_state 126896                scheduler_set_channel_state 133353
update_socket_info 6719                           update_socket_info 7160
update_socket_written 120297                      update_socket_written 126327

Child Tickets

Attachments (8)

logfile.00.hist.txt (3.6 KB) - added by Dhalgren 9 months ago.
function call histogram of 30% nominal cpu interval
logfile.03.hist.txt (3.3 KB) - added by Dhalgren 9 months ago.
function call histogram of 100% cpu interval
logfile.01.hist.txt (3.0 KB) - added by Dhalgren 9 months ago.
function call histogram
logfile.02.hist.txt (2.9 KB) - added by Dhalgren 9 months ago.
function call histogram
logfile.04.hist.txt (3.3 KB) - added by Dhalgren 9 months ago.
function call histogram
logfile.05.hist.txt (3.2 KB) - added by Dhalgren 9 months ago.
function call histogram
logfile.06.hist.txt (3.5 KB) - added by Dhalgren 9 months ago.
function call histogram
logfile.07.hist.txt (3.2 KB) - added by Dhalgren 9 months ago.
function call histogram

Download all attachments as: .zip

Change History (27)

Changed 9 months ago by Dhalgren

Attachment: logfile.00.hist.txt added

function call histogram of 30% nominal cpu interval

Changed 9 months ago by Dhalgren

Attachment: logfile.03.hist.txt added

function call histogram of 100% cpu interval

comment:1 Changed 9 months ago by teor

Keywords: regression? performance added
Milestone: Tor: 0.3.3.x-final

Changed 9 months ago by Dhalgren

Attachment: logfile.01.hist.txt added

function call histogram

Changed 9 months ago by Dhalgren

Attachment: logfile.02.hist.txt added

function call histogram

Changed 9 months ago by Dhalgren

Attachment: logfile.04.hist.txt added

function call histogram

Changed 9 months ago by Dhalgren

Attachment: logfile.05.hist.txt added

function call histogram

Changed 9 months ago by Dhalgren

Attachment: logfile.06.hist.txt added

function call histogram

Changed 9 months ago by Dhalgren

Attachment: logfile.07.hist.txt added

function call histogram

comment:2 Changed 9 months ago by Dhalgren

attached rest of the slice histograms

times in seconds for each are

00 5.697
01 5.615
02 5.625
03 5.556
04 5.567
05 5.574
06 7.137
07 3.666

can provide grep extracts or entire log

recalling I saw this on 0.2.9.14 so probably not a (recent) regression or KIST

guess is event loop is not blocking rather than some function hogging CPU, but just a guess; performance browsing through exit is fine when this happens

comment:3 Changed 9 months ago by nickm

Keywords: 033-must? added

If this is a regression, I think it's a must-fix.

comment:4 Changed 9 months ago by nickm

Keywords: regression? 033-must? removed
Milestone: Tor: 0.3.3.x-finalTor: 0.3.4.x-final
Parent ID: #25500

... but the last comment indicates that it is probably not a regression.

I think we might be able to look at this as part of #25500, though, since anything that can make the main loop run wild without actually taking any action is something that could maybe affect clients too.

One hypothesis is that we should be marking some connection as blocked, but that we aren't, so we are busy-polling on it.

comment:5 Changed 9 months ago by Dhalgren

A theory randomly popped into my mind as to the possible cause:

Use a Linux tc police rule to limit bandwidth on ingress and noticed the CPU saturation problem shows up, at least at times, when average ingress traffic is near the tc police limit but not in excess of it. Probably packet discards from tc-police stalls many TCP connections in the middle of app-layer buffer deliveries. Is it possible that libevent or a tor dameon handler spins when incomplete buffers are pending assembly rather than blocking?

Did not investigate code--is just an idea.

comment:6 Changed 9 months ago by dgoulet

Owner: set to dgoulet
Status: newaccepted

comment:7 Changed 9 months ago by nickm

Keywords: 034-roadmap-subtask added

comment:8 Changed 9 months ago by nickm

Keywords: 034-triage-20180328 added

comment:9 Changed 9 months ago by nickm

Keywords: 034-included-20180328 added

comment:10 Changed 7 months ago by dgoulet

Milestone: Tor: 0.3.4.x-finalTor: 0.3.5.x-final
Version: Tor: 0.3.3.3-alpha

Can't pull this off before 034 merge window closes. Deferring to 035 but still part of a roadmap item.

comment:11 Changed 6 months ago by arma

See also #24857 and #18614 for other profiling / cpu load tickets. Should we merge them, or pick our favorite and close the others?

comment:12 Changed 6 months ago by Dhalgren

seeing CPU saturation today with 0.3.4.1-alpha

removed all tc police rules with no effect, so not the comment:5 theory

TAP handshakes skyrocketed and perhaps it's all TAP handshake CPU consmption

also nf_conntrack is dropping connections due to lack of table slots, which implies scanning activity

comment:13 in reply to:  12 Changed 6 months ago by teor

Replying to Dhalgren:

seeing CPU saturation today with 0.3.4.1-alpha

removed all tc police rules with no effect, so not the comment:5 theory

TAP handshakes skyrocketed and perhaps it's all TAP handshake CPU consmption

We can adjust this on a network-wide basis using NumNTorsPerTAP, but there's no torrc option for individual relays.

comment:14 Changed 5 months ago by dgoulet

Milestone: Tor: 0.3.5.x-finalTor: unspecified

comment:15 Changed 5 months ago by dgoulet

Keywords: 035-removed added

comment:16 Changed 4 months ago by teor

Summary: main event-loop spins consuming 100% of a CPU core at timesmain event-loop spins consuming 100% of a CPU core running scheduler_set_channel_state
Version: Tor: 0.2.9.14

comment:17 Changed 6 weeks ago by dgoulet

Status: acceptednew

comment:18 Changed 6 weeks ago by dgoulet

Owner: dgoulet deleted
Status: newassigned

comment:19 Changed 6 weeks ago by dgoulet

Parent ID: #25500
Note: See TracTickets for help on using tickets.