Opened 3 years ago

Last modified 3 months ago

#16585 new defect

Network activity from scheduler_run() creates multiple side channel information leaks

Reported by: starlight Owned by:
Priority: Medium Milestone: Tor: unspecified
Component: Core Tor/Tor Version: Tor: 0.2.6.10
Severity: Normal Keywords: tor-client guard side-channel
Cc: mikeperry Actual Points:
Parent ID: Points:
Reviewer: Sponsor:

Description

This is most easily observed when
one or two guards are manually configured
and all guards becomes unreachable
because either they are either

a) down

b) enforcing circuit DDOS rate limiting

One also observes brief sharp traffic
forward drop-offs when new circuits
are being created rapidly but otherwise
under nominal operation.

Log message accompanying the problem:

Failed to find node for hop 0 of our path. Discarding this circuit.

Attached Vidalia bandwidth graphs
show one each of (a) and (b) above.

Easy to reproduce.

Child Tickets

Attachments (10)

tor_relay_cliff1.gif (65.4 KB) - added by starlight 3 years ago.
with two guards, one down, one rate-limiting
tor_relay_cliff2.gif (62.9 KB) - added by starlight 3 years ago.
with two guards, both down
tor_relay_cliff3x.gif (29.8 KB) - added by starlight 3 years ago.
latest relay forwarding implosion
tor_relay_cliff4.gif (34.1 KB) - added by starlight 3 years ago.
Vidalia graph of traffic seize-up
cliff4_logfile_scrub.txt (210.2 KB) - added by starlight 3 years ago.
[info] level log of event, [debug] available on request
cliff5_logfile_scrub_A.txt.xz (2.2 MB) - added by starlight 3 years ago.
[debug] level log of event, first half
cliff5_logfile_scrub_B.txt.xz (1.9 MB) - added by starlight 3 years ago.
debug] level log of event, second half
tor_relay_cliff5.gif (32.1 KB) - added by starlight 3 years ago.
Vidalia graph of traffic seize-up
cliff5_sched_runs.txt (840 bytes) - added by starlight 3 years ago.
corrected histogram of schedule_run() vs total events per second
cliff5_logfile_intrst.txt (21.0 KB) - added by starlight 3 years ago.
most interesting section of debug log edited for relevence and brevity

Change History (40)

Changed 3 years ago by starlight

Attachment: tor_relay_cliff1.gif added

with two guards, one down, one rate-limiting

Changed 3 years ago by starlight

Attachment: tor_relay_cliff2.gif added

with two guards, both down

comment:1 Changed 3 years ago by starlight

thread from tor-relays with some further detail,
background including inaccurate speculation

https://lists.torproject.org/pipermail/tor-relays/2015-July/007369.html

comment:2 Changed 3 years ago by starlight

Perhaps resolving this issue should
occur as part of the logic rework
under way in ticket #12595.

comment:3 Changed 3 years ago by starlight

Have been seeing the problem where the guard relationship
is stable and and good, but where new circuits are
established. Possilby new TCP channel connections
are created due to inactivity time-out of previous
channels and/or possibly it was due to

[Notice] Have tried resolving or connecting to address '[scrubbed]' at 3 different places. Giving up.

This seems a serious bug. Might be possible to use the
behavior to deanomize traffic.

The attached graph shows what happened when two

RVAL=$RANDOM curl --socks5-hostnam -U $RVAL:$RVAL hXXp://<ip_addr>/<file_path> &

commands were started within a few seconds of each other.

Changed 3 years ago by starlight

Attachment: tor_relay_cliff3x.gif added

latest relay forwarding implosion

comment:4 Changed 3 years ago by starlight

Happened again, got this strange message:

No circuits are opened. Relaxed timeout for circuit 4726 (a Measuring circuit timeout 3-hop circuit in state doing handshakes with channel state open) to 60000ms. However, it appears the circuit has timed out anyway. N guards are live.

Hacked at it a bit and got a full debug of an event, though
not the one that caused the above message. Uploaded
log with [debug] lines excluded as it's 1/2 GB (10MB xz),
but is available. A few errors appear and I'm not sure
which might be most related to the relay forwarding seize-up.

Changed 3 years ago by starlight

Attachment: tor_relay_cliff4.gif added

Vidalia graph of traffic seize-up

Changed 3 years ago by starlight

Attachment: cliff4_logfile_scrub.txt added

[info] level log of event, [debug] available on request

Changed 3 years ago by starlight

[debug] level log of event, first half

Changed 3 years ago by starlight

debug] level log of event, second half

Changed 3 years ago by starlight

Attachment: tor_relay_cliff5.gif added

Vidalia graph of traffic seize-up

comment:5 Changed 3 years ago by starlight

Priority: normalmajor
Summary: relay stops forwarding all traffic when client function cannot establish circuitsrelay stops forwarding all traffic when client function cannot establish circuits -- scheduler_run() starvation

Nailed it down.

Problem starts at

Aug 14 22:11:15.000 [debug] new_route_len(): Chosen route length 3 (6239/7105 routers suitable).
.
.
.
Aug 14 22:11:16.000 [debug] circuit_increment_failure_count(): n_circuit_failures now 1.

at which time

scheduler_run(): We have a chance to run the scheduler

goes into starvation, and scheduler_run() is not
invoked again for 2000 milliseconds

here's a histogram where the left column is total
logged events and the right column is scheduler_run()

22:11:12  67873 363
22:11:13  78595 658
22:11:14  93149 242
22:11:15  14383   5
22:11:16   1812   0
22:11:17   1200   0
22:11:18   5547   0
22:11:19  18219   4
22:11:20   3729   0
22:11:21   5063   0
22:11:22   1445   0
22:11:23   6735   0
22:11:24     28   0
22:11:25  63976  17
22:11:26  67298  28
22:11:27  77377  38
22:11:28 100585  80
22:11:29  99899  71
22:11:30  77994  72
22:11:31 101059  99
22:11:32 100829  94
22:11:33  79472  84
22:11:34  99978  98
22:11:35  98906  89
22:11:36  81795  77
22:11:37 100629 170
22:11:38  98391 213
22:11:39  79961 223
22:11:40  81213 659
22:11:41  93819 566
22:11:42  82917 744
22:11:43  85495 538
22:11:44  29038 271
22:11:45  71416 259
22:11:46  58348 656
22:11:47  65918 686
22:11:48  78183 568
22:11:49  70243 651
22:11:50  86030 608
22:11:51  67953 744
22:11:52  74292 662
22:11:53  91937 562

so it's fairly obvious scheduler_run()
starvation and whatever is causing it is
the problem

FULL [DEBUG] LEVEL LOG IS ATTACHED TO TICKET IN TWO .XZ PARTS

Changed 3 years ago by starlight

Attachment: cliff5_sched_runs.txt added

corrected histogram of schedule_run() vs total events per second

comment:6 Changed 3 years ago by starlight

Examining the debug log turned up what appears to be
the specific issue. Appears the scheduler event
loop is still able to accept incoming cells while
performing heavy computation for new circuit
onion-skins, but that no outbound cell are
written. The most interesting section of the
log (edited for relevance and attached to ticket)
shows extreme delays between bulk calls to
channel_write_packed_cell(). The most
interesting of the interesting follows inline:

22:11:15 [info] choose_good_exit_server_general(): Found 894 servers that might support 0/0 pending connections.
22:11:15 [info] compute_weighted_bandwidths(): Empty routerlist passed in to consensus weight node selection for rule weight as exit
22:11:15 [info] choose_good_exit_server_general(): Chose exit server '$F1BE15429B3CE696D6807F4D4A58B1BFEC45C822~zwiebeltoralf at 5.9.158.75'
22:11:15 [info] circuit_send_next_onion_skin(): First hop: finished sending CREATE cell to '${scrubbed1}~{scrubbed1} at {scrubbed1}'
22:11:15 [info] connection_edge_process_inbuf(): data from edge while in 'waiting for circuit' state. Leaving it on buffer.
22:11:15 [debug] scheduler_evt_callback(): Scheduler event callback called
22:11:15 [debug] scheduler_run(): We have a chance to run the scheduler
22:11:15 [debug] channel_write_packed_cell() { COUNT 119 }
22:11:15 [debug] scheduler_run(): Scheduler handled 23 of 23 pending channels, queue size from 102906952 to 102969213, queue heuristic from 32559 to 94820
22:11:15 [debug] channel_write_packed_cell(): Writing packed_cell_t 0x7fa2b5089150 to channel 0x7fa2b1fb6590 with global ID 2574011
22:11:16 [info] choose_good_exit_server_general(): Found 894 servers that might support 0/0 pending connections.
22:11:16 [info] compute_weighted_bandwidths(): Empty routerlist passed in to consensus weight node selection for rule weight as exit
22:11:16 [info] choose_good_exit_server_general(): Chose exit server '$21ABB9584677EEB54514265F24756AD0B04E0EFA~spfTOR1e1 at 77.109.141.138'
22:11:16 [info] circuit_send_next_onion_skin(): First hop: finished sending CREATE cell to '${scrubbed1}~{scrubbed1} at {scrubbed1}'
22:11:16 [info] connection_edge_process_inbuf(): data from edge while in 'waiting for circuit' state. Leaving it on buffer.
22:11:16 [debug] channel_write_packed_cell(): Writing packed_cell_t 0x7fa2ae01be50 to channel 0x7fa2b06218c0 with global ID 2619137
22:11:16 [debug] channel_write_packed_cell(): Writing packed_cell_t 0x7fa2b146a180 to channel 0x7fa2b60546f0 with global ID 2632321
22:11:16 [debug] channel_write_packed_cell(): Writing packed_cell_t 0x7fa2ab1ec0c0 to channel 0x7fa2b77e85e0 with global ID 2517382
22:11:16 [debug] channel_write_packed_cell(): Writing packed_cell_t 0x7fa2b658ac10 to channel 0x7fa2ace699e0 with global ID 2538923
22:11:16 [info] connection_exit_connect_dir(): Opening local connection for anonymized directory exit
22:11:16 [info] connection_exit_connect_dir(): Opening local connection for anonymized directory exit
22:11:16 [info] connection_exit_connect_dir(): Opening local connection for anonymized directory exit
22:11:16 [info] choose_good_exit_server_general(): Found 894 servers that might support 0/0 pending connections.
22:11:16 [info] compute_weighted_bandwidths(): Empty routerlist passed in to consensus weight node selection for rule weight as exit
22:11:16 [info] choose_good_exit_server_general(): Chose exit server '$E1E922A20AF608728824A620BADC6EFC8CB8C2B8~TorLand1 at 37.130.227.133'
22:11:16 [info] circuit_send_next_onion_skin(): First hop: finished sending CREATE cell to '${scrubbed1}~{scrubbed1} at {scrubbed1}'
22:11:17 [info] choose_good_exit_server_general(): Found 894 servers that might support 0/0 pending connections.
22:11:17 [info] compute_weighted_bandwidths(): Empty routerlist passed in to consensus weight node selection for rule weight as exit
22:11:17 [info] choose_good_exit_server_general(): Chose exit server '$22F9FB518CE880081B3101B70B9D85D5BD4D7967~BabylonNetwork06 at 185.62.188.14'
22:11:17 [info] circuit_send_next_onion_skin(): First hop: finished sending CREATE cell to '${scrubbed1}~{scrubbed1} at {scrubbed1}'
22:11:17 [info] run_connection_housekeeping(): Expiring non-used OR connection to fd 427 (23.239.18.57:9001) [no circuits for 1090; timeout 1089; canonical].
22:11:17 [debug] scheduler_evt_callback(): Scheduler event callback called
22:11:17 [debug] scheduler_run(): We have a chance to run the scheduler
22:11:17 [debug] channel_write_packed_cell() { COUNT 261 }
22:11:17 [debug] scheduler_run(): Scheduler handled 80 of 80 pending channels, queue size from 102876843 to 103013928, queue heuristic from 0 to 137085
22:11:17 [info] rep_hist_note_or_conn_bytes(): 1793602 below threshold, 128897 mostly read, 116725 mostly written, 223398 both read and written.
22:11:17 [debug] channel_write_packed_cell() { COUNT 2 }
22:11:17 [info] circuit_finish_handshake(): Finished building circuit hop:
22:11:17 [info] exit circ (length 3, last hop zwiebeltoralf): ${scrubbed1}(open) $A6F64642F6D4C15E62BC421589E013433193055D(closed) $F1BE15429B3CE696D6807F4D4A58B1BFEC45C822(closed)
22:11:17 [info] circuit_send_next_onion_skin(): Sending extend relay cell.
22:11:17 [debug] channel_write_packed_cell() { COUNT 1 }
22:11:17 [info] connection_edge_process_relay_cell(): sendme cell dropped, unknown stream (streamid 304).
22:11:17 [debug] channel_write_packed_cell() { COUNT 1 }
22:11:17 [info] connection_handle_listener_read(): New SOCKS connection opened from {scrubbed}.
22:11:17 [debug] channel_write_packed_cell() { COUNT 6 }
22:11:17 [info] channel_register(): Channel 0x7fa2b3494de0 (global ID 2657985) in state opening (1) registered with no identity digest
22:11:17 [debug] channel_write_packed_cell() { COUNT 1 }
22:11:18 [info] choose_good_exit_server_general(): Found 894 servers that might support 0/0 pending connections.
22:11:18 [info] compute_weighted_bandwidths(): Empty routerlist passed in to consensus weight node selection for rule weight as exit
22:11:18 [info] choose_good_exit_server_general(): Chose exit server '$7DD29A65C370B86B5BE706EA3B1417745714C8AF~dreamatorium at 89.31.57.5'
22:11:18 [info] circuit_send_next_onion_skin(): First hop: finished sending CREATE cell to '${scrubbed1}~{scrubbed1} at {scrubbed1}'
22:11:18 [debug] scheduler_evt_callback(): Scheduler event callback called
22:11:18 [debug] scheduler_run(): We have a chance to run the scheduler
22:11:18 [debug] channel_write_packed_cell() { COUNT 318 }
22:11:18 [debug] scheduler_run(): Scheduler handled 87 of 87 pending channels, queue size from 102864432 to 103031360, queue heuristic from 0 to 166928
22:11:18 [info] directory_handle_command_get(): Got a v2 rendezvous descriptor request for ID '[scrubbed]'
22:11:18 [info] directory_handle_command_get(): Got a v2 rendezvous descriptor request for ID '[scrubbed]'
22:11:18 [info] directory_handle_command_get(): Got a v2 rendezvous descriptor request for ID '[scrubbed]'
22:11:18 [info] circuit_finish_handshake(): Finished building circuit hop:
22:11:18 [info] exit circ (length 3, last hop TorLand1): ${scrubbed1}(open) $3B2757383632D8C20B5E02DB55FEFF6198C8FEF5(closed) $E1E922A20AF608728824A620BADC6EFC8CB8C2B8(closed)
22:11:18 [info] circuit_send_next_onion_skin(): Sending extend relay cell.
22:11:18 [info] circuit_finish_handshake(): Finished building circuit hop:
22:11:18 [info] exit circ (length 3, last hop spfTOR1e1): ${scrubbed1}(open) $28F5601709FDCAA51C60E1D7004F40EFA0C99DE4(closed) $21ABB9584677EEB54514265F24756AD0B04E0EFA(closed)
22:11:18 [info] circuit_send_next_onion_skin(): Sending extend relay cell.
22:11:18 [info] connection_edge_process_relay_cell(): sendme cell dropped, unknown stream (streamid 304).
22:11:18 [debug] channel_write_packed_cell() { COUNT 1 }
22:11:18 [info] connection_handle_listener_read(): New SOCKS connection opened from {scrubbed}.
22:11:18 [debug] channel_write_packed_cell() { COUNT 1 }
22:11:18 [info] channel_register(): Channel 0x7fa2af81b840 (global ID 2657986) in state opening (1) registered with no identity digest
22:11:18 [debug] channel_write_packed_cell() { COUNT 4 }
22:11:18 [debug] scheduler_evt_callback(): Scheduler event callback called
22:11:18 [debug] scheduler_run(): We have a chance to run the scheduler
22:11:18 [debug] channel_write_packed_cell() { COUNT 1 }
22:11:18 [debug] scheduler_run(): Scheduler handled 97 of 97 pending channels, queue size from 102895271 to 103041195, queue heuristic from 0 to 145924

Changed 3 years ago by starlight

Attachment: cliff5_logfile_intrst.txt added

most interesting section of debug log edited for relevence and brevity

comment:7 Changed 3 years ago by starlight

Looking at it more, it seems the biggest delays are
associated with searching the database of relays
for new circuit paths.

Seems to me this massive transaction should be
handled as an independent event queued off to
a worker thread, but that it is presently
handled in the main relay worker code path.

Probably back when only 2000 relays existed
the delays were barely noticeable, but with
the population 7000 and rising the problem
will only become worse with time.

The system where this ran has two cores but
the second worker thread/core never hits more
than 1% utilization. Have run top while
provoking the issue and is clear that the
problem is a CPU-bound main tor thread.

NOTE: In the attached "interesting" log
extraction, all [debug] lines are removed
from the first section above the line-break.
Below the line-break, only selected [debug]
entries are included, and channel_write_packed_cell()
lines are collapsed to show just the number
of invocations.

comment:8 Changed 3 years ago by starlight

Another idea would be to modify the path-
search code so it can pause and resume
after some increment of work has completed.
Thus control can be returned to the main
event loop frequently to permit cell
forward processing to run on more reasonable
intervals.

comment:9 Changed 3 years ago by starlight

After some thought, realized that disturbances in
cell forwarding by client circuit formation
can probably employed in traffic confirmation
analysis. Further, it seems unlikely that
improving the transaction granularity of the
client code will eliminate the potential for
traffic confirmation as long as the work is
performed by the same thread.

Created bug #16824 to recommend that simultaneous
client and relay operation be discouraged with
a new configuration parameter.

This ticket may still have relevance as it appears
circuit creation failures incur excessive processing
relative to nominal circuit creation and that this
area should be improved.

comment:10 Changed 3 years ago by starlight

This flaw is bad for anonymity!

An adversary in a position to analyze the traffic
of a relay that also operates as a client can
easily identify outbound cells that are part of
a new circuit establishment. Those cells are
written after a pause in general traffic
forwarding and immediately ahead of the
resumption of forwarding. Is clear from
the debug trace that the only cells written
during circuit construction are related to
the new circuit, while forwarding traffic is
held suspended on queues.

comment:11 Changed 3 years ago by nickm

Milestone: Tor: 0.2.7.x-final

Should at least get a warning message into 0.2.7, and ideally a better fix still.

comment:12 Changed 3 years ago by nickm

Keywords: PostFreeze027 added

I'd merge patches for these for 0.2.7 if they come in on time. In some cases, that will require figuring out an as-yet-unsolved bugs.

comment:13 Changed 3 years ago by mikeperry

Cc: mikeperry added

comment:14 Changed 3 years ago by mikeperry

What if there are more sources of this type of side-channel caused by other slow and/or sensitive operations? Perhaps the actual channel send+recv handling is what should be done in a worker thread, rather than as a scheduled event, to avoid leaking any such side-channel in such an obvious way to the wire.

Related question: Would the transition to multithreaded crypto help this at all, or would that be mostly orthogonal? (https://trac.torproject.org/projects/tor/wiki/org/projects/Tor/MultithreadedCrypto).

As a stopgap, maybe we can create cached data structures of the relay lists that allow for faster searching for a good relay in frequently-repeated cases?

Last edited 3 years ago by mikeperry (previous) (diff)

comment:15 in reply to:  14 Changed 3 years ago by starlight

Replying to mikeperry:

What if there are more sources of this type of side-channel caused by other slow and/or sensitive operations? Perhaps the actual channel send+recv handling is what should be done in a worker thread, rather than as a scheduled event, to avoid leaking any such side-channel in such an obvious way to the wire.

This seems a good solution to me. Have cell-forwarding run in a dedicated thread separate from all the various other transaction types. Perhaps is not overly difficult since most everything runs as libevent transactions. A separate libevent thread for cell forwarding with brief-hold locks to protect cell queues could perhaps be the structure.

comment:16 Changed 3 years ago by mikeperry

Keywords: 0.2.8.x-triage added
Summary: relay stops forwarding all traffic when client function cannot establish circuits -- scheduler_run() starvationNetwork activity from scheduler_run() creates multiple side channel information leaks

comment:17 Changed 3 years ago by mikeperry

Keywords: 028-triage added; 0.2.8.x-triage removed

comment:18 Changed 3 years ago by mikeperry

#9241 may help with the original side channel that starlight found here, especially if we can pre-build many more paths than we need (or schedule that on different threads).

comment:19 Changed 3 years ago by nickm

Keywords: 027-backport added
Milestone: Tor: 0.2.7.x-finalTor: 0.2.8.x-final

comment:20 Changed 2 years ago by nickm

Milestone: Tor: 0.2.8.x-finalTor: 0.2.9.x-final

Throw most 0.2.8 "NEW" tickets into 0.2.9. I expect that many of them will subsequently get triaged out.

comment:21 Changed 2 years ago by nickm

Priority: HighMedium

comment:22 Changed 2 years ago by isabela

Milestone: Tor: 0.2.9.x-finalTor: 0.2.???

tickets market to be removed from milestone 029

comment:23 Changed 21 months ago by teor

Milestone: Tor: 0.2.???Tor: 0.3.???

Milestone renamed

comment:24 Changed 20 months ago by nickm

Keywords: tor-03-unspecified-201612 added
Milestone: Tor: 0.3.???Tor: unspecified

Finally admitting that 0.3.??? was a euphemism for Tor: unspecified all along.

comment:25 Changed 15 months ago by nickm

Keywords: tor-03-unspecified-201612 removed

Remove an old triaging keyword.

comment:26 Changed 15 months ago by nickm

Keywords: 028-triage removed

comment:27 Changed 15 months ago by nickm

Keywords: 027-backport removed

These are not ripe for an 027 backport

comment:28 Changed 15 months ago by nickm

Keywords: PostFreeze027 removed

comment:29 Changed 14 months ago by nickm

Keywords: tor-client guard side-channel added
Severity: Normal

comment:30 Changed 3 months ago by arma

This "we don't call the scheduler for seconds at a time" bug looks like something we should fix, right?

And the focus on what exactly our warning should be is a distraction from identifying why the bug happens and how to refactor things so stuff doesn't get starved?

Note: See TracTickets for help on using tickets.