Opened 2 months ago

Last modified 4 weeks ago

#31653 assigned defect

Padding cells sent with 0ms delay cause circuit failures

Reported by: pulls Owned by: mikeperry
Priority: Medium Milestone: Tor: 0.4.2.x-final
Component: Core Tor/Tor Version: Tor: 0.4.1.5
Severity: Normal Keywords: wtf-pad circpad-researchers-want 042-should
Cc: mikeperry Actual Points:
Parent ID: Points:
Reviewer: Sponsor:

Description

Below appears to be a bug that results in a closed circuit due to a relay sending a padding cell (RELAY_COMMAND_DROP) to the client.

At links below you find code for two circuit padding machines:

  1. circpad_machine_client_close_circuit_minimal
  2. circpad_machine_relay_close_circuit_minimal

Machine 1 runs on a client on CIRCUIT_PURPOSE_C_GENERAL circuits with 3 hops as soon as CIRCPAD_CIRC_OPENED: the only thing it does is set a timer 1000s in the future and waits for the timer to expire. The purpose of the machine is to negotiate padding with the relay to activate Machine 2 on the middle relay.

Machine 2 runs at the middle relay and repeatedly sends a padding cell to the client 1 usec after the event CIRCPAD_EVENT_NONPADDING_SENT. In other words, for every non-padding cell we directly add a padding cell. At the client, this causes relay_decrypt_cell(): Incoming cell at client not recognized. Closing. for all circuits triggering Machine 2 at the relay. Closing a circuit by injecting padding cells seems unintended.

Here is part of a log from a client on info level showing how the machine registers, negotiates with the relay (starting Machine 2 at the relay), eventually fails to decrypt, and the circuit closes.

Sep 05 10:32:19.000 [info] circpad_setup_machine_on_circ(): Registering machine client_close_circuit_minimal to origin circ 3 (5)
Sep 05 10:32:19.000 [info] circpad_node_supports_padding(): Checking padding: supported
Sep 05 10:32:19.000 [info] circpad_negotiate_padding(): Negotiating padding on circuit 3 (5), command 2
Sep 05 10:32:19.000 [info] link_apconn_to_circ(): Looks like completed circuit to [scrubbed] does allow optimistic data for connection to [scrubbed]
Sep 05 10:32:19.000 [info] connection_ap_handshake_send_begin(): Sending relay cell 0 on circ 3296464733 to begin stream 20575.
Sep 05 10:32:19.000 [info] connection_ap_handshake_send_begin(): Address/port sent, ap socket 13, n_circ_id 3296464733
Sep 05 10:32:19.000 [info] rep_hist_note_used_port(): New port prediction added. Will continue predictive circ building for 2355 more seconds.
Sep 05 10:32:19.000 [info] connection_edge_process_inbuf(): data from edge while in 'waiting for circuit' state. Leaving it on buffer.
Sep 05 10:32:19.000 [info] exit circ (length 3): $[manually-scrubbed](open) $[manually-scrubbed](open) $[manually-scrubbed](open)
Sep 05 10:32:19.000 [info] pathbias_count_use_attempt(): Used circuit 3 is already in path state use attempted. Circuit is a General-purpose client currently open.
Sep 05 10:32:19.000 [info] link_apconn_to_circ(): Looks like completed circuit to [scrubbed] does allow optimistic data for connection to [scrubbed]
Sep 05 10:32:19.000 [info] connection_ap_handshake_send_begin(): Sending relay cell 0 on circ 3296464733 to begin stream 20576.
Sep 05 10:32:19.000 [info] connection_ap_handshake_send_begin(): Address/port sent, ap socket 14, n_circ_id 3296464733
Sep 05 10:32:19.000 [info] circpad_deliver_recognized_relay_cell_events(): Got padding cell on origin circuit 3.
Sep 05 10:32:20.000 [info] relay_decrypt_cell(): Incoming cell at client not recognized. Closing.
Sep 05 10:32:20.000 [info] circuit_receive_relay_cell(): relay crypt failed. Dropping connection.
Sep 05 10:32:20.000 [info] command_process_relay_cell(): circuit_receive_relay_cell (backward) failed. Closing.
Sep 05 10:32:20.000 [info] circpad_circuit_machineinfo_free_idx(): Freeing padding info idx 0 on circuit 3 (23)
Sep 05 10:32:20.000 [info] circpad_node_supports_padding(): Checking padding: supported
Sep 05 10:32:20.000 [info] circpad_negotiate_padding(): Negotiating padding on circuit 3 (23), command 1
Sep 05 10:32:20.000 [info] pathbias_send_usable_probe(): Sending pathbias testing cell to 0.233.9.53:25 on stream 20577 for circ 3.
Sep 05 10:32:20.000 [info] relay_decrypt_cell(): Incoming cell at client not recognized. Closing.
Sep 05 10:32:20.000 [info] circuit_receive_relay_cell(): relay crypt failed. Dropping connection.
Sep 05 10:32:20.000 [info] command_process_relay_cell(): circuit_receive_relay_cell (backward) failed. Closing.
Sep 05 10:32:20.000 [info] pathbias_send_usable_probe(): Got pathbias probe request for circuit 3 with outstanding probe
Sep 05 10:32:20.000 [info] pathbias_check_close(): Circuit 3 closed without successful use for reason 2. Circuit purpose 23 currently 1,open. Len 3.
Sep 05 10:32:20.000 [info] circuit_mark_for_close_(): Circuit 3296464733 (id: 3) marked for close at src/core/or/command.c:582 (orig reason: 2, new reason: 0)
Sep 05 10:32:20.000 [info] connection_edge_destroy(): CircID 0: At an edge. Marking connection for close.
Sep 05 10:32:20.000 [info] connection_edge_destroy(): CircID 0: At an edge. Marking connection for close.
Sep 05 10:32:20.000 [info] circuit_free_(): Circuit 0 (id: 3) has been freed.

If we delay sending the padding from the relay I cannot trigger the bug (see commented out fix in the Machine 2 function). With the code below, the code triggers on every circuit that has the machine negotiated.

Code: https://github.com/pylls/tor/tree/circuit-padding-relay-padding-bug (https://github.com/pylls/tor/blob/circuit-padding-relay-padding-bug/src/core/or/circuitpadding_machines.c#L460, as well as circuitpadding_machines.h and registration in circpad_machines_init() of circuitpadding.c).

Child Tickets

Change History (17)

comment:1 Changed 2 months ago by pulls

Scrubbed comment, demo machine is no longer running at the provided middle.

Last edited 2 months ago by pulls (previous) (diff)

comment:2 Changed 2 months ago by dgoulet

Cc: mikeperry added
Milestone: Tor: 0.4.2.x-final

comment:3 Changed 2 months ago by mikeperry

The codepath in circpad_machine_schedule_padding() for 0 delay uses a direct call instead of a scheduled callback.

This might be causing an out-of-order AES ctr issue where the padding cell is being sent before the cell that triggered it, but somehow the AES counters are not updated correctly for this ordering. This should not happen normally... Are you also using the branch from #29494 by any chance? That might mess with your cell ordering in this case...

As a workaround: Can you try replacing the direct call to circpad_send_padding_cell_for_callback() from circpad_machine_schedule_padding() for the case where in_usec <0 (https://github.com/pylls/tor/blob/40c6f9bd887bdec7ed3bb03c690dd3d560321d48/src/core/or/circuitpadding.c#L1485) with an assignment of either in_usec = 0 or in_usec = 1?

By removing the direct call to circpad_send_padding_cell_for_callback() and instead letting code continue to the timer_set_cb() codepath, this will cause us to unwind back to libevent to call timer_set_cb() on the next event loop.

Hope this helps!

Last edited 2 months ago by mikeperry (previous) (diff)

comment:4 Changed 2 months ago by pulls

Thanks, makes sense. The workaround works, no more closed circuits with:

if (in_usec <= 0) {
    //return circpad_send_padding_cell_for_callback(mi);
    in_usec = 0;
  }

The relay wasn't branched from #29494 though, but tor-0.4.1.5.

comment:5 Changed 2 months ago by arma

Haven't looked at all the code, but upon reading the bug report, I also suspect an out-of-order send. Perhaps the code that is about to send the cell triggers the code to check for padding, and the code to check for padding sneaks in a padding cell right then, and then it returns and the original code sends its cell.

comment:6 Changed 2 months ago by mikeperry

Keywords: circpad-researchers-want added; circpad removed

The #29494 implementation will change this codepath. It is reasonable for now to merge a workaround that just sets in_usec = 0 to ensure correct behavior for 0 delay in the meantime.

There is roughly a 2/1000 chance of this happening during client rend circuit creation in production in 0.4.1.x (because the left edge of the histogram for that circuit type is 0ms and the right edge is 1000, and we roll the dice once on the client side and once on the relay side).

comment:7 Changed 2 months ago by mikeperry

Summary: padding machine sending padding from relay to client closes circuitPadding cells sent with 0ms delay cause circuit failures

comment:8 Changed 2 months ago by nickm

Keywords: 042-should added

comment:9 Changed 8 weeks ago by mikeperry

Owner: set to mikeperry
Status: newassigned

comment:10 Changed 7 weeks ago by mikeperry

This is a tricky thing to fix; not because of the fix itself, but because our unit tests frequently schedule 0usec padding (often with only some probability) and expect a direct call.

Fixing all the tests to ensure that there are no places where they can still flap is more complicated than I have time for right now.

comment:11 Changed 7 weeks ago by mikeperry

Owner: changed from mikeperry to none

I thought I could do this quickly; turns out I can't. Not before Oct 22 (Firefox ESR deadline)

comment:12 Changed 6 weeks ago by nickm

Owner: changed from none to mikeperry

comment:13 Changed 6 weeks ago by nickm

("not before oct 22" is okay if the fix itself is simple. Mike also says he might be able to do a simple fix later next week)

comment:14 Changed 6 weeks ago by mikeperry

Tobias: FYI I have noticed a perf issue with this solution. Going back into libevent for the callback introduces anywhere from 0-10ms delay, at random, on just a client. On relays, it may be much worse.. Or maybe better, if they are not building circuits (client path construction can block the event loop for a long time).

For origin/master, this means we need to fast-path the 0-delay case still somehow without callbacks, and also warn about this in the developer doc. I bet trying to compose a packet train to fake a burst that actually has 9ms delay between packets is going to get seen by classifiers pretty easily :/.

comment:15 Changed 6 weeks ago by pulls

Yeah, that's not great, a delay of 0-10 ms is massive. Even worse, basically how timers behave becomes a function of load? Simulations based on unit-tests (#31788) will be harder to tweak since one should account for relay load as well.

At the relay, this kind of variability may not be all that bad. Time is messy for classifiers and you already have a lot of natural variability at this end together with typically many cells going towards the client (so you can make a machine start to create a cell train from the early cells, working around some possible delay). The client-side delay is the worst I think, because here the natural variability is typically the time it takes for Firefox to queue up more GET requests.

I think adding priority for padding timers is a good #circpad-researchers-want and in the meantime we can recommend that researchers working on defenses focus on Deep Fingerprinting, since it doesn't use time. Deep Fingerprinting shares architecture with Var-CNN and Tik-Tok, so it would be really interesting to see a defense that works on Deep Fingerprinting in the circpad framework but fails to the other attacks.

comment:16 Changed 4 weeks ago by mikeperry

Bleh. It is unfortunate that clients need more accurate timers than relays.

  1. Do you have any sense as to if client-side timing is more important because most test crawls tend to use client-side timings as opposed to guard-side timings (and thus inherently get very high client-side timing resolution and visibility into Firefox delays), or because of something else that is just inherent to the HTTP protocol?
  1. If I were to write a patch that allowed either clients or relays to correctly fast-path this 0ms case to insert bursts of back-to-back packets without the circuit failure, would that help?

comment:17 Changed 4 weeks ago by pulls

For 1, I would guess it's mostly due to how people collect traces and do evaluation, but as is the case for 2, I don't really know. As discussed for the simulator, I think the first order of business is to get reasonably efficient machines that can defend against deep learning attacks without time and then take it from there.

Note: See TracTickets for help on using tickets.