Padding cells sent with 0ms delay cause circuit failures

changed milestone to %Tor: unspecified

added 042-should 043-deferred 044-deferred circpad-researchers-want component::core tor/tor milestone::Tor: unspecified owner::mikeperry priority::medium severity::normal status::assigned type::defect version::tor 0.4.1.5 wtf-pad labels

Scrubbed comment, demo machine is no longer running at the provided middle.

Trac:
Cc: N/A to mikeperry
Milestone: N/A to Tor: 0.4.2.x-final

The codepath in circpad_machine_schedule_padding() for 0 delay uses a direct call instead of a scheduled callback.

This might be causing an out-of-order AES ctr issue where the padding cell is being sent before the cell that triggered it, but somehow the AES counters are not updated correctly for this ordering. This should not happen normally... Are you also using the branch from #29494 (moved) by any chance? That might mess with your cell ordering in this case...

As a workaround: Can you try replacing the direct call to circpad_send_padding_cell_for_callback() from circpad_machine_schedule_padding() for the case where in_usec <0 (https://github.com/pylls/tor/blob/40c6f9bd887bdec7ed3bb03c690dd3d560321d48/src/core/or/circuitpadding.c#L1485) with an assignment of either in_usec = 0 or in_usec = 1?

By removing the direct call to circpad_send_padding_cell_for_callback() and instead letting code continue to the timer_set_cb() codepath, this will cause us to unwind back to libevent to call timer_set_cb() on the next event loop.

Hope this helps!

Thanks, makes sense. The workaround works, no more closed circuits with:

if (in_usec <= 0) {
    //return circpad_send_padding_cell_for_callback(mi);
    in_usec = 0;
  }

The relay wasn't branched from #29494 (moved) though, but tor-0.4.1.5.

Haven't looked at all the code, but upon reading the bug report, I also suspect an out-of-order send. Perhaps the code that is about to send the cell triggers the code to check for padding, and the code to check for padding sneaks in a padding cell right then, and then it returns and the original code sends its cell.

The #29494 (moved) implementation will change this codepath. It is reasonable for now to merge a workaround that just sets in_usec = 0 to ensure correct behavior for 0 delay in the meantime.

There is roughly a 2/1000 chance of this happening during client rend circuit creation in production in 0.4.1.x (because the left edge of the histogram for that circuit type is 0ms and the right edge is 1000, and we roll the dice once on the client side and once on the relay side).

Trac:
Keywords: wtf-pad circpad deleted, wtf-pad circpad-researchers-want added

Trac:
Summary: padding machine sending padding from relay to client closes circuit to Padding cells sent with 0ms delay cause circuit failures

Trac:
Keywords: wtf-pad circpad-researchers-want deleted, wtf-pad circpad-researchers-want 042-should added

Trac:
Owner: N/A to mikeperry
Status: new to assigned

This is a tricky thing to fix; not because of the fix itself, but because our unit tests frequently schedule 0usec padding (often with only some probability) and expect a direct call.

Fixing all the tests to ensure that there are no places where they can still flap is more complicated than I have time for right now.

I thought I could do this quickly; turns out I can't. Not before Oct 22 (Firefox ESR deadline)

Trac:
Owner: mikeperry to none

Trac:
Owner: none to mikeperry

("not before oct 22" is okay if the fix itself is simple. Mike also says he might be able to do a simple fix later next week)

Tobias: FYI I have noticed a perf issue with this solution. Going back into libevent for the callback introduces anywhere from 0-10ms delay, at random, on just a client. On relays, it may be much worse.. Or maybe better, if they are not building circuits (client path construction can block the event loop for a long time).

For origin/master, this means we need to fast-path the 0-delay case still somehow without callbacks, and also warn about this in the developer doc. I bet trying to compose a packet train to fake a burst that actually has 9ms delay between packets is going to get seen by classifiers pretty easily :/.

Yeah, that's not great, a delay of 0-10 ms is massive. Even worse, basically how timers behave becomes a function of load? Simulations based on unit-tests (#31788 (moved)) will be harder to tweak since one should account for relay load as well.

At the relay, this kind of variability may not be all that bad. Time is messy for classifiers and you already have a lot of natural variability at this end together with typically many cells going towards the client (so you can make a machine start to create a cell train from the early cells, working around some possible delay). The client-side delay is the worst I think, because here the natural variability is typically the time it takes for Firefox to queue up more GET requests.

I think adding priority for padding timers is a good #circpad-researchers-want and in the meantime we can recommend that researchers working on defenses focus on Deep Fingerprinting, since it doesn't use time. Deep Fingerprinting shares architecture with Var-CNN and Tik-Tok, so it would be really interesting to see a defense that works on Deep Fingerprinting in the circpad framework but fails to the other attacks.

Bleh. It is unfortunate that clients need more accurate timers than relays.

Do you have any sense as to if client-side timing is more important because most test crawls tend to use client-side timings as opposed to guard-side timings (and thus inherently get very high client-side timing resolution and visibility into Firefox delays), or because of something else that is just inherent to the HTTP protocol?
If I were to write a patch that allowed either clients or relays to correctly fast-path this 0ms case to insert bursts of back-to-back packets without the circuit failure, would that help?

For 1, I would guess it's mostly due to how people collect traces and do evaluation, but as is the case for 2, I don't really know. As discussed for the simulator, I think the first order of business is to get reasonably efficient machines that can defend against deep learning attacks without time and then take it from there.

In his writeup, Nate documented a crash when using the always-schedule workaround: https://github.com/notem/tor-rbp-padding-machine-doc (Issues section).

Trac:
Milestone: Tor: 0.4.2.x-final to Tor: 0.4.3.x-final

All 0.4.3.x tickets without 043-must, 043-should, or 043-can are about to be deferred.

Trac:
Keywords: wtf-pad circpad-researchers-want 042-should deleted, 042-should, 043-deferred, wtf-pad, circpad-researchers-want added

Trac:
Milestone: Tor: 0.4.3.x-final to Tor: 0.4.4.x-final

Bulk-remove tickets from 0.4.4. Add the 044-deferred label to them.

Trac:
Keywords: N/A deleted, 044-deferred added
Milestone: Tor: 0.4.4.x-final to Tor: unspecified

moved to tpo/core/tor#31653

Padding cells sent with 0ms delay cause circuit failures

Child items ...

Activity