Opened 6 months ago

Last modified 6 weeks ago

#30992 assigned defect

circpadding machines have shutdown sync issues (with intro circ NACKs and other cases)

Reported by: asn Owned by: mikeperry
Priority: Medium Milestone: Tor: unspecified
Component: Core Tor/Tor Version: Tor: 0.4.1.1-alpha
Severity: Normal Keywords: wtf-pad circpad-researchers-maybe-want
Cc: dgoulet, mikeperry Actual Points:
Parent ID: Points: 3
Reviewer: Sponsor:

Description

When client-side intro circuits get NACKed they re-extend to a different intro point using the same circuit.

This seems to mess up the circsetup machines and we get "padding came from wrong hop" log warns.

Child Tickets

TicketStatusOwnerSummaryComponent
#32140assignedasnsingle onion: Padding negotiated cell from wrong hopCore Tor/Tor

Attachments (2)

log.tgz (695.3 KB) - added by dgoulet 6 months ago.
debug.log.tgz (2.5 MB) - added by dgoulet 5 months ago.

Change History (20)

comment:1 Changed 6 months ago by mikeperry

Very odd.. not sure why this would change the hop where the padding comes from...

Is there an easy way/test service to reproduce this with? Or do you just have to wait for it to happen?

comment:2 Changed 6 months ago by asn

I think dgoulet got this easily when he was testing the HS DoS. David, you got any logs?

comment:3 in reply to:  2 Changed 6 months ago by dgoulet

Replying to asn:

I think dgoulet got this easily when he was testing the HS DoS. David, you got any logs?

Yes I got that not "often" but enough to notice during my HS DoS testing for #15516.

Attached is an info.log that I managed to reproduce. I can maybe get debug logs if needed but might take me some times since my experiment fill the disk very fast with debug logs :S.

Changed 6 months ago by dgoulet

Attachment: log.tgz added

comment:4 Changed 5 months ago by asn

Thanks for the logs.

Can't quite figure it out. Maybe it has something to do with the mark for close right above? But these logs seem so high-velocity that it's not easy to correlate things.

comment:5 Changed 5 months ago by mikeperry

Here's a branch with some additional logs. I suspect this *might* actually be caused by us trying to negotiate twice in a row with Padding=1 0.4.0 relays.. But we should not be doing that, either.

https://github.com/torproject/tor/pull/1191

Changed 5 months ago by dgoulet

Attachment: debug.log.tgz added

comment:6 Changed 5 months ago by dgoulet

Debug logs client side using mikeperry-tor:bug30992-logs branch attached. It takes few second until I can get that on the DoS testbed. So I can reproduce easily.

debug.log is 34M.

comment:7 Changed 4 months ago by mikeperry

Ok let me try to summarize things here. There are two bugs. The first is that 0.4.0.x middle nodes are listing padding support, which results in the log:

Jul 31 14:25:29.971 [warn] Middle node did not accept our padding request on circuit 7 (7)

I filed #31356 for this.

The second bug *also* has that warn, and *also* to 0.4.0.x relays not supporting padding. This is what happens:

  1. We try to negotiate padding with a 0.4.0 relay that doesn't actually support it
  2. Before the relay can tell us "no", the introduction point fails, and the purpose changes from CIRCUIT_PURPOSE_C_INTRODUCE_ACK_WAIT back to CIRCUIT_PURPOSE_C_INTRODUCING.
  3. Because the client intro padding machines do not support CIRCUIT_PURPOSE_C_INTRODUCING purpose circuits, they shut themselves down, sending a STOP command to the 0.4.0 relay
  4. Then, the original padding reponse from step 1 above finally comes back, as an error response (since the 0.4.0.x relay did not support padding).
  5. Because we got an error response back, we tear down our all of our padding machine state from the circuit.
  6. Then, the response to the STOP finally comes back, but there is no padding machine anymore, so the padding is from the "wrong hop", causing this warn:
    Jul 31 14:25:30.467 [warn] Padding negotiated cell from wrong hop on circuit 406
    

Fixing #31356 will stop this chain of events from happening.. But because this cell pattern is different, we're not properly concealing these failed intro point requests by our circuit padding, because we negotiate ourselves on and off, and because there are extra intro cells and extends here..

Worse than this, if we fix #31356, the chain of events will become:

  1. We try to negotiate padding with a 0.4.1.x relay
  2. Before the relay can tell us "yes", the introduction point fails, and the purpose changes to CIRCUIT_PURPOSE_C_INTRODUCING.
  3. The client padding machines sends the STOP command
  4. Then the original padding negotiate response comes back ok. We accept it, but do nothing.
  5. Before the STOP response comes back, the introduction point switches back to CIRCUIT_PURPOSE_C_INTRODUCE_ACK_WAIT, causing us to try to negotiate padding again.
  6. The STOP response comes back, causing us to free all machine state.
  7. The negotiated OK response comes back, again from the "wrong hop".

I didn't actually see the 0.4.1 version of the sequence in the logs, probably because there are not enough 0.4.1 relays yet. but it can happen.

One fix is to add a check in circpad_handle_padding_negotiated() to ignore STOP commands if we're not expecting them.. But that removes the ability for relay-side circuit padding machines to shut down the client ones. Hrmm..

comment:8 Changed 4 months ago by gaba

Owner: set to mikeperry
Status: newassigned

comment:9 Changed 4 months ago by mikeperry

Just double-checked all instances in the logs so far, and circuit 188 in dgoulet's debug logs has the "wrong hop" log message after the purpose change from CIRCUIT_PURPOSE_C_INTRODUCE_ACK_WAIT to CIRCUIT_PURPOSE_C_INTRODUCING, and it *does not* change back to CIRCUIT_PURPOSE_C_INTRODUCE_ACK_WAIT, but it still has the message. This one is a real mystery..

comment:10 in reply to:  9 Changed 4 months ago by mikeperry

Replying to mikeperry:

Just double-checked all instances in the logs so far, and circuit 188 in dgoulet's debug logs has the "wrong hop" log message after the purpose change from CIRCUIT_PURPOSE_C_INTRODUCE_ACK_WAIT to CIRCUIT_PURPOSE_C_INTRODUCING, and it *does not* change back to CIRCUIT_PURPOSE_C_INTRODUCE_ACK_WAIT, but it still has the message. This one is a real mystery..

Ok, in this case, the sequence is:

  1. Client tries to negotiate padding with a 0.4.1 relay
  2. The negotiation comes back ok; padding commences normally
  3. The relay side finishes padding (sends 10 cells or so) and sends a STOP command
  4. The intropoint fails, causing the circuit purpose to change from CIRCUIT_PURPOSE_C_INTRODUCE_ACK_WAIT to CIRCUIT_PURPOSE_C_INTRODUCING
  5. Client padding machine sends a STOP command
  6. The relay gets the STOP command, responds with STOP,ERROR (because it had no machines at this point)
  7. Client gets the relay's STOP command and tear down.
  8. Client gets the relay's STOP,ERROR response to our stop command, and print out the "wrong hop" message (ie since we have no padding machines, any hop source is the wrong hop).

There's probably quite a few variations on the above..

comment:11 Changed 4 months ago by mikeperry

I am most concerned about the second case in comment:7. The other two cases are degenerate and the machines were shutting down anyway.

Possible fixes include:

  1. Add a sequence number, packet number, or padding machine instantiation counter to the commands, so we can appropriately match the most recent ones. This requires a new padding protocol version.
  2. Instead of a protocol version bump, we can use the field's absence (aka "0" value) to mean "yolo, accept it always" because this bug is not that severe.
  3. Not start up a machine until the STOP response comes back from the previous one.

I don't especially like 3. The optimization that got us into this mess is supposed to let us immediately replace one machine with another depending on circuit conditions. That should include tearing the same machine down and spinning it back up in succession.

I think my preference is for 2, but maybe a protover bump is cleaner anyway, since we already have to mess with it for #31356, and bumping it to 2 would avoid the need for any 0.4.0 patching.

Any other ideas?

Last edited 4 months ago by mikeperry (previous) (diff)

comment:12 in reply to:  11 ; Changed 4 months ago by asn

Replying to mikeperry:

I am most concerned about the second case in comment:7. The other two cases are degenerate and the machines were shutting down anyway.

Thanks for the investigation here. That was a good dig up of bugs.

Possible fixes include:

  1. Add a sequence number, packet number, or padding machine instantiation counter to the commands, so we can appropriately match the most recent ones. This requires a new padding protocol version.
  2. Instead of a protocol version bump, we can use the field's absence (aka "0" value) to mean "yolo, accept it always" because this bug is not that severe.
  3. Not start up a machine until the STOP response comes back from the previous one.

I don't especially like 3. The optimization that got us into this mess is supposed to let us immediately replace one machine with another depending on circuit conditions. That should include tearing the same machine down and spinning it back up in succession.

I admit I don't understand exactly our "machine replace" ideal behavior and what goals we have for it. Can you please expand on the above and what other goals we have here?

I think my preference is for 2, but maybe a protover bump is cleaner anyway, since we already have to mess with it for #31356, and bumping it to 2 would avoid the need for any 0.4.0 patching.

I agree that '2' seems like a plausible solution here. Are the PADDING_NEGOTIATE(D) cells extensible enough to add such a thing? Or how do we go about it?

Also, can you explain how the "machine counter" logic of '2' would work with multiple desynched START/STOP commands flying in the air like in comment:7 and comment:10? Would we ignore any START/STOP commands that don't correspond to our latest state? And how would that work for the issue in comment:10? Could that create more issues?

comment:13 Changed 4 months ago by mikeperry

Keywords: 042-proposed added; 041-should removed
Milestone: Tor: 0.4.1.x-finalTor: unspecified
Points: 0.43
Summary: circpadding: Circsetup machines give out warnings when client-side intro gets NACKedcircpadding machines have shutdown sync issues (with intro circ NACKs and other cases)

I am taking this out of 0.4.1 because none of the above cases are catastrophic, and fixing them with a sequence number fix will still have cell count issues that cause these circuits to be uniquely recognizable (due to the extra padding negotiation *and* cannibalization onionskin cells each time there is a purpose change due to a NACK). We need both the sequence number fix and padding machine changes to make these cannibalized NACKed intro circuits not stand out. :/

comment:14 in reply to:  12 Changed 4 months ago by mikeperry

Replying to asn:

Replying to mikeperry:
I admit I don't understand exactly our "machine replace" ideal behavior and what goals we have for it. Can you please expand on the above and what other goals we have here?

The machine replacement is meant to allow us to use the machine conditions to apply different padding machines to different phases of circuit lifetime.

For example, our current padding machines only deal with circuit setup. We may want to add future padding machines that apply to REND circuits that are built and have streams on them. To do this, we would specify the additional circuit state condition CIRCPAD_HAS_STREAMS on the machines that add padding to rend circuits that already have streams on them. These machines would not apply during the setup phase, but would apply after we have padded the rend and it has been joined. So our current machines would shut down, and these would start up.

Another example might be something like "We only want to pad on circuits that have active streams on them to port 22, to try to make SSH circuits look more like web circuits". For these, we would implement #29083 as a machine condition. Then, machines could spin up and shut down on a circuit only while it has port 22 streams on it (which could be opened and closed repeatedly).

I think my preference is for 2, but maybe a protover bump is cleaner anyway, since we already have to mess with it for #31356, and bumping it to 2 would avoid the need for any 0.4.0 patching.

I agree that '2' seems like a plausible solution here. Are the PADDING_NEGOTIATE(D) cells extensible enough to add such a thing? Or how do we go about it?

Yes. The circpad negotiation cells have version fields, so we can add extra fields for them that are only checked if the version value is high enough for them to be present.

Also, can you explain how the "machine counter" logic of '2' would work with multiple desynched START/STOP commands flying in the air like in comment:7 and comment:10? Would we ignore any START/STOP commands that don't correspond to our latest state? And how would that work for the issue in comment:10? Could that create more issues?

Yes, we would ignore any commands that did not correspond to a machine counter for our most recent machine. This will solve the comment:7 sequence. For the comment:10 sequence, the key implementation detail is that we have to update the "current machine counter" upon shutdown, so that after we tear our machines down, all previous in-flight stops, etc will still be recognized as from the previous machine.

A related question is if we should treat these cells as "dropped" cells for side channel defenses like vanguards... I think the right answer here is to still retain the hop number of the previous machine, as well, so we can verify that these invalid commands are not injected from say, the exit node as a side channel vector.

Anyway, that is my current thinking.. The side channel piece of this might be solved with another approach, though..

All of this complication is why it's smartest just to let this be for now, especially since fixing it for 0.4.1 won't change any of the fingerprintability of these circuits..

comment:15 Changed 3 months ago by gaba

Owner: mikeperry deleted

comment:16 Changed 3 months ago by mikeperry

Keywords: circpad-researchers-maybe-want added; circpad 042-proposed removed

comment:17 Changed 7 weeks ago by asn

This bug seems to echo in chutney as well (#32140)

comment:18 Changed 6 weeks ago by gaba

Owner: set to mikeperry
Note: See TracTickets for help on using tickets.