I think dgoulet got this easily when he was testing the HS DoS. David, you got any logs?
Yes I got that not "often" but enough to notice during my HS DoS testing for #15516 (moved).
Attached is an info.log that I managed to reproduce. I can maybe get debug logs if needed but might take me some times since my experiment fill the disk very fast with debug logs :S.
Can't quite figure it out. Maybe it has something to do with the mark for close right above? But these logs seem so high-velocity that it's not easy to correlate things.
Here's a branch with some additional logs. I suspect this might actually be caused by us trying to negotiate twice in a row with Padding=1 0.4.0 relays.. But we should not be doing that, either.
Debug logs client side using mikeperry-tor:bug30992-logs branch attached. It takes few second until I can get that on the DoS testbed. So I can reproduce easily.
Ok let me try to summarize things here. There are two bugs. The first is that 0.4.0.x middle nodes are listing padding support, which results in the log:
Jul 31 14:25:29.971 [warn] Middle node did not accept our padding request on circuit 7 (7)
The second bug also has that warn, and also to 0.4.0.x relays not supporting padding. This is what happens:
We try to negotiate padding with a 0.4.0 relay that doesn't actually support it
Before the relay can tell us "no", the introduction point fails, and the purpose changes from CIRCUIT_PURPOSE_C_INTRODUCE_ACK_WAIT back to CIRCUIT_PURPOSE_C_INTRODUCING.
Because the client intro padding machines do not support CIRCUIT_PURPOSE_C_INTRODUCING purpose circuits, they shut themselves down, sending a STOP command to the 0.4.0 relay
Then, the original padding reponse from step 1 above finally comes back, as an error response (since the 0.4.0.x relay did not support padding).
Because we got an error response back, we tear down our all of our padding machine state from the circuit.
Then, the response to the STOP finally comes back, but there is no padding machine anymore, so the padding is from the "wrong hop", causing this warn:
Jul 31 14:25:30.467 [warn] Padding negotiated cell from wrong hop on circuit 406
Fixing #31356 (moved) will stop this chain of events from happening.. But because this cell pattern is different, we're not properly concealing these failed intro point requests by our circuit padding, because we negotiate ourselves on and off, and because there are extra intro cells and extends here..
Worse than this, if we fix #31356 (moved), the chain of events will become:
We try to negotiate padding with a 0.4.1.x relay
Before the relay can tell us "yes", the introduction point fails, and the purpose changes to CIRCUIT_PURPOSE_C_INTRODUCING.
The client padding machines sends the STOP command
Then the original padding negotiate response comes back ok. We accept it, but do nothing.
Before the STOP response comes back, the introduction point switches back to CIRCUIT_PURPOSE_C_INTRODUCE_ACK_WAIT, causing us to try to negotiate padding again.
The STOP response comes back, causing us to free all machine state.
The negotiated OK response comes back, again from the "wrong hop".
I didn't actually see the 0.4.1 version of the sequence in the logs, probably because there are not enough 0.4.1 relays yet. but it can happen.
One fix is to add a check in circpad_handle_padding_negotiated() to ignore STOP commands if we're not expecting them.. But that removes the ability for relay-side circuit padding machines to shut down the client ones. Hrmm..
Just double-checked all instances in the logs so far, and circuit 188 in dgoulet's debug logs has the "wrong hop" log message after the purpose change from CIRCUIT_PURPOSE_C_INTRODUCE_ACK_WAIT to CIRCUIT_PURPOSE_C_INTRODUCING, and it does not change back to CIRCUIT_PURPOSE_C_INTRODUCE_ACK_WAIT, but it still has the message. This one is a real mystery..
Just double-checked all instances in the logs so far, and circuit 188 in dgoulet's debug logs has the "wrong hop" log message after the purpose change from CIRCUIT_PURPOSE_C_INTRODUCE_ACK_WAIT to CIRCUIT_PURPOSE_C_INTRODUCING, and it does not change back to CIRCUIT_PURPOSE_C_INTRODUCE_ACK_WAIT, but it still has the message. This one is a real mystery..
Ok, in this case, the sequence is:
Client tries to negotiate padding with a 0.4.1 relay
The negotiation comes back ok; padding commences normally
The relay side finishes padding (sends 10 cells or so) and sends a STOP command
The intropoint fails, causing the circuit purpose to change from CIRCUIT_PURPOSE_C_INTRODUCE_ACK_WAIT to CIRCUIT_PURPOSE_C_INTRODUCING
Client padding machine sends a STOP command
The relay gets the STOP command, responds with STOP,ERROR (because it had no machines at this point)
Client gets the relay's STOP command and tear down.
Client gets the relay's STOP,ERROR response to our stop command, and print out the "wrong hop" message (ie since we have no padding machines, any hop source is the wrong hop).
There's probably quite a few variations on the above..
I am most concerned about the second case in comment:7. The other two cases are degenerate and the machines were shutting down anyway.
Possible fixes include:
Add a sequence number, packet number, or padding machine instantiation counter to the commands, so we can appropriately match the most recent ones. This requires a new padding protocol version.
Instead of a protocol version bump, we can use the field's absence (aka "0" value) to mean "yolo, accept it always" because this bug is not that severe.
Not start up a machine until the STOP response comes back from the previous one.
I don't especially like 3. The optimization that got us into this mess is supposed to let us immediately replace one machine with another depending on circuit conditions. That should include tearing the same machine down and spinning it back up in succession.
I think my preference is for 2, but maybe a protover bump is cleaner anyway, since we already have to mess with it for #31356 (moved), and bumping it to 2 would avoid the need for any 0.4.0 patching.
I am most concerned about the second case in comment:7. The other two cases are degenerate and the machines were shutting down anyway.
Thanks for the investigation here. That was a good dig up of bugs.
Possible fixes include:
Add a sequence number, packet number, or padding machine instantiation counter to the commands, so we can appropriately match the most recent ones. This requires a new padding protocol version.
Instead of a protocol version bump, we can use the field's absence (aka "0" value) to mean "yolo, accept it always" because this bug is not that severe.
Not start up a machine until the STOP response comes back from the previous one.
I don't especially like 3. The optimization that got us into this mess is supposed to let us immediately replace one machine with another depending on circuit conditions. That should include tearing the same machine down and spinning it back up in succession.
I admit I don't understand exactly our "machine replace" ideal behavior and what goals we have for it. Can you please expand on the above and what other goals we have here?
I think my preference is for 2, but maybe a protover bump is cleaner anyway, since we already have to mess with it for #31356 (moved), and bumping it to 2 would avoid the need for any 0.4.0 patching.
I agree that '2' seems like a plausible solution here. Are the PADDING_NEGOTIATE(D) cells extensible enough to add such a thing? Or how do we go about it?
Also, can you explain how the "machine counter" logic of '2' would work with multiple desynched START/STOP commands flying in the air like in comment:7 and comment:10? Would we ignore any START/STOP commands that don't correspond to our latest state? And how would that work for the issue in comment:10? Could that create more issues?
I am taking this out of 0.4.1 because none of the above cases are catastrophic, and fixing them with a sequence number fix will still have cell count issues that cause these circuits to be uniquely recognizable (due to the extra padding negotiation and cannibalization onionskin cells each time there is a purpose change due to a NACK). We need both the sequence number fix and padding machine changes to make these cannibalized NACKed intro circuits not stand out. :/
Trac: Keywords: wtf-pad circpad 041-should deleted, wtf-pad circpad 042-proposed added Summary: circpadding: Circsetup machines give out warnings when client-side intro gets NACKed to circpadding machines have shutdown sync issues (with intro circ NACKs and other cases) Points: 0.4 to 3 Milestone: Tor: 0.4.1.x-final to Tor: unspecified
Replying to mikeperry:
I admit I don't understand exactly our "machine replace" ideal behavior and what goals we have for it. Can you please expand on the above and what other goals we have here?
The machine replacement is meant to allow us to use the machine conditions to apply different padding machines to different phases of circuit lifetime.
For example, our current padding machines only deal with circuit setup. We may want to add future padding machines that apply to REND circuits that are built and have streams on them. To do this, we would specify the additional circuit state condition CIRCPAD_HAS_STREAMS on the machines that add padding to rend circuits that already have streams on them. These machines would not apply during the setup phase, but would apply after we have padded the rend and it has been joined. So our current machines would shut down, and these would start up.
Another example might be something like "We only want to pad on circuits that have active streams on them to port 22, to try to make SSH circuits look more like web circuits". For these, we would implement #29083 (moved) as a machine condition. Then, machines could spin up and shut down on a circuit only while it has port 22 streams on it (which could be opened and closed repeatedly).
I think my preference is for 2, but maybe a protover bump is cleaner anyway, since we already have to mess with it for #31356 (moved), and bumping it to 2 would avoid the need for any 0.4.0 patching.
I agree that '2' seems like a plausible solution here. Are the PADDING_NEGOTIATE(D) cells extensible enough to add such a thing? Or how do we go about it?
Yes. The circpad negotiation cells have version fields, so we can add extra fields for them that are only checked if the version value is high enough for them to be present.
Also, can you explain how the "machine counter" logic of '2' would work with multiple desynched START/STOP commands flying in the air like in comment:7 and comment:10? Would we ignore any START/STOP commands that don't correspond to our latest state? And how would that work for the issue in comment:10? Could that create more issues?
Yes, we would ignore any commands that did not correspond to a machine counter for our most recent machine. This will solve the comment:7 sequence. For the comment:10 sequence, the key implementation detail is that we have to update the "current machine counter" upon shutdown, so that after we tear our machines down, all previous in-flight stops, etc will still be recognized as from the previous machine.
A related question is if we should treat these cells as "dropped" cells for side channel defenses like vanguards... I think the right answer here is to still retain the hop number of the previous machine, as well, so we can verify that these invalid commands are not injected from say, the exit node as a side channel vector.
Anyway, that is my current thinking.. The side channel piece of this might be solved with another approach, though..
All of this complication is why it's smartest just to let this be for now, especially since fixing it for 0.4.1 won't change any of the fingerprintability of these circuits..
Plan from today's network team meeting: we should do what we can to figure out a real solution here and make sure we know what the bug is. That can land in 0.4.3 if it's simple,and 0.4.4 if it's not.
We should land something in 0.4.3 (and backport?) to limit the warnings here.
Thanks for the reminder. #32040 (moved) seems like it could exacerbate the shutdown/restart problem, but I think is a slightly different cause than this.
If we end up needing to bump the padding protocol version, we should fix both issues at once. I will have to spend some time pondering what we can do with the existing protocol. There may also be a subtle way to fix both and still preserve compatibility with relays without bumping padding protocol versions on them. We should consider as many of these kinds of issues in that analysis.
Hello, I tested the branch on chutney and verified that it doesn't emit any padding from wrong hop or any #32044 (moved)-style restarts. I then tested without the branch and verified both behaviors.
I left some code-related comments on the branch. Can you fix those, and then open a PR so that we have CI running on this too?
I am going to break #32040 (moved) off because I think we should clean up the apply vs keep flags for both intro and rend, but the unit tests disagree. So we should do a little more work on that one before merge.