Should I close the intro circuit once INTRODUCE2 cells have been parsed and processed? If so, is hs_circ_handle_introduce2() a good place to close the intro circuits?
Never mind Comment 2. After reading rend-spec-v2.txt in torspec, I think it should be done in hs_service_receive_intro_established() instead. Is this correct?
UPDATE: To be more descriptive: should I call circuit_mark_for_close() in hs_service_receive_intro_established() on a successful call and not just on a failure?
Hey neel. I think that's not right. We want the service to keep its intro circuits open so that it receives multiple introductions through them.
On the other hand, on the client-side, after the client completes the introduction with the service, we want the client to close the client-side part of the introduction circuit.
So in C <-> IP <-> HS we want the link from C to IP to close after the introduction has been completed. Good luck! :)
I looked at handle_introduce_ack_success() and found this code:
end: /* We don't need the intro circuit anymore. It did what it had to do! */ circuit_change_purpose(TO_CIRCUIT(intro_circ), CIRCUIT_PURPOSE_C_INTRODUCE_ACKED); circuit_mark_for_close(TO_CIRCUIT(intro_circ), END_CIRC_REASON_FINISHED); /* XXX: Close pending intro circuits we might have in parallel. */ return;
Isn't this already being done after the first comment above (or in handle_introduce_ack_success())?
Yes client side we are fine. I think this ticket was more on the introduction point side?
Now I just realized something that is maybe bad in v3 (not in v2).
See handle_introduce1() (hs_intropoint.c). Notice at the end that we only close the circuit if we send a NACK but not a ACK. Actually, it should be the opposite! The reason is that if you ACK, then the client will close that circuits so instead of waiting for another round trip for the DESTROY cell, the IP can just send it after the ACK and the client will likely close it much faster.
Now, why we shouldn't close with a NACK? Because, in case of a NACK, the client will use the same circuit to re-extend to a new IP. If the current IP is closing the circuits, that re-extend is most likely failing... So the whole "reextend on NACK" optimization is rendered useless by closing the circuit on NACK on the intro side.
To summarize (all of this intro point side):
Close circuit on ACK
Keep circuit on NACK.
Thoughts?
Trac: Status: assigned to needs_information Cc: neel to neel, asn
Yes client side we are fine. I think this ticket was more on the introduction point side?
Now I just realized something that is maybe bad in v3 (not in v2).
See handle_introduce1() (hs_intropoint.c). Notice at the end that we only close the circuit if we send a NACK but not a ACK. Actually, it should be the opposite! The reason is that if you ACK, then the client will close that circuits so instead of waiting for another round trip for the DESTROY cell, the IP can just send it after the ACK and the client will likely close it much faster.
Now, why we shouldn't close with a NACK? Because, in case of a NACK, the client will use the same circuit to re-extend to a new IP. If the current IP is closing the circuits, that re-extend is most likely failing... So the whole "reextend on NACK" optimization is rendered useless by closing the circuit on NACK on the intro side.
To summarize (all of this intro point side):
Close circuit on ACK
Keep circuit on NACK.
Thoughts?
Hm, that's interesting.
I think your suggestion makes sense!
However, I think it would be great after we write this patch, we also test that this "reextend on NACK" optimization works as intended.
So this is a fix that probably applies back to 0.3.2.1-alpha I believe. Meaning the changes file is wrong and the branch should be based on maint-0.3.3 because I realized we are most likely to backport this fix since the re-extend optimization is busted.
Now, on another note, I've discuss it with arma further on IRC and maybe actively closing the circuit is not ideal as in "we want to let the client decide what to do".
Thus, it would simplify the patch, we just need to NOT close the circuit on NACK and that is it. We'll backport that fix.
Sorry neel for the confusion. Thanks!
Trac: Status: needs_review to needs_revision Keywords: tor-hs dos deleted, 034-backport, tor-hs dos 033-backport added
This is a pretty important fix for v3 onion services, because in 0.3.4.x, if you try an intro point and it has rotated away, then simultaneously (a) you will try to extend that intro circuit to the next intro point, and (b) the first one will start tearing down the circuit. I could easily believe that the client doesn't handle it well and waits for something to time out before launching a new one.
I'm going to retitle the ticket to be what we actually did. And I'm going to go back in time and change this entry in the ChangeLog to be a major bugfix, so it will be noted that way in the 0.3.5 ReleaseNotes.
Trac: Keywords: 034-backport, tor-hs dos 033-backport deleted, tor-hs dos added Milestone: Tor: 0.3.4.x-final to Tor: 0.3.5.x-final Summary: Close intro circuit after introduction has been completed to Surprise race: Intro point closes circuit after NACK, at the same time as client tries to extend circuit to new intro point
I would like us to strongly reconsider the backport this back down to 035. Reason is that it is really badly affecting tor clients and thus HS reachability. Here is how/why:
(The following considers that every time the client reaches the intro point, it gets NACKed because it has the old descriptor.)
The obvious issue is that tor clients gets the intro circuit destroyed while it is trying to re-extend to a new IP. This itself, requires the client to do many round trips before noticing and then re-opening a new circuit to see the same again until all 3 IP have failed.
This one is a more serious issue.
I've experienced during my testing a client looping over all IPs trying to establish an intro point but instead getting its circuit TRUNCATED for internal reason just after sending the INTRODUCE1 cell. This is not seen as an "intro failure" by the client so the intro point will be retried.
However, how can we get a TRUNCATEDbefore the INTRODUCE_ACK nack-ing our request? This behavior I've seen a lot where a client can make 20-30 tries before it finally gets a NACK.
The reason I believe is in our cell scheduler: When selecting an active channel, we ask the cmux subsystem to give the first active circuit queue, this is done in circuitmux_get_first_active_circuit(). But, we alternate between DESTROY cell queue and the RELAY cell queue.
When the intro point sends a NACK, it first queues the INTRODUCE_ACK cell and then, because of this bug still everywhere in the network, it queues a DESTROY just after. Then our scheduler, at that point in time, decides to send the DESTROYbefore the ack resulting in our client receiving a truncated cell, not noticing the NACK and thus retrying the same intro point after.
If no DESTROY cells were sent on the channel cmux yet, then it is prioritized from the relay cell. So, it is not actually 1/2 chance of hitting this, I believe it is much high probability to hit the issue I just described especially on smaller relays.
Considering the above, I'm strongly asking for 035, 040 and 041 backport.
Trac: Status: closed to reopened Resolution: fixed toN/A Keywords: tor-hs dos deleted, tor-hs dos 035-backport 040-backport 041-backport added
I remember that yesterday you were showing me logs that involved an intropoint on a relay that still had 0.3.4. Maybe it's time to tell the authorities to stop listing these?
I remember that yesterday you were showing me logs that involved an intropoint on a relay that still had 0.3.4. Maybe it's time to tell the authorities to stop listing these?
That is how I found this issue, by running intro points on 0.3.4.9 ... so either that or we patch clients to stop selecting them.
We would need to remove HSIntro=4 basically from the 032, 033 and 034 relays so the service will go in legacy mode and this bug is not in the legacy code.
We would need to remove HSIntro=4 basically from the 032, 033 and 034 relays so the service will go in legacy mode and this bug is not in the legacy code.
We could make working versions declare a new HSIntro version, and then make clients require that version.
When we fix high-impact bugs in future, we should consider adding a new protocol version, and making clients require it. That's what protocol versions are for,