Surprise race: Intro point closes circuit after NACK, at the same time as client tries to extend circuit to new intro point

changed milestone to %Tor: 0.3.5.x-final

added 035-backport 040-backport 041-backport component::core tor/tor dos milestone::Tor: 0.3.5.x-final owner::neel parent::25882 priority::medium resolution::fixed reviewer::dgoulet severity::normal sponsor::27 status::closed tor-hs type::defect labels

Trac:
Owner: N/A to neel
Status: new to assigned
Cc: N/A to neel

Should I close the intro circuit once INTRODUCE2 cells have been parsed and processed? If so, is hs_circ_handle_introduce2() a good place to close the intro circuits?

Never mind Comment 2. After reading rend-spec-v2.txt in torspec, I think it should be done in hs_service_receive_intro_established() instead. Is this correct?

UPDATE: To be more descriptive: should I call circuit_mark_for_close() in hs_service_receive_intro_established() on a successful call and not just on a failure?

Hey neel, currently stuck in post-meeting backlog. Will try to get to this late next week. Thanks!

Hey neel. I think that's not right. We want the service to keep its intro circuits open so that it receives multiple introductions through them.

On the other hand, on the client-side, after the client completes the introduction with the service, we want the client to close the client-side part of the introduction circuit.

So in C <-> IP <-> HS we want the link from C to IP to close after the introduction has been completed. Good luck! :)

Thanks for the feedback.

I looked at handle_introduce_ack_success() and found this code:

 end:
  /* We don't need the intro circuit anymore. It did what it had to do! */
  circuit_change_purpose(TO_CIRCUIT(intro_circ),
                         CIRCUIT_PURPOSE_C_INTRODUCE_ACKED);
  circuit_mark_for_close(TO_CIRCUIT(intro_circ), END_CIRC_REASON_FINISHED);

  /* XXX: Close pending intro circuits we might have in parallel. */
  return;

Isn't this already being done after the first comment above (or in handle_introduce_ack_success())?

Yes client side we are fine. I think this ticket was more on the introduction point side?

Now I just realized something that is maybe bad in v3 (not in v2).

See handle_introduce1() (hs_intropoint.c). Notice at the end that we only close the circuit if we send a NACK but not a ACK. Actually, it should be the opposite! The reason is that if you ACK, then the client will close that circuits so instead of waiting for another round trip for the DESTROY cell, the IP can just send it after the ACK and the client will likely close it much faster.

Now, why we shouldn't close with a NACK? Because, in case of a NACK, the client will use the same circuit to re-extend to a new IP. If the current IP is closing the circuits, that re-extend is most likely failing... So the whole "reextend on NACK" optimization is rendered useless by closing the circuit on NACK on the intro side.

To summarize (all of this intro point side):

Close circuit on ACK
Keep circuit on NACK.

Thoughts?

Trac:
Status: assigned to needs_information
Cc: neel to neel, asn

Replying to dgoulet:

Yes client side we are fine. I think this ticket was more on the introduction point side?

Now I just realized something that is maybe bad in v3 (not in v2).

See handle_introduce1() (hs_intropoint.c). Notice at the end that we only close the circuit if we send a NACK but not a ACK. Actually, it should be the opposite! The reason is that if you ACK, then the client will close that circuits so instead of waiting for another round trip for the DESTROY cell, the IP can just send it after the ACK and the client will likely close it much faster.

Now, why we shouldn't close with a NACK? Because, in case of a NACK, the client will use the same circuit to re-extend to a new IP. If the current IP is closing the circuits, that re-extend is most likely failing... So the whole "reextend on NACK" optimization is rendered useless by closing the circuit on NACK on the intro side.

To summarize (all of this intro point side):

Close circuit on ACK

Keep circuit on NACK.

Thoughts?

Hm, that's interesting.

I think your suggestion makes sense!

However, I think it would be great after we write this patch, we also test that this "reextend on NACK" optimization works as intended.

Trac:
Status: needs_information to new

Trac:
Status: new to assigned

PR is here: https://github.com/torproject/tor/pull/426

Trac:
Milestone: Tor: unspecified to Tor: 0.3.6.x-final

Trac:
Status: assigned to needs_review

Trac:
Reviewer: N/A to dgoulet

So this is a fix that probably applies back to 0.3.2.1-alpha I believe. Meaning the changes file is wrong and the branch should be based on maint-0.3.3 because I realized we are most likely to backport this fix since the re-extend optimization is busted.

Now, on another note, I've discuss it with arma further on IRC and maybe actively closing the circuit is not ideal as in "we want to let the client decide what to do".

Thus, it would simplify the patch, we just need to NOT close the circuit on NACK and that is it. We'll backport that fix.

Sorry neel for the confusion. Thanks!

Trac:
Status: needs_review to needs_revision
Keywords: tor-hs dos deleted, 034-backport, tor-hs dos 033-backport added

A PR based on maint-3.3 which includes your suggestions is here: https://github.com/torproject/tor/pull/487

If you want it based on master, tell me and I can base the patch off that instead.

Tor 0.3.6.x has been renamed to 0.4.0.x.

Trac:
Milestone: Tor: 0.3.6.x-final to Tor: 0.4.0.x-final

Trac:
Status: needs_revision to needs_review

Big thanks neel!

ACK.

Trac:
Status: needs_review to merge_ready

LGTM!

Merged into 035 and master.

Trac:
Milestone: Tor: 0.4.0.x-final to Tor: 0.3.4.x-final

Going to suggest "no backport" here. Any objections?

Replying to nickm:

Going to suggest "no backport" here. Any objections?

Agreed.

Also agreed, no need to backport before 0.3.5.

This is a pretty important fix for v3 onion services, because in 0.3.4.x, if you try an intro point and it has rotated away, then simultaneously (a) you will try to extend that intro circuit to the next intro point, and (b) the first one will start tearing down the circuit. I could easily believe that the client doesn't handle it well and waits for something to time out before launching a new one.

I'm going to retitle the ticket to be what we actually did. And I'm going to go back in time and change this entry in the ChangeLog to be a major bugfix, so it will be noted that way in the 0.3.5 ReleaseNotes.

Trac:
Keywords: 034-backport, tor-hs dos 033-backport deleted, tor-hs dos added
Milestone: Tor: 0.3.4.x-final to Tor: 0.3.5.x-final
Summary: Close intro circuit after introduction has been completed to Surprise race: Intro point closes circuit after NACK, at the same time as client tries to extend circuit to new intro point

No backport then we are good here.

Trac:
Status: merge_ready to closed
Resolution: N/A to fixed

Trac:
Sponsor: N/A to Sponsor27

I would like us to strongly reconsider the backport this back down to 035. Reason is that it is really badly affecting tor clients and thus HS reachability. Here is how/why:

(The following considers that every time the client reaches the intro point, it gets NACKed because it has the old descriptor.)

The obvious issue is that tor clients gets the intro circuit destroyed while it is trying to re-extend to a new IP. This itself, requires the client to do many round trips before noticing and then re-opening a new circuit to see the same again until all 3 IP have failed.
This one is a more serious issue.

I've experienced during my testing a client looping over all IPs trying to establish an intro point but instead getting its circuit TRUNCATED for internal reason just after sending the INTRODUCE1 cell. This is not seen as an "intro failure" by the client so the intro point will be retried.

However, how can we get a TRUNCATED before the INTRODUCE_ACK nack-ing our request? This behavior I've seen a lot where a client can make 20-30 tries before it finally gets a NACK.

The reason I believe is in our cell scheduler: When selecting an active channel, we ask the cmux subsystem to give the first active circuit queue, this is done in circuitmux_get_first_active_circuit(). But, we alternate between DESTROY cell queue and the RELAY cell queue.

When the intro point sends a NACK, it first queues the INTRODUCE_ACK cell and then, because of this bug still everywhere in the network, it queues a DESTROY just after. Then our scheduler, at that point in time, decides to send the DESTROY before the ack resulting in our client receiving a truncated cell, not noticing the NACK and thus retrying the same intro point after.

If no DESTROY cells were sent on the channel cmux yet, then it is prioritized from the relay cell. So, it is not actually 1/2 chance of hitting this, I believe it is much high probability to hit the issue I just described especially on smaller relays.

Considering the above, I'm strongly asking for 035, 040 and 041 backport.

Trac:
Status: closed to reopened
Resolution: fixed to N/A
Keywords: tor-hs dos deleted, tor-hs dos 035-backport 040-backport 041-backport added

Merged into 035 and master. Wasn't that you?

Replying to cypherpunks:

Merged into 035 and master. Wasn't that you?

It was indeed... hmm it appears the "no backport" discussion was about <= 034 ... And they are EOL... damn.

Trac:
Status: reopened to closed
Resolution: N/A to fixed

I remember that yesterday you were showing me logs that involved an intropoint on a relay that still had 0.3.4. Maybe it's time to tell the authorities to stop listing these?

Replying to nickm:

I remember that yesterday you were showing me logs that involved an intropoint on a relay that still had 0.3.4. Maybe it's time to tell the authorities to stop listing these?

That is how I found this issue, by running intro points on 0.3.4.9 ... so either that or we patch clients to stop selecting them.

Is there a flag we can stop giving them?

Replying to nickm:

Is there a flag we can stop giving them?

We would need to remove HSIntro=4 basically from the 032, 033 and 034 relays so the service will go in legacy mode and this bug is not in the legacy code.

Not sure the authorities can do that :S

It's possible, but would require a new consensus method.

I opened #31549 (moved) for just removing old tors from the directory at last.

Replying to dgoulet:

Replying to nickm:

Is there a flag we can stop giving them?

We would need to remove HSIntro=4 basically from the 032, 033 and 034 relays so the service will go in legacy mode and this bug is not in the legacy code.

We could make working versions declare a new HSIntro version, and then make clients require that version.

When we fix high-impact bugs in future, we should consider adding a new protocol version, and making clients require it. That's what protocol versions are for,

Trac:
Parent: N/A to #25882 (moved)

Related #31964 (moved)

closed

mentioned in issue #31549 (moved)

mentioned in issue #31964 (moved)

moved to tpo/core/tor#27841 (closed)

mentioned in issue tpo/core/tor#31549 (closed)

mentioned in issue tpo/core/tor#31964 (closed)

Surprise race: Intro point closes circuit after NACK, at the same time as client tries to extend circuit to new intro point

Child items ...

Activity