Opened 14 months ago

Closed 3 months ago

Last modified 5 weeks ago

#27841 closed defect (fixed)

Surprise race: Intro point closes circuit after NACK, at the same time as client tries to extend circuit to new intro point

Reported by: asn Owned by: neel
Priority: Medium Milestone: Tor: 0.3.5.x-final
Component: Core Tor/Tor Version:
Severity: Normal Keywords: tor-hs dos 035-backport 040-backport 041-backport
Cc: neel, asn Actual Points:
Parent ID: #25882 Points:
Reviewer: dgoulet Sponsor: Sponsor27

Description

We are currently leaving inroduction circuits open after intro has been done, even tho we ignore any further introductions on that circuit.

In the future, we can consider closing that circuit to decrease the load on the network.

Child Tickets

Change History (37)

comment:1 Changed 14 months ago by neel

Cc: neel added
Owner: set to neel
Status: newassigned

comment:2 Changed 13 months ago by neel

Should I close the intro circuit once INTRODUCE2 cells have been parsed and processed? If so, is hs_circ_handle_introduce2() a good place to close the intro circuits?

comment:3 Changed 13 months ago by neel

Never mind Comment 2. After reading rend-spec-v2.txt in torspec, I think it should be done in hs_service_receive_intro_established() instead. Is this correct?

UPDATE: To be more descriptive: should I call circuit_mark_for_close() in hs_service_receive_intro_established() on a successful call and not just on a failure?

Last edited 13 months ago by neel (previous) (diff)

comment:4 Changed 13 months ago by asn

Hey neel, currently stuck in post-meeting backlog. Will try to get to this late next week. Thanks!

comment:5 Changed 13 months ago by asn

Hey neel. I think that's not right. We want the service to keep its intro circuits open so that it receives multiple introductions through them.

On the other hand, on the client-side, after the client completes the introduction with the service, we want the client to close the client-side part of the introduction circuit.

So in C <-> IP <-> HS we want the link from C to IP to close after the introduction has been completed. Good luck! :)

comment:6 Changed 13 months ago by neel

Thanks for the feedback.

I looked at handle_introduce_ack_success() and found this code:

 end:
  /* We don't need the intro circuit anymore. It did what it had to do! */
  circuit_change_purpose(TO_CIRCUIT(intro_circ),
                         CIRCUIT_PURPOSE_C_INTRODUCE_ACKED);
  circuit_mark_for_close(TO_CIRCUIT(intro_circ), END_CIRC_REASON_FINISHED);

  /* XXX: Close pending intro circuits we might have in parallel. */
  return;

Isn't this already being done after the first comment above (or in handle_introduce_ack_success())?

comment:7 Changed 13 months ago by dgoulet

Cc: asn added
Status: assignedneeds_information

Yes client side we are fine. I think this ticket was more on the introduction point side?

Now I just realized something that is maybe bad in v3 (not in v2).

See handle_introduce1() (hs_intropoint.c). Notice at the end that we only close the circuit if we send a NACK but not a ACK. Actually, it should be the opposite! The reason is that if you ACK, then the client will close that circuits so instead of waiting for another round trip for the DESTROY cell, the IP can just send it after the ACK and the client will likely close it much faster.

Now, why we shouldn't close with a NACK? Because, in case of a NACK, the client will use the same circuit to re-extend to a new IP. If the current IP is closing the circuits, that re-extend is most likely failing... So the whole "reextend on NACK" optimization is rendered useless by closing the circuit on NACK on the intro side.

To summarize (all of this intro point side):

  • Close circuit on ACK
  • Keep circuit on NACK.

Thoughts?

Last edited 13 months ago by dgoulet (previous) (diff)

comment:8 in reply to:  7 Changed 13 months ago by asn

Status: needs_informationnew

Replying to dgoulet:

Yes client side we are fine. I think this ticket was more on the introduction point side?

Now I just realized something that is maybe bad in v3 (not in v2).

See handle_introduce1() (hs_intropoint.c). Notice at the end that we only close the circuit if we send a NACK but not a ACK. Actually, it should be the opposite! The reason is that if you ACK, then the client will close that circuits so instead of waiting for another round trip for the DESTROY cell, the IP can just send it after the ACK and the client will likely close it much faster.

Now, why we shouldn't close with a NACK? Because, in case of a NACK, the client will use the same circuit to re-extend to a new IP. If the current IP is closing the circuits, that re-extend is most likely failing... So the whole "reextend on NACK" optimization is rendered useless by closing the circuit on NACK on the intro side.

To summarize (all of this intro point side):

  • Close circuit on ACK
  • Keep circuit on NACK.

Thoughts?

Hm, that's interesting.

I think your suggestion makes sense!

However, I think it would be great after we write this patch, we also test that this "reextend on NACK" optimization works as intended.

comment:9 Changed 13 months ago by neel

Status: newassigned

comment:11 Changed 13 months ago by nickm

Milestone: Tor: unspecifiedTor: 0.3.6.x-final

comment:12 Changed 13 months ago by neel

Status: assignedneeds_review

comment:13 Changed 12 months ago by dgoulet

Reviewer: dgoulet

comment:14 Changed 12 months ago by dgoulet

Keywords: 033-backport 034-backport added
Status: needs_reviewneeds_revision

So this is a fix that probably applies back to 0.3.2.1-alpha I believe. Meaning the changes file is wrong and the branch should be based on maint-0.3.3 because I realized we are most likely to backport this fix since the re-extend optimization is busted.

Now, on another note, I've discuss it with arma further on IRC and maybe actively closing the circuit is not ideal as in "we want to let the client decide what to do".

Thus, it would simplify the patch, we just need to NOT close the circuit on NACK and that is it. We'll backport that fix.

Sorry neel for the confusion. Thanks!

comment:15 Changed 12 months ago by neel

A PR based on maint-3.3 which includes your suggestions is here: https://github.com/torproject/tor/pull/487

If you want it based on master, tell me and I can base the patch off that instead.

comment:16 Changed 12 months ago by nickm

Milestone: Tor: 0.3.6.x-finalTor: 0.4.0.x-final

Tor 0.3.6.x has been renamed to 0.4.0.x.

comment:17 Changed 12 months ago by neel

Status: needs_revisionneeds_review

comment:18 Changed 12 months ago by dgoulet

Status: needs_reviewmerge_ready

Big thanks neel!

ACK.

comment:19 Changed 12 months ago by asn

LGTM!

comment:20 Changed 12 months ago by dgoulet

Milestone: Tor: 0.4.0.x-finalTor: 0.3.4.x-final

Merged into 035 and master.

comment:21 Changed 10 months ago by nickm

Going to suggest "no backport" here. Any objections?

comment:22 in reply to:  21 Changed 10 months ago by asn

Replying to nickm:

Going to suggest "no backport" here. Any objections?

Agreed.

comment:23 Changed 10 months ago by arma

Keywords: 033-backport 034-backport removed
Milestone: Tor: 0.3.4.x-finalTor: 0.3.5.x-final
Summary: Close intro circuit after introduction has been completedSurprise race: Intro point closes circuit after NACK, at the same time as client tries to extend circuit to new intro point

Also agreed, no need to backport before 0.3.5.

This is a pretty important fix for v3 onion services, because in 0.3.4.x, if you try an intro point and it has rotated away, then simultaneously (a) you will try to extend that intro circuit to the next intro point, and (b) the first one will start tearing down the circuit. I could easily believe that the client doesn't handle it well and waits for something to time out before launching a new one.

I'm going to retitle the ticket to be what we actually did. And I'm going to go back in time and change this entry in the ChangeLog to be a major bugfix, so it will be noted that way in the 0.3.5 ReleaseNotes.

comment:24 Changed 10 months ago by dgoulet

Resolution: fixed
Status: merge_readyclosed

No backport then we are good here.

comment:25 Changed 8 months ago by pili

Sponsor: Sponsor27

comment:26 Changed 3 months ago by dgoulet

Keywords: 035-backport 040-backport 041-backport added
Resolution: fixed
Status: closedreopened

I would like us to strongly reconsider the backport this back down to 035. Reason is that it is really badly affecting tor clients and thus HS reachability. Here is how/why:

(The following considers that every time the client reaches the intro point, it gets NACKed because it has the old descriptor.)

  1. The obvious issue is that tor clients gets the intro circuit destroyed while it is trying to re-extend to a new IP. This itself, requires the client to do many round trips before noticing and then re-opening a new circuit to see the same again until all 3 IP have failed.
  1. This one is a more serious issue.

I've experienced during my testing a client looping over all IPs trying to establish an intro point but instead getting its circuit TRUNCATED for internal reason just _after_ sending the INTRODUCE1 cell. This is not seen as an "intro failure" by the client so the intro point will be retried.

However, how can we get a TRUNCATED _before_ the INTRODUCE_ACK nack-ing our request? This behavior I've seen a lot where a client can make 20-30 tries before it finally gets a NACK.

The reason I believe is in our cell scheduler: When selecting an active channel, we ask the cmux subsystem to give the first active circuit queue, this is done in circuitmux_get_first_active_circuit(). But, we alternate between DESTROY cell queue and the RELAY cell queue.

When the intro point sends a NACK, it first queues the INTRODUCE_ACK cell and then, because of this bug still everywhere in the network, it queues a DESTROY just after. Then our scheduler, at that point in time, decides to send the DESTROY _before_ the ack resulting in our client receiving a truncated cell, not noticing the NACK and thus retrying the same intro point after.

If no DESTROY cells were sent on the channel cmux yet, then it is prioritized from the relay cell. So, it is not actually 1/2 chance of hitting this, I believe it is much high probability to hit the issue I just described especially on smaller relays.

Considering the above, I'm strongly asking for 035, 040 and 041 backport.

comment:27 Changed 3 months ago by cypherpunks

Merged into 035 and master.

Wasn't that you?

comment:28 in reply to:  27 Changed 3 months ago by dgoulet

Resolution: fixed
Status: reopenedclosed

Replying to cypherpunks:

Merged into 035 and master.

Wasn't that you?

It was indeed... hmm it appears the "no backport" discussion was about <= 034 ... And they are EOL... damn.

comment:29 Changed 3 months ago by nickm

I remember that yesterday you were showing me logs that involved an intropoint on a relay that still had 0.3.4. Maybe it's time to tell the authorities to stop listing these?

comment:30 in reply to:  29 Changed 3 months ago by dgoulet

Replying to nickm:

I remember that yesterday you were showing me logs that involved an intropoint on a relay that still had 0.3.4. Maybe it's time to tell the authorities to stop listing these?

That is how I found this issue, by running intro points on 0.3.4.9 ... so either that or we patch clients to stop selecting them.

comment:31 Changed 3 months ago by nickm

Is there a flag we can stop giving them?

comment:32 in reply to:  31 ; Changed 3 months ago by dgoulet

Replying to nickm:

Is there a flag we can stop giving them?

We would need to remove HSIntro=4 basically from the 032, 033 and 034 relays so the service will go in legacy mode and this bug is not in the legacy code.

Not sure the authorities can do that :S

comment:33 Changed 3 months ago by nickm

It's possible, but would require a new consensus method.

comment:34 Changed 3 months ago by nickm

I opened #31549 for just removing old tors from the directory at last.

comment:35 in reply to:  32 Changed 2 months ago by teor

Replying to dgoulet:

Replying to nickm:

Is there a flag we can stop giving them?

We would need to remove HSIntro=4 basically from the 032, 033 and 034 relays so the service will go in legacy mode and this bug is not in the legacy code.

We could make working versions declare a new HSIntro version, and then make clients require that version.

When we fix high-impact bugs in future, we should consider adding a new protocol version, and making clients require it. That's what protocol versions are for,

comment:36 Changed 7 weeks ago by asn

Parent ID: #25882

comment:37 Changed 5 weeks ago by dgoulet

Related #31964

Note: See TracTickets for help on using tickets.