Opened 10 months ago

Last modified 4 months ago

#21621 needs_information defect

Intro points can get stuck in CIRCUIT_PURPOSE_S_ESTABLISH_INTRO

Reported by: teor Owned by: teor
Priority: Medium Milestone: Tor: 0.3.3.x-final
Component: Core Tor/Tor Version: Tor: 0.2.7.2-alpha
Severity: Normal Keywords: tor-hs
Cc: Actual Points:
Parent ID: #21446 Points: 1
Reviewer: dgoulet Sponsor: SponsorR-can

Description

When a hidden service opens an introduction point circuit, there's nothing that checks that it actually receive an INTRO_ESTABLISHED cell within a reasonable timeout.

So in remove_invalid_intro_points(), we should close circuits that have been in CIRCUIT_PURPOSE_S_ESTABLISH_INTRO for too long (60 seconds? 5 minutes? Some dynamic interval?).

Child Tickets

Change History (16)

comment:1 Changed 10 months ago by teor

Parent ID: #21446

This is more cleanup after #21599: covering another edge case where circuits could get stuck.

comment:2 Changed 10 months ago by dgoulet

Status: newneeds_information

I think we do.

intro->circuit_established = 1; is what signals that the intro circuit is established and ready to be used?

comment:3 in reply to:  2 ; Changed 10 months ago by teor

Status: needs_informationnew

Replying to dgoulet:

I think we do.

intro->circuit_established = 1; is what signals that the intro circuit is established and ready to be used?

We only do that when the INTRO_ESTABLISHED cell is received.
Until then, intro->circuit_established is 0, and it can remain in that state forever:

  • What if the introduction point has a bug and never sends it?
    • What do introduction points do when they detect a replay?
  • What if there's a deliberate attack where an intro point maintains the circuit but never sends the INTRO_ESTABLISHED cell?

comment:4 in reply to:  3 Changed 10 months ago by teor

Replying to teor:

Replying to dgoulet:

I think we do.

intro->circuit_established = 1; is what signals that the intro circuit is established and ready to be used?

We only do that when the INTRO_ESTABLISHED cell is received.
Until then, intro->circuit_established is 0, and it can remain in that state forever:

  • What if the introduction point has a bug and never sends it?

This can't happen, at least in the latest version: the intro point either sends INTRO_ESTABLISHED or closes the circuit.

It's also not possible for an OOM to cause this: on OOM, the circuit is closed when the cells are discarded.

  • What do introduction points do when they detect a replay?

There is no replay cache: that's rend points.

(If the key is the same, the old circuit is replaced with the new circuit, and the old circuit is closed.)

  • What if there's a deliberate attack where an intro point maintains the circuit but never sends the INTRO_ESTABLISHED cell?

Or another bug?

Then we'd need to fix this issue.

comment:5 Changed 10 months ago by teor

So let's try something like:

If a circuit is stuck in CIRCUIT_PURPOSE_S_ESTABLISH_INTRO for more than N seconds, close the circuit and treat it as a failure.

This should be fixed along with #21600, where we wait T seconds between detecting failure and retrying the connection.

We have the following approximate constraints:

  • Detecting failure (#21600) takes 1-2 first-hop latencies (it's likely to be the slow hop),
  • Building a 3-hop circuit (this ticket) and sending and receiving a reply takes about 8 first-hop latencies (3*2 + 2),
  • The total potential failure delay is approximately 10 first-hop latencies,
  • multiplied by 3 retry attempts, this gives ~30 first-hop latencies,
  • We reset the intro attempt count every 600 seconds (5 minutes), giving an approximate first-hop latency of ~20 seconds.

But we don't want to penalise hidden services on slow connections (our connection timeout is ~30 seconds), which would give us 900 seconds (15 minutes) for a full set of retries. That's probably acceptable, unless we've just killed off all of our intro points and will never get any of them back. But as long as they are in the consensus, they should be reachable (and if they're not, we drop them straight away).

So that makes:

  • N = 8*30 = 240 (or maybe 8*15 = 120 if the timeout is the rtt)
  • T = 2*30 = 60 (or maybe 2*15 = 30)

Given we've had no limit for N before this ticket, I'd be conservative and make it 240 seconds.

Given that T has been 1 second before this ticket, I'd be conservative and make it 30 seconds.

So that's a total of 3 * (240 + 30) = 810 seconds (13.5 minutes) to exhaust retries.

Edit: sometimes I just can't maths

Last edited 10 months ago by teor (previous) (diff)

comment:6 Changed 10 months ago by teor

(Oh, and we should randomise each interval between 0.5 and 1.5 times, to avoid thundering herds.)

comment:7 Changed 10 months ago by teor

So let's make it a random value in [0.5*8*CircuitTimeout, 1.5*8*CircuitTimeout]

comment:8 Changed 10 months ago by teor

Status: newneeds_review

Ok, so it turns out this is simpler than I expected: this should never happen with correctly-behaving intro points. And as long as our retries are randomised, we don't need to randomise the intro timeout.

Please see my branch bug21621.
It also adds some logging diagnostics for #21446 that include a new field added to fix this bug.

comment:9 Changed 10 months ago by nickm

Status: needs_reviewneeds_revision

Thanks, Teor! Two requests.

  1. If you're going to use BUG(intro->time_launched > now), you should probably use one of the monotonic time functions, not time().
  1. Could you rebase this onto master, without the changes from other tickets that have already been (squashed and) merged? I tried to do it myself, but I ran into conflicts.

comment:10 Changed 10 months ago by asn

Hey hey, good catch and I agree this is something to be fixed.

Some review points:

  • How come we ended up with 5 minutes as the timeout? The math in comment:5 do not seem to indicate 5 minutes anywhere? Shouldn't it be 240 seconds? Also not sure what the 13.5 minutes value is.

FWIW, personally I feel like 5 minutes is a reasonable/conservative timeout value for what we are aiming here.

  • I think it would be better if the new logic added in remove_invalid_intro_points() was actually a function of its own. This way we don't polute the host function this much, and we can also reuse that function (with mods) for prop224.
  • Maybe rename MAX_INTRO_TIMEOUT to MAX_INTRO_NOT_ESTABLISHED_CIRC_TIMEOUT or something a bit more meaningful?

comment:11 Changed 9 months ago by dgoulet

Sponsor: SponsorR-can

comment:12 Changed 8 months ago by teor

Owner: teor deleted
Status: needs_revisionassigned

I will not have time to revise this before the 0.3.1 code freeze.
It would be good if someone else fixed this bug in 0.3.1, because it affects hidden service reliability.

If you defer to 0.3.2, please reassign to me.

comment:13 Changed 8 months ago by dgoulet

Owner: set to dgoulet
Status: assignedaccepted

comment:14 Changed 7 months ago by dgoulet

Status: acceptedneeds_information

So I was studying carefully this ticket for the service implementation of prop224 (#20657) in order to implement this feature. However, I *think* the service intro circuit do expire normally and don't have a super long expiration like a rendezvous circuit has.

Looking at circuit_expire_building() (which actually looks at open circuits contrary to its comment), the timeout cutoff is "normal" that is not super long:

SET_CUTOFF(s_intro_cutoff, get_circuit_build_timeout_ms() * (9/6.0) + 1000);

And then later, there is the if (victim->state == CIRCUIT_STATE_OPEN) { condition that just breaks if we have a CIRCUIT_PURPOSE_S_ESTABLISH_INTRO and ultimately close it with timeout reason.

So, I believe tor "naturally" closes too old intro circuit that *are* established but then nothing happens on them. By design, they are suppose to be short live so this seems already integrated?

comment:15 Changed 7 months ago by dgoulet

Milestone: Tor: 0.3.1.x-finalTor: 0.3.2.x-final
Owner: changed from dgoulet to teor
Reviewer: dgoulet
Status: needs_informationassigned

This introduces many thing to HS that could not play well now that we've freeze 031. Furthermore, I don't think we do *need* to fix that asap if my previous comment is accurate.

Deferring and back to teor as requested by him.

comment:16 Changed 4 months ago by dgoulet

Milestone: Tor: 0.3.2.x-finalTor: 0.3.3.x-final
Status: assignedneeds_information
Note: See TracTickets for help on using tickets.