When a hidden service opens an introduction point circuit, there's nothing that checks that it actually receive an INTRO_ESTABLISHED cell within a reasonable timeout.
So in remove_invalid_intro_points(), we should close circuits that have been in CIRCUIT_PURPOSE_S_ESTABLISH_INTRO for too long (60 seconds? 5 minutes? Some dynamic interval?).
To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information
Child items ...
Show closed items
Linked items 0
Link issues together to show that they're related.
Learn more.
If a circuit is stuck in CIRCUIT_PURPOSE_S_ESTABLISH_INTRO for more than N seconds, close the circuit and treat it as a failure.
This should be fixed along with #21600 (moved), where we wait T seconds between detecting failure and retrying the connection.
We have the following approximate constraints:
Detecting failure (#21600 (moved)) takes 1-2 first-hop latencies (it's likely to be the slow hop),
Building a 3-hop circuit (this ticket) and sending and receiving a reply takes about 8 first-hop latencies (3*2 + 2),
The total potential failure delay is approximately 10 first-hop latencies,
multiplied by 3 retry attempts, this gives ~30 first-hop latencies,
We reset the intro attempt count every 600 seconds (5 minutes), giving an approximate first-hop latency of ~20 seconds.
But we don't want to penalise hidden services on slow connections (our connection timeout is ~30 seconds), which would give us 900 seconds (15 minutes) for a full set of retries. That's probably acceptable, unless we've just killed off all of our intro points and will never get any of them back. But as long as they are in the consensus, they should be reachable (and if they're not, we drop them straight away).
So that makes:
N = 830 = 240 (or maybe 815 = 120 if the timeout is the rtt)
T = 230 = 60 (or maybe 215 = 30)
Given we've had no limit for N before this ticket, I'd be conservative and make it 240 seconds.
Given that T has been 1 second before this ticket, I'd be conservative and make it 30 seconds.
So that's a total of 3 * (240 + 30) = 810 seconds (13.5 minutes) to exhaust retries.
Ok, so it turns out this is simpler than I expected: this should never happen with correctly-behaving intro points. And as long as our retries are randomised, we don't need to randomise the intro timeout.
Please see my branch bug21621.
It also adds some logging diagnostics for #21446 (moved) that include a new field added to fix this bug.
If you're going to use BUG(intro->time_launched > now), you should probably use one of the monotonic time functions, not time().
Could you rebase this onto master, without the changes from other tickets that have already been (squashed and) merged? I tried to do it myself, but I ran into conflicts.
Hey hey, good catch and I agree this is something to be fixed.
Some review points:
How come we ended up with 5 minutes as the timeout? The math in comment:5 do not seem to indicate 5 minutes anywhere? Shouldn't it be 240 seconds? Also not sure what the 13.5 minutes value is.
FWIW, personally I feel like 5 minutes is a reasonable/conservative timeout value for what we are aiming here.
I think it would be better if the new logic added in remove_invalid_intro_points() was actually a function of its own. This way we don't polute the host function this much, and we can also reuse that function (with mods) for prop224.
Maybe rename MAX_INTRO_TIMEOUT to MAX_INTRO_NOT_ESTABLISHED_CIRC_TIMEOUT or something a bit more meaningful?
I will not have time to revise this before the 0.3.1 code freeze.
It would be good if someone else fixed this bug in 0.3.1, because it affects hidden service reliability.
If you defer to 0.3.2, please reassign to me.
Trac: Owner: teor toN/A Status: needs_revision to assigned
So I was studying carefully this ticket for the service implementation of prop224 (#20657 (moved)) in order to implement this feature. However, I think the service intro circuit do expire normally and don't have a super long expiration like a rendezvous circuit has.
Looking at circuit_expire_building() (which actually looks at open circuits contrary to its comment), the timeout cutoff is "normal" that is not super long:
And then later, there is the if (victim->state == CIRCUIT_STATE_OPEN) { condition that just breaks if we have a CIRCUIT_PURPOSE_S_ESTABLISH_INTRO and ultimately close it with timeout reason.
So, I believe tor "naturally" closes too old intro circuit that are established but then nothing happens on them. By design, they are suppose to be short live so this seems already integrated?
This introduces many thing to HS that could not play well now that we've freeze 031. Furthermore, I don't think we do need to fix that asap if my previous comment is accurate.
Deferring and back to teor as requested by him.
Trac: Reviewer: N/Ato dgoulet Milestone: Tor: 0.3.1.x-final to Tor: 0.3.2.x-final Owner: dgoulet to teor Status: needs_information to assigned
I do not think this is an issue in HSv3 because a service intro circuit that never establishes ultimately get removed by circuit_building_expire() and the HS subsystem notices it at that point which results in opening a new one.
No descriptors are published without all IP established.
As for v2, I believe it should also behave like that but uncertain. In any case, not sure we'll fix this in v2 unless major stability or reachability issue arise?
Please re-open if I'm wrong but in the end, I think circuit_building_expire() does its job properly.
Trac: Status: new to closed Resolution: N/Ato not a bug