Opened 4 months ago

Closed 4 months ago

#30921 closed defect (fixed)

hs-v3: Close intro circuits when cleaning up the client descriptor cache

Reported by: dgoulet Owned by: dgoulet
Priority: Medium Milestone: Tor: 0.4.2.x-final
Component: Core Tor/Tor Version:
Severity: Normal Keywords: tor-hs, tor-client
Cc: Actual Points: 0.1
Parent ID: #28970 Points: 0.1
Reviewer: asn Sponsor: Sponsor27-must

Description

In #28970, one of the assert indicates that we are missing the descriptor object when the intro point circuit opened:

Jan 01 05:31:33.000 [warn] tor_bug_occurred_(): Bug: ../src/or/hs_client.c:624: setup_intro_circ_auth_key: Non-fatal assertion !(desc == NULL) failed. (on Tor 0.3.4.9 )
Jan 01 05:31:33.000 [warn] Bug: Non-fatal assertion !(desc == NULL) failed in setup_intro_circ_auth_key at ../src/or/hs_client.c:624. Stack trace: (on Tor 0.3.4.9 )
Jan 01 05:31:33.000 [warn] Bug:     /usr/bin/tor(log_backtrace+0x42) [0x56345cea26e2] (on Tor 0.3.4.9 )
Jan 01 05:31:33.000 [warn] Bug:     /usr/bin/tor(tor_bug_occurred_+0xb7) [0x56345cebd587] (on Tor 0.3.4.9 )
Jan 01 05:31:33.000 [warn] Bug:     /usr/bin/tor(hs_client_circuit_has_opened+0x2ca) [0x56345ce8027a] (on Tor 0.3.4.9 )

When a descriptor is removed from the client cache, the intro circuits aren't closed so there is a race where if it happens in the same main loop run that the client has an opened intro circuit for it, then it could lead to that assert.

Regardless of the cause of the assert or not, we should always close pending intro circuits when cleaning up a descriptor since once it opens, the client requires access to the descriptor object to complete the introduction (see setup_intro_circ_auth_key()).

Funny thought that we do that when we replace a descriptor from the client cache but not when we purge it...

This is a possible backport contender in order to avoid BUG() and failure of reachability client side.

Child Tickets

Change History (11)

comment:1 Changed 4 months ago by dgoulet

Actual Points: 0.1
Reviewer: asn
Status: assignedneeds_review

Branch: ticket30921_042_01
PR: https://github.com/torproject/tor/pull/1120

I added a unit test that indeed confirms the issue without the fix. No backport afterall since this uses some code from #27471 earlier patch that was not backported which is the "close all client intro circuit" function.

comment:2 Changed 4 months ago by nickm

Is this a backport candidate? If so, how far?

comment:3 Changed 4 months ago by dgoulet

Quoting from previous comment:1: No backport afterall since this uses some code from #27471 earlier patch that was not backported which is the "close all client intro circuit" function.

These issues happen very rarely so in #27471 we were OK to not backport due to the amount of code required to fix this.

comment:4 Changed 4 months ago by asn

Status: needs_reviewneeds_revision

Hmm, some questions about this fix:

  • I'm not sure if the identified race condition is the cause of this assert. Looking at #28970, there are two asserts and the second one is on hs_ident_intro_circ_is_valid() which imples that the hs_ident exists but is zeroed out. IIUC, this seems to be the reason that we can't find a descriptor here, and not a race condition.
  • Just noting that Regardless of the cause of the assert or not, we should always close pending intro circuits when cleaning up a descriptor, seems like a step backwards from #22893. Am I right to say that we need to do this anyway so that we maintain our current suboptimal design, and there is no way around it as long as we don't do #22893?
  • I now see a bit of code dup between cache_store_as_client() and cache_clean_v3_as_client() perhaps we can tidy it up in one func.

comment:5 in reply to:  4 ; Changed 4 months ago by dgoulet

Replying to asn:

Hmm, some questions about this fix:

  • I'm not sure if the identified race condition is the cause of this assert. Looking at #28970, there are two asserts and the second one is on hs_ident_intro_circ_is_valid() which imples that the hs_ident exists but is zeroed out. IIUC, this seems to be the reason that we can't find a descriptor here, and not a race condition.

I would argue that the "!desc" assert could be caused by what I outlined and fixed in the patch. The "has opened" code path can't return an error so after that assert is hit, we end up trying to send an INTRODUCE1 on the circuit and then you hit the second assert.

The reason why the intro_auth_pk key is zeroed is because the setup_intro_circ_auth_key() failed before it could be filled (by not finding the descriptor).

  • Just noting that Regardless of the cause of the assert or not, we should always close pending intro circuits when cleaning up a descriptor, seems like a step backwards from #22893. Am I right to say that we need to do this anyway so that we maintain our current suboptimal design, and there is no way around it as long as we don't do #22893?

Yes, in an ideal world, we would implement #22893 definitely.

comment:6 in reply to:  5 ; Changed 4 months ago by asn

Replying to dgoulet:

Replying to asn:

Hmm, some questions about this fix:

  • I'm not sure if the identified race condition is the cause of this assert. Looking at #28970, there are two asserts and the second one is on hs_ident_intro_circ_is_valid() which imples that the hs_ident exists but is zeroed out. IIUC, this seems to be the reason that we can't find a descriptor here, and not a race condition.

I would argue that the "!desc" assert could be caused by what I outlined and fixed in the patch. The "has opened" code path can't return an error so after that assert is hit, we end up trying to send an INTRODUCE1 on the circuit and then you hit the second assert.

The reason why the intro_auth_pk key is zeroed is because the setup_intro_circ_auth_key() failed before it could be filled (by not finding the descriptor).

ACK agreed.

One last thing: Can you please outline the exact steps to enter this edge case? IIUC, it's about a client who gets their intro circ opened at the exact same second that their descriptor expired? This seems extremely rare to me, and I wonder if there is a simpler explanation.

In particular, descriptors expire at "the start of the next time period", and the logs in question were at 05:31, so not sure how expiry could work this way.

comment:7 in reply to:  6 Changed 4 months ago by dgoulet

Replying to asn:

Replying to dgoulet:

Replying to asn:

Hmm, some questions about this fix:

  • I'm not sure if the identified race condition is the cause of this assert. Looking at #28970, there are two asserts and the second one is on hs_ident_intro_circ_is_valid() which imples that the hs_ident exists but is zeroed out. IIUC, this seems to be the reason that we can't find a descriptor here, and not a race condition.

I would argue that the "!desc" assert could be caused by what I outlined and fixed in the patch. The "has opened" code path can't return an error so after that assert is hit, we end up trying to send an INTRODUCE1 on the circuit and then you hit the second assert.

The reason why the intro_auth_pk key is zeroed is because the setup_intro_circ_auth_key() failed before it could be filled (by not finding the descriptor).

ACK agreed.

One last thing: Can you please outline the exact steps to enter this edge case? IIUC, it's about a client who gets their intro circ opened at the exact same second that their descriptor expired? This seems extremely rare to me, and I wonder if there is a simpler explanation.

In particular, descriptors expire at "the start of the next time period", and the logs in question were at 05:31, so not sure how expiry could work this way.

Well the callback for cleaning the client cache is every 30 minutes. So at 05:31, we might have simply gotten the consensus earlier that invalidates the client cache since the "valid_after" time is used. Also I have no idea what is the timezone of that log :S...

The use case is that when the client cache expires a descriptor, it could be in the same main loop that an introduction circuit is now opened.

It is very rare indeed! This is why I think I've never seen that and only active clients let say trying to connect to the HS every 30 seconds might have more chance to hit it. Considering thousands of tor users out there with very different use case, it is a case that I can see happening but very rarely.

This is also why, for this family of bugs, we decided to not backport because it is so rare.

Nevertheless, when cleaning descriptor from the client cache, I think it is correct behavior to then always close any pending intro circuits. Exactly what we did when we *replace* an existing descriptor.

Also, re-reading this ticket, this appears to be very wrong and imo we should address this...:

The "has opened" code path can't return an error so after that assert is hit, we end up trying to send an INTRODUCE1 on the circuit and then you hit the second assert.

If the subsystem says "well your has_opened is not working for me", it is saying that the circuit won't work so we should stop right there and close it. Going into the "sending a cell code path" is a bit crazy imo.

comment:8 Changed 4 months ago by asn

Sounds good. Let's get this done. :)

How about a small fixup to unite the code dup between cache_store_as_client() and cache_clean_v3_as_client()? I think it's gonna be cleaner and also will be easier to clean when we do #22893.

After the fixup, I think we are good to go!

PS: I was assuming that the logs were on UTC because I thought that's what Tor does. But perhaps that's not the case if syslog is used.

comment:9 in reply to:  8 Changed 4 months ago by dgoulet

Status: needs_revisionneeds_review

Replying to asn:

How about a small fixup to unite the code dup between cache_store_as_client() and cache_clean_v3_as_client()? I think it's gonna be cleaner and also will be easier to clean when we do #22893.

Actually, the clean and store functions are kind of very different? What do you have in mind here to unify? ...

comment:10 Changed 4 months ago by asn

Status: needs_reviewmerge_ready

LGTM!

(Figured out with David over IRC that my suggestion was not worth it)

Last edited 4 months ago by asn (previous) (diff)

comment:11 Changed 4 months ago by nickm

Resolution: fixed
Status: merge_readyclosed

merged to master!

Note: See TracTickets for help on using tickets.