Opened 2 years ago

Closed 2 years ago

#23603 closed defect (fixed)

hs: Cleanup race between circuit close and free with the HS circuitmap

Reported by: dgoulet Owned by: dgoulet
Priority: High Milestone: Tor: 0.3.2.x-final
Component: Core Tor/Tor Version:
Severity: Normal Keywords: tor-hs, review-group-27, prop224
Cc: Actual Points:
Parent ID: Points:
Reviewer: asn Sponsor:

Description (last modified by dgoulet)

Here is the scenario that has been encountered by asn. The following is in chronological order:

  1. [02:21:27] An intro point (IP) is picked and circuit A is launched.
  1. [02:23:47] Circuit A never completes and thus is marked for close which is NOT free thus still in our HS circuitmap.
  1. [02:23:48] A second later, the HS housekeeping kicks in (every 1 sec) and we notice that we are missing a circuit for this IP so we launch a new circuit B. What happens then is this new circuit B takes the place of circuit A in the HS circuitmap with the *same* hs_token (the token in this case is the IP authentication public key).

At this point, we have circuit A marked for close and we have circuit B that was launched and replaced circuit A with the same hs_token which identify the IP. So far so good.

  1. [02:23:48] At the same second, we get a circuit_build_failed() which is only called from circuit_about_to_free()->circuit_close_all_marked() which means it is freeing circuit A leading to a hs_circuitmap_remove_circuit() which will use its hs_token to remove circuit B because it is the same token.

We end up with no circuit for the intro point IP in our circuitmap.

  1. [02:23:49] A second after our circuit build failed which removed circuit B from the circuitmap, we cleanup the IP from our list through the house keeping and we launch a new circuit to a new intro point. The reason why we removed the IP is either because it has expire, no node_t could be found or we retried too many times on that circuit which the allowed number of retries is 3 right now. When we do remove an IP from our service list, we also try to close the circuit but because circuit B got removed from the circuitmap, we can't find it thus it is not closed.
  1. [02:24:54] Circuit B completes but we are unable to find the IP in our service list. And boom we get the warning:

In summary, we end up with an intro circuit being established but not in our circuitmap leading to the failed lookup when we remove that IP from our service list for whatever reason.

Seems like one thing we could do is actually remove a circuit from the HS circuitmap when we mark for close because at that point, the circuit is good as dead. We avoid this race between mark for close and actually being freed where in between a new circuit to the same destination with the same hs_token can be launched.

Child Tickets

Attachments (1)

tryintofind.log.gz (360.1 KB) - added by asn 2 years ago.

Download all attachments as: .zip

Change History (28)

comment:1 Changed 2 years ago by dgoulet

Description: modified (diff)

comment:2 Changed 2 years ago by nickm

So, here's a patch I came up with when I sorta believed the above theory: http://paste.debian.net/987018/

But I found a problem with the theory above: when we call hs_circuitmap_remove_circuit() in step 4, shouldn't we _either_ hit the "Could not find circuit" warning, or fail the "tmp == circ" assertion?

comment:3 Changed 2 years ago by dgoulet

#23604 was merged some hours ago so running master now should help us get more information if this bug comes again.

comment:4 Changed 2 years ago by asn

OK I had this happen to me again on my unstable service.

Here is the creation of the circuit:

Sep 23 02:33:29.000 [info] hs_circ_launch_intro_point(): Launching a circuit to intro point $809814F480D046EF977004BD545914BFFA47C06D~uhkakw at 185.148.145.74 for service 4acth47i6kxnvkewtm6q7ib2s3ufpo5sqbsnzjpbi7utijcltosqemad.
Sep 23 02:33:29.000 [info] origin_circuit_new(): Circuit 697 chose an idle timeout of 681 based on 652 seconds of predictive building remaining.
Sep 23 02:33:29.000 [info] onion_pick_cpath_exit(): Using requested exit node '$809814F480D046EF977004BD545914BFFA47C06D~uhkakw at 185.148.145.74'
Sep 23 02:33:29.000 [info] select_entry_guard_for_circuit(): Selected primary guard Unnamed ($3E53D3979DB07EFD736661C934A1DED14127B684) for circuit.
...

and then after a second:

Sep 23 02:33:30.000 [warn] Unknown introduction point auth key on circuit 3571433146 for service 4acth47i6kxnvkewtm6q7ib2s3ufpo5sqbsnzjpbi7utijcltosqemad
Sep 23 02:33:30.000 [info] circuit_mark_for_close_(): Circuit 3571433146 (id: 697) marked for close at src/or/hs_service.c:2711 (orig reason: 12, new reason: 0)
...
Sep 23 02:33:30.000 [info] circuit_free(): Circuit 0 (id: 697) has been freed.

Note the 697 id which is the only identifier. There are no other references to 3571433146 or 697 in the logs.

There are no cleanup_intro_points() logs between creation and destruction (it only took a second). I'm not sure where the circuit went, or maybe the circuitmap was never updated? :/


BTW note that this is an unstable hidden service where internet goes up and down a few times a day. I have 200 intro circuit creations in 3 days.

My cleanup intro point logs always have maxed out retries:

Sep 23 01:46:41.000 [info] cleanup_intro_points(): Intro point $9C900A7F6F5DD034CFFD192DAEC9CCAA813DB022 (retried: 4 times). Removing it.
Sep 23 01:46:41.000 [info] cleanup_intro_points(): Intro point $CC865174BE13444B251CCB2F82B3F0F9A614DC53 (retried: 4 times). Removing it.
Sep 23 01:46:41.000 [info] cleanup_intro_points(): Intro point $60AEEBDC21E6D5411853BDA03478604FFBA00B20 (retried: 4 times). Removing it.
Sep 23 02:09:19.000 [info] cleanup_intro_points(): Intro point $F1D71314781D23800C3C62CB720BBCF449B27399 (retried: 4 times). Removing it.

also i got this once

Sep 23 02:33:30.000 [info] circuit_free(): Circuit 0 (id: 697) has been freed.

Changed 2 years ago by asn

Attachment: tryintofind.log.gz added

comment:5 Changed 2 years ago by asn

Also attached info logs to this ticket!

comment:6 in reply to:  2 Changed 2 years ago by dgoulet

Ok, those latest logs were enlightening and 99.9% certainty now :)

The offending function is hs_service_intro_circ_has_closed() called in circuit_about_to_free().

Same scenario has above but just simpler for the sake of the example:

  1. mark for close intro circuits
  2. launch new ones to same intro points.
  3. about_to_free() the circuits in step (1). --> This calls has_closed() which removes the intro point object if it has been retried too many times.
  4. New circuit opens, boom no intro point object and we get the warning of Unknown auth key....

From asn's log, which suspend/wakeup many times his tor, we happen to try intro points many times (successfully) and in their 18h-24h lifetime, it was done more than 3 times which triggered this condition in the has closed function:

  if (ip->circuit_retries > MAX_INTRO_POINT_CIRCUIT_RETRIES) {
    remember_failing_intro_point(ip, desc, approx_time());
    service_intro_point_remove(service, ip);
    service_intro_point_free(ip);

This can only happen if the service launch the new circuits *before* the free() of the previous circuits happens *and* the circuit retry counter has reached > 3.

Couple of possible fixes here. First, we should NOT clean anything in that function. We should ONLY do that in cleanup_intro_points() which would have cleanup everything properly (circuit included). For this, we need to move the "remember failing intro point" logic in there in case we've tried too many times.

Second, this circuit_retries counter should probably not be incremented when we just go out of hibernation and tor is assuming at that point that all circuits are not working. We either do a callback from hibernation into the HS subsystem to cleanup those counters or another possible option would be to reset to 0 the counter on a successful intro point establishment.

comment:7 Changed 2 years ago by dgoulet

Owner: set to dgoulet
Status: newaccepted

comment:8 Changed 2 years ago by asn

Agreed on only doing cleanup in cleanup_intro_points().

Not sure about the circuit_retries counter situation. A real hibernation callback is probably not easy to do in a cross-platform way, and resetting the retries count is a recipe for guard discovery attacks. However, IIUC this bug is not as impotant as the cleanup issue.

comment:9 Changed 2 years ago by dgoulet

Status: acceptedneeds_review

See branch bug23603_032_01 for the cleanup fix.

I haven't touched the circuit_retries counter here, I'll document the issue in a ticket though.

comment:10 Changed 2 years ago by asn

Status: needs_reviewmerge_ready

LGTM.

comment:11 Changed 2 years ago by nickm

Status: merge_readyneeds_revision

looks okay but needs a changes file. How tested is this?

comment:12 in reply to:  11 Changed 2 years ago by dgoulet

Replying to nickm:

looks okay but needs a changes file. How tested is this?

Not fully tested because the race is difficult to get in chutney or the network.

Although, we can probably pull it off in a unit test to at least test that it works out.

comment:13 Changed 2 years ago by asn

I pushed bug23603_test with a draft unittest here. It needs improvements.

David, I did not understand your suggested fix for cleanup_intro_points to make it remove intro points that will get cleaned up anyway. I think my confusion stems from the fact that we increase intro point circuit_retries when we launch a circuit and not when we fail it.

This means that there are cases (say network down) where we will launch a 4th retry on an intro point, the retry will succeed, and then we will still clean it up since it reached 4 retries. I think the right way to do this is to increase the retry counter when we fail the circuit in hs_service_intro_circ_has_closed(). No?

I think the suggested fix also creates confusion over the semantics of MAX_INTRO_POINT_CIRCUIT_RETRIES since we would then have in cleanup_intro_points:

if (ip->circuit_retries > MAX_INTRO_POINT_CIRCUIT_RETRIES) {
   clean intro point
}

if (ip->circuit_retries >= MAX_INTRO_POINT_CIRCUIT_RETRIES &&
    no_active_intro_circuit_for_this_ip) {
   clean intro point
}

This logic looks confusing to me. What's the meaning of max after all? I think all this confusion stems from the fact that circuit_retries gets incremented on launch and not on fail.

What do you think?

Last edited 2 years ago by asn (previous) (diff)

comment:14 Changed 2 years ago by dgoulet

Indeed. The circuit_retries is really about "How many time have used this IP to build a circuit" regardless of success or fail. Even if the circuit was taken down 5 min later, we still count that as a "try".

Apart from all the numerous issues with this behavior, one of the bug here is that we reach 3 try which is not above the limit of 4, so we open a new circuit and circuit_retries == 4 which makes us clean it up the second after because we've reached our max...

We should:

  1. Allow the circuit creation up to 3 tries.
  2. Deny circuit creation at > 3 and clean it up at > 3.

Kind of tricky here, so I think the easiest way to do that is that if circuit_retries is 3 and we want to launch, ++ it and deny the launch.

OR we only cleanup the intro point if it doesn't have a circuit established and the counter has reached the max? (Which is what v2 does, cleanups only if it realizes that the retry has reached the max and the last circuit is gone thus a retry is soon about to be launched).

If you have a more elegant solution, I'm all ears?

comment:15 in reply to:  14 Changed 2 years ago by asn

Replying to dgoulet:

Indeed. The circuit_retries is really about "How many time have used this IP to build a circuit" regardless of success or fail. Even if the circuit was taken down 5 min later, we still count that as a "try".

Apart from all the numerous issues with this behavior, one of the bug here is that we reach 3 try which is not above the limit of 4, so we open a new circuit and circuit_retries == 4 which makes us clean it up the second after because we've reached our max...

We should:

  1. Allow the circuit creation up to 3 tries.
  2. Deny circuit creation at > 3 and clean it up at > 3.

Kind of tricky here, so I think the easiest way to do that is that if circuit_retries is 3 and we want to launch, ++ it and deny the launch.

OR we only cleanup the intro point if it doesn't have a circuit established and the counter has reached the max? (Which is what v2 does, cleanups only if it realizes that the retry has reached the max and the last circuit is gone thus a retry is soon about to be launched).

If you have a more elegant solution, I'm all ears?

Hmmm, I feel like the first solution you suggested is less racy and more straightforward than the first (at least to my mind). Curious why v2 is doing it the other way. Maybe I have not thought of something?

I'd also like to either rename circuit_retries to circuits_launched, or restore the circuit_retries semantic by only increasing it when we fail and actually retry a circuit. Because right now circuit_retries will also count the first circuit launch which is not a retry. I feel like that will also simplify logic and understanding.

comment:16 Changed 2 years ago by dgoulet

Agree!!

comment:17 Changed 2 years ago by dgoulet

Owner: changed from dgoulet to asn
Status: needs_revisionassigned

Re-assigning to asn since this is branch and code. Let me know if you have no time before 032 stable to fix this.

comment:18 Changed 2 years ago by asn

Milestone: Tor: 0.3.2.x-finalTor: 0.3.3.x-final

Moving to 0.3.3. Likely no time to patch this properly and fix the test, and also a pretty rare bug afaik.

comment:19 Changed 2 years ago by dgoulet

Milestone: Tor: 0.3.3.x-finalTor: 0.3.2.x-final
Owner: changed from asn to dgoulet
Status: assignedaccepted

I hit last week #24346 and turns out *this* ticket is the cause.

In a nutshell:

  1. close(intro_circ)
  2. Launch again to intro (retry)
  3. Register intro circuit that just opened (FAIL because intro_circ is still in circuitmap)
  4. free(intro_circ) leading to a removal from the circuitmap.

And there you have it, this new intro fully established circuit can't be found by the service so the #24346 is stuck with:

Dec 02 15:58:47.000 [info] should_service_upload_descriptor(): Service ONION can't upload its next descriptor. Intro circuits aren't yet all established (2/3). [59 similar message(s) suppressed in last 60 seconds]

Only 2 of them are registered in the circuitmap and the third one is no where to be found.

Turning this one high prio and back in 032 because it can basically disable a service until the next SRV value which can be up to 24h.

comment:20 Changed 2 years ago by dgoulet

Reviewer: asn
Status: acceptedneeds_review

I've taken the old patch and asn's test. Rebased it on latest 032.

This fixes two things that this ticket found as issues. They are in two separate commits. Third commit adds asn's test.

The issue with circuit_retries is *not* addressed in here but it is also not critical, the service will still work and it should be a rare case. But, we should definitely fix it. I propose its own ticket once this is merged and we agree.

Branch: bug23603_032_01

comment:21 Changed 2 years ago by asn

Hello, LGTM.

I pushed a branch bug23603_032_01 on my repo that tries to tidy up the HS circuit cleanup operation by pulling it into a function, so that we don't pollute the circuit_free() and circuit_mark_for_close() functions with various HS-specific stuff. Let me know if you like it, and if you like it a lot squash it up, and put it in merge_ready. Cheers!

comment:22 Changed 2 years ago by dgoulet

I like it *but* now that both the close and free call back the hs service layer using has_closed(), the new test fails to work because we do not have a registered service for it.

comment:23 Changed 2 years ago by dgoulet

Ok I took over. I had to do some moving and put the function in hs_circuit.c as well, thought it would be more accurate there.

Branch: bug23603_032_02

Notice couple of things. First, that cleanup needs to be done on both or and origin circuit. Second, w need to do it before the hs_ident is freed because the has_closed() needs it.

comment:24 Changed 2 years ago by nickm

Keywords: review-group-27 added

comment:25 Changed 2 years ago by asn

Status: needs_reviewmerge_ready

LGTM! Tests pass with ASAN as well.

comment:26 Changed 2 years ago by asn

Keywords: prop224 added

comment:27 Changed 2 years ago by nickm

Resolution: fixed
Status: merge_readyclosed

looks plausible to me too. merging to 0.3.2 and forward!

Note: See TracTickets for help on using tickets.