Opened 3 months ago

Closed 6 weeks ago

#31652 closed defect (fixed)

hs-v3: Service circuit retry limit should not close a valid circuit

Reported by: dgoulet Owned by: dgoulet
Priority: Medium Milestone: Tor: 0.4.2.x-final
Component: Core Tor/Tor Version:
Severity: Normal Keywords: tor-hs, tor-circuit, 042-should
Cc: asn, neel Actual Points:
Parent ID: #30200 Points: 0.1
Reviewer: asn Sponsor: Sponsor27-must

Description

Context: Lets say a service has 3 intro circuits opened and stable.

Now, imagine one circuit collapses, like for instance the intro point restarted "tor" after an update. Our code is designed to retry 3 times that is once every second and then give up on the intro point.

What it looks like:

Every second, run_build_circuit_event() is run and launches intro circuit(s) if we are missing any. For each IP, it will increment the circuit_retries counter but does not actually look at it to decide to launch or not.

Before that event, also every 1 second, cleanup_intro_points() checks that every intro point has not expired, fell off the consensus or that circuit_retries is greater than (>) MAX_INTRO_POINT_CIRCUIT_RETRIES = 3.

Putting this together, imagine now that the first 3 attempts failed for whatever reasons and then we launch a 4th one because circuit_retries = 3, it does pass validation of > MAX_INTRO_POINT_CIRCUIT_RETRIES so then a circuit is launched and that very one succeeds.

Because circuit_retries is now 4 then the next second, cleanup_intro_points() removes the IP and closes the valid open established circuit...

I've observed the above a surprising amount of time because booting a tor relay takes some seconds mostly due to the diff-cache parsing.

In a nutshell, we should NOT launch a circuit if we reached the max retries for an intro point. This back and forth of opening a circuit and then deciding that we went over the limit makes no sense in the first place.

Child Tickets

Change History (31)

comment:1 Changed 3 months ago by neel

Cc: neel added
Owner: set to neel
Status: newassigned

comment:2 Changed 3 months ago by neel

Status: assignedneeds_review

comment:3 Changed 2 months ago by dgoulet

Reviewer: asndgoulet

comment:4 Changed 2 months ago by dgoulet

Status: needs_reviewneeds_revision

Comment on PR

comment:5 Changed 2 months ago by neel

Status: needs_revisionneeds_review

comment:6 Changed 2 months ago by nickm

Keywords: 042-should added

comment:7 Changed 2 months ago by dgoulet

Status: needs_reviewneeds_revision

Close but slight problem!

comment:8 Changed 2 months ago by neel

Status: needs_revisionneeds_review

Fixed it!

comment:9 Changed 2 months ago by asn

Status: needs_reviewneeds_revision

Hmm, seems some test failed and CI is sad. (Didn't do any source code review)

comment:10 Changed 2 months ago by neel

Status: needs_revisionneeds_review

comment:11 Changed 2 months ago by asn

Reviewer: dgouletasn

comment:12 Changed 2 months ago by asn

Status: needs_reviewneeds_revision

Thanks for the fixups. I think that if block in cleanup_intro_points() has grown to an unruly complexity which is hard to understand what it's checking and why anymore. Can you please split into a separate function, split all the clauses into different if statements, and document them individually?

Also, any chance for a unittest for this particular case? Both for how we end up with 4 circs, and for how the new code does not clean the fourth one up (if it's valid).

Thanks! :)

Last edited 2 months ago by asn (previous) (diff)

comment:13 Changed 2 months ago by neel

Status: needs_revisionneeds_review

comment:14 Changed 2 months ago by asn

Status: needs_reviewneeds_revision

Unittest looks reasonable, but I have a question on the naming of the function and the overall logic. Maybe some documentation to clarify is missing?

comment:15 Changed 2 months ago by neel

Status: needs_revisionneeds_review

I made the corrections.

comment:16 Changed 8 weeks ago by asn

Status: needs_reviewneeds_revision

Neel, I'm still confused here...

Why does a function called should_not_retry_intro_point() allow the possibility of a retry when we have gone over the number of maximum retries? In my view, it should always return false in that case.

In particular the following block:

  /* If we have gone over the number of retried circuits, make sure we don't
   * already have an established circuit. */
  if (ip->circuit_retries > MAX_INTRO_POINT_CIRCUIT_RETRIES) {
    return !ip->circuit_established || hs_circ_service_get_intro_circ(ip);
  }

is confusing me even tho I've read it a few times. When I'm reading that function my logic would be "If we have gone over the number of retried circuits, we only allow retries if ...", but then why would we allow retries if we are past the max? Also the comment mentions the established circuit clause, but not the hs_circ_service_get_intro_circ(ip); one. Can you please clarify that logic further? :/

We seem to have left some comments in cleanup_intro_points() that are only relevant to should_not_retry_intro_point().

comment:17 Changed 8 weeks ago by neel

Status: needs_revisionneeds_review

I have clarified the code, calling the new function should_remove_intro_point() and simplified the logic.

The code works this way:

If this statement is true:

    if (ip->circuit_established || hs_circ_service_get_intro_circ(ip))

We return false to not destroy the circuit. Otherwise, we return this test case:

  return (ip->circuit_retries > MAX_INTRO_POINT_CIRCUIT_RETRIES);

Setting as needs review.

comment:18 Changed 7 weeks ago by asn

Status: needs_reviewneeds_revision

OK this is much better. I really understand the logic now. Thanks!

That said, you changed the behavior of the code in the latest iteration which I was not expecting (I was just expecting documentation). In particular, the code now avoids removing the intro circ if hs_circ_service_get_intro_circ() is true (which makes sense!) whereas before it was doing the exact opposite.

How could you fix that bug without the unittest noticing at all? Can you please enrich the unittest so that it catches such bugs in the future?

We are almost there!

comment:19 Changed 7 weeks ago by neel

To clarify, is the bug you are talking about related to removing the intro point if hs_circ_service_get_intro_circ() is true? If not, what is it?

comment:20 in reply to:  19 Changed 7 weeks ago by asn

Replying to neel:

To clarify, is the bug you are talking about related to removing the intro point if hs_circ_service_get_intro_circ() is true? If not, what is it?

Yep! Before 861ff757712d008424746e9d1e9bd85b3f472dee we used to return True if hs_circ_service_get_intro_circ(ip) is true. After that commit, we return False.
This seems like a behavior change and not just refactoring.

comment:21 Changed 7 weeks ago by neel

Status: needs_revisionneeds_review

I fixed the issue while allowing tests to pass.

However, I believe my code is tested with the test routine test_service_event():

    /* Now, we'll create an IP with a registered circuit. The IP object
     * shouldn't go away. */
    ip = helper_create_service_ip();
    service_intro_point_add(service->desc_current->intro_points.map, ip);
    ed25519_pubkey_copy(&circ->hs_ident->intro_auth_pk,
                        &ip->auth_key_kp.pubkey);
    hs_circuitmap_register_intro_circ_v3_service_side(
                                         circ, &ip->auth_key_kp.pubkey);
    run_housekeeping_event(now);
    tt_int_op(digest256map_size(service->desc_current->intro_points.map),
              OP_EQ, 1);
    /* We'll mangle the IP object to expire. */
    ip->time_to_expire = now;
    run_housekeeping_event(now);
    tt_int_op(digest256map_size(service->desc_current->intro_points.map),
              OP_EQ, 0);

Setting as needs review.

comment:22 Changed 7 weeks ago by asn

Keywords: dgoulet-merge added
Status: needs_reviewmerge_ready

I think that looks reasonable. Thanks!

That said, if you have a unittest that does not fail after changing the behavior of the function, your unittest is not testing the function well enough. That's what I tried to say at comment:18.

Anyhow, I think that's OK for now tho.

I will mark this as merge_ready, and put David as the merger so that he takes a second look as well.

comment:23 Changed 7 weeks ago by dgoulet

Status: merge_readyneeds_revision

I think we can _not_ do this:

  if (hs_circ_service_get_intro_circ(ip) && over_max_retries) {
    return true;
  }

If we have a pending intro circuit for that ip and we are over the max retries, this means that the pending intro circuit is the one that will be used or will fail.

So returning true means that it will be removed from the service list and will lead to a lost lingering intro circuit that once established, we'll hit an error because we can't find the "ip" object.

If we have one established or in the process of establishing one, NEVER remove the IP is what I think we should do.

If we have no circuits at all and we are over the limit, remove it.

comment:24 Changed 7 weeks ago by asn

Yes, dgoulet I think you are right.

Given your feedback I think the behavior before https://github.com/torproject/tor/pull/1315/commits/5c26c656a2037f3b264d8c7609621450e6e46293 is the right one.

But we need a unittest that actually tests this feature.

comment:25 Changed 7 weeks ago by dgoulet

Yes apology here... I clearly was very confused or made a typo but this was misleading:

https://github.com/torproject/tor/pull/1315#pullrequestreview-287606815

comment:23 is the right way imo.

comment:26 Changed 6 weeks ago by dgoulet

Owner: changed from neel to dgoulet
Status: needs_revisionaccepted

Now that we now what to do, I can take it up.

Thanks neel!

comment:27 Changed 6 weeks ago by dgoulet

Keywords: dgoulet-merge removed

comment:28 Changed 6 weeks ago by dgoulet

Status: acceptedneeds_review

Branch: ticket31652_042_01
PR: https://github.com/torproject/tor/pull/1401

The patch has changed quite a bit but the logic is the same.

comment:29 Changed 6 weeks ago by asn

Status: needs_reviewmerge_ready

LGTM!

comment:30 Changed 6 weeks ago by nickm

Looks good to me too. The code is tricky, but the coverage is complete.

comment:31 Changed 6 weeks ago by nickm

Resolution: fixed
Status: merge_readyclosed

(Merged to master.)

Note: See TracTickets for help on using tickets.