Context: Lets say a service has 3 intro circuits opened and stable.
Now, imagine one circuit collapses, like for instance the intro point restarted "tor" after an update. Our code is designed to retry 3 times that is once every second and then give up on the intro point.
What it looks like:
Every second, run_build_circuit_event() is run and launches intro circuit(s) if we are missing any. For each IP, it will increment the circuit_retries counter but does not actually look at it to decide to launch or not.
Before that event, also every 1 second, cleanup_intro_points() checks that every intro point has not expired, fell off the consensus or that circuit_retries is greater than (>) MAX_INTRO_POINT_CIRCUIT_RETRIES = 3.
Putting this together, imagine now that the first 3 attempts failed for whatever reasons and then we launch a 4th one because circuit_retries = 3, it does pass validation of > MAX_INTRO_POINT_CIRCUIT_RETRIES so then a circuit is launched and that very one succeeds.
Because circuit_retries is now 4 then the next second, cleanup_intro_points() removes the IP and closes the valid open established circuit...
I've observed the above a surprising amount of time because booting a tor relay takes some seconds mostly due to the diff-cache parsing.
In a nutshell, we should NOT launch a circuit if we reached the max retries for an intro point. This back and forth of opening a circuit and then deciding that we went over the limit makes no sense in the first place.
To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information
Child items
0
Show closed items
No child items are currently assigned. Use child items to break down this issue into smaller parts.
Linked items
0
Link issues together to show that they're related.
Learn more.
Thanks for the fixups. I think that if block in cleanup_intro_points() has grown to an unruly complexity which is hard to understand what it's checking and why anymore. Can you please split into a separate function, split all the clauses into different if statements, and document them individually?
Also, any chance for a unittest for this particular case? Both for how we end up with 4 circs, and for how the new code does not clean the fourth one up (if it's valid).
Why does a function called should_not_retry_intro_point() allow the possibility of a retry when we have gone over the number of maximum retries? In my view, it should always return false in that case.
In particular the following block:
/* If we have gone over the number of retried circuits, make sure we don't * already have an established circuit. */ if (ip->circuit_retries > MAX_INTRO_POINT_CIRCUIT_RETRIES) { return !ip->circuit_established || hs_circ_service_get_intro_circ(ip); }
is confusing me even tho I've read it a few times. When I'm reading that function my logic would be "If we have gone over the number of retried circuits, we only allow retries if ...", but then why would we allow retries if we are past the max? Also the comment mentions the established circuit clause, but not the hs_circ_service_get_intro_circ(ip); one. Can you please clarify that logic further? :/
We seem to have left some comments in cleanup_intro_points() that are only relevant to should_not_retry_intro_point().
OK this is much better. I really understand the logic now. Thanks!
That said, you changed the behavior of the code in the latest iteration which I was not expecting (I was just expecting documentation). In particular, the code now avoids removing the intro circ if hs_circ_service_get_intro_circ() is true (which makes sense!) whereas before it was doing the exact opposite.
How could you fix that bug without the unittest noticing at all? Can you please enrich the unittest so that it catches such bugs in the future?
To clarify, is the bug you are talking about related to removing the intro point if hs_circ_service_get_intro_circ() is true? If not, what is it?
Yep! Before 861ff757712d008424746e9d1e9bd85b3f472dee we used to return True if hs_circ_service_get_intro_circ(ip) is true. After that commit, we return False.
This seems like a behavior change and not just refactoring.
However, I believe my code is tested with the test routine test_service_event():
/* Now, we'll create an IP with a registered circuit. The IP object * shouldn't go away. */ ip = helper_create_service_ip(); service_intro_point_add(service->desc_current->intro_points.map, ip); ed25519_pubkey_copy(&circ->hs_ident->intro_auth_pk, &ip->auth_key_kp.pubkey); hs_circuitmap_register_intro_circ_v3_service_side( circ, &ip->auth_key_kp.pubkey); run_housekeeping_event(now); tt_int_op(digest256map_size(service->desc_current->intro_points.map), OP_EQ, 1); /* We'll mangle the IP object to expire. */ ip->time_to_expire = now; run_housekeeping_event(now); tt_int_op(digest256map_size(service->desc_current->intro_points.map), OP_EQ, 0);
That said, if you have a unittest that does not fail after changing the behavior of the function, your unittest is not testing the function well enough. That's what I tried to say at comment:18.
Anyhow, I think that's OK for now tho.
I will mark this as merge_ready, and put David as the merger so that he takes a second look as well.
Trac: Status: needs_review to merge_ready Keywords: N/Adeleted, dgoulet-merge added
if (hs_circ_service_get_intro_circ(ip) && over_max_retries) { return true; }
If we have a pending intro circuit for that ip and we are over the max retries, this means that the pending intro circuit is the one that will be used or will fail.
So returning true means that it will be removed from the service list and will lead to a lost lingering intro circuit that once established, we'll hit an error because we can't find the "ip" object.
If we have one established or in the process of establishing one, NEVER remove the IP is what I think we should do.
If we have no circuits at all and we are over the limit, remove it.