Potential issue with branch above: What happens if one side closes the circuit anyhow, while the other side (with the machine enabled) keeps it alive even if the other side is closed?
I'm OK with the purpose-based approach, but now I see that we are using both my "special flag" and also the "special purpose" so we moved from one special thing to two. In this case, can you please motivate the new purpose further in the docs of CIRCUIT_PURPOSE_C_CIRCUIT_PADDING. IIUC, the new purpose is so that Tor does not attach
new streams to the circuit, or are there more reasons?
If that's the only reason, perhaps it would be wise to ask feedback from Nick to see if a new purpose is the best way to achieve that goal.
Ok I switched to checking the purpose instead of the flag, and removed the flag.
While thinking about #28634 (moved), I started to realize we may need more here. We may want to flip this back to C_GENERAL isntead of closing it. Or at least think about how we would make this circuit behave more like it was real. Not sure about hidden service circuits, but general circuits are marked dirty and allowed to sit around for 10 minutes after use before being closed. I am not sure the best way to do this, though. Maybe we do need some kind of timer that does not send a packet but just causes a state transition.. Bleh.
I'm also getting more and more pessimistic about doing this or #28634 (moved) by Tuesday.. I think it is more important that what we deploy is sane and good rather than just does something. I think if we act early enough in 0.4.1 to make more reasonable machines, we'll have plenty of testing.
I'm also getting more and more pessimistic about doing this or #28634 (moved) by Tuesday.. I think it is more important that what we deploy is sane and good rather than just does something. I think if we act early enough in 0.4.1 to make more reasonable machines, we'll have plenty of testing.
Agreed. Let's relax and focus on #28142 (moved) till then.
Ok I think the best way to handle comment:8 is to not mark circuit for close upon transition to the end state. Then, circuit_expire_old_circuits_clientside() should cull them as per existing dirtyness timeouts. This is done in an extra fixup commit on https://github.com/torproject/tor/pull/644.
However, I am still going back and forth between purpose vs flag. Here's the Pro/Con:
Purpose:
Pros:
We don't have to worry about hidden service code and circuit prediction code getting confused by outstanding circuits of a particular type, with particular state, that they thought they closed.
In logs, controller handling, and everywhere else in the code, it is very clear what this circuit is doing and what it is for. Code is very unlikely to accidentally use it for something else.
Cons:
State machines must remember to include CIRCUIT_PURPOSE_C_PADDING in their purpose masks, or they will shut down as soon as the purpose changes and cause chaos when other machines pick up this purpose.
Changing purposes of things is a little scary. But we already do it; the code should support it (but: #29034 (moved) is one major outstanding instance where it does not properly do so).
Flag:
Pros:
Our padding machine conditions are simpler. They can apply to just the purpose they are for, and not worrying about adding this extra padding purpose if they happen to use the "please keep circ open" feature.
Cons:
Lots of wide-ranging code can be confused because it tried to close a circuit, but then that circuit stays around with the same purpose, and does not have its marked for close flag set. It may try to re-use these circuits in surprising places.
After this analysis, I am inclined to say that the purpose-based way is the way to go here. So I'm going to set this to needs_review. Please do let me know if there are other pros/cons.
ACK on the analysis. Can you please provide a clean branch rebased on master, so that it's easier to review it? That way I can also easily apply it as part of my #28634 (moved) to make sure it works fine.
What I see happening is that the client will send an INTRODUCE_ACK, then the above log will trigger and the circuit will stay open, but no more padding will be sent whatsoever. BTW, due to the purposes approach I also had to add the CIRCUIT_PURPOSE_C_CIRCUIT_PADDING purpose to the intro machine otherwise the machine would be shutdown when the purpose change happened.
You can test this in chutney, and if you want to test it on the real net you can use this MiddleNodes 56927E61B51E6F363FB55498150A6DDFCF7077F2 which is running an old version of the #28634 (moved) branch (commit 15e0f36). Testing this on chutney will illustrate the problem, but note that the histogram latencies are too slow for localhost so you won't see much padding.
Finally, I also get pathbias_should_count(): Bug: Circuit 3 is now being counted despite being ignored in the past. Purpose is Circuit kept open for padding, path state is already counted (on Tor 0.4.1.0-alpha-dev 2a0158f4a3873845).
Ok, I think I figured out some of your problems. That check for CIRCPAD_TOKEN_REMOVAL_NONE in circpad_machine_setup_tokens() is needed for the padding to work correctly. The root issue was misusing machineinfo in one of the places to inspect mi->histogram_len instead of state->histogram_len (hidden behind the CIRCPAD_INFINITY_BIN() macro).
I fixed these and a simple test works: it pads after not closing the intro circ. And I don't get pathbias errors (though I don't know why).
I do have a very strange unit test failure in your tests.. I think it's due to you switching to shutting down the machine.. But oddly if I check for null on the failing line that checks client_side->padding_info[0]->current_state, the check passes. Heisenbug.. Yay.
Anyway I didn't have to make any changes to the PR for this branch for this, just additional fixes effectively to your #28632 (moved) code. So I'm setting this back to needs_review.
I rebased your branch on origin/master and added a fixup commit to make it compile and add some more comments (and a useful log): https://github.com/torproject/tor/pull/933
Ok, I added a test for circuits_expire_old_circuits_clientside(), and also realized that we want to mark the circuit dirty if it wasn't already, so that we use the dirtiness timers instead of the idle ones (we want to mimic an in-use/used circuit for this).
I want to talk about this approach more before we merge it.
As I understand it, the idea here is that some circuits that would otherwise close should instead stay open longer, so that padding can be sent on them. This patch achieves that by adding a call inside circuit_mark_for_close to circpad_circuit_should_be_marked_for_close, which potentially overrides the decision to mark the circuit, and changes its purpose instead. Later, the circuit is supposed to get expired naturally when it times out.
Here are some things that worry me about the logic:
Intercepting circuit_mark_for_close() is pretty risky IMO. It no longer 'does what it says on the label' and instead might keep the circuit open. There are over 100 callers to circuit_mark_for_close(); I think it will be hard to audit them all.
The function circpad_circuit_should_be_marked_for_close() is named as if it were a predicate function, but in fact it changes the circuit's purpose in some cases. (Also, it doesn't answer the question "should this circuit be marked for close"; it answers "can this circuit be allowed to close, given that somebody else wants to mark it.")
The function circuit_expire_old_circuits_clientside(), which we are relying on to kill of circuits eventually, uses circuit_mark_for_close() to close the circuits it wants to expire, and circuit_mark_for_close is now potentially overridden.
Here's one possibility of we could do instead:
Rename the new purpose from C_CIRCUIT_PADDING to CIRCUIT_PADDING_SHUTDOWN to make it clear that the circuit's purpose is "sending padding until it dies."
Leave circuit_mark_for_close() untouched from the current codebase. Instead add a new function, circuit_transition_to_shutdown(). This function should either mark the circuit for close immediately by calling circuit_mark_for_close(), or should change the circuit's purpose to CIRCUIT_PADDING_SHUTDOWN.
Audit the codebase so that we change some calls from circuit_mark_for_close to circuit_transition_to_shutdown. We should only do this for calls that use REASON_FINISHED or REASON_NONE or REASON_IP_NOW_REDUNDANT today.
I want to talk about this approach more before we merge it.
As I understand it, the idea here is that some circuits that would otherwise close should instead stay open longer, so that padding can be sent on them. This patch achieves that by adding a call inside circuit_mark_for_close to circpad_circuit_should_be_marked_for_close, which potentially overrides the decision to mark the circuit, and changes its purpose instead. Later, the circuit is supposed to get expired naturally when it times out.
Here are some things that worry me about the logic:
Intercepting circuit_mark_for_close() is pretty risky IMO. It no longer 'does what it says on the label' and instead might keep the circuit open. There are over 100 callers to circuit_mark_for_close(); I think it will be hard to audit them all.
This same reasoning made us prefer to hook circuit_mark_for_close(). If we want to keep a circuit open for padding, preventing current and future code from somehow closing it via one of these 100 other codepaths seemed daunting... This is also the reasoning that caused me to prefer purpose changes to flags: when a purpose changes, both existing code and not-yet-written code will stop trying to use that circuit. But when a flag gets added, not everybody knows to check that flag before proceeding.
The function circpad_circuit_should_be_marked_for_close() is named as if it were a predicate function, but in fact it changes the circuit's purpose in some cases. (Also, it doesn't answer the question "should this circuit be marked for close"; it answers "can this circuit be allowed to close, given that somebody else wants to mark it.")
This is already the case since we added pathbias code.. It changes the purpose and sends a probe in some cases. Our circuit timeout code also changes circuit purpose to MEASUREMENT in a similar way, to hold them open long enough to get a better timeout estimate (at Roger's suggestion).
I see circuit purpose as a signifying component ownership of a circuit. By doing it this way, we're sending a loud and clear signal to the rest of the code that we've taken over responsibility to use, update, and free this circuit.
The function circuit_expire_old_circuits_clientside(), which we are relying on to kill of circuits eventually, uses circuit_mark_for_close() to close the circuits it wants to expire, and circuit_mark_for_close is now potentially overridden.
This was also a deliberate design artifact: to avoid dead parrot issues, we wanted to use the exact same timeout logic (and conditional timeout values based on that logic) to tear down the circuit, once circuit padding was done padding on it. (Said another way, we're telling the circuit_expire_old_circuits_clientside() code, other timeout code, and all other current and future components: "Hey, this is now the padding component's circuit. Please don't time it out or close it).
Here's one possibility of we could do instead:
Rename the new purpose from C_CIRCUIT_PADDING to CIRCUIT_PADDING_SHUTDOWN to make it clear that the circuit's purpose is "sending padding until it dies."
Hrmm.. I would rather have a purpose devoted to the padding subsystem as a whole, again as signifying ownership. In ticket:30092#comment:7, we realized that we may need to launch new circuits from the padding code eventually. The logic for keeping those alive will be the same as for these converted circuits (keep them alive until the padding machine is gone, and then use our normal circuit_expire_old_circuits_clientside() timers after that), and so using the same purpose for any padding-subsystem launched circuits also seems wise.
Leave circuit_mark_for_close() untouched from the current codebase. Instead add a new function, circuit_transition_to_shutdown(). This function should either mark the circuit for close immediately by calling circuit_mark_for_close(), or should change the circuit's purpose to CIRCUIT_PADDING_SHUTDOWN.
Future callers would have to know to use this new function in all relevant future patches. The number of people and components that have to learn about this change to maintain correctness for circuitpadding will become very large. This seems failure prone.
If we went this way, I would want to add asserts to circuit_mark_for_close() to prevent it from being called without a call to circuit_transition_to_shutdown() for those cases to future-proof it, which seems a little backwards. And I bet we'd still have breakage in the future.
Audit the codebase so that we change some calls from circuit_mark_for_close to circuit_transition_to_shutdown. We should only do this for calls that use REASON_FINISHED or REASON_NONE or REASON_IP_NOW_REDUNDANT today.
Again, we'd need lots of BUG macros and future-proof checks to guard against reverts in behavior.. This seems more fragile an approach.
How does that sound?
Can we back up and discuss the root reasons why you feel this approach risky? My read of the above is that your three concerns are: code clarity, memory leaks, and purpose changes. Is that right?
If we try to step back and look at the places where this code might cause memory leaks and other ownership-related issues, maybe we can come up with either tests or refactoring that directly addresses those two concerns?
Can we back up and discuss the root reasons why you feel this approach risky? My read of the above is that your three concerns are: code clarity, memory leaks, and purpose changes. Is that right?
Yes, I think these are legitimate reasons to worry about. I think another worry is that because of overriding mark_for_close() we might get into a situation where the body of the function actually never gets called, and circuits stay alive for ever. I find this worry reasonable because circpad_circuit_should_be_marked_for_close() is a non-trivial function and we should make sure it does not permanently block a circuit from getting closed. Given that we have no functional machine right now, testing that this code works as intended in the real network is hard and hence this worry is even more reasonable.
That said, I'm also not sure if Nick's suggested solution of using an intermediate function circuit_transition_to_shutdown() is good enough. As has been said, mark_for_close() is a function that is consumed all over the codebase and by every developer, and if we introduce our own intermediate function this means that all developers need to be aware of when the intermediate function should be used instead of mark_for_close(). I'm already having trouble thinking about where mark_for_close() should be replaced by transition_to_shutdown() in the current codebase, let alone having all developers keep this in mind for ever in the future. That's why I think having the circuitpadding subsystem keep track of this might make more sense, in a similar way that the pathbias subsystem uses pathbias_check_close() to track the same thing.
Here is a suggestion of how to meet in the middle here. We keep the current logic, but we also add an assert-like function that checks whether the invariants here are preserved as we wish them to be. We can call that function from assert_circuit_ok() and also from circuit_expire_old_circuits_clientside() and we make sure that if something is going wrong it warns loud enough that we get bug reports and notice it quickly.
Here is an example of a scenario that we should be checking: "If a circuit is alive and has a machine that manages its own lifetime, the machine should still be running, or if has ENDed and the circuit must have not expired yet.". Or, to catch errors that would let vanilla circuits never get marked for close, we could add an extra debug-field to a circuit that gets set in the beginning of mark_for_close() and we make sure that a circuit cannot linger around with that field if it does not have a machine that manages its own lifetime". These are just quick examples: to make good ones, we should think of the behavior we want from our circuits, and come up with the right invariants to preserve those behaviors.
Here is a suggestion and implementation plan for an invariant we could use here to minimize unseen bugs:
Make a new soft-assert function (e.g. assert_circuit_expiry_ok()) which gets called at the end of circuit_expire_old_circuits_clientside().
Also abstract the "has this circuit expired?" logic of circuit_expire_old_circuits_clientside() into its own function so that we can use it.
Go through the list of circuits: If a circuit is in CIRCUIT_PURPOSE_C_CIRCUIT_PADDING purpose, then examine it further.
Soft-assert that for a circuit to be in that purpose, it means that:
If it has no machine, then the circuit has not expired yet (using helper function above). With this we want to catch PADDING circuits whose machine got shutdown.
If there is a machine, then:
manage_circ_lifetime == 1
Machine has either not ENDed, or if it has ENDed the circuit has not expired yet (using helper function).
I think the above should guard us from most bugs that could result in PADDING circuits staying around for ever, as long as circuit_expire_old_circuits_clientside() indeed gets called periodically. Perhaps we can add another safeguard to make sure that the expiry function indeed gets called periodically.
Finally, the above logic is not particularly optimized for performance, as it does another loop over the circuit list. We could optimize it by doing it inline the circuit_expire_old_circuits_clientside() but we should make sure that it does not increase the tech-debt and complexity of the function.
Finally finally, let's add some robust documentation about this feature and how exactly it works, so that we dont forget in a few months.
May 08 13:34:07.858 [info] circpad_circuit_should_be_marked_for_close(): Circuit 4 is not marked for close because of a pending padding machine.May 08 13:21:01.602 [info] pathbias_should_count(): Bug: Circuit 4 is now being counted despite being ignored in the past. Purpose is Circuit kept open for padding, path state is already counted (on Tor 0.4.1.0-alpha-dev 4b09a5063381fc1c)
May 08 17:41:12.611 [notice] pathbias_mark_use_success(): Bug: Used circuit 18 is in strange path state new. Circuit is a Circuit kept open for padding currently open. (on Tor 0.4.1.0-alpha-dev 4b09a5063381fc1c)May 08 17:41:12.611 [notice] pathbias_count_use_attempt(): Bug: Used circuit is in strange path state new. Circuit is a Circuit kept open for padding currently open. (on Tor 0.4.1.0-alpha-dev 4b09a5063381fc1c)
I'm also getting this Bug on my #28634 (moved) branch with hte latest #28780 (moved) revision:
{{{
May 08 13:34:07.858 [info] circpad_circuit_should_be_marked_for_close(): Circuit 4 is not marked for close because of a pending padding machine.
May 08 13:21:01.602 [info] pathbias_should_count(): Bug: Circuit 4 is now being counted despite being ignored in the past. Purpose is Circuit kept open for padding, path
state is already counted (on Tor 0.4.1.0-alpha-dev 4b09a5063381fc1c)
}}}
{{{
May 08 17:41:12.611 [notice] pathbias_mark_use_success(): Bug: Used circuit 18 is in strange path state new. Circuit is a Circuit kept open for padding currently open. (on Tor 0.4.1.0-alpha-dev 4b09a5063381fc1c)
May 08 17:41:12.611 [notice] pathbias_count_use_attempt(): Bug: Used circuit is in strange path state new. Circuit is a Circuit kept open for padding currently open. (on Tor 0.4.1.0-alpha-dev 4b09a5063381fc1c)
}}}
Are these two log messages preceded by a similar pathbias_should_count() message as the one above? This looks like another consequence of the first issue.
I'm still thinking about overall approach here. I am leaning towards using approx_time() to cheaply assert that PADDING circuits don't ever sit around unused for more than their 1.25 hour delay period, and about how to refactor circuit_expire_old_circuits_clientside() and/or circuit_mark_for_close without risking more pathbias or other issues.
With these two commits, it should be much easier to verify that it is not possible for circpad to hold open a circuit if more than CIRCPAD_DELAY_INFINITE==UNIT32_MAX microseconds (or about 1.25 hours) pass with no circuit cell delivery events happening. I have not written tests for this yet, but if we like this approach, I can.
I am open to adding additional checks to circuit_expire_old_circuits_clientside(), but I want to temperature check how people felt about this handbrake-style lifespan check in the first place, because that's the type of thing I'd add to circuit_expire_old_circuit_clientside() first.
I can still be persuaded to eliminate circuit_mark_for_close() and make it super clear for callers that they must pick between a new error-close version and a differently-named normal-close version, and have each version assert if they are called with reason codes that should be used with the other version, but that change will be invasive and I am not sure that will actually save us from circuit-leak errors (which will actually arise from circuit_expire_old_circuits_clientside()).
With these two commits, it should be much easier to verify that it is not possible for circpad to hold open a circuit if more than CIRCPAD_DELAY_INFINITE==UNIT32_MAX microseconds (or about 1.25 hours) pass with no circuit cell delivery events happening. I have not written tests for this yet, but if we like this approach, I can.
I am open to adding additional checks to circuit_expire_old_circuits_clientside(), but I want to temperature check how people felt about this handbrake-style lifespan check in the first place, because that's the type of thing I'd add to circuit_expire_old_circuit_clientside() first.
I can still be persuaded to eliminate circuit_mark_for_close() and make it super clear for callers that they must pick between a new error-close version and a differently-named normal-close version, and have each version assert if they are called with reason codes that should be used with the other version, but that change will be invasive and I am not sure that will actually save us from circuit-leak errors (which will actually arise from circuit_expire_old_circuits_clientside()).
Hey Mike,
I like this approach but I would like to hear Nick's opinion too.
I think the general concept here is general enough to detect all types of "circuit left open for ever" failures. Unittests on this specific functionality would be useful to get more confidence, and we should also test it on the real net.
FWIW, I did some testing with the latest of this branch and #28634 (moved), and I can confirm that this branch will eventually close the padding circuit as intended.
FWIW, I did some testing with the latest of this branch and #28634 (moved), and I can confirm that this branch will eventually close the padding circuit as intended.
Out of curiosity, did you discover this through a malfunctioning machine that deadlocked and hit the 1.25 hour timer, or with a well-behaved machine?
I have been brainstorming for other failure modes, and I realized that malfunctioning machines can get caught in infinite padding loops -- as long as cells are being sent to each other, they will stick around (because that last_cell_time_sec field will keep getting updated due to the sent and/or received cells). I think this type of error is best safeguarded against by specifying padding limits on the machines, though, as I just suggested in ticket:28634#comment:23.
I will add unit tests to cover this new expiry timer, and I will think more if anything can be tweaked in circuit_expire_old_circuits_clientside(), but right now I'm not seeing anything that will further address the immediate concerns of additional circuit-leak risk and code clarity, but I'm open to suggestions.
I am tempted to say that refactoring how circuit_mark_for_close vs circuit expiry vs circuit build timeout vs measurement circuits vs pathbias vs padding takeover vs cannibalization vs other purpose changes all interact with circuit shutdown should be done in a modularization-sponsored ticket. There are lot of things that could benefit from some refactoring wrt circuit ownership, timeout, expiry, and shutdown paths, and I don't think it's the best move to rabbit hole down them under Sponsor2 if we can keep this particular change low-risk.
Still hooks circuit_mark_for_close(), but the circpad_is_using_circuit_for_padding() function name and mechanism are much clearer.
Only one return branch in the circpad_is_using_circuit_for_padding() function keeps the circuit open. All others result in relinquishing ownership to circuit_mark_for_close()/circuit_expire_old_circuits_clientside().
It is easy to see that this one branch cannot hold the circuit open for more than CIRCPAD_DELAY_MAX_SECS (1.25hr) of inactivity on a circuit, even if it is deadlocked. This is also verified in the unit tests, by testing circuit_expire_old_circuits_clientside() directly.
If padding machines deadlock by ping-ponging (sending padding back and forth forever), then they will hit their machine-specific padding overhead limits, and cease padding. At which point, network activity on the circuit will cease, and circpad_is_using_circuit_for_padding() will return 0 and allow circuit_expire_old_circuits_clientside() to close the circuit (machine-specific padding limits are tested in a previously existing test).
Thanks Nick! I replied with the specific fixup commit hashes on your review in the old PR, but I had to open a new PR due to merge/practracker conflicts: https://github.com/torproject/tor/pull/1023
(I think we should make a new ticket for figuring out a good pattern to refactor circuit build timeout, circuit idle expiry, circuit dirty expiry, normal circuit close, cannibalization, purpose/ownership changes, and error-driven close into a new module/design pattern, and then that one could support some kind of API for stuff like this, but I still think this approach is pretty narrow and controlled given all of the other snakes in this pile).