begindir connections advance directly to the CLIENT_SENDING state, even before their TLS connection is created.
So we need to modify the logic from #4483 (moved) that checks for consensuses that are downloading:
is it a direct connection in state after connecting, or
a linked connection that's attached, or
a linked connection that's in a state after sending
And then do the standard "extra consensus download check" when we're about to link a directory connection to a TLS connection.
Or arma may have a better idea.
I am not sure whether we should fix this in 0.2.8, because the current code works, but it's not optimal when too many TLS connections hang.
Tagging it just in case.
To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information
Child items ...
Show closed items
Linked items 0
Link issues together to show that they're related.
Learn more.
I believe the current code "works" in that it never generates any extra consensus fetches. So the functionality intended from #4483 (moved) is not present in 0.2.8.
I am not sure whether we will want to fix this in 0.2.8 -- I think implementing #4483 (moved) will be quite some lines of diff.
But I'm going to figure that one out later. In the mean time I'm accumulating patches on my bug18809 branch.
I believe the current code "works" in that it never generates any extra consensus fetches. So the functionality intended from #4483 (moved) is not present in 0.2.8.
This isn't quite correct: the current code schedules additional consensus fetches more often when connections fail fast.
But for slow or blackholed connections, it won't generate simultaneous connections.
So it's an improvement over the 0.2.7 behaviour, but it doesn't fully implement the proposal.
I am not sure whether we will want to fix this in 0.2.8 -- I think implementing #4483 (moved) will be quite some lines of diff.
I think we can fix most of this by checking for an existing consensus downloads in connection_dir_finished_flushing (#18808 (moved)) and before we call connection_ap_attach_pending (this bug).
Then we modify the code that counts consensus downloads to account for begindir requests, as per the description in the ticket. (This probably means some refactoring to do this counting in one place.)
This should leave us with much cleaner code, because the complex checking & counting logic will be in one place.
I think we can fix anything else in 0.2.9.
But I'm going to figure that one out later. In the mean time I'm accumulating patches on my bug18809 branch.
I tested it in a variety of ways, and it seems to work as intended.
I left the unit tests a broken mess though. :/ If we like the changes here, we can clean up the unit tests.
My next trick is going to be locking down all the situations where clients that ought to be doing begindirs might instead end up doing a naked dirport connection (e.g. #18483 (moved)). I think each of those cases is a mistake and we should fix it. The reason I want to do it here is because now #4483 (moved) plus this branch likely has some edge cases where we could open naked dirport consensus fetches in parallel, and not notice to cancel them. I would rather fix that bug at the root rather than complicate this patch with it.
DG2: There is an awful lot of code removal in there but the commit message doesn't tell me why nor what is being removed here. This is an interesting change, we go from a callback every second trying to cleanup to a function call only done when we successfully receive a consensus.
The whole logic of this patch seems OK to me. Let's fix those unit tests :).
DG1: Just after we set answer = 1, we should break. There is no need to continue the loop.
Sounds good. I've pushed another commit that does this.
Commit a9012a61fafae3ca8c6ae40ba5f433dcf12aad71
DG2: There is an awful lot of code removal in there but the commit message doesn't tell me why nor what is being removed here. This is an interesting change, we go from a callback every second trying to cleanup to a function call only done when we successfully receive a consensus.
Yes, correct. The comments for the new connection_dir_close_consensus_fetches() function are probably the best description of what's going on here. Should I try to clarify more?
The whole logic of this patch seems OK to me. Let's fix those unit tests :).
If somebody wants to grab this and fix the unit tests, please do. :)
Begin code review for arma's bug18809 branch:525307c0ea5717adba9f35cf7ccb1c5108ac3440: - Looks fine, just typo fixes33335b803c346a49513e4c64848342eedb6de77c: - I think this is correct. - Maybe break out of the loop in networkstatus_consensus_is_already_downloading() for efficiency after answer = 1?5c81825104f063c69ba17a5b2545f43756d534d6: - Looks good to me.a9012a61fafae3ca8c6ae40ba5f433dcf12aad71: - IIUC, we're replacing a complicated algorithm for picking which active connection to keep meant to run periodically with one that takes which one to keep passed in as an argument and triggered from various places that can cause us to want to kill existing connections - like successfully downloading the consensus, leading to all this simplification and deleted code. - I think this is all okay.7df6a08ff84a3925ce4fe3c49b86d2033c4e3d58: - We're removing a lot of places a connection could get closed; objective is to make sure they hang around until something succeeds in downloading the consensus, as per the invocation of connection_dir_close_consensus_fetches() in a9012a61fafae3ca8c6ae40ba5f433dcf12aad71, I presume? - I suppose if everything just hangs and never succeeds, they eventually go away due to network timeouts anyway. - I think this is okay.6642bc346edb30f068c554fa674062930ad315c0: - This looks fine.015472016cb6c328a80b02da14f16be05f208ebb: - Looks good to me.26b64a79bed253bf08b3d65fa360419af2c0065c: - This looks okay.005617c92142dcaa7c76152d3e1c007585bf81cc: - This looks okay.0a7c177336dab1fdb04e46e33230191bb4834793: - Okay, this covered my suggestion in 33335b803c346a49513e4c64848342eedb6de77c.I think this is fine to merge provided it's been tested in Chutney or equivalentthat covers all the roles a Tor process can bootstrap in.End code review
ce8266d fix typos/etc before i go nuts on #18809 (moved)
function rename only
91c5801 avoid following through on a consensus fetch if we have one already arriving
Suffers from a race condition where two connections could attach at the same time. But this is unlikely and relatively benign: we'll just fetch the consensus twice.
59da060 use the new function here too
this is refactoring I should have done
a7665df close other consensus fetches when we get a consensus
There were two race conditions avoided by the previous code:
the old function closed the connection immediately to avoid wasting data. But this meant that if a connection was receiving data or undergoing other operations, there could be errors. I think calling connection_mark_for_close is better, even if it's less efficient.
connections can be initiated after connection_dir_list_by_purpose_and_resource() is called. These connections are never cleaned up. I think this is fixed by checking when we attach a stream in 91c5801.
e230e80 get rid of the scattered checks to cancel a consensus fetch
these look good, because we catch it in attach now
bcae392 avoid another redundant check
good, but I wonder if we'll want networkstatus_consensus_is_downloading_usable_flavor in future. probably not worth keeping it around
c98fbd4 remove some more unused code
probably not worth keeping it around, but I do wonder if we'll want these functions
d5a9628 simplify more -- we only call these funcs when bootstrapping
makes sense, but you'd better document that those functions should only ever be called when bootstrapping
1f72653 fix a bug where relays would use the aggressive client bootstrapping retry number
I think the correct fix for this is to check server_mode(options) in consensus_max_download_tries(), and return options->TestingConsensusMaxDownloadTries if it's true.
Otherwise, this bug is just waiting to happen again.
aa6341d stop looping once we know what the answer will be
early exit FTW
53aaed8 get rid of another no-longer-used function
looks ok
507883e rip out the unit tests for the functions i removed
Please leave the parts of the unit tests that test existing functions, or remove those functions if they are unused:
91c5801 avoid following through on a consensus fetch if we have one already arriving
Suffers from a race condition where two connections could attach at the same time. But this is unlikely and relatively benign: we'll just fetch the consensus twice.
Can you elaborate on this one? This part of Tor is single-threaded, so they can't attach at the same time. One of them attaches before the other, and the first one to win gets immediately moved to state AP_CONN_STATE_CONNECT_WAIT as soon as it's chosen, so the second one will know that there's already a winner.
a7665df close other consensus fetches when we get a consensus
There were two race conditions avoided by the previous code:
the old function closed the connection immediately to avoid wasting data. But this meant that if a connection was receiving data or undergoing other operations, there could be errors. I think calling connection_mark_for_close is better, even if it's less efficient.
The trouble is that connection_mark_for_close() just closes the dir conn -- not the underlying tls connection, and so it doesn't undo any begin cell already sent onto the network. So we can close the consensus request, but if it already sent its begin cell, we will still get a consensus coming back at us -- we'll just drop all the data cells because we won't be expecting it anymore. This behavior is similar to the problem in bug #16844 (moved).
But to be clear, this new function connection_dir_close_consensus_fetches() is aiming to find and kill directory requests that haven't attached to a circuit yet. There should not be any directory requests that are attached to a circuit (i.e. fetching a consensus) at this point -- they would have been noticed inside 91c5801 and stopped there.
connections can be initiated after connection_dir_list_by_purpose_and_resource() is called. These connections are never cleaned up. I think this is fixed by checking when we attach a stream in 91c5801.
You mean the connection_dir_list_by_purpose_and_resource() call inside connection_dir_close_consensus_fetches()? That function is only called after we've called networkstatus_set_current_consensus(), so we should have a consensus in place. But! What about the case where we got a consensus, but we don't have the certs yet for verifying it? Then networkstatus_consensus_is_bootstrapping() will return 1, yes? Crappo. Does this mean we want to modify networkstatus_consensus_is_bootstrapping() so it says no if there is a consensus but it's in waiting-for-certs limbo? After all, we don't want to start launching more consensus fetches in that situation.
bcae392 avoid another redundant check
good, but I wonder if we'll want networkstatus_consensus_is_downloading_usable_flavor in future. probably not worth keeping it around
Right, we can bring it back if we want it one day.
d5a9628 simplify more -- we only call these funcs when bootstrapping
makes sense, but you'd better document that those functions should only ever be called when bootstrapping
The functions are named update_consensus_bootstrap_multiple_downloads and update_consensus_bootstrap_attempt_downloads(), and also their function comments start with "When we're bootstrapping, ". Should we do more?
1f72653 fix a bug where relays would use the aggressive client bootstrapping retry number
I think the correct fix for this is to check server_mode(options) in consensus_max_download_tries(), and return options->TestingConsensusMaxDownloadTries if it's true.
Otherwise, this bug is just waiting to happen again.
Hm? I was actually thinking of just removing consensus_max_download_tries() entirely, since it is only called from two places, and one of the places wants the first clause of the function, and the other place wants the last clause of the function. I guess this is my instinct to simplify rather than complexify.
507883e rip out the unit tests for the functions i removed
Please leave the parts of the unit tests that test existing functions, or remove those functions if they are unused:
connection_dir_list_by_purpose_and_resource
connection_dir_count_by_purpose_and_resource
They are still used. Would you find it easy to add back whichever parts of the unit tests you want to keep? Or throw out this last commit and do whatever you think is smart? :) Thanks!
91c5801 avoid following through on a consensus fetch if we have one already arriving
Suffers from a race condition where two connections could attach at the same time. But this is unlikely and relatively benign: we'll just fetch the consensus twice.
Can you elaborate on this one? This part of Tor is single-threaded, so they can't attach at the same time. One of them attaches before the other, and the first one to win gets immediately moved to state AP_CONN_STATE_CONNECT_WAIT as soon as it's chosen, so the second one will know that there's already a winner.
Oops, I always imagine more threads in Tor than what we actually have.
a7665df close other consensus fetches when we get a consensus
There were two race conditions avoided by the previous code:
the old function closed the connection immediately to avoid wasting data. But this meant that if a connection was receiving data or undergoing other operations, there could be errors. I think calling connection_mark_for_close is better, even if it's less efficient.
The trouble is that connection_mark_for_close() just closes the dir conn -- not the underlying tls connection, and so it doesn't undo any begin cell already sent onto the network. So we can close the consensus request, but if it already sent its begin cell, we will still get a consensus coming back at us -- we'll just drop all the data cells because we won't be expecting it anymore. This behavior is similar to the problem in bug #16844 (moved).
But to be clear, this new function connection_dir_close_consensus_fetches() is aiming to find and kill directory requests that haven't attached to a circuit yet. There should not be any directory requests that are attached to a circuit (i.e. fetching a consensus) at this point -- they would have been noticed inside 91c5801 and stopped there.
Ok, sounds sensible.
connections can be initiated after connection_dir_list_by_purpose_and_resource() is called. These connections are never cleaned up. I think this is fixed by checking when we attach a stream in 91c5801.
You mean the connection_dir_list_by_purpose_and_resource() call inside connection_dir_close_consensus_fetches()? That function is only called after we've called networkstatus_set_current_consensus(), so we should have a consensus in place. But! What about the case where we got a consensus, but we don't have the certs yet for verifying it? Then networkstatus_consensus_is_bootstrapping() will return 1, yes? Crappo. Does this mean we want to modify networkstatus_consensus_is_bootstrapping() so it says no if there is a consensus but it's in waiting-for-certs limbo? After all, we don't want to start launching more consensus fetches in that situation.
Yes, I think we'll have to modify this part.
d5a9628 simplify more -- we only call these funcs when bootstrapping
makes sense, but you'd better document that those functions should only ever be called when bootstrapping
The functions are named update_consensus_bootstrap_multiple_downloads and update_consensus_bootstrap_attempt_downloads(), and also their function comments start with "When we're bootstrapping, ". Should we do more?
Nope.
1f72653 fix a bug where relays would use the aggressive client bootstrapping retry number
I think the correct fix for this is to check server_mode(options) in consensus_max_download_tries(), and return options->TestingConsensusMaxDownloadTries if it's true.
Otherwise, this bug is just waiting to happen again.
Hm? I was actually thinking of just removing consensus_max_download_tries() entirely, since it is only called from two places, and one of the places wants the first clause of the function, and the other place wants the last clause of the function. I guess this is my instinct to simplify rather than complexify.
Then let's split it up.
507883e rip out the unit tests for the functions i removed
Please leave the parts of the unit tests that test existing functions, or remove those functions if they are unused:
connection_dir_list_by_purpose_and_resource
connection_dir_count_by_purpose_and_resource
They are still used. Would you find it easy to add back whichever parts of the unit tests you want to keep? Or throw out this last commit and do whatever you think is smart? :) Thanks!
I'd keep the bits of the unit tests that test these functions.
I'll let you make your changes first.
I have added 2 minor changes requested by arma to arma's bug18809_028. I also modified and added unit tests, and added a changes file.
The commits are:
d7029ed Stop downloading consensuses when a consensus has been downloaded
arma's "What about the case where we got a consensus, but we don't have the certs yet for verifying it?"
249f39d Changes file for bug 18809
0c23a91 Remove consensus_max_download_tries by refactoring
arma's "I was actually thinking of just removing consensus_max_download_tries() entirely"
38e52c6 Revert "rip out the unit tests for the functions i removed"
arma's "Would you find it easy to add back whichever parts of the unit tests you want to keep? Or throw out this last commit and do whatever you think is smart?"
reverts arma's 507883e
24915a1 Update unit tests for multiple bootstrap connections
removes some of the unit tests, fixes others
8e70ca8 Restore and improve download schedule unit tests
someone #if 0'd out these tests, maybe it was even me!
4ce6a94 fixup! Update unit tests for multiple bootstrap connections
that test line actually belongs in the next commit in a different place
2cd4b2b Add unit tests for networkstatus_consensus_is_bootstrapping
6a52f0c fixup! fixup! Update unit tests for multiple bootstrap connections
make check-spaces has its revenge
We should squash this branch and remove the revert and its original before merging.
(I haven't looked into fixing #19074 (moved) yet. And depending on how hard it is, we might want to deal with it separately.)
As the bulk of this patch has had 3 reviews (dgoulet, andrea, teor), and arma is busy, I'm throwing it straight to nickm.
Trac: Actualpoints: medium to 4 Status: needs_revision to merge_ready
I hate making this function more complex. It is already far too complex. But I won't hold off the patch for that.
Here's a case I needed to think about:
+ * As soon as we have received a consensus, return 0, even if we don't have+ * enough certificates to validate it. */
Suppose that a fallback gives us a bogus consensus whose certificates are all bad. When will we notice, throw the consensus away, and try to fetch another one? Having a comment that explains the process here would ease my mind, and the mind of the next person who tries to read the code. :)
Ug, there is a huge bunch of conflicts when I try to merge this to master. They seem to be caused by the fact that we already fixed the "boostrapping" spelling mistake on master. Please have a look at the big merge commit (d718c717a67051cb1da2a1f80c70401d0d10b374) before I close this ticket and make sure I got it right?
I think everything is ok, because git show d718c717a67051cb1da2a1f80c70401d0d10b374 gives me an empty patch. git show on a merge commit uses git diff-tree —cc, which removes hunks which match either parent. So I think we're ok here.
(Is there another git command I should use? I tried git diff-tree but that gave me the same results.)
Also, the functions are named correctly and all the unit tests pass, as does make test-network-all.
My branch fix18809-comment explains that a consensus that's waiting for certificates fails after 20 minutes. And then we try again. It's a comment-only change.
Jenkins mingw fails with unused-but-set-variable warnings:
15:34:57 src/test/test_connection.c:660:17: error: variable 'ap_conn4' set but not used [-Werror=unused-but-set-variable]15:34:57 src/test/test_connection.c:658:17: error: variable 'ap_conn2' set but not used [-Werror=unused-but-set-variable]15:34:57 src/test/test_connection.c:655:21: error: variable 'conn3' set but not used [-Werror=unused-but-set-variable]
Please try my branch fix18809-warnings, which includes the warning fix and the comment fix.
(I don't have a compiler that does -Wunused-but-set-variable, so we'll have to throw this one at jenkins.)