Opened 3 years ago

Closed 3 years ago

#18809 closed defect (fixed)

Handle linked connections better during bootstrap

Reported by: teor Owned by: teor
Priority: Medium Milestone: Tor: 0.2.8.x-final
Component: Core Tor/Tor Version: Tor: 0.2.8.1-alpha
Severity: Normal Keywords: must-fix-before-028-rc, TorCoreTeam201605, review-group-1, must-fix-before-0283
Cc: Actual Points: 4
Parent ID: Points: 3
Reviewer: andrea Sponsor: SponsorS-can

Description

begindir connections advance directly to the CLIENT_SENDING state, even before their TLS connection is created.

So we need to modify the logic from #4483 that checks for consensuses that are downloading:

  • is it a direct connection in state after connecting, or
  • a linked connection that's attached, or
  • a linked connection that's in a state after sending

And then do the standard "extra consensus download check" when we're about to link a directory connection to a TLS connection.

Or arma may have a better idea.

I am not sure whether we should fix this in 0.2.8, because the current code works, but it's not optimal when too many TLS connections hang.
Tagging it just in case.

Child Tickets

TicketStatusOwnerSummaryComponent
#18483closedClients should always tunnel connections, and never fall back to a DirPortCore Tor/Tor
#18808closedteorCheck if connection is excess in connection_dir_finished_flushingCore Tor/Tor

Change History (37)

comment:1 Changed 3 years ago by teor

We'll want to do the check in circuit_has_opened, before we call circuit_try_attaching_streams, which calls connection_ap_attach_pending.

Last edited 3 years ago by teor (previous) (diff)

comment:2 Changed 3 years ago by arma

I believe the current code "works" in that it never generates any extra consensus fetches. So the functionality intended from #4483 is not present in 0.2.8.

I am not sure whether we will want to fix this in 0.2.8 -- I think implementing #4483 will be quite some lines of diff.

But I'm going to figure that one out later. In the mean time I'm accumulating patches on my bug18809 branch.

comment:3 in reply to:  2 Changed 3 years ago by teor

Replying to arma:

I believe the current code "works" in that it never generates any extra consensus fetches. So the functionality intended from #4483 is not present in 0.2.8.

This isn't quite correct: the current code schedules additional consensus fetches more often when connections fail fast.

But for slow or blackholed connections, it won't generate simultaneous connections.

So it's an improvement over the 0.2.7 behaviour, but it doesn't fully implement the proposal.

I am not sure whether we will want to fix this in 0.2.8 -- I think implementing #4483 will be quite some lines of diff.

I think we can fix most of this by checking for an existing consensus downloads in connection_dir_finished_flushing (#18808) and before we call connection_ap_attach_pending (this bug).

Then we modify the code that counts consensus downloads to account for begindir requests, as per the description in the ticket. (This probably means some refactoring to do this counting in one place.)

This should leave us with much cleaner code, because the complex checking & counting logic will be in one place.

I think we can fix anything else in 0.2.9.

But I'm going to figure that one out later. In the mean time I'm accumulating patches on my bug18809 branch.

Looking forward to it.

comment:4 Changed 3 years ago by arma

Status: newneeds_review

Ok, my bug18809 branch is ready for review.

I tested it in a variety of ways, and it seems to work as intended.

I left the unit tests a broken mess though. :/ If we like the changes here, we can clean up the unit tests.

My next trick is going to be locking down all the situations where clients that ought to be doing begindirs might instead end up doing a naked dirport connection (e.g. #18483). I think each of those cases is a mistake and we should fix it. The reason I want to do it here is because now #4483 plus this branch likely has some edge cases where we could open naked dirport consensus fetches in parallel, and not notice to cancel them. I would rather fix that bug at the root rather than complicate this patch with it.

comment:5 Changed 3 years ago by nickm

Keywords: TorCoreTeam201605 added

These 0.2.8 items really should get handled in May. Or earlier.

comment:6 Changed 3 years ago by teor

Reviewer: teor

comment:7 Changed 3 years ago by nickm

Calling all non-needs_information tickets for May.

comment:8 Changed 3 years ago by dgoulet

  • Commit 33335b8

DG1: Just after we set answer = 1, we should break. There is no need to continue the loop.

+        !AP_CONN_STATE_IS_UNATTACHED(base->linked_conn->state))
+      answer = 1;
  • Commit a9012a61fafae3ca8c6ae40ba5f433dcf12aad71

DG2: There is an awful lot of code removal in there but the commit message doesn't tell me why nor what is being removed here. This is an interesting change, we go from a callback every second trying to cleanup to a function call only done when we successfully receive a consensus.

The whole logic of this patch seems OK to me. Let's fix those unit tests :).

comment:9 Changed 3 years ago by andrea

Owner: set to andrea
Status: needs_reviewassigned

comment:10 in reply to:  8 Changed 3 years ago by arma

Replying to dgoulet:

DG1: Just after we set answer = 1, we should break. There is no need to continue the loop.

Sounds good. I've pushed another commit that does this.

  • Commit a9012a61fafae3ca8c6ae40ba5f433dcf12aad71

DG2: There is an awful lot of code removal in there but the commit message doesn't tell me why nor what is being removed here. This is an interesting change, we go from a callback every second trying to cleanup to a function call only done when we successfully receive a consensus.

Yes, correct. The comments for the new connection_dir_close_consensus_fetches() function are probably the best description of what's going on here. Should I try to clarify more?

The whole logic of this patch seems OK to me. Let's fix those unit tests :).

If somebody wants to grab this and fix the unit tests, please do. :)

comment:11 Changed 3 years ago by teor

Status: assignedneeds_review

comment:12 Changed 3 years ago by andrea

Begin code review for arma's bug18809 branch:

525307c0ea5717adba9f35cf7ccb1c5108ac3440:
 - Looks fine, just typo fixes

33335b803c346a49513e4c64848342eedb6de77c:
 - I think this is correct.
 - Maybe break out of the loop in networkstatus_consensus_is_already_downloading()
   for efficiency after answer = 1?

5c81825104f063c69ba17a5b2545f43756d534d6:
 - Looks good to me.

a9012a61fafae3ca8c6ae40ba5f433dcf12aad71:
 - IIUC, we're replacing a complicated algorithm for picking which active
   connection to keep meant to run periodically with one that takes which
   one to keep passed in as an argument and triggered from various places
   that can cause us to want to kill existing connections - like successfully
   downloading the consensus, leading to all this simplification and deleted
   code.
 - I think this is all okay.

7df6a08ff84a3925ce4fe3c49b86d2033c4e3d58:
 - We're removing a lot of places a connection could get closed; objective
   is to make sure they hang around until something succeeds in downloading
   the consensus, as per the invocation of connection_dir_close_consensus_fetches()
   in a9012a61fafae3ca8c6ae40ba5f433dcf12aad71, I presume?
 - I suppose if everything just hangs and never succeeds, they eventually go
   away due to network timeouts anyway.
 - I think this is okay.

6642bc346edb30f068c554fa674062930ad315c0:
 - This looks fine.

015472016cb6c328a80b02da14f16be05f208ebb:
 - Looks good to me.

26b64a79bed253bf08b3d65fa360419af2c0065c:
 - This looks okay.

005617c92142dcaa7c76152d3e1c007585bf81cc:
 - This looks okay.

0a7c177336dab1fdb04e46e33230191bb4834793:
 - Okay, this covered my suggestion in 33335b803c346a49513e4c64848342eedb6de77c.

I think this is fine to merge provided it's been tested in Chutney or equivalent
that covers all the roles a Tor process can bootstrap in.

End code review

comment:13 Changed 3 years ago by arma

Status: needs_reviewmerge_ready

Great. I think it's ready to merge then.

There's more to be done later, but this ticket will be a good move forward.

comment:14 Changed 3 years ago by arma

Owner: changed from andrea to arma
Reviewer: teordgoulet andrea
Status: merge_readyassigned

comment:15 Changed 3 years ago by arma

Status: assignedneeds_review

comment:16 Changed 3 years ago by arma

Status: needs_reviewmerge_ready

comment:17 Changed 3 years ago by arma

Actual Points: medium
Points: medium
Sponsor: SponsorS-can

comment:18 Changed 3 years ago by arma

Status: merge_readyneeds_revision

Ha ha, not merge ready! We need to fix the unit tests first. And then maybe review those fixes too.

comment:19 Changed 3 years ago by arma

I now have a bug18809_028 branch which has these commits applied to the maint-0.2.8 branch.

This is the branch where we should add the unit test fixes.

Last edited 3 years ago by arma (previous) (diff)

comment:20 Changed 3 years ago by arma

Status: needs_revisionneeds_review

I have added two new commits -- one to remove another function that we no longer use, and one to rip out all the unit tests from the stuff I removed.

I think that means we're ready for more review.

comment:21 Changed 3 years ago by nickm

Keywords: review-group-1 added

comment:22 Changed 3 years ago by dgoulet

Reviewer: dgoulet andreaandrea

comment:23 Changed 3 years ago by teor

Status: needs_reviewneeds_revision

Code review:

ce8266d fix typos/etc before i go nuts on #18809
function rename only

91c5801 avoid following through on a consensus fetch if we have one already arriving
Suffers from a race condition where two connections could attach at the same time. But this is unlikely and relatively benign: we'll just fetch the consensus twice.

59da060 use the new function here too
this is refactoring I should have done

a7665df close other consensus fetches when we get a consensus
There were two race conditions avoided by the previous code:

  • the old function closed the connection immediately to avoid wasting data. But this meant that if a connection was receiving data or undergoing other operations, there could be errors. I think calling connection_mark_for_close is better, even if it's less efficient.
  • connections can be initiated after connection_dir_list_by_purpose_and_resource() is called. These connections are never cleaned up. I think this is fixed by checking when we attach a stream in 91c5801.

e230e80 get rid of the scattered checks to cancel a consensus fetch
these look good, because we catch it in attach now

bcae392 avoid another redundant check
good, but I wonder if we'll want networkstatus_consensus_is_downloading_usable_flavor in future. probably not worth keeping it around

c98fbd4 remove some more unused code
probably not worth keeping it around, but I do wonder if we'll want these functions

d5a9628 simplify more -- we only call these funcs when bootstrapping
makes sense, but you'd better document that those functions should only ever be called when bootstrapping

1f72653 fix a bug where relays would use the aggressive client bootstrapping retry number

I think the correct fix for this is to check server_mode(options) in consensus_max_download_tries(), and return options->TestingConsensusMaxDownloadTries if it's true.

Otherwise, this bug is just waiting to happen again.

aa6341d stop looping once we know what the answer will be
early exit FTW

53aaed8 get rid of another no-longer-used function
looks ok

507883e rip out the unit tests for the functions i removed
Please leave the parts of the unit tests that test existing functions, or remove those functions if they are unused:

  • connection_dir_list_by_purpose_and_resource
  • connection_dir_count_by_purpose_and_resource

comment:24 in reply to:  23 ; Changed 3 years ago by arma

Replying to teor:

Code review:

Thanks for the review!

91c5801 avoid following through on a consensus fetch if we have one already arriving
Suffers from a race condition where two connections could attach at the same time. But this is unlikely and relatively benign: we'll just fetch the consensus twice.

Can you elaborate on this one? This part of Tor is single-threaded, so they can't attach at the same time. One of them attaches before the other, and the first one to win gets immediately moved to state AP_CONN_STATE_CONNECT_WAIT as soon as it's chosen, so the second one will know that there's already a winner.

a7665df close other consensus fetches when we get a consensus
There were two race conditions avoided by the previous code:

  • the old function closed the connection immediately to avoid wasting data. But this meant that if a connection was receiving data or undergoing other operations, there could be errors. I think calling connection_mark_for_close is better, even if it's less efficient.

The trouble is that connection_mark_for_close() just closes the dir conn -- not the underlying tls connection, and so it doesn't undo any begin cell already sent onto the network. So we can close the consensus request, but if it already sent its begin cell, we will still get a consensus coming back at us -- we'll just drop all the data cells because we won't be expecting it anymore. This behavior is similar to the problem in bug #16844.

But to be clear, this new function connection_dir_close_consensus_fetches() is aiming to find and kill directory requests that haven't attached to a circuit yet. There should not be any directory requests that are attached to a circuit (i.e. fetching a consensus) at this point -- they would have been noticed inside 91c5801 and stopped there.

  • connections can be initiated after connection_dir_list_by_purpose_and_resource() is called. These connections are never cleaned up. I think this is fixed by checking when we attach a stream in 91c5801.

You mean the connection_dir_list_by_purpose_and_resource() call inside connection_dir_close_consensus_fetches()? That function is only called after we've called networkstatus_set_current_consensus(), so we should have a consensus in place. But! What about the case where we got a consensus, but we don't have the certs yet for verifying it? Then networkstatus_consensus_is_bootstrapping() will return 1, yes? Crappo. Does this mean we want to modify networkstatus_consensus_is_bootstrapping() so it says no if there is a consensus but it's in waiting-for-certs limbo? After all, we don't want to start launching more consensus fetches in that situation.

bcae392 avoid another redundant check
good, but I wonder if we'll want networkstatus_consensus_is_downloading_usable_flavor in future. probably not worth keeping it around

Right, we can bring it back if we want it one day.

d5a9628 simplify more -- we only call these funcs when bootstrapping
makes sense, but you'd better document that those functions should only ever be called when bootstrapping

The functions are named update_consensus_bootstrap_multiple_downloads and update_consensus_bootstrap_attempt_downloads(), and also their function comments start with "When we're bootstrapping, ". Should we do more?

1f72653 fix a bug where relays would use the aggressive client bootstrapping retry number
I think the correct fix for this is to check server_mode(options) in consensus_max_download_tries(), and return options->TestingConsensusMaxDownloadTries if it's true.

Otherwise, this bug is just waiting to happen again.

Hm? I was actually thinking of just removing consensus_max_download_tries() entirely, since it is only called from two places, and one of the places wants the first clause of the function, and the other place wants the last clause of the function. I guess this is my instinct to simplify rather than complexify.

507883e rip out the unit tests for the functions i removed
Please leave the parts of the unit tests that test existing functions, or remove those functions if they are unused:

  • connection_dir_list_by_purpose_and_resource
  • connection_dir_count_by_purpose_and_resource

They are still used. Would you find it easy to add back whichever parts of the unit tests you want to keep? Or throw out this last commit and do whatever you think is smart? :) Thanks!

comment:25 in reply to:  24 Changed 3 years ago by teor

Replying to arma:

Replying to teor:

Code review:

Thanks for the review!

91c5801 avoid following through on a consensus fetch if we have one already arriving
Suffers from a race condition where two connections could attach at the same time. But this is unlikely and relatively benign: we'll just fetch the consensus twice.

Can you elaborate on this one? This part of Tor is single-threaded, so they can't attach at the same time. One of them attaches before the other, and the first one to win gets immediately moved to state AP_CONN_STATE_CONNECT_WAIT as soon as it's chosen, so the second one will know that there's already a winner.

Oops, I always imagine more threads in Tor than what we actually have.

a7665df close other consensus fetches when we get a consensus
There were two race conditions avoided by the previous code:

  • the old function closed the connection immediately to avoid wasting data. But this meant that if a connection was receiving data or undergoing other operations, there could be errors. I think calling connection_mark_for_close is better, even if it's less efficient.

The trouble is that connection_mark_for_close() just closes the dir conn -- not the underlying tls connection, and so it doesn't undo any begin cell already sent onto the network. So we can close the consensus request, but if it already sent its begin cell, we will still get a consensus coming back at us -- we'll just drop all the data cells because we won't be expecting it anymore. This behavior is similar to the problem in bug #16844.

But to be clear, this new function connection_dir_close_consensus_fetches() is aiming to find and kill directory requests that haven't attached to a circuit yet. There should not be any directory requests that are attached to a circuit (i.e. fetching a consensus) at this point -- they would have been noticed inside 91c5801 and stopped there.

Ok, sounds sensible.

  • connections can be initiated after connection_dir_list_by_purpose_and_resource() is called. These connections are never cleaned up. I think this is fixed by checking when we attach a stream in 91c5801.

You mean the connection_dir_list_by_purpose_and_resource() call inside connection_dir_close_consensus_fetches()? That function is only called after we've called networkstatus_set_current_consensus(), so we should have a consensus in place. But! What about the case where we got a consensus, but we don't have the certs yet for verifying it? Then networkstatus_consensus_is_bootstrapping() will return 1, yes? Crappo. Does this mean we want to modify networkstatus_consensus_is_bootstrapping() so it says no if there is a consensus but it's in waiting-for-certs limbo? After all, we don't want to start launching more consensus fetches in that situation.

Yes, I think we'll have to modify this part.

d5a9628 simplify more -- we only call these funcs when bootstrapping
makes sense, but you'd better document that those functions should only ever be called when bootstrapping

The functions are named update_consensus_bootstrap_multiple_downloads and update_consensus_bootstrap_attempt_downloads(), and also their function comments start with "When we're bootstrapping, ". Should we do more?

Nope.

1f72653 fix a bug where relays would use the aggressive client bootstrapping retry number
I think the correct fix for this is to check server_mode(options) in consensus_max_download_tries(), and return options->TestingConsensusMaxDownloadTries if it's true.

Otherwise, this bug is just waiting to happen again.

Hm? I was actually thinking of just removing consensus_max_download_tries() entirely, since it is only called from two places, and one of the places wants the first clause of the function, and the other place wants the last clause of the function. I guess this is my instinct to simplify rather than complexify.

Then let's split it up.

507883e rip out the unit tests for the functions i removed
Please leave the parts of the unit tests that test existing functions, or remove those functions if they are unused:

  • connection_dir_list_by_purpose_and_resource
  • connection_dir_count_by_purpose_and_resource

They are still used. Would you find it easy to add back whichever parts of the unit tests you want to keep? Or throw out this last commit and do whatever you think is smart? :) Thanks!

I'd keep the bits of the unit tests that test these functions.
I'll let you make your changes first.

comment:26 Changed 3 years ago by isabela

Points: medium3

comment:27 Changed 3 years ago by teor

Owner: changed from arma to teor
Status: needs_revisionassigned

For the record, I will finish the rest of this.

comment:28 Changed 3 years ago by teor

Status: assignedneeds_revision

comment:29 Changed 3 years ago by teor

Keywords: must-fix-before-0283 added

comment:30 Changed 3 years ago by teor

Actual Points: medium4
Status: needs_revisionmerge_ready

Please see my branch bug18809_028 at https://github.com/teor2345/tor.git

I have added 2 minor changes requested by arma to arma's bug18809_028. I also modified and added unit tests, and added a changes file.

The commits are:

  • d7029ed Stop downloading consensuses when a consensus has been downloaded
    • arma's "What about the case where we got a consensus, but we don't have the certs yet for verifying it?"
  • 249f39d Changes file for bug 18809
  • 0c23a91 Remove consensus_max_download_tries by refactoring
    • arma's "I was actually thinking of just removing consensus_max_download_tries() entirely"
  • 38e52c6 Revert "rip out the unit tests for the functions i removed"
    • arma's "Would you find it easy to add back whichever parts of the unit tests you want to keep? Or throw out this last commit and do whatever you think is smart?"
    • reverts arma's 507883e
  • 24915a1 Update unit tests for multiple bootstrap connections
    • removes some of the unit tests, fixes others
  • 8e70ca8 Restore and improve download schedule unit tests
    • someone #if 0'd out these tests, maybe it was even me!
  • 4ce6a94 fixup! Update unit tests for multiple bootstrap connections
    • that test line actually belongs in the next commit in a different place
  • 2cd4b2b Add unit tests for networkstatus_consensus_is_bootstrapping
  • 6a52f0c fixup! fixup! Update unit tests for multiple bootstrap connections
    • make check-spaces has its revenge

We should squash this branch and remove the revert and its original before merging.

(I haven't looked into fixing #19074 yet. And depending on how hard it is, we might want to deal with it separately.)

As the bulk of this patch has had 3 reviews (dgoulet, andrea, teor), and arma is busy, I'm throwing it straight to nickm.

comment:31 Changed 3 years ago by arma

0c23a91550 looks good.

249f39ddc8 looks fine but a pedant would spell aggressive better.

d7029ed1a looks fine but a pedant would spell Previously better.

I didn't look at the unit tests, so maybe somebody should.

Thanks!

comment:32 Changed 3 years ago by nickm

Wow that's big.

For review purposes, I did the squash and revert-pair-removal you describe in bug18809_028_squashed.

comment:33 Changed 3 years ago by nickm

In connection_ap_handshake_attach_circuit():

  • I hate making this function more complex. It is already far too complex. But I won't hold off the patch for that.

Here's a case I needed to think about:

+ * As soon as we have received a consensus, return 0, even if we don't have
+ * enough certificates to validate it. */

Suppose that a fallback gives us a bogus consensus whose certificates are all bad. When will we notice, throw the consensus away, and try to fetch another one? Having a comment that explains the process here would ease my mind, and the mind of the next person who tries to read the code. :)

Merging and applying roger's spelling fixes.

comment:34 Changed 3 years ago by nickm

Ug, there is a huge bunch of conflicts when I try to merge this to master. They _seem_ to be caused by the fact that we already fixed the "boostrapping" spelling mistake on master. Please have a look at the big merge commit (d718c717a67051cb1da2a1f80c70401d0d10b374) before I close this ticket and make sure I got it right?

comment:35 Changed 3 years ago by teor

I think everything is ok, because git show d718c717a67051cb1da2a1f80c70401d0d10b374 gives me an empty patch. git show on a merge commit uses git diff-tree —cc, which removes hunks which match either parent. So I think we're ok here.

(Is there another git command I should use? I tried git diff-tree but that gave me the same results.)

Also, the functions are named correctly and all the unit tests pass, as does make test-network-all.

My branch fix18809-comment explains that a consensus that's waiting for certificates fails after 20 minutes. And then we try again. It's a comment-only change.

comment:36 Changed 3 years ago by teor

Jenkins mingw fails with unused-but-set-variable warnings:

15:34:57 src/test/test_connection.c:660:17: error: variable 'ap_conn4' set but not used [-Werror=unused-but-set-variable]
15:34:57 src/test/test_connection.c:658:17: error: variable 'ap_conn2' set but not used [-Werror=unused-but-set-variable]
15:34:57 src/test/test_connection.c:655:21: error: variable 'conn3' set but not used [-Werror=unused-but-set-variable]

https://jenkins.torproject.org/job/tor-ci-mingwcross-0.2.8/ARCHITECTURE=amd64,SUITE=jessie/27/console

Please try my branch fix18809-warnings, which includes the warning fix and the comment fix.
(I don't have a compiler that does -Wunused-but-set-variable, so we'll have to throw this one at jenkins.)

comment:37 Changed 3 years ago by nickm

Resolution: fixed
Status: merge_readyclosed

Both fixes merged. Thanks!

Note: See TracTickets for help on using tickets.