Opened 4 years ago

Closed 3 years ago

Last modified 3 years ago

#15942 closed defect (implemented)

Make tor connection failures random-exponential-backoff

Reported by: teor Owned by: andrea
Priority: High Milestone: Tor: 0.2.9.x-final
Component: Core Tor/Tor Version:
Severity: Normal Keywords: SponsorS, 028-triage, tor-dos, TorCoreTeam201606, review-group-3, review-group-4
Cc: nickm Actual Points: 3
Parent ID: #17293 Points: 3
Reviewer: nickm Sponsor: SponsorU-can

Description

Every connection request in tor could be modified to random-exponential-backoff on failure.

This would resolve repeated-connection overloading issues in general.

In particular, it would reduce the risk that old versions would DoS the authorities (or fallback directories, see #15775 / #15228) when the clients are switched off.

Connections to authorities are a priority for this change, then directory servers.

Child Tickets

Change History (26)

comment:1 Changed 4 years ago by teor

Included this in:
https://pad.riseup.net/p/deprecating-old-tors2

We should also make protocol failures random-exponential-backoff.

comment:2 Changed 4 years ago by nickm

Milestone: Tor: 0.2.7.x-finalTor: 0.2.8.x-final

comment:3 Changed 4 years ago by nickm

Keywords: 028-triage added

comment:4 Changed 4 years ago by nickm

Keywords: SponsorU removed
Sponsor: SponsorU

Bulk-replace SponsorU keyword with SponsorU field.

comment:5 Changed 4 years ago by nickm

Points: medium

comment:6 Changed 3 years ago by nickm

Milestone: Tor: 0.2.8.x-finalTor: 0.2.9.x-final

It is impossible that we will fix all 226 currently open 028 tickets before 028 releases. Time to move some out. This is my second pass through the "new" and tickets, looking for things to move to 0.2.9.

comment:7 Changed 3 years ago by isabela

Sponsor: SponsorUSponsorU-can

comment:8 Changed 3 years ago by nickm

Keywords: tor-dos added

comment:9 Changed 3 years ago by isabela

Points: medium3

comment:10 Changed 3 years ago by nickm

Parent ID: #15940#17293

comment:11 Changed 3 years ago by andrea

Owner: set to andrea
Severity: Normal
Status: newassigned

Taking ownership for 0.2.9 triage

comment:12 Changed 3 years ago by nickm

Keywords: TorCoreTeam201606 added

comment:13 Changed 3 years ago by andrea

Connection failures are detected in connection_handle_read_impl() / connection_handle_write_impl(), which call, generically, connection_close_immediate()/connection_mark_for_close_internal(), but also in the case of orconns, call connection_or_notify_error(), and call connection_edge_end_errno() for edge connections.

The connection_close_immediate()/connection_mark_for_close_internal() path flows to connection_about_to_close_connection(), which can call connection_dir_about_to_close(), connection_or_about_to_close(), connection_ap_about_to_close() or connection_exit_about_to_close(). In the case of orconns and edge connections everything interesting happens from connection_or_notify_error() and connection_edge_end_errno(), but connection_dir_about_to_close() is the trigger point for retrying downloads from the directory servers.

Edge connections are either outgoing from the exit, in which case we just send an END cell down the circuit on failure, from connection_edge_end_errno() -> connection_edge_end() -> connection_edge_send_command(), or incoming from the client, in which case we don't get any choices about retrying. There's no retry policy to change there.

Orconn failures cause circuits to die or fail to attach, and these flow through circuit_n_chan_done() and circuit_unlink_all_from_channel() from channel_closed(). Ultimately, connection failures end up in circuit_about_to_free(), and then for origin circuits in circuit_build_failed() when handling a circuit closed for error.

Since all of these possible failure cases are ultimately driven from somewhere else (e.g., exit connection fails) and trigger reporting back to the cause of that connection (e.g. send END cell) rather than retrying, or are on the client side and become a matter of general circuit-building policy, for this ticket I'll be focusing attention on retries of failed downloads from the directory servers. We should think about backoffs for circuit building at some point perhaps, but it seems to be largely separable from the question of directories, less critical for DoS-resistance since there aren't analogous heavily loaded elements like the authorities, and more security-sensitive because of potential implications for behavior when we fail to connect to our preferred entry guard.

comment:14 Changed 3 years ago by andrea

Failed directory connections are handled in connection_dir_request_failed(),
which calls:

  • networkstatus_consensus_download_failed() for a consensus
    • calls download_status_failed() / update_consensus_networkstatus_downloads()
      • download_status_failed() is macro for download_status_increment_failure()
  • connection_dir_download_cert_failed() for a certificate
    • calls authority_cert_dl_failed() / update_certificate_downloads()
  • this ultimately uses download_status_t too just like the consensus download; see download_status_is_ready_by_sk_in_cl() and friends in routerlist.c
  • connection_dir_download_routerdesc_failed()

890 /* No need to relaunch descriptor downloads here: we already do it
891 * every 10 or 60 seconds (FOO_DESCRIPTOR_RETRY_INTERVAL) in main.c. */

  • The mechanism here is in launch_descriptor_fetches_callback()/reset_descriptor_failures_callback(); we can realize exponential backoff by suitable adjustments
  • connection_dir_bridge_routerdesc_failed()
    • calls connection_dir_retry_bridges()
      • calls retry_bridge_descriptor_fetch_directly()
        • calls launch_direct_bridge_descriptor_fetch()

At minimum, it should be easy to implement exponential backoffs for consensus and certificate downloads through the download_status_t mechanism, since they already notify it of their successes/failures and ask it whether we're ready to attempt a new download yet. Further ivestigation of the right approach for the bridge descriptor and router descriptor download cases pending.

comment:15 Changed 3 years ago by andrea

Seems we're using the download_status_t mechanism for all four types of dirserver downloads:

  • download_status_t initializers in networkstatus.c control consensus downloads
  • initialized in download_status_cert_init() of routerlist.c for cert downloads
  • fetch_status in bridge_info_t of entrynodes.c for bridge descriptors, set up in bridge_add_from_config()
  • for routerdescs, there's a download_status_t in routerstatus_t, and these are ultimately created by routerstatus_parse_entry_from_string()

comment:16 Changed 3 years ago by andrea

Status: assignedneeds_review

Please review implementation in my bug15942 branch; this has been tested by unit tests for the random exponential backoff download schedule in src/test/test_dir.c, and by using iptables to block outgoing TCP connections while bootstrapping a client to observe backoffs in progress.

comment:17 Changed 3 years ago by nickm

Keywords: review-group-3 added

Move some tickets into review-group-3: they are in 0.2.9, and they are needs_review.

comment:18 Changed 3 years ago by nickm

Reviewer: nickm

comment:19 Changed 3 years ago by nickm

Status: needs_reviewneeds_revision

I tried using gitlab to review your #15942 branch, at https://gitlab.com/nickm_tor/tor/merge_requests/1 . What do you think?

(Except as noted, the branch lgtm.)

comment:20 Changed 3 years ago by nickm

(oh and also: most of my questions on that branch are actual questions, and not sneaky suggestions. "No, that would be a bad idea" is an okay answer in most cases.)

comment:21 Changed 3 years ago by andrea

Status: needs_revisionneeds_review

See rebased and updated bug15942_v2 branch.

comment:22 Changed 3 years ago by nickm

okay, now I get it.

But now that I look at it, I wonder whether the shift-trick isn't a little too clever. What if we just did it like this bug15942_v2_alternative ?

comment:23 Changed 3 years ago by nickm

Keywords: review-group-4 added

comment:24 Changed 3 years ago by andrea

Seems reasonable as long as entropy is sufficiently cheap; I'm fine with that alternative I think.

comment:25 Changed 3 years ago by nickm

Resolution: implemented
Status: needs_reviewclosed

Okay. Merged it. Let's see how it works out! (Please set the "actual points" field to the approx number of coding days you spent on this.)

comment:26 Changed 3 years ago by andrea

Actual Points: 3
Note: See TracTickets for help on using tickets.