Opened 4 years ago

Closed 3 years ago

Last modified 3 years ago

#18640 closed enhancement (implemented)

Use smarter algorithms to handle socket exhaustion

Reported by: nickm Owned by: andrea
Priority: Medium Milestone: Tor: 0.2.9.x-final
Component: Core Tor/Tor Version:
Severity: Normal Keywords: tor-dos, TorCoreTeam201608, review-group-5, review-group-7
Cc: Actual Points: 4
Parent ID: #17293 Points: 3
Reviewer: nickm Sponsor: SponsorU-can

Description

We could do something for socket exhaustion that's smart, like what we currently do for OOM.

Child Tickets

Change History (31)

comment:1 Changed 4 years ago by mikeperry

Keywords: tor-dos added; dos removed

Canonicalize dos tag to tor-dos

comment:2 Changed 4 years ago by isabela

Points: medium3

comment:3 Changed 4 years ago by nickm

Parent ID: #17293

comment:4 Changed 4 years ago by andrea

Owner: set to andrea
Status: newassigned

Taking ownership for 0.2.9 triage

comment:5 Changed 4 years ago by nickm

Keywords: TorCoreTeam201606 added

comment:6 Changed 3 years ago by andrea

Notes on handling OOS:

Sockets probably account for most of the descriptor use in a running Tor
process, but we should make the maximum socket count a little lower than
the maximum number of descriptors so they can't block us from opening files.
We'll need a heuristic to decide which connections are least harmful to
kill analogous to OOM handling, but the first step is building plumbing
to get a chance to run an OOS handler.

Per compat.c, sockets are created (and the socket counter adjusted) in
tor_socketpair() (AF_UNIX only), tor_open_socket() and tor_accept_socket(), and
the _nonblocking/_with_extensions versions of those functions.

Call sites:

  • tor_open_socket_nonblocking() called in three places, all in connection.c
  • connection_listener_new() (2 call sites)
  • connection_connect_sockaddr()
  • tor_accept_socket_nonblocking() called in one place, in connection_handle_listener_read()
  • open_socket() calls in src/ext/eventdns.c ?
  • get_n_open_sockets() returns counter
  • socket exhaustion can be detected by open/accept failing with errno EMFILE, ENFILE, ENOBUFS, ENOMEM.
  • For testing purposes, we can artificially restrict number of descriptors to trigger socket exhaustion using ulimit -n.

We can build an OOS handler analogous to circuits_handle_oom called with
current number of open sockets and an indicator of whether one just failed
with one of the socket exhaustion related errnos from open_socket()/accept
call sites. It can then compare socket count against a new config setting
parallel to the MaxMemInQueues setting the OOM handler relies on to decide
if it needs to act.

  • How should we set the default of the number of sockets? On Linux getrlimit(RLIMIT_NOFILE) works, but what about elsewhere?

comment:7 Changed 3 years ago by andrea

Status: assignedneeds_review

My ticket18640 branch now has basic plumbing to implement a MaxSockets config option and a stub connection_handle_oos() function called in the appropriate places. Seeking design review on this approach while I work on the details of connection_handle_oos().

comment:8 Changed 3 years ago by nickm

Milestone: Tor: 0.2.9.x-final

comment:9 Changed 3 years ago by nickm

Did you see ConnLimit? It's very similar to MaxSockets, and it already ties into the rlimit infrastructure where available, I think. See also the "n_sockets_open" counter in compat.c; that might help too?

comment:10 Changed 3 years ago by andrea

Right, thanks. I already saw n_socket_open and queried it in a bunch of places where I call the OOS handler in connection.c. Fresh version of plumbing without the redundant config variable in my ticket18640_v2_plumbing branch.

For a handling heuristic, I'm thinking something like sort open orconns by number of circuits and kill the ones carrying the fewest until we're down to the target.

comment:11 Changed 3 years ago by andrea

Testing note: empirically, the smallest ulimit -n I can get Tor to bootstrap with is 36, and it can be forced to create new connections with EXTENDCIRCUIT 0 <identity digest> at the control port.

comment:12 Changed 3 years ago by andrea

Implementation of kill-orconns-with-fewest circuits now in my ticket18640_v2_squashed branch. This is mostly plumbing and a simple sort heuristic; it'd be easy to modify it with more sophisticated selection. I will now proceed with unit tests for it. Currently, the easiest way to repro an OOS is to set a low ulimit and send many EXTENDCIRCUIT commands to the control port; ignore the instance of #19535 you'll see along the way; it's a pre-existing bug which doesn't impede demonstrating the OOS code.

comment:13 Changed 3 years ago by andrea

Version with unit tests in my ticket18640_v2_squashed_tests branch.

comment:14 Changed 3 years ago by andrea

Actual Points: 4

comment:15 Changed 3 years ago by nickm

Reviewer: nickm

comment:16 Changed 3 years ago by nickm

Keywords: review-group-5 added

comment:17 Changed 3 years ago by nickm

https://gitlab.com/nickm_tor/tor/merge_requests/4/commits is a gitlab page for review here. I'll make another note when I've done the review.

comment:18 Changed 3 years ago by nickm

Actual-review-points-so-far : 0.2

Looking good! There are a few small issues to fix, and I'd like to talk a little more about the actual heuristics for what to kill: see comments on branch.

(I don't think we need to get those heuristics right before merging. We could instead disable the connection-killer until we believe in it, or make it off-by-default till we believe in it.)

comment:19 Changed 3 years ago by nickm

Status: needs_reviewneeds_revision

comment:20 Changed 3 years ago by nickm

Keywords: review-group-6 added

comment:21 Changed 3 years ago by nickm

Keywords: review-group-7 added; review-group-6 removed

All review-group-6 tickets have had at least one review; moving them to 7.

comment:22 Changed 3 years ago by andrea

Status: needs_revisionneeds_review

My ticket18640_v3 branch now has all the requested fixes except for a heuristic that considers more connection types, and a new DisableOOSCheck flag, currently enabled by default.

comment:23 Changed 3 years ago by nickm

Status: needs_reviewmerge_ready

Looks good to me. There heeds to be documentation for DisableOOSCheck, and a changes file. I can add those when I merge. (Planning to hold the merge till after 0.2.9.2-alpha, so that this can bake on git master for a little while before it ships.)

To do:

  • Add a changes file
  • Document DisableOOSCheck
  • Merge
  • Open a new ticket to make this code look at all edge/directory connections, and turn the check on by default.

comment:24 Changed 3 years ago by nickm

Merged!

comment:25 Changed 3 years ago by nickm

Resolution: implemented
Status: merge_readyclosed

Open a new ticket to make this code look at all edge/directory connections, and turn the check on by default.

This is #19984, just opened.

comment:26 Changed 3 years ago by dgoulet

Resolution: implemented
Status: closedreopened

This made current git upstream fail to build with two issues.

Both fixed in bug18640_029_01.

comment:27 Changed 3 years ago by nickm

Resolution: implemented
Status: reopenedclosed

merged; thank you!

comment:28 Changed 3 years ago by andrea

Resolution: implemented
Status: closedreopened

Test suite failures fixed in oos-test-failures

comment:29 Changed 3 years ago by andrea

Status: reopenedneeds_review

comment:30 Changed 3 years ago by nickm

Resolution: implemented
Status: needs_reviewclosed

Merged.

comment:31 Changed 3 years ago by nickm

Keywords: TorCoreTeam201608 added; TorCoreTeam201606 removed

This was finished in August

Note: See TracTickets for help on using tickets.