Sockets probably account for most of the descriptor use in a running Tor
process, but we should make the maximum socket count a little lower than
the maximum number of descriptors so they can't block us from opening files.
We'll need a heuristic to decide which connections are least harmful to
kill analogous to OOM handling, but the first step is building plumbing
to get a chance to run an OOS handler.
Per compat.c, sockets are created (and the socket counter adjusted) in
tor_socketpair() (AF_UNIX only), tor_open_socket() and tor_accept_socket(), and
the _nonblocking/_with_extensions versions of those functions.
Call sites:
tor_open_socket_nonblocking() called in three places, all in connection.c
connection_listener_new() (2 call sites)
connection_connect_sockaddr()
tor_accept_socket_nonblocking() called in one place, in connection_handle_listener_read()
open_socket() calls in src/ext/eventdns.c ?
get_n_open_sockets() returns counter
socket exhaustion can be detected by open/accept failing with
errno EMFILE, ENFILE, ENOBUFS, ENOMEM.
For testing purposes, we can artificially restrict number of descriptors
to trigger socket exhaustion using ulimit -n.
We can build an OOS handler analogous to circuits_handle_oom called with
current number of open sockets and an indicator of whether one just failed
with one of the socket exhaustion related errnos from open_socket()/accept
call sites. It can then compare socket count against a new config setting
parallel to the MaxMemInQueues setting the OOM handler relies on to decide
if it needs to act.
How should we set the default of the number of sockets? On Linux
getrlimit(RLIMIT_NOFILE) works, but what about elsewhere?
My ticket18640 branch now has basic plumbing to implement a MaxSockets config option and a stub connection_handle_oos() function called in the appropriate places. Seeking design review on this approach while I work on the details of connection_handle_oos().
Did you see ConnLimit? It's very similar to MaxSockets, and it already ties into the rlimit infrastructure where available, I think. See also the "n_sockets_open" counter in compat.c; that might help too?
Right, thanks. I already saw n_socket_open and queried it in a bunch of places where I call the OOS handler in connection.c. Fresh version of plumbing without the redundant config variable in my ticket18640_v2_plumbing branch.
For a handling heuristic, I'm thinking something like sort open orconns by number of circuits and kill the ones carrying the fewest until we're down to the target.
Testing note: empirically, the smallest ulimit -n I can get Tor to bootstrap with is 36, and it can be forced to create new connections with EXTENDCIRCUIT 0 at the control port.
Implementation of kill-orconns-with-fewest circuits now in my ticket18640_v2_squashed branch. This is mostly plumbing and a simple sort heuristic; it'd be easy to modify it with more sophisticated selection. I will now proceed with unit tests for it. Currently, the easiest way to repro an OOS is to set a low ulimit and send many EXTENDCIRCUIT commands to the control port; ignore the instance of #19535 (moved) you'll see along the way; it's a pre-existing bug which doesn't impede demonstrating the OOS code.
Looking good! There are a few small issues to fix, and I'd like to talk a little more about the actual heuristics for what to kill: see comments on branch.
(I don't think we need to get those heuristics right before merging. We could instead disable the connection-killer until we believe in it, or make it off-by-default till we believe in it.)
My ticket18640_v3 branch now has all the requested fixes except for a heuristic that considers more connection types, and a new DisableOOSCheck flag, currently enabled by default.
Looks good to me. There heeds to be documentation for DisableOOSCheck, and a changes file. I can add those when I merge. (Planning to hold the merge till after 0.2.9.2-alpha, so that this can bake on git master for a little while before it ships.)
To do:
Add a changes file
Document DisableOOSCheck
Merge
Open a new ticket to make this code look at all edge/directory connections, and turn the check on by default.