Opened 10 years ago

Last modified 13 months ago

#925 new defect (None)

Tor fails badly when accept(2) returns EMFILE or ENFILE

Reported by: riastradh Owned by:
Priority: Low Milestone: Tor: unspecified
Component: Core Tor/Tor Version: 0.2.0.33
Severity: Normal Keywords: tor-relay, oos, sponsor8-maybe, tor-dos
Cc: riastradh, nickm, Sebastian Actual Points:
Parent ID: #19984 Points: medium
Reviewer: Sponsor:

Description (last modified by nickm)

If accept(2) in connection_handle_listener_read returns EMFILE or
ENFILE, Tor logs a failure and returns to the event loop. The
listening socket remains ready for reading, however, so that Tor again
tries to accept a connection. This leads to tens of thousands of
logged failures per second. Here is an excerpt from my syslog:

Feb 11 05:57:36 Tor[20415]: accept failed: Too many open files. Dropping incoming connection.
Feb 11 05:57:54 last message repeated 301536 times
Feb 11 05:57:54 Tor[20415]: Failing because we have 1765 connections already. Please raise your ulimit -n.
Feb 11 05:57:54 Tor[20415]: accept failed: Too many open files. Dropping incoming connection.
Feb 11 05:58:05 last message repeated 184158 times
Feb 11 05:58:05 Tor[20415]: Failing because we have 1765 connections already. Please raise your ulimit -n.
Feb 11 05:58:05 Tor[20415]: accept failed: Too many open files. Dropping incoming connection.
Feb 11 05:58:13 last message repeated 127556 times
Feb 11 05:58:13 Tor[20415]: Failing because we have 1765 connections already. Please raise your ulimit -n.
Feb 11 05:58:13 Tor[20415]: accept failed: Too many open files. Dropping incoming connection.
Feb 11 05:58:26 last message repeated 223556 times
Feb 11 05:58:26 Tor[20415]: Failing because we have 1765 connections already. Please raise your ulimit -n.

I don't know what the right thing to do here is, but spiking the CPU
and spraying log messages is not a very graceful mode of failure. One
way to mitigate the damage might be to close the listening socket,
which I believe won't be reopened until a minute later. This is no
worse for the Tor network than just wedging, and perhaps better, since
prospective connectors would be refused rather than silently forgotten
in a flurry of furious logging.

Also, it would be nice to document the number of file descriptors
generally required by a Tor relay, or a formula for computing it. For
example, is it proportional to the bandwidth and to the number of
relays in the Tor network? Or to the bandwidth and to the number of
users in the Tor network? This way, prospective operators of Tor
relays would not need to repeatedly restart their relays as they test
incremental bumps in the file descriptor ulimits, unless there is some
way to bump them without restarting the relay (but I doubt whether
there is).

(Apologies if this is duplicated: I hit ^A while editing this, in order
to move to the beginning of the line, but the obnoxious !@#^%&%^& web
form [and my obnoxiously colluding web browser] interpreted it to mean
something else for which I quickly hit the stop button. I don't know
what hitting ^A actually did.)

[Automatically added by flyspray2trac: Operating System: All]

Child Tickets

Change History (20)

comment:1 Changed 10 years ago by nickm

0.2.1.x already rate-limits the messages here. Maybe we should backport the code there. (Roger, what do you think?)

It'd be nice to back off on accepting new connections when we're "full"; we could just delete the appropriate listener
event_read for a while. I wonder what this would break.

For recommended ulimit values, I'd go with "as high as you can".

comment:2 Changed 10 years ago by loafier

I had a chance to test this with a local Tor server and a script to create a lot of
idle connections. It seems that even after rate limiting the log message, a lot of
CPU is still wasted letting accept() fail over and over, so either closing the
listener or removing the read event for a period of time would probably be a good
idea.

Closing the listener looks like the easier of the two options, because, as Taylor
mentioned, Tor already attempts to reopen listeners every 60s.

comment:3 Changed 9 years ago by Sebastian

So it appears that we have two tricky problems here. One is that we can't
really close the listening socket, because we can't know that we'll be able
to open it again. The other one is that we don't give connecting parties
immediate feedback about our current state when we just remove the
read event.

A possible solution might be to notice that we're almost exhausting our
limits, and accept and close immediately in that case.

nickm| Hm. Your idea gives me an icky feeling, but I can't actually think what's wrong with it
nickm| This makes me suspect that it might be ok

comment:4 Changed 9 years ago by nickm

Milestone: Tor: 0.2.2.x-final

comment:5 Changed 9 years ago by nickm

Description: modified (diff)

So part one of the "fix" here, if we want to try it, is for connection_handle_listener_read() to compare get_n_open_sockets() with get_options->ConnLimit [grep through the rest of that file to see how we do it]. If we have too many sockets, then we should immediately close the new connection...

...and part two is, if we ever get an EMFILE/ENFILE, to reset our idea of what our connlimit is based on the number of files we currently have open...

...but first, we need to look through the code that connects to ORs/directories, and make sure that we don't actually treat a completed connect() attempt as meaning that a server is up. I am 97% sure that we don't.

comment:6 Changed 9 years ago by arma

Triage: nickm is welcome to fix this if he wants, but I'm not holding up 0.2.2.x for it.

It might be that it's the sort of fix where we're not totally confident it helps and doesn't hurt, so putting it in right before an rc could be unsmart.

comment:7 Changed 8 years ago by arma

Milestone: Tor: 0.2.2.x-finalTor: 0.2.3.x-final

comment:8 Changed 7 years ago by nickm

Milestone: Tor: 0.2.3.x-finalTor: 0.2.4.x-final

Ugh, wanted to get to this. I still hope I can.

comment:9 Changed 6 years ago by nickm

Keywords: tor-relay added

comment:10 Changed 6 years ago by nickm

Component: Tor RelayTor

comment:11 Changed 6 years ago by nickm

Milestone: Tor: 0.2.4.x-finalTor: unspecified

comment:12 Changed 3 years ago by nickm

Cc: riastradh,nickm,Sebastianriastradh, nickm, Sebastian
Milestone: Tor: unspecifiedTor: 0.2.8.x-final

comment:13 Changed 3 years ago by nickm

Points: medium

comment:14 Changed 3 years ago by nickm

Milestone: Tor: 0.2.8.x-finalTor: 0.2.???

Move a bunch of (but not all) low-priority items to 0.2.???. If you write a patch, it can still get into 0.2.8

comment:15 Changed 2 years ago by teor

Milestone: Tor: 0.2.???Tor: 0.3.???

Milestone renamed

comment:16 Changed 2 years ago by nickm

Keywords: tor-03-unspecified-201612 added
Milestone: Tor: 0.3.???Tor: unspecified

Finally admitting that 0.3.??? was a euphemism for Tor: unspecified all along.

comment:17 Changed 22 months ago by nickm

Keywords: tor-03-unspecified-201612 removed

Remove an old triaging keyword.

comment:18 Changed 22 months ago by nickm

Keywords: dos oos added
Parent ID: #19984
Severity: Normal

comment:19 Changed 22 months ago by nickm

Keywords: sponsor8-maybe added

comment:20 Changed 13 months ago by dgoulet

Keywords: tor-dos added; dos removed

Rename keyword "dos" to "tor-dos"

Note: See TracTickets for help on using tickets.