trac gets overwhelmed

added component::internal services/service - trac owner::qbi priority::high resolution::fixed severity::critical status::closed type::defect labels

as part of the monitoring work, i think setting prometheus / grafana could help in figuring out what the bottleneck is and resolve it. but in the meantime it would be useful to hook up the service in nagios to at least know when problems occur.

today trac hung badly - all requests were giving 503 errors to client and the machine was maxing its CPU and memory. i found this in the error log:

[Thu Apr 11 16:30:23.749569 2019] [wsgi:error] [pid 22934:tid 140416296871680] (11)Resource temporarily unavailable: [client [REDACTED]:40900] mod_wsgi (pid=22934): Unable to connect to WSGI daemon process 'trac.torproject.org' on '/var/run/apache2/wsgi.2106.9.1.sock' after multiple attempts as listener backlog limit was exceeded.

The trac.log was full of:

IOError: Apache/mod_wsgi failed to write response data: Broken pipe

CPU and memory had been maxed out for more than two hours already when the outage started:

Apache was also seeing more hits than usual:

But I don't believe it was starving out of resources:

It's possible the pgsql database got overwhelmed. We don't have metrics for that in prometheus because, ironically enough, I just decided yesterday it might have been overkill. Maybe we should revise that decision now.

I wonder if our WSGI config could be tweaked. This is what we have right now:

WSGIDaemonProcess trac.torproject.org user=tracweb group=tracweb home=/ processes=6 threads=10 maximum-requests=5000 inactivity-timeout=1800 umask=0007 display-name=wsgi-trac.torproject.org

I've decided to make more of those settings explicit to see if some tweaks might be useful:

WSGIDaemonProcess trac.torproject.org user=tracweb group=tracweb home=/ processes=6 threads=10 maximum-requests=5000 inactivity-timeout=1800 umask=0007 graceful-timeout=30 restart-interval=30 response-socket-timeout=10 display-name=wsgi-trac.torproject.org

The server was rebooted, which fixed the problem, but we'll see if the above tweaks might fix the problem in the future.

Failing that, a good path to take next time is to look at whether the database is overloaded - it would explain why the frontend is falling over without a clear explanation, although it must be said that most of the CPU was taken by WSGI processes, not pgsql.

clarification: outage actually started right about when the graphs noticed it, so there's a clear correlation between the higher traffic and the downtime.

it looks like the little tweaks I did to the CGI handlers helped, because I haven't heard of those problems in the last 4 months since I commented on this ticket.

when i asked in #tor-project, the response was that listing users in the admin interface was slow, but that otherwise "it's been working fine".

so I think we can close this. if we have more specific problems, we can reopen.

Trac:
Status: assigned to closed
Resolution: N/A to fixed

we just had a problem again where trac was overwhelmed. I noticed an IP that was hogging ~30 apache threads all on its own, thanks to this magic command:

ss -t -n 'dport = 80 or sport = 80 or dport = 443 or sport = 443' | awk '{print $NF}' | sed 's/:[None..None](../compare/None...None)*$//'  | sort | uniq -c | sort

That showed a single IP that was taking up most of the threads. I killed the IP with:

iptables -I INPUT -s $IPADDRESS -j REJECT

... and restarted apache (to kill old sockets). Load has returned to normal and things seem generally happier.

closed

mentioned in issue #31746 (moved)

trac gets overwhelmed

Child items 0

Activity