Opened 3 months ago

Last modified 2 months ago

#29672 assigned defect

trac gets overwhelmed

Reported by: anarcat Owned by: qbi
Priority: High Milestone:
Component: Internal Services/Service - trac Version:
Severity: Critical Keywords:
Cc: Actual Points:
Parent ID: Points:
Reviewer: Sponsor:


there seems to be recurring performance problems with trac. diagnose the problem and find a solution.

Child Tickets

Change History (3)

comment:1 Changed 3 months ago by anarcat

as part of the monitoring work, i think setting prometheus / grafana could help in figuring out what the bottleneck is and resolve it. but in the meantime it would be useful to hook up the service in nagios to at least know when problems occur.

comment:2 Changed 2 months ago by anarcat

today trac hung badly - all requests were giving 503 errors to client and the machine was maxing its CPU and memory. i found this in the error log:

[Thu Apr 11 16:30:23.749569 2019] [wsgi:error] [pid 22934:tid 140416296871680] (11)Resource temporarily unavailable: [client [REDACTED]:40900] mod_wsgi (pid=22934): Unable to connect to WSGI daemon process '' on '/var/run/apache2/wsgi.2106.9.1.sock' after multiple attempts as listener backlog limit was exceeded.

The trac.log was full of:

IOError: Apache/mod_wsgi failed to write response data: Broken pipe

CPU and memory had been maxed out for more than two hours already when the outage started:

Apache was also seeing more hits than usual:

But I don't believe it was starving out of resources:

It's possible the pgsql database got overwhelmed. We don't have metrics for that in prometheus because, ironically enough, I just decided yesterday it might have been overkill. Maybe we should revise that decision now.

I wonder if our WSGI config could be tweaked. This is what we have right now:

WSGIDaemonProcess user=tracweb group=tracweb home=/ processes=6 threads=10 maximum-requests=5000 inactivity-timeout=1800 umask=0007

I've decided to make more of those settings explicit to see if some tweaks might be useful:

WSGIDaemonProcess user=tracweb group=tracweb home=/ processes=6 threads=10 maximum-requests=5000 inactivity-timeout=1800 umask=0007 graceful-timeout=30 restart-interval=30 response-socket-timeout=10

The server was rebooted, which fixed the problem, but we'll see if the above tweaks might fix the problem in the future.

Failing that, a good path to take next time is to look at whether the database is overloaded - it would explain why the frontend is falling over without a clear explanation, although it must be said that most of the CPU was taken by WSGI processes, not pgsql.

comment:3 Changed 2 months ago by anarcat

clarification: outage actually started right about when the graphs noticed it, so there's a clear correlation between the higher traffic and the downtime.

Note: See TracTickets for help on using tickets.