Opened 20 months ago

Closed 15 months ago

Last modified 10 months ago

#29672 closed defect (fixed)

trac gets overwhelmed

Reported by: anarcat Owned by: qbi
Priority: High Milestone:
Component: Internal Services/Service - trac Version:
Severity: Critical Keywords:
Cc: Actual Points:
Parent ID: Points:
Reviewer: Sponsor:


there seems to be recurring performance problems with trac. diagnose the problem and find a solution.

Child Tickets

Change History (5)

comment:1 Changed 20 months ago by anarcat

as part of the monitoring work, i think setting prometheus / grafana could help in figuring out what the bottleneck is and resolve it. but in the meantime it would be useful to hook up the service in nagios to at least know when problems occur.

comment:2 Changed 19 months ago by anarcat

today trac hung badly - all requests were giving 503 errors to client and the machine was maxing its CPU and memory. i found this in the error log:

[Thu Apr 11 16:30:23.749569 2019] [wsgi:error] [pid 22934:tid 140416296871680] (11)Resource temporarily unavailable: [client [REDACTED]:40900] mod_wsgi (pid=22934): Unable to connect to WSGI daemon process '' on '/var/run/apache2/wsgi.2106.9.1.sock' after multiple attempts as listener backlog limit was exceeded.

The trac.log was full of:

IOError: Apache/mod_wsgi failed to write response data: Broken pipe

CPU and memory had been maxed out for more than two hours already when the outage started:

Apache was also seeing more hits than usual:

But I don't believe it was starving out of resources:

It's possible the pgsql database got overwhelmed. We don't have metrics for that in prometheus because, ironically enough, I just decided yesterday it might have been overkill. Maybe we should revise that decision now.

I wonder if our WSGI config could be tweaked. This is what we have right now:

WSGIDaemonProcess user=tracweb group=tracweb home=/ processes=6 threads=10 maximum-requests=5000 inactivity-timeout=1800 umask=0007

I've decided to make more of those settings explicit to see if some tweaks might be useful:

WSGIDaemonProcess user=tracweb group=tracweb home=/ processes=6 threads=10 maximum-requests=5000 inactivity-timeout=1800 umask=0007 graceful-timeout=30 restart-interval=30 response-socket-timeout=10

The server was rebooted, which fixed the problem, but we'll see if the above tweaks might fix the problem in the future.

Failing that, a good path to take next time is to look at whether the database is overloaded - it would explain why the frontend is falling over without a clear explanation, although it must be said that most of the CPU was taken by WSGI processes, not pgsql.

comment:3 Changed 19 months ago by anarcat

clarification: outage actually started right about when the graphs noticed it, so there's a clear correlation between the higher traffic and the downtime.

comment:4 Changed 15 months ago by anarcat

Resolution: fixed
Status: assignedclosed

it looks like the little tweaks I did to the CGI handlers helped, because I haven't heard of those problems in the last 4 months since I commented on this ticket.

when i asked in #tor-project, the response was that listing users in the admin interface was slow, but that otherwise "it's been working fine".

so I think we can close this. if we have more specific problems, we can reopen.

comment:5 Changed 10 months ago by anarcat

we just had a problem again where trac was overwhelmed. I noticed an IP that was hogging ~30 apache threads all on its own, thanks to this magic command:

ss -t -n 'dport = 80 or sport = 80 or dport = 443 or sport = 443' | awk '{print $NF}' | sed 's/:[0-9]*$//'  | sort | uniq -c | sort

That showed a single IP that was taking up most of the threads. I killed the IP with:


... and restarted apache (to kill old sockets). Load has returned to normal and things seem generally happier.

Note: See TracTickets for help on using tickets.