Opened 9 months ago

Closed 4 months ago

#29672 closed defect (fixed)

trac gets overwhelmed

Reported by: anarcat Owned by: qbi
Priority: High Milestone:
Component: Internal Services/Service - trac Version:
Severity: Critical Keywords:
Cc: Actual Points:
Parent ID: Points:
Reviewer: Sponsor:

Description

there seems to be recurring performance problems with trac. diagnose the problem and find a solution.

Child Tickets

Change History (4)

comment:1 Changed 9 months ago by anarcat

as part of the monitoring work, i think setting prometheus / grafana could help in figuring out what the bottleneck is and resolve it. but in the meantime it would be useful to hook up the service in nagios to at least know when problems occur.

comment:2 Changed 8 months ago by anarcat

today trac hung badly - all requests were giving 503 errors to client and the machine was maxing its CPU and memory. i found this in the error log:

[Thu Apr 11 16:30:23.749569 2019] [wsgi:error] [pid 22934:tid 140416296871680] (11)Resource temporarily unavailable: [client [REDACTED]:40900] mod_wsgi (pid=22934): Unable to connect to WSGI daemon process 'trac.torproject.org' on '/var/run/apache2/wsgi.2106.9.1.sock' after multiple attempts as listener backlog limit was exceeded.

The trac.log was full of:

IOError: Apache/mod_wsgi failed to write response data: Broken pipe

CPU and memory had been maxed out for more than two hours already when the outage started:

https://paste.anarc.at/snaps/snap-2019.04.11-12.53.48.png

Apache was also seeing more hits than usual:

https://paste.anarc.at/snaps/snap-2019.04.11-12.57.04.png

But I don't believe it was starving out of resources:

https://paste.anarc.at/snaps/snap-2019.04.11-12.58.56.png

It's possible the pgsql database got overwhelmed. We don't have metrics for that in prometheus because, ironically enough, I just decided yesterday it might have been overkill. Maybe we should revise that decision now.

I wonder if our WSGI config could be tweaked. This is what we have right now:

WSGIDaemonProcess trac.torproject.org user=tracweb group=tracweb home=/ processes=6 threads=10 maximum-requests=5000 inactivity-timeout=1800 umask=0007 display-name=wsgi-trac.torproject.org

I've decided to make more of those settings explicit to see if some tweaks might be useful:

WSGIDaemonProcess trac.torproject.org user=tracweb group=tracweb home=/ processes=6 threads=10 maximum-requests=5000 inactivity-timeout=1800 umask=0007 graceful-timeout=30 restart-interval=30 response-socket-timeout=10 display-name=wsgi-trac.torproject.org

The server was rebooted, which fixed the problem, but we'll see if the above tweaks might fix the problem in the future.

Failing that, a good path to take next time is to look at whether the database is overloaded - it would explain why the frontend is falling over without a clear explanation, although it must be said that most of the CPU was taken by WSGI processes, not pgsql.

comment:3 Changed 8 months ago by anarcat

clarification: outage actually started right about when the graphs noticed it, so there's a clear correlation between the higher traffic and the downtime.

comment:4 Changed 4 months ago by anarcat

Resolution: fixed
Status: assignedclosed

it looks like the little tweaks I did to the CGI handlers helped, because I haven't heard of those problems in the last 4 months since I commented on this ticket.

when i asked in #tor-project, the response was that listing users in the admin interface was slow, but that otherwise "it's been working fine".

so I think we can close this. if we have more specific problems, we can reopen.

Note: See TracTickets for help on using tickets.