Thinking about #33364 (moved), I found that snowflake-server is chewing a lot of memory. It may be some memory leak or something.
$ top -o%MEM PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND26910 debian-+ 20 0 1916628 1.522g 0 S 0.0 77.8 58:51.37 snowflake-serve
The memory use seems to be inhibiting other processes. runsvdir puts status messages in its own argv so you can inspect them with ps. Currently it's reflecting xz not being able to allocate memory to compress logs:
Initially I suspected the recent websocketconn changes from #33144 (moved), but those can only have had effect since 2020-02-10 18:57 when the server was restarted, and the earliest reports of "Could not connect to the bridge" predate that (assuming that the memory usage and the issue in #33364 (moved) are the same).
Trac: Description: Thinking about #33364 (moved), I found that snowflake-server is chewing a lot of memory. It may be some memory leak or something.
$ top -o%MEM PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND26910 debian-+ 20 0 1916628 1.522g 0 S 0.0 77.8 58:51.37 snowflake-serve
The memory use seems to be inhibiting other processes. runsvdir puts status messages in its own argv so you can inspect them with ps. Currently it's reflecting xz not being able to allocate memory to compress logs:
In the short term, looks like we need to restart the server. Then we need to figure out what's causing it to use so much memory.
The server was last restarted 2020-02-10 18:57 (one week ago) for #32964 (moved).
to
Thinking about #33364 (moved), I found that snowflake-server is chewing a lot of memory. It may be some memory leak or something.
$ top -o%MEM PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND26910 debian-+ 20 0 1916628 1.522g 0 S 0.0 77.8 58:51.37 snowflake-serve
The memory use seems to be inhibiting other processes. runsvdir puts status messages in its own argv so you can inspect them with ps. Currently it's reflecting xz not being able to allocate memory to compress logs:
Initially I suspected the recent websocketconn changes from #33144 (moved), but those can only have had effect since 2020-02-10 18:57 when the server was restarted, and the earliest reports of "Could not connect to the bridge" predate that
Well, there is a leak in the new websocketconn code anyway. You can see it with this patch. The "Close pw1" line runs but the "Close pr2" line does not, leaking a goroutine.
The first goroutine reads messages from the WebSocket ws and writes them to pw1, which causes them to be returned from the Conn's Read method (pr1 is the Reader and is the other end of the pw1 pipe). This part is fine.
The second goroutine reads from pr2 and writes to the WebSocket ws. pr2 gets its input from things written using the Conn's Write method, which feeds directly into pw2.
Problem is, there's nothing that ever closes pw2. The Close method closes the WebSocket, so if the second goroutine ever were to try to write to it, it would detect an error and exit. But as long as nothing further ever calls Write, nothing is written to pw2 and so the goroutine waits forever.
Now let's watch it to see if memory use increases from some other cause, or remains stable. Here's a graph of memory use today. It was very high until being restarted at 19:15 in comment:4, then started creeping up again at a rate of about 24 MB / hour (which works out to 1.75 GB in 6.4 days, about the one week noted in the ticket description). I'll update the ticket later with an updated graph when the patched server has been able to run for a while.
Since restarting with the patch from comment:5, memory usage has been fairly flat but still increasing slightly. I don't know if it's another memory leak or what. It could be that, even before the websocketconn changes in #33144 (moved) that introduced a serious leak, we had a slower leak that resulted in the reports #33112 (moved), #33126 (moved), and #33127 (moved), which were earlier.
I'll close this because we've solve the acute issue, and leave #33112 (moved) open in case there is something else that causes proxies not to be able to reach the bridge sometimes.
Trac: Resolution: N/Ato fixed Status: needs_information to closed
I would not be surprised if there were goroutine leaks in other parts of snowflake as well. We should take this opportunity to check the client, proxy-go, broker, and server for leaks.
I would not be surprised if there were goroutine leaks in other parts of snowflake as well. We should take this opportunity to check the client, proxy-go, broker, and server for leaks.
The pprof package has a WriteHeapProfile analogous to the StartCPUProfile/StopCPUProfile from #33211 (moved). I haven't tried it but I found a blog post about using it to find memory leaks.