Goroutine leak in websocketconn

added component::circumvention/snowflake priority::medium resolution::fixed reviewer::cohosh severity::normal status::closed type::defect labels

Got it. Are you handling the restart?

Replying to dcf:

The server was last restarted 2020-02-10 18:57 (one week ago) at ca9ae12c383405bc9a755e1bc902e9755495c1f1 for #32964 (moved).

Initially I suspected the recent websocketconn changes from #33144 (moved), but those can only have had effect since 2020-02-10 18:57 when the server was restarted, and the earliest reports of "Could not connect to the bridge" predate that (assuming that the memory usage and the issue in #33364 (moved) are the same).

2020-01-30 #33112 (moved)
2020-02-02 #33126 (moved)
2020-02-02 #33127 (moved)
2020-02-18 #33364 (moved)

Trac:
Description: Thinking about #33364 (moved), I found that snowflake-server is chewing a lot of memory. It may be some memory leak or something.

$ top -o%MEM
  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
26910 debian-+  20   0 1916628 1.522g      0 S   0.0 77.8  58:51.37 snowflake-serve

The memory use seems to be inhibiting other processes. runsvdir puts status messages in its own argv so you can inspect them with ps. Currently it's reflecting xz not being able to allocate memory to compress logs:

$ ps ax | grep runsvdir
 1358 ?        Ss    94:01 runsvdir -P /etc/service log: locate memory \
svlogd: warning: processor failed, restart: /home/snowflake-proxy/snowflake-proxy-standalone-17h.log.d xz: (stdin): Cannot allocate memory \
svlogd: warning: processor failed, restart: /home/snowflake-proxy/snowflake-proxy-standalone-17h.log.d xz: (stdin): Cannot allocate memory \
svlogd: warning: processor failed, restart: /home/snowflake-proxy/snowflake-proxy-standalone-17h.log.d

I even got it just now trying to run a diagnostic command (it doesn't always happen):

$ ps ax | grep standal
-bash: fork: Cannot allocate memory

In the short term, looks like we need to restart the server. Then we need to figure out what's causing it to use so much memory.

The server was last restarted 2020-02-10 18:57 (one week ago) for #32964 (moved).

to

Thinking about #33364 (moved), I found that snowflake-server is chewing a lot of memory. It may be some memory leak or something.

$ top -o%MEM
  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
26910 debian-+  20   0 1916628 1.522g      0 S   0.0 77.8  58:51.37 snowflake-serve

The memory use seems to be inhibiting other processes. runsvdir puts status messages in its own argv so you can inspect them with ps. Currently it's reflecting xz not being able to allocate memory to compress logs:

$ ps ax | grep runsvdir
 1358 ?        Ss    94:01 runsvdir -P /etc/service log: locate memory \
svlogd: warning: processor failed, restart: /home/snowflake-proxy/snowflake-proxy-standalone-17h.log.d xz: (stdin): Cannot allocate memory \
svlogd: warning: processor failed, restart: /home/snowflake-proxy/snowflake-proxy-standalone-17h.log.d xz: (stdin): Cannot allocate memory \
svlogd: warning: processor failed, restart: /home/snowflake-proxy/snowflake-proxy-standalone-17h.log.d

I even got it just now trying to run a diagnostic command (it doesn't always happen):

$ ps ax | grep standal
-bash: fork: Cannot allocate memory

In the short term, looks like we need to restart the server. Then we need to figure out what's causing it to use so much memory.

The server was last restarted 2020-02-10 18:57 (one week ago) at ca9ae12c383405bc9a755e1bc902e9755495c1f1 for #32964 (moved).

Replying to cohosh:

Got it. Are you handling the restart?

I want to set up some simple memory monitor first and then I'll update when I restart it.

I did service tor restart at 2020-02-18 19:15:30, which immediately relieved the memory pressure.

I'm running a simple loop to make a log of memory usage:

while true; do \
  stdbuf -o 0 printf '%s,%s\n' \
    "$(free | tail -n +2 | head -n 1 | gawk 'BEGIN {OFS=","} {$1=strftime("%Y-%m-%d %H:%M:%S", systime(), 1); print}')" \
    "$(pidof snowflake-server)" \
  sleep 30 \
done >> mem.log

=date=	=total=	=used=	=free=	=shared=	=buff/cache=	=available=	=pid=
2020-02-18 19:14:17	2050352	1872184	63904	97116	114264	16074	26910
2020-02-18 19:14:47	2050352	1857036	80360	97116	112956	12912	26910
2020-02-18 19:15:17	2050352	1856508	79384	97116	114460	12736	26910
2020-02-18 19:15:47	2050352	262348	1655960	97116	132044	1598992
2020-02-18 19:16:18	2050352	235200	1604900	97116	210252	1587252	6710
2020-02-18 19:16:48	2050352	265172	1558428	97116	226752	1554828	6710
2020-02-18 19:17:18	2050352	264920	1558276	97116	227156	1555088	6710
2020-02-18 19:17:48	2050352	263456	1559728	97116	227168	1557708	6710

Replying to dcf:

Initially I suspected the recent websocketconn changes from #33144 (moved), but those can only have had effect since 2020-02-10 18:57 when the server was restarted, and the earliest reports of "Could not connect to the bridge" predate that

Well, there is a leak in the new websocketconn code anyway. You can see it with this patch. The "Close pw1" line runs but the "Close pr2" line does not, leaking a goroutine.

diff --git a/common/websocketconn/websocketconn.go b/common/websocketconn/websocketconn.go
index 46bb977..0fb7fa9 100644
--- a/common/websocketconn/websocketconn.go
+++ b/common/websocketconn/websocketconn.go
@@ -3,4 +3,5 @@ package websocketconn
 import (
 	"io"
+	"log"
 	"time"
 
@@ -106,8 +107,10 @@ func New(ws *websocket.Conn) *Conn {
 	go func() {
 		pw1.CloseWithError(closeErrorToEOF(readLoop(pw1, ws)))
+		log.Printf("%p Close pw1", ws)
 	}()
 	pr2, pw2 := io.Pipe()
 	go func() {
 		pr2.CloseWithError(closeErrorToEOF(writeLoop(ws, pr2)))
+		log.Printf("%p Close pr2", ws)
 	}()
 	return &Conn{

https://gitweb.torproject.org/pluggable-transports/snowflake.git/tree/common/websocketconn/websocketconn.go?id=ca9ae12c383405bc9a755e1bc902e9755495c1f1#n106

	pr1, pw1 := io.Pipe()
	go func() {
		pw1.CloseWithError(closeErrorToEOF(readLoop(pw1, ws)))
	}()
	pr2, pw2 := io.Pipe()
	go func() {
		pr2.CloseWithError(closeErrorToEOF(writeLoop(ws, pr2)))
	}()
	return &Conn{
		Conn:   ws,
		Reader: pr1,
		Writer: pw2,
	}

The first goroutine reads messages from the WebSocket ws and writes them to pw1, which causes them to be returned from the Conn's Read method (pr1 is the Reader and is the other end of the pw1 pipe). This part is fine.

The second goroutine reads from pr2 and writes to the WebSocket ws. pr2 gets its input from things written using the Conn's Write method, which feeds directly into pw2.

Problem is, there's nothing that ever closes pw2. The Close method closes the WebSocket, so if the second goroutine ever were to try to write to it, it would detect an error and exit. But as long as nothing further ever calls Write, nothing is written to pw2 and so the goroutine waits forever.

Here's a patch that closes the internal Pipes when the Close method is called, and a test case that fails before the patch and does not fail after the patch. https://gitweb.torproject.org/user/dcf/snowflake.git/commit/?h=bug33367-websocketconn&id=380b133155ad725126bc418d0e66b3c550b4c555

Trac:
Status: new to needs_review

Trac:
Reviewer: N/A to cohosh

Nice catch, looks good to me!

Trac:
Status: needs_review to merge_ready

Merged the websocketconn fix in 380b133155ad725126bc418d0e66b3c550b4c555.

Deployed that version at 2020-02-18 23:18:10.

Now let's watch it to see if memory use increases from some other cause, or remains stable. Here's a graph of memory use today. It was very high until being restarted at 19:15 in comment:4, then started creeping up again at a rate of about 24 MB / hour (which works out to 1.75 GB in 6.4 days, about the one week noted in the ticket description). I'll update the ticket later with an updated graph when the patched server has been able to run for a while.

Trac:
Status: merge_ready to needs_information

Trac:

Graph of memory usage between 19:00 and 23:30 on 2020-02-18.

Trac:
snowflake-server-mem-20200218.zip

Source code and data for snowflake-server-mem-20200218.png

Trac:

Graph of memory usage between 2020-02-18 19:00 and 2020-02-19 16:40

Trac:
snowflake-server-mem-20200219.zip

Source code and data for snowflake-server-mem-20200219.zip

Since restarting with the patch from comment:5, memory usage has been fairly flat but still increasing slightly. I don't know if it's another memory leak or what. It could be that, even before the websocketconn changes in #33144 (moved) that introduced a serious leak, we had a slower leak that resulted in the reports #33112 (moved), #33126 (moved), and #33127 (moved), which were earlier.

I'll close this because we've solve the acute issue, and leave #33112 (moved) open in case there is something else that causes proxies not to be able to reach the bridge sometimes.

Trac:
Resolution: N/A to fixed
Status: needs_information to closed

Trac:
Summary: Snowflake server using 1.5 GB memory, preventing other allocations to Goroutine leak in websocketconn

I would not be surprised if there were goroutine leaks in other parts of snowflake as well. We should take this opportunity to check the client, proxy-go, broker, and server for leaks.

Replying to cohosh:

I would not be surprised if there were goroutine leaks in other parts of snowflake as well. We should take this opportunity to check the client, proxy-go, broker, and server for leaks.

The pprof package has a WriteHeapProfile analogous to the StartCPUProfile/StopCPUProfile from #33211 (moved). I haven't tried it but I found a blog post about using it to find memory leaks.

closed

moved to tpo/anti-censorship/pluggable-transports/snowflake#33367 (closed)

Goroutine leak in websocketconn

Child items 0

Activity