Opened 20 months ago

Closed 6 months ago

Last modified 6 months ago

#21312 closed defect (fixed)

Memory and file descriptor leaks in programs that use go-webrtc

Reported by: arlolra Owned by: arlolra
Priority: High Milestone:
Component: Obfuscation/Snowflake Version:
Severity: Major Keywords:
Cc: arma, dcf Actual Points:
Parent ID: Points:
Reviewer: Sponsor:

Description

1:30:49 AM - arma5: 2017/01/24 18:57:44 candidate:3131354255 1 tcp 1518280447 192.168.1.154 36729 typ host tcptype passive generation 0 ufrag YXKXffPBBnA3SJ3K network-id 1
1:30:52 AM - arma5: this is the last line in its log
1:31:20 AM - arma5: and snowflake-client is pegged at 100% cpu
1:32:39 AM - arma5: %  strace -p3074
1:32:39 AM - arma5: Process 3074 attached
1:32:39 AM - arma5: futex(0x1059710, FUTEX_WAIT, 0, NULL

Child Tickets

Attachments (28)

20170809-info-threads.txt (71.0 KB) - added by dcf 14 months ago.
20170809-thread-apply-all-bt.txt (240.1 KB) - added by dcf 14 months ago.
snowflake-client-fds-20170814.png (10.8 KB) - added by dcf 13 months ago.
Graph of number of file descriptors used by snowflake-client.
20170816-thread-apply-all-bt.txt (55.8 KB) - added by dcf 13 months ago.
proxy-go-mem.20180308.png (35.6 KB) - added by dcf 7 months ago.
proxy-go-fd.20180308.png (24.6 KB) - added by dcf 7 months ago.
proxy-go-mem.20180308.log (74.9 KB) - added by dcf 7 months ago.
while true; do free | tail -n +2 | head -n 1 | gawk 'BEGIN {OFS=","} {$1=strftime("%Y-%m-%d %H:%M:%S", systime(), 1); print}'; sleep 180; done >> mem.log
proxy-go-fd.20180308.log (28.2 KB) - added by dcf 7 months ago.
while true; do echo "$(date -u +"%Y-%m-%d %H:%M:%S"),$(lsof | grep -c ^proxy-go)"; sleep 180; done >> fd.log
proxy-go-log.R (1.6 KB) - added by dcf 7 months ago.
0001-Call-explicit-frees-in-server-webrtc.patch (1.8 KB) - added by dcf 6 months ago.
0001-Call-explicit-frees-in-server-webrtc.2.patch (2.3 KB) - added by dcf 6 months ago.
0001-Add-synchronization-around-destroying-dataChannels.patch (8.0 KB) - added by arlolra 6 months ago.
proxy-go-log.20180314.R (2.8 KB) - added by dcf 6 months ago.
proxy-go-starting.20180314.log (41.8 KB) - added by dcf 6 months ago.
grep starting *.log | sed -e 's/\(.*\)-\([0-9]\+h\)\.log:\(.......... ..:..:..\) .*/\1,\2,\3/' > starting.log
proxy-go-starting.20180314.png (57.3 KB) - added by dcf 6 months ago.
proxy-go-abort.20180314.log (20.5 KB) - added by dcf 6 months ago.
gdb and stdout/stderr of a "pure virtual method called" in proxy-go
0001-Add-locks-around-erases.patch (1.5 KB) - added by arlolra 6 months ago.
0001-Add-synchronization-to-manipulating-vectors.patch (3.0 KB) - added by arlolra 6 months ago.
0001-Add-synchronization-around-destroying-dataChannels-a.patch (7.2 KB) - added by arlolra 6 months ago.
proxy-go-log.20180316.zip (164.1 KB) - added by dcf 6 months ago.
proxy-go-fd.20180316.png (26.9 KB) - added by dcf 6 months ago.
proxy-go-mem.20180316.png (22.6 KB) - added by dcf 6 months ago.
proxy-go-starting.20180316.png (60.5 KB) - added by dcf 6 months ago.
0001-Add-synchronization-around-destroying-DataChannels-a.patch (7.1 KB) - added by arlolra 6 months ago.
proxy-go-log.20180320.zip (197.2 KB) - added by dcf 6 months ago.
proxy-go-fd.20180320.png (25.3 KB) - added by dcf 6 months ago.
proxy-go-mem.20180320.png (22.6 KB) - added by dcf 6 months ago.
proxy-go-starting.20180320.png (57.4 KB) - added by dcf 6 months ago.

Download all attachments as: .zip

Change History (93)

comment:1 Changed 20 months ago by arlolra

1:39:57 AM - arma5: or does strace try to watch all threads
1:40:43 AM - arma5: ok with -f i got a bunch of output
1:42:49 AM - arma5: a bunch of futex, accept, getsockopt, select
1:42:54 AM - arma5: with a bunch of EAGAINs
1:43:03 AM - arma5: could be normal, could be abnormal, hard to say from that
1:45:06 AM - arma5: man that is a lot a lot a lot of threads.
1:45:27 AM - arma5: 1024 threads
1:45:32 AM - arma5: what an intriguing number of threads.

comment:2 Changed 20 months ago by arma

My snowflake-client has 1024 threads open, according to gdb's 'thread apply all bt'.

About half of them are sitting in select:

Thread 559 (Thread 0x7f1178072700 (LWP 5595)):
#0  0x00007f127a105893 in select () at ../sysdeps/unix/syscall-template.S:81
#1  0x00000000007b1fd5 in ?? ()
#2  0x0000000000681185 in ?? ()
#3  0x00000000006938e4 in ?? ()
#4  0x000000000069428c in ?? ()    
#5  0x00007f127ac7d064 in start_thread (arg=0x7f1178072700)
    at pthread_create.c:309
#6  0x00007f127a10c62d in clone ()
    at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

and about half are in pthread_cond_timedwait:

Thread 561 (Thread 0x7f1178a75700 (LWP 5586)):
#0  pthread_cond_timedwait@@GLIBC_2.3.2 ()
    at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:238
#1  0x0000000000696b6a in ?? ()
#2  0x000000000096c9b3 in ?? ()
#3  0x000000000096cb29 in ?? ()
#4  0x00000000006999e5 in ?? ()
#5  0x0000000000699a19 in ?? ()
#6  0x00007f127ac7d064 in start_thread (arg=0x7f1178a75700)
    at pthread_create.c:309
#7  0x00007f127a10c62d in clone ()
    at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

but anyway I don't have many symbols so hey.

What thing makes it keep making new threads? It's probably good that something in my system stopped it at 1024. :)

comment:3 Changed 20 months ago by arma

It's also using a good chunk of virtual memory:

arma      3074 42.5  6.9 6854568 267736 pts/5  Sl   15:54 129:25 ./TorBrowser/Tor/PluggableTransports/snowflake-client -url https://snowflake-reg.appspot.com/ -front www.google.com -ice stun:stun.l.google.com:19302

(I assume that all the threads share most of their memory so it's not so bad in reality.)

Last edited 17 months ago by arma (previous) (diff)

comment:4 Changed 20 months ago by arma

arlo's first thought is "libwebrtc"

yawning points out that cgo has pathological cases with blocking cgo calls, which "lead to the go runtime creating a shit load of OS threads"

comment:5 Changed 20 months ago by arma

Oh, and to be clear, it's not *just* pegged at 100% cpu. It's also not doing anything useful. :)

comment:6 Changed 20 months ago by yawning

Cc: yawning added

To clarify, cgo calls are dispatched in their own OS thread, and the massive VM use is more than likely multi-megabyte stacks being allocated per thread. Go's lightweight concurrency does not apply with cgo in the picture, especially if "blocking in cgo calls" is part of the equation. GOMAXPROCS will not save you because the parameter does not apply to cgo calls.

The easy way to confirm that this is what is happening is probably to set GODEBUG to something appropriate and look at debug spew on stderr (https://golang.org/pkg/runtime/ probably want scheddetail and schedtrace).

Instrumenting the code to utilize runtime/pprof's threadcreate profile will be able to give stack traces at thread creation time, if that's useful.

Changed 14 months ago by dcf

Attachment: 20170809-info-threads.txt added

Changed 14 months ago by dcf

comment:9 Changed 14 months ago by dcf

I encountered this lockup for the first time today. snowflake-client was using 200% CPU. I had been browsing and my requests started timing out. Here is the end of snowflake.log; it must have been around 17:00:00 when I noticed the hang.

2017/08/09 16:58:43 ---- Handler: snowflake assigned ----
2017/08/09 16:58:43 Buffered 201 bytes --> WebRTC
2017/08/09 16:58:43 WebRTC: DataChannel.OnOpen
2017/08/09 16:58:43 Flushed 201 bytes.
2017/08/09 16:58:45 Traffic Bytes (in|out): 743 | 201 -- (1 OnMessages, 1 Sends)
2017/08/09 16:58:51 Traffic Bytes (in|out): 51 | 166 -- (1 OnMessages, 2 Sends)
2017/08/09 16:58:53 WebRTC: At capacity [1/1]  Retrying in 10 seconds...
2017/08/09 16:59:03 WebRTC: At capacity [1/1]  Retrying in 10 seconds...
2017/08/09 16:59:13 WebRTC: At capacity [1/1]  Retrying in 10 seconds...
2017/08/09 16:59:17 WebRTC: No messages received for 30 seconds -- closing stale connection.
2017/08/09 16:59:17 WebRTC: closing DataChannel
2017/08/09 16:59:23 WebRTC: Collecting a new Snowflake. Currently at [0/1]
2017/08/09 16:59:23 snowflake-SeQZtvutAUlE6pFr  connecting...
2017/08/09 17:03:45 copy loop ended
2017/08/09 17:03:45 ---- Handler: closed ---
2017/08/09 17:03:45 SOCKS listening...
2017/08/09 17:03:46 SOCKS accepted:  {0.0.3.0:1  map[]}

The process had 521 threads, not 1024.

comment:10 Changed 14 months ago by yawning

I mean, the bindings still appears to leak OS threads like a sieve, and still use cgo, so it's not like anything has changed since arma encountered this.

Changed 13 months ago by dcf

Graph of number of file descriptors used by snowflake-client.

comment:11 Changed 13 months ago by dcf

Here's a graph of the number of file descriptors held open by the snowflake-client process while I was using it today. The first plateau at 21:30 is when it stopped working and I turned off snowflake in the browser. snowflake-client continued running, got out of whatever rut it was in, and consumed a few more descriptors before I finally closed the browser.

Graph of number of file descriptors used by snowflake-client.

The descriptors are six-fd repetitions of this nature. I suspect there's one for each attempt to contact the broker (i.e. ---- Handler: snowflake assigned ----).

snowflake 28366 david  217w     FIFO       0,10      0t0 1150258477 pipe
snowflake 28366 david  218r     FIFO       0,10      0t0 1150258478 pipe
snowflake 28366 david  219w     FIFO       0,10      0t0 1150258478 pipe
snowflake 28366 david  220u     IPv4 1150257563      0t0        UDP xx.Berkeley.EDU:48700 
snowflake 28366 david  221u     IPv4 1150258488      0t0        TCP xx.Berkeley.EDU:46037 (LISTEN)
snowflake 28366 david  222r     FIFO       0,10      0t0 1150262588 pipe
snowflake 28366 david  223w     FIFO       0,10      0t0 1150262588 pipe
snowflake 28366 david  224r     FIFO       0,10      0t0 1150262589 pipe
snowflake 28366 david  225w     FIFO       0,10      0t0 1150262589 pipe
snowflake 28366 david  226u     IPv4 1150260856      0t0        UDP xx.Berkeley.EDU:49943 
snowflake 28366 david  227u     IPv4 1150262068      0t0        TCP xx.Berkeley.EDU:52465 (LISTEN)

During the plateau at 21:30, there were a lot of Buffered messages in the log:

2017/08/14 14:24:34 WebRTC: DataChannel.OnOpen
2017/08/14 14:24:34 Buffered 1057 bytes --> WebRTC
2017/08/14 14:24:36 Buffered 1057 bytes --> WebRTC
2017/08/14 14:24:38 Buffered 1057 bytes --> WebRTC
2017/08/14 14:24:39 Buffered 1057 bytes --> WebRTC
2017/08/14 14:24:40 Buffered 1057 bytes --> WebRTC
2017/08/14 14:24:42 Buffered 1057 bytes --> WebRTC
2017/08/14 14:24:44 Buffered 1057 bytes --> WebRTC
2017/08/14 14:24:46 Buffered 1057 bytes --> WebRTC
2017/08/14 14:24:48 Buffered 1057 bytes --> WebRTC
2017/08/14 14:24:50 Buffered 1057 bytes --> WebRTC
2017/08/14 14:24:52 Buffered 1057 bytes --> WebRTC
2017/08/14 14:24:54 Buffered 1057 bytes --> WebRTC
2017/08/14 14:24:56 Buffered 1057 bytes --> WebRTC
2017/08/14 14:24:58 Buffered 1057 bytes --> WebRTC
2017/08/14 14:25:00 Buffered 1057 bytes --> WebRTC
2017/08/14 14:25:01 Buffered 543 bytes --> WebRTC
2017/08/14 14:25:02 Buffered 543 bytes --> WebRTC
2017/08/14 14:25:03 Buffered 543 bytes --> WebRTC
2017/08/14 14:25:03 WebRTC: No messages received for 30 seconds -- closing stale connection.
2017/08/14 14:25:03 WebRTC: closing DataChannel
2017/08/14 14:25:03 WebRTC: DataChannel.OnClose [locally]
2017/08/14 14:25:03 WebRTC: closing PeerConnection
2017/08/14 14:25:03 WebRTC: Closing
2017/08/14 14:25:04 Buffered 543 bytes --> WebRTC
2017/08/14 14:25:05 Buffered 543 bytes --> WebRTC
2017/08/14 14:25:06 Buffered 543 bytes --> WebRTC
2017/08/14 14:25:07 Buffered 543 bytes --> WebRTC

Changed 13 months ago by dcf

comment:13 Changed 13 months ago by dcf

Thanks yawning.

Here are backtraces from another case from today, this time with debugging symbols. In this case there were 89 threads and 168 fds.

attachment:20170816-thread-apply-all-bt.txt

Most of the backtraces are in one of two patterns.

  • 27 threads
    #0  pthread_cond_timedwait@@GLIBC_2.3.2 () at ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:225
    #1  0x000000000063c9de in rtc::Event::Wait(int) ()
    #2  0x000000000088833a in webrtc::EventWrapperImpl::Wait(unsigned long) ()
    #3  0x000000000087ac0d in webrtc::ProcessThreadImpl::Process() ()
    #4  0x000000000063e837 in rtc::PlatformThread::Run() ()
    #5  0x000000000063e786 in rtc::PlatformThread::StartThread(void*) ()
    #6  0x00007fcf4b70a494 in start_thread (arg=0x7fcebc5f6700) at pthread_create.c:333
    #7  0x00007fcf4aba2a8f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:97
    
  • 50 threads
    #0  0x00007fcf4ab9b383 in select () at ../sysdeps/unix/syscall-template.S:84
    #1  0x0000000000688e48 in rtc::PhysicalSocketServer::Wait(int, bool) ()
    #2  0x000000000063b3b1 in rtc::MessageQueue::Get(rtc::Message*, int, bool) ()
    #3  0x0000000000639a72 in rtc::Thread::Run() ()
    #4  0x0000000000639783 in rtc::Thread::PreRun(void*) ()
    #5  0x00007fcf4b70a494 in start_thread (arg=0x7fcead7ea700) at pthread_create.c:333
    #6  0x00007fcf4aba2a8f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:97
    

comment:14 in reply to:  13 Changed 13 months ago by yawning

Replying to dcf:

Here are backtraces from another case from today, this time with debugging symbols. In this case there were 89 threads and 168 fds.

Not immediately attributable to the leak I pointed out 7 MONTHS AGO: 55, 12, 11, 10, 6, 5, 4, 3, 2, 1

The Peer object leaks webrtc Thread objects, and said Thread objects also happen to call pipe() as part of the creation process. This is no longer news. While I'm not going to rule out there being other things that are wrong, fixing the bindings is probably a good idea.

Last edited 13 months ago by yawning (previous) (diff)

comment:15 Changed 13 months ago by arlolra

For what it's worth (and probably not much), rtc::scoped_refptr<PeerConnectionFactoryInterface> pc_factory; should be automatically deallocated when the Peer goes out of scope, so isn't exactly leaking. The culprit is vector<rtc::scoped_refptr<Peer>> localPeers; which never releases the Peers. See, https://github.com/keroserene/go-webrtc/issues/35#issuecomment-233465318

comment:16 in reply to:  15 Changed 13 months ago by yawning

Replying to arlolra:

For what it's worth (and probably not much), rtc::scoped_refptr<PeerConnectionFactoryInterface> pc_factory; should be automatically deallocated when the Peer goes out of scope, so isn't exactly leaking. The culprit is vector<rtc::scoped_refptr<Peer>> localPeers; which never releases the Peers. See, https://github.com/keroserene/go-webrtc/issues/35#issuecomment-233465318

I guess?

The Peer dtor needs to call delete on the created Threads, because the PeerConnectionFactory's dtor will not if you explicitly provide one. It's not immediately obvious to me what handles tearing down the SocketServer either, but there's a limit to how much I want to look at the bindings or the webrtc code.

I can't tell where worker_thread_ in the PeerConnectionFactory is supposed to be terminated either, though the PeerConnectionFactory dtor appears to shut down the signalling thread. At least if the worker_thread_ magically gets terminated somehow (which it may) the code will only be leaking memory and maybe some file descriptors.

But at this point I'm done wasting my time even thinking about this. Good luck.

comment:17 Changed 13 months ago by yawning

Cc: yawning removed

comment:18 Changed 12 months ago by cypherpunks

Priority: MediumHigh
Severity: NormalMajor

comment:19 Changed 12 months ago by dcf

Cc: dcf added

comment:20 Changed 12 months ago by arlolra

Owner: set to arlolra
Status: newassigned

I'll probably be working on this next week at the dev meeting.

comment:22 Changed 7 months ago by arlolra

Pushed a patch which makes use of the above fixed in the client,
https://gitweb.torproject.org/pluggable-transports/snowflake.git/commit/?id=f6301197cc154781cfa96a5255c47c9977887f02

A similar patch for the go proxy (which I have) and webrtc server (which I don't) are needed.

comment:23 Changed 7 months ago by arlolra

Opened #25429 for another problem surfaced here.

Changed 7 months ago by dcf

Attachment: proxy-go-mem.20180308.png added

Changed 7 months ago by dcf

Attachment: proxy-go-fd.20180308.png added

Changed 7 months ago by dcf

Attachment: proxy-go-mem.20180308.log added

while true; do free | tail -n +2 | head -n 1 | gawk 'BEGIN {OFS=","} {$1=strftime("%Y-%m-%d %H:%M:%S", systime(), 1); print}'; sleep 180; done >> mem.log

Changed 7 months ago by dcf

Attachment: proxy-go-fd.20180308.log added

while true; do echo "$(date -u +"%Y-%m-%d %H:%M:%S"),$(lsof | grep -c ^proxy-go)"; sleep 180; done >> fd.log

Changed 7 months ago by dcf

Attachment: proxy-go-log.R added

comment:25 in reply to:  24 Changed 7 months ago by dcf

Replying to arlolra:

A similar patch for the go proxy (which I have)

https://gitweb.torproject.org/pluggable-transports/snowflake.git/commit/?id=ff8f3851082e8f7f8b4c8b99b161be35020aeb67

Here are graphs of memory and file descriptor usage on the server hosting 6 proxy-go instances, before and after deploying the patch. (Memory is systemwide, file descriptors are only for proxy-go processes.)

It's pretty obvious the patch fixed a memory leak. It looks like unlike snowflake-client, file descriptors were not a problem for proxy-go.

I put a ruler for 10 hours because the proxy-go instances are restarted periodically, at intervals of 1, 2, or 10 hours (two instances for each interval). The periodic restarting was a workaround for them not working if you let them run too long).



comment:26 Changed 6 months ago by cypherpunks

@dcf I'm seeing 80-120sec interruptions at times following your change (it's definitely noticeable), looking at the snowflake.log at around the intervals in which it happens I found this,

10:54 WebRTC: Collecting a new Snowflake. Currently at [0/1]
10:54 snowflake-<redacted-ID1>  connecting...
10:54 WebRTC: PeerConnection created.
10:54 WebRTC: OnNegotiationNeeded
10:54 WebRTC: DataChannel created.
.
.
.
11:10 ---- Handler: snowflake assigned ----
11:10 Buffered xxx bytes --> WebRTC
11:10 WebRTC: DataChannel.OnOpen
11:10 Flushed xxx bytes.
11:10 Traffic Bytes (in|out): xxx | xx -- (1 OnMessages, 1 Sends)
11:17 Traffic Bytes (in|out): xxxxx | xxxxx -- (35 OnMessages, 28 Sends)
11:20 WebRTC: At capacity [1/1]  Retrying in 10 seconds...
11:30 WebRTC: At capacity [1/1]  Retrying in 10 seconds...
11:40 WebRTC: At capacity [1/1]  Retrying in 10 seconds...
11:42 Traffic Bytes (in|out): xxxx | xxxx -- (1 OnMessages, 1 Sends)
11:50 WebRTC: At capacity [1/1]  Retrying in 10 seconds...
12:00 WebRTC: At capacity [1/1]  Retrying in 10 seconds...
12:10 WebRTC: At capacity [1/1]  Retrying in 10 seconds...
12:13 WebRTC: No messages received for 30 seconds -- closing stale connection.
12:13 WebRTC: closing DataChannel
12:13 WebRTC: DataChannel.OnClose [locally]
12:13 WebRTC: closing PeerConnection
12:13 WebRTC: Closing
12:13 copy loop ended
12:13 ---- Handler: closed ---
12:13 SOCKS listening...
12:14 SOCKS accepted:  {0.0.3.0:1  map[]}
12:18 Traffic Bytes (in|out): 0 | xxxxx -- (0 OnMessages, 23 Sends)
12:20 WebRTC: Collecting a new Snowflake. Currently at [0/1]
12:20 snowflake-<redacted-ID2>  connecting...
12:20 WebRTC: PeerConnection created.
12:20 WebRTC: OnNegotiationNeeded
12:20 WebRTC: DataChannel created.

Not sure though whether that's due to that change.

Last edited 6 months ago by cypherpunks (previous) (diff)

comment:27 in reply to:  26 Changed 6 months ago by dcf

Replying to cypherpunks:

@dcf I'm seeing 80-120sec interruptions at times following your change (it's definitely noticeable), looking at the snowflake.log at around the intervals in which it happens I found this,

@cypherpunks which change do you mean? The patches in comment:21 (go-webrtc), comment:22 (snowflake-client), and comment:24 (proxy-go) are by arlolra. If you mean comment:25, all I did was recompile with the patch from command:24 and deploy the public proxy-go instances.

I too have noticed some occasional short delays, but I assumed they were because of the client closing connections after 30 seconds (#25429). Maybe I'm wrong, but your log line No messages received for 30 seconds -- closing stale connection. is consistent with that assumption.

If you know how to compile Go code, you could try changing SnowflakeTimeout in snowflake.go from 30 to 3600, replacing Browser/TorBrowser/Tor/PluggableTransports/snowflake-client in your Tor Browser, and leaving a note at #25429 to say if it improved the situation.

comment:28 Changed 6 months ago by dcf

Status: assignedneeds_review

So, is the last thing needed for this ticket a patch for server-webrtc?

Here's a stab at it. It seems mostly isomorphic to the proxy-go changes. attachment:0001-Call-explicit-frees-in-server-webrtc.patch

I tested it locally, youtube-dl'ed a video, and let it idle for about and hour, and CPU is now 0% and memory is low and stable.

comment:29 in reply to:  28 ; Changed 6 months ago by dcf

Status: needs_reviewneeds_revision

Replying to dcf:

So, is the last thing needed for this ticket a patch for server-webrtc? Here's a stab at it.

Stand by. The server-webrtc process stopped running for some reason. I'm not sure if it's related to the patch or not. I'll try running it again with a debugger attached.

The last messages in the server-webrtc log:

2018/03/13 03:27:19 sdp offer successfully received.
2018/03/13 03:27:19 Generating answer...
2018/03/13 03:27:19 answering HTTP POST
2018/03/13 03:27:19 OnDataChannel
2018/03/13 03:27:19 OnOpen channel
2018/03/13 03:27:25 Write 543 bytes --> WebRTC

The last messages in the client log:

2018/03/13 03:27:19 WebRTC: DataChannel.OnOpen
2018/03/13 03:27:20 Traffic Bytes (in|out): 1086 | 1086 -- (2 OnMessages, 2 Sends)
2018/03/13 03:27:29 WebRTC: At capacity [3/3]  Retrying in 10 seconds...
2018/03/13 03:27:30 Traffic Bytes (in|out): 0 | 543 -- (0 OnMessages, 1 Sends)
2018/03/13 03:27:39 WebRTC: No messages received for 30 seconds -- closing stale connection.
2018/03/13 03:27:39 WebRTC: closing DataChannel
2018/03/13 03:27:39 WebRTC: DataChannel.OnClose [locally]
2018/03/13 03:27:39 WebRTC: closing PeerConnection
2018/03/13 03:27:39 WebRTC: Closing
2018/03/13 03:27:39 WebRTC: Collecting a new Snowflake. Currently at [2/3]
2018/03/13 03:27:39 snowflake-6UnoOk512PFjOuaU  connecting...
2018/03/13 03:27:39 WebRTC: PeerConnection created.
2018/03/13 03:27:39 WebRTC: OnNegotiationNeeded
2018/03/13 03:27:39 WebRTC: DataChannel created.
2018/03/13 03:27:39 candidate:1749024283 1 udp 2122260223 X.X.X.X 57098 typ host generation 0 ufrag GUHO network-id 1 network-cost 50
2018/03/13 03:27:39 candidate:649855723 1 tcp 1518280447 X.X.X.X 33197 typ host tcptype passive generation 0 ufrag GUHO network-id 1 network-cost 50
2018/03/13 03:27:39 WebRTC: OnIceComplete
2018/03/13 03:27:39 Negotiating via BrokerChannel...
Target URL:   
Front URL:   127.0.0.1:8080
2018/03/13 03:27:39 http://127.0.0.1:8080/client
2018/03/13 03:27:39 "http://127.0.0.1:8080/client"
2018/03/13 03:27:39 BrokerChannel Error: dial tcp 127.0.0.1:8080: getsockopt: connection refused
2018/03/13 03:27:39 Failed to retrieve answer. Retrying in 10 seconds
2018/03/13 03:27:41 Traffic Bytes (in|out): 0 | 543 -- (0 OnMessages, 1 Sends)
2018/03/13 03:27:46 WebRTC: No messages received for 30 seconds -- closing stale connection.
2018/03/13 03:27:46 WebRTC: closing DataChannel
2018/03/13 03:27:46 WebRTC: DataChannel.OnClose [locally]
2018/03/13 03:27:46 WebRTC: closing PeerConnection
2018/03/13 03:27:46 WebRTC: Closing
2018/03/13 03:27:46 copy loop ended
2018/03/13 03:27:46 ---- Handler: closed ---
2018/03/13 03:27:46 SOCKS listening...
2018/03/13 03:27:46 SOCKS accepted:  {0.0.3.0:1   map[]}
2018/03/13 03:27:46 ---- Handler: snowflake assigned ----
2018/03/13 03:27:49 WebRTC: No messages received for 30 seconds -- closing stale connection.
2018/03/13 03:27:49 WebRTC: closing DataChannel
2018/03/13 03:27:49 WebRTC: DataChannel.OnClose [locally]
2018/03/13 03:27:49 WebRTC: closing PeerConnection
2018/03/13 03:27:49 WebRTC: Closing
2018/03/13 03:27:49 copy loop ended
2018/03/13 03:27:49 ---- Handler: closed ---
2018/03/13 03:27:49 SOCKS listening...
2018/03/13 03:27:49 Negotiating via BrokerChannel...
Target URL:   
Front URL:   127.0.0.1:8080
2018/03/13 03:27:49 http://127.0.0.1:8080/client
2018/03/13 03:27:49 "http://127.0.0.1:8080/client"
2018/03/13 03:27:49 BrokerChannel Error: dial tcp 127.0.0.1:8080: getsockopt: connection refused
2018/03/13 03:27:49 Failed to retrieve answer. Retrying in 10 seconds
2018/03/13 03:27:51 Traffic Bytes (in|out): 0 | 190 -- (0 OnMessages, 1 Sends)
2018/03/13 03:27:59 Negotiating via BrokerChannel...
Target URL:   
Front URL:   127.0.0.1:8080
2018/03/13 03:27:59 http://127.0.0.1:8080/client
2018/03/13 03:27:59 "http://127.0.0.1:8080/client"
2018/03/13 03:27:59 BrokerChannel Error: dial tcp 127.0.0.1:8080: getsockopt: connection refused
2018/03/13 03:27:59 Failed to retrieve answer. Retrying in 10 seconds

(Then it repeats with further getsockopt errors.)

comment:30 Changed 6 months ago by cypherpunks

Replying to dcf:

@cypherpunks which change do you mean? The patches in comment:21 (go-webrtc), comment:22 (snowflake-client), and comment:24 (proxy-go) are by arlolra. If you mean comment:25, all I did was recompile with the patch from command:24 and deploy the public proxy-go instances.

I meant the changes on the server side, i.e. comment:24.

I too have noticed some occasional short delays, but I assumed they were because of the client closing connections after 30 seconds (#25429). Maybe I'm wrong, but your log line No messages received for 30 seconds -- closing stale connection. is consistent with that assumption.

Maybe that change didn't produce a bug by itself but only exacerbated some other bug. Also note that this sometimes happens even when I'm not "pausing" activity, for example with a long download (which ends up failing), and would even result in a "Secure Connection Failed" while an iframe was loading, or a tab endlessly loading.

I also tried to have a download in place while browsing, and at some point the download failed as well as the sites that were loading. So I think this rules out that #25429 is the only bug here, from the logs:

33:44 WebRTC: At capacity [1/1]  Retrying in 10 seconds...
33:45 Traffic Bytes (in|out): 1888447 | 58143 -- (346 OnMessages, 73 Sends)
33:50 Traffic Bytes (in|out): 2297595 | 71457 -- (397 OnMessages, 89 Sends)
33:54 WebRTC: At capacity [1/1]  Retrying in 10 seconds...
34:03 Traffic Bytes (in|out): 941820 | 33542 -- (152 OnMessages, 40 Sends)
34:04 WebRTC: At capacity [1/1]  Retrying in 10 seconds...
34:14 WebRTC: At capacity [1/1]  Retrying in 10 seconds...
34:16 Traffic Bytes (in|out): 0 | 3200 -- (0 OnMessages, 4 Sends)
34:22 WebRTC: No messages received for 30 seconds -- closing stale connection.
34:22 WebRTC: closing DataChannel
34:22 WebRTC: DataChannel.OnClose [locally]
34:22 WebRTC: closing PeerConnection
34:22 WebRTC: Closing
34:22 copy loop ended

If you know how to compile Go code, you could try changing SnowflakeTimeout in snowflake.go from 30 to 3600, replacing Browser/TorBrowser/Tor/PluggableTransports/snowflake-client in your Tor Browser, and leaving a note at #25429 to say if it improved the situation.

I had initially hopped that putting ConnectionPadding 1 in the torrc would turn out to be an easy solution, but given what I said above it's not #25429 that is the only problem. And I don't think upping the timeout would solve the issue.

comment:31 in reply to:  29 Changed 6 months ago by dcf

Replying to dcf:

Stand by. The server-webrtc process stopped running for some reason. I'm not sure if it's related to the patch or not. I'll try running it again with a debugger attached.

Here's a server-webrtc segfault from gdb.

Thread 6 "server-webrtc" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7f3d20f11700 (LWP 14901)]
0x0000000000000000 in ?? ()
(gdb) bt
#0  0x0000000000000000 in ?? ()
#1  0x000055cc34c659c8 in CGO_Channel_Send (channel=0x7f3cf401b2c0, data=0xc42019a000, size=<optimized out>, 
    binary=<optimized out>) at /home/david/go/src/github.com/keroserene/go-webrtc/datachannel.cc:19
#2  0x000055cc34aa7570 in runtime.asmcgocall () at /usr/lib/go-1.7/src/runtime/asm_amd64.s:590
#3  0x000000c420022cd0 in ?? ()
#4  0x000055cc34a5114d in runtime.cgocall (fn=0x0, arg=0xc4200c81a0, ~r2=552668856)
    at /usr/lib/go-1.7/src/runtime/cgocall.go:115
#5  0x000000c42001b500 in ?? ()
#6  0x0000000000000000 in ?? ()

It's happening in the call to c.dc.Send(b) inside webRTCConn.Write. (I added a logging statement right after the Send that didn't get called.)

comment:32 Changed 6 months ago by arlolra

Here's a server-webrtc segfault from gdb.

From the timeline in comment:29, maybe it's a race between the datachannel close on the server and reading from the OR (I'm assuming the logs never make it because they're async and the segfaults prevents them from being written).

You can try something like,

--- a/server-webrtc/snowflake.go
+++ b/server-webrtc/snowflake.go
@@ -52,7 +52,9 @@ func (c *webRTCConn) Read(b []byte) (int, error) {
 func (c *webRTCConn) Write(b []byte) (int, error) {
        // log.Printf("webrtc Write %d %+q", len(b), string(b))
        log.Printf("Write %d bytes --> WebRTC", len(b))
-       c.dc.Send(b)
+       if c.dc != nil } {
+               c.dc.Send(b)
+       }
        return len(b), nil
 }
 
@@ -127,6 +129,7 @@ func makePeerConnectionFromOffer(sdp *webrtc.SessionDescription, config *webrtc.
                dc.OnClose = func() {
                        log.Println("OnClose channel")
                        pw.Close()
+                       conn.dc = nil
                        pc.DeleteDataChannel(dc)
                }
                dc.OnMessage = func(msg []byte) {

The client has something similar where if the transport isn't available to be written to, it buffers the data.

comment:33 in reply to:  32 ; Changed 6 months ago by dcf

Status: needs_revisionneeds_review

Replying to arlolra:

Here's a server-webrtc segfault from gdb.

From the timeline in comment:29, maybe it's a race between the datachannel close on the server and reading from the OR (I'm assuming the logs never make it because they're async and the segfaults prevents them from being written).

You can try something like,

Good call. I tried your idea and added some debugging code to show the addresses of the data channel objects; it was definitely crashing while trying to dereference a NULL pointer before (notice 0x0 in the output).

2018/03/14 01:00:03 sdp offer successfully received.
2018/03/14 01:00:03 Generating answer...
2018/03/14 01:00:03 answering HTTP POST
2018/03/14 01:00:03 OnDataChannel
2018/03/14 01:00:03 OnOpen channel
2018/03/14 01:00:12 OnMessage <--- 543 bytes
2018/03/14 01:00:12 Write 543 bytes 0xc42000cd20 --> WebRTC
2018/03/14 01:00:12 OnMessage <--- 543 bytes
2018/03/14 01:00:13 Write 543 bytes 0xc42000cd20 --> WebRTC
2018/03/14 01:00:13 OnMessage <--- 543 bytes
2018/03/14 01:00:13 Write 543 bytes 0xc42000cd20 --> WebRTC
2018/03/14 01:00:14 Write 543 bytes 0x0 --> WebRTC
2018/03/14 01:00:22 OnClose channel 0xc42000cfa0
2018/03/14 01:00:23 OnMessage <--- 543 bytes
2018/03/14 01:00:23 Write 543 bytes 0xc42000cd20 --> WebRTC
2018/03/14 01:00:23 OnMessage <--- 543 bytes
2018/03/14 01:00:24 Write 543 bytes 0xc42000cd20 --> WebRTC
2018/03/14 01:00:24 OnMessage <--- 543 bytes
2018/03/14 01:00:24 Write 543 bytes 0xc42000cd20 --> WebRTC
2018/03/14 01:00:26 Write 543 bytes 0x0 --> WebRTC
2018/03/14 01:00:33 OnClose channel 0xc42000d0e0
2018/03/14 01:00:54 OnClose channel 0xc42000cd20

attachment:0001-Call-explicit-frees-in-server-webrtc.2.patch is a revised patch.

I'm guessing that this only narrows the race window, doesn't eliminate it completely, because setting a pointer to nil and testing it against nil are probably not atomic. Maybe we'll have to introduce some synchronization.

I suppose that the corresponding patches for client and proxy-go will also have the same flaw? Perhaps this is the cause of the short delays that cypherpunks saw in comment:26: the proxy-go instances are randomly crashing and being restarted. I'll add some logging to them and see if that's the case.

comment:34 in reply to:  33 ; Changed 6 months ago by dcf

Replying to dcf:

I suppose that the corresponding patches for client and proxy-go will also have the same flaw? Perhaps this is the cause of the short delays that cypherpunks saw in comment:26: the proxy-go instances are randomly crashing and being restarted. I'll add some logging to them and see if that's the case.

It does appear to be the case that proxy-go is crashing and being restarted. One of the instances was automatically restarted 6 times since I started it manually 30 minutes ago.

2018/03/14 02:34:27 starting
...
2018/03/14 02:34:53 starting
...
2018/03/14 02:37:54 starting
...
2018/03/14 02:39:54 starting
...
2018/03/14 02:40:55 starting
...
2018/03/14 02:41:55 starting
...
2018/03/14 03:02:09 starting

I'll let it run overnight, and tomorrow I plan to try an analogous patch to the one in comment:32. I've also re-started the memory and file descriptor monitors, in case the memory leak was only being masked by frequent restarts.

comment:35 in reply to:  33 ; Changed 6 months ago by arlolra

I'm guessing that this only narrows the race window, doesn't eliminate it completely, because setting a pointer to nil and testing it against nil are probably not atomic. Maybe we'll have to introduce some synchronization.

Yup, I can bang that out if you'd like.

I suppose that the corresponding patches for client and proxy-go will also have the same flaw?

The client already nils the transport.

Perhaps this is the cause of the short delays that cypherpunks saw in comment:26: the proxy-go instances are randomly crashing and being restarted. I'll add some logging to them and see if that's the case.

Probably, looks like the proxy code has the same pattern.

comment:36 in reply to:  35 ; Changed 6 months ago by dcf

Replying to arlolra:

I'm guessing that this only narrows the race window, doesn't eliminate it completely, because setting a pointer to nil and testing it against nil are probably not atomic. Maybe we'll have to introduce some synchronization.

Yup, I can bang that out if you'd like.

Yes, if you can, I'd appreciate it.

comment:38 Changed 6 months ago by arlolra

Will do.

Attached is a patch to try. I've been running the client with some success.

Changed 6 months ago by dcf

Attachment: proxy-go-log.20180314.R added

Changed 6 months ago by dcf

grep starting *.log | sed -e 's/\(.*\)-\([0-9]\+h\)\.log:\(.......... ..:..:..\) .*/\1,\2,\3/' > starting.log

Changed 6 months ago by dcf

comment:39 in reply to:  34 ; Changed 6 months ago by dcf

Replying to dcf:

It does appear to be the case that proxy-go is crashing and being restarted. One of the instances was automatically restarted 6 times since I started it manually 30 minutes ago.
I'll let it run overnight, and tomorrow I plan to try an analogous patch to the one in comment:32. I've also re-started the memory and file descriptor monitors, in case the memory leak was only being masked by frequent restarts.

Well, I tried the nil check patch for proxy-go, and it seems that it is still restarting frequently :( So either there's a crash that remains unfixed even after the nil check, or there's something else going on that we don't understand.

  • The first blue line is 44ab82bc61, which just adds a "starting" marker to the log.
  • The second blue line is f2abf5b60c, which tries to protect against the crash with a nil check.
  • The "appengine" section represents the proxies connected to snowflake.appspot.com broker; i.e. the one that users are actually using.
  • The "standalone" section represents the proxies connected to the standalone broker snowflake-broker.bamsoftware.com; i.e. the one nobody is using unless they went out of their way to configure it. You'll notice that these ones are mostly being restarted externally every 1h, 2h, 10h, which is how I configured them and is expected. I don't know what accounts for the restarts between about 2018-03-14 07:30:00 and 2018-03-14 10:00:00.


I'll try connecting a debugger to one of the frequently restarting instances and see if there's anything obvious.

comment:40 in reply to:  39 Changed 6 months ago by dcf

Replying to dcf:

I'll try connecting a debugger to one of the frequently restarting instances and see if there's anything obvious.

It didn't take long to get a crash in the debugger. It's not a segfault; it's a runtime abort inside CGO_DestroyPeer. (Perhaps "Pure virtual method called"? I'm not sure.)

(gdb) c
Continuing.
[New Thread 0x7f4d6cfc9700 (LWP 16471)]
[Thread 0x7f4d6d0ca700 (LWP 16457) exited]

Thread 3 "proxy-go" received signal SIGABRT, Aborted.
[Switching to Thread 0x7f4d76337700 (LWP 16396)]
__GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51
51      ../sysdeps/unix/sysv/linux/raise.c: No such file or directory.
(gdb) bt
#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51
#1  0x00007f4d773b53fa in __GI_abort () at abort.c:89
#2  0x00007f4d77ccc0ad in __gnu_cxx::__verbose_terminate_handler() () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#3  0x00007f4d77cca066 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#4  0x00007f4d77cca0b1 in std::terminate() () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#5  0x00007f4d77ccab8f in __cxa_pure_virtual () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#6  0x0000561ddf051957 in rtc::scoped_refptr<Peer>::scoped_refptr (r=..., this=<optimized out>)
    at ./include/webrtc/base/scoped_ref_ptr.h:82
#7  __gnu_cxx::__ops::_Iter_pred<CGO_DestroyPeer(CGO_Peer)::<lambda(rtc::scoped_refptr<Peer>)> >::operator()<__gnu_cxx::__normal_iterator<rtc::scoped_refptr<Peer>*, std::vector<rtc::scoped_refptr<Peer> > > > (__it=...,
    this=<synthetic pointer>) at /usr/include/c++/6/bits/predefined_ops.h:241
#8  std::__find_if<__gnu_cxx::__normal_iterator<rtc::scoped_refptr<Peer>*, std::vector<rtc::scoped_refptr<Peer> > >, __gnu_cxx::__ops::_Iter_pred<CGO_DestroyPeer(CGO_Peer)::<lambda(rtc::scoped_refptr<Peer>)> > > (__pred=..., __last=
  {ptr_ = 0xffffffff}, __first={ptr_ = 0x7f4d580008c0}) at /usr/include/c++/6/bits/stl_algo.h:148
#9  std::__find_if<__gnu_cxx::__normal_iterator<rtc::scoped_refptr<Peer>*, std::vector<rtc::scoped_refptr<Peer> > >, __gnu_cxx::__ops::_Iter_pred<CGO_DestroyPeer(CGO_Peer)::<lambda(rtc::scoped_refptr<Peer>)> > > (__pred=..., __last=...,
    __first=...) at /usr/include/c++/6/bits/stl_algo.h:162
#10 std::__remove_if<__gnu_cxx::__normal_iterator<rtc::scoped_refptr<Peer>*, std::vector<rtc::scoped_refptr<Peer> > >, __gnu_cxx::__ops::_Iter_pred<CGO_DestroyPeer(CGO_Peer)::<lambda(rtc::scoped_refptr<Peer>)> > > (__pred=..., __last=
  {ptr_ = 0xffffffff}, __first=...) at /usr/include/c++/6/bits/stl_algo.h:860
#11 std::remove_if<__gnu_cxx::__normal_iterator<rtc::scoped_refptr<Peer>*, std::vector<rtc::scoped_refptr<Peer> > >, CGO_DestroyPeer(CGO_Peer)::<lambda(rtc::scoped_refptr<Peer>)> > (__pred=..., __last=..., __first=...)
    at /usr/include/c++/6/bits/stl_algo.h:937
#12 CGO_DestroyPeer (cgoPeer=0x7f4d580008c0)
    at /home/david/go/src/github.com/keroserene/go-webrtc/peerconnection.cc:249
#13 0x0000561ddee9a350 in runtime.asmcgocall () at /usr/lib/go-1.7/src/runtime/asm_amd64.s:590
#14 0x000000c420117de0 in ?? ()
#15 0x0000561ddee44dcd in runtime.cgocall (fn=0x0, arg=0xc420000ea0, ~r2=1983082168)
    at /usr/lib/go-1.7/src/runtime/cgocall.go:115
#16 0x000000c420013500 in ?? ()
#17 0x0000000000000000 in ?? ()

comment:41 Changed 6 months ago by arlolra

Perhaps "Pure virtual method called"? I'm not sure.

Well, CGO_DestroyPeer doesn't appear to be thread-safe; so let's fix that. The client is currently defaulting to 1 concurrent peer, which would explain why it doesn't hit this issue, if that's the cause. The proxies, which are handling multiple peers, each of which are voluntarily exiting every 30s of inactivity, seem like they could be hitting some contention there, but possibly not at the rate things are failing?

Changed 6 months ago by dcf

Attachment: proxy-go-abort.20180314.log added

gdb and stdout/stderr of a "pure virtual method called" in proxy-go

Changed 6 months ago by arlolra

comment:42 Changed 6 months ago by arlolra

Well, CGO_DestroyPeer doesn't appear to be thread-safe; so let's fix that.

Attached a go-webrtc patch. Can you try running that and seeing if you still get these aborts?

comment:43 in reply to:  41 Changed 6 months ago by dcf

Replying to arlolra:

Perhaps "Pure virtual method called"? I'm not sure.

Well, CGO_DestroyPeer doesn't appear to be thread-safe; so let's fix that. The client is currently defaulting to 1 concurrent peer, which would explain why it doesn't hit this issue, if that's the cause. The proxies, which are handling multiple peers, each of which are voluntarily exiting every 30s of inactivity, seem like they could be hitting some contention there, but possibly not at the rate things are failing?

I just reproduced the crash on my desktop. I ran gdb --args ./proxy-go -log snowflake.log, then r, and just waited for someone to use it (it took 20 minutes). It is indeed "pure virtual method called". It looks to me like there was only one client connected at the time; so it seems that contention among multiple clients isn't necessary.

The full log is in attachment:proxy-go-abort.20180314.log​. It has not only gdb output, but also stdout/stderr and the contents of the log file, interleaved. Potentially of note: the abort happens immediately after the first OnClose channel. The client had been connected for 78 seconds.

Here is the last part of attachment:proxy-go-abort.20180314.log:

2018/03/15 03:53:11 Write 3113 bytes --> WebRTC
2018/03/15 03:53:11 Write 1571 bytes --> WebRTC
2018/03/15 03:53:11 Write 3627 bytes --> WebRTC
2018/03/15 03:53:11 OnMessage <--- 543 bytes
2018/03/15 03:53:14 broker returns: 504
2018/03/15 03:53:24 broker returns: 504
2018/03/15 03:53:39 broker returns: 504
2018/03/15 03:53:42 OnClose channel
INFO: peerconnection.go:503: fired OnIceConnectionStateChange:  1
INFO: peerconnection.go:435: fired OnSignalingStateChange:  1 Closed
[Thread 0x7fffd77fe700 (LWP 10782) exited]
[Thread 0x7fffd7fff700 (LWP 10781) exited]
[Thread 0x7ffff4587700 (LWP 10777) exited]
pure virtual method called
terminate called without an active exception
[Thread 0x7ffff4688700 (LWP 10776) exited]
[Thread 0x7fffedffb700 (LWP 10774) exited]
[Thread 0x7fffed7fa700 (LWP 10775) exited]

Thread 6 "proxy-go" received signal SIGABRT, Aborted.
[Switching to Thread 0x7fffef7fe700 (LWP 10491)]
__GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51
51      ../sysdeps/unix/sysv/linux/raise.c: No such file or directory.
(gdb) bt
#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51
#1  0x00007ffff686842a in __GI_abort () at abort.c:89
#2  0x00007ffff717f0ad in __gnu_cxx::__verbose_terminate_handler() () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#3  0x00007ffff717d066 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#4  0x00007ffff717d0b1 in std::terminate() () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#5  0x00007ffff717db8f in __cxa_pure_virtual () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#6  0x00005555558dd957 in rtc::scoped_refptr<Peer>::scoped_refptr (r=..., this=<optimized out>)
    at ./include/webrtc/base/scoped_ref_ptr.h:82
#7  __gnu_cxx::__ops::_Iter_pred<CGO_DestroyPeer(CGO_Peer)::<lambda(rtc::scoped_refptr<Peer>)> >::operator()<__gnu_cxx::__normal_iterator<rtc::scoped_refptr<Peer>*, std::vector<rtc::scoped_refptr<Peer> > > > (__it=...,
    this=<synthetic pointer>) at /usr/include/c++/6/bits/predefined_ops.h:241
#8  std::__find_if<__gnu_cxx::__normal_iterator<rtc::scoped_refptr<Peer>*, std::vector<rtc::scoped_refptr<Peer> > >, __gnu_cxx::__ops::_Iter_pred<CGO_DestroyPeer(CGO_Peer)::<lambda(rtc::scoped_refptr<Peer>)> > > (__pred=..., __last=
  {ptr_ = 0x7fffe0000298}, __first={ptr_ = 0x7fffe0000a20}) at /usr/include/c++/6/bits/stl_algo.h:148
#9  std::__find_if<__gnu_cxx::__normal_iterator<rtc::scoped_refptr<Peer>*, std::vector<rtc::scoped_refptr<Peer> > >, __gnu_cxx::__ops::_Iter_pred<CGO_DestroyPeer(CGO_Peer)::<lambda(rtc::scoped_refptr<Peer>)> > > (__pred=..., __last=...,
    __first=...) at /usr/include/c++/6/bits/stl_algo.h:162
#10 std::__remove_if<__gnu_cxx::__normal_iterator<rtc::scoped_refptr<Peer>*, std::vector<rtc::scoped_refptr<Peer> > >, __gnu_cxx::__ops::_Iter_pred<CGO_DestroyPeer(CGO_Peer)::<lambda(rtc::scoped_refptr<Peer>)> > > (__pred=..., __last=
  {ptr_ = 0x7fffe0000298}, __first=...) at /usr/include/c++/6/bits/stl_algo.h:860
#11 std::remove_if<__gnu_cxx::__normal_iterator<rtc::scoped_refptr<Peer>*, std::vector<rtc::scoped_refptr<Peer> > >, CGO_DestroyPeer(CGO_Peer)::<lambda(rtc::scoped_refptr<Peer>)> > (__pred=..., __last=..., __first=...)
    at /usr/include/c++/6/bits/stl_algo.h:937
#12 CGO_DestroyPeer (cgoPeer=0x7fffe0000a20)
    at /home/david/go/src/github.com/keroserene/go-webrtc/peerconnection.cc:249
#13 0x0000555555726350 in runtime.asmcgocall () at /usr/lib/go-1.7/src/runtime/asm_amd64.s:590
#14 0x000000c420035de0 in ?? ()
#15 0x00005555556d0dcd in runtime.cgocall (fn=0x0, arg=0xc42006c1a0, ~r2=-276832584)
    at /usr/lib/go-1.7/src/runtime/cgocall.go:115
#16 0x000000c42001c000 in ?? ()
#17 0x0000000000000000 in ?? ()

P.S. I think I'll be able to look at attachment:0001-Add-synchronization-around-destroying-dataChannels.patch tomorrow.

comment:44 in reply to:  42 ; Changed 6 months ago by dcf

Replying to arlolra:

Well, CGO_DestroyPeer doesn't appear to be thread-safe; so let's fix that.

Attached a go-webrtc patch. Can you try running that and seeing if you still get these aborts?

Yeah I'll try it.

comment:45 Changed 6 months ago by arlolra

Yeah I'll try it.

Nevermind, don't bother. Was able to reproduce with your steps as well. I'll let you know when I have something concrete. Thanks.

comment:46 in reply to:  44 Changed 6 months ago by dcf

Replying to dcf:

Replying to arlolra:

Well, CGO_DestroyPeer doesn't appear to be thread-safe; so let's fix that.

Attached a go-webrtc patch. Can you try running that and seeing if you still get these aborts?

Yeah I'll try it.

With this patch I got another quick crash on my desktop. Also just after the first OnClose, this time a segmentation fault in CGO_Close. Here is the end of the log:

2018/03/15 04:18:21 OnClose channel
INFO: peerconnection.go:503: fired OnIceConnectionStateChange:  1
INFO: peerconnection.go:435: fired OnSignalingStateChange:  1 Closed
[Thread 0x7fffd7fff700 (LWP 11244) exited]
[Thread 0x7fffecff9700 (LWP 11243) exited]
[Thread 0x7ffff4587700 (LWP 11239) exited]
[Thread 0x7ffff4688700 (LWP 11238) exited]
[Thread 0x7fffed7fa700 (LWP 11237) exited]
[Thread 0x7fffedffb700 (LWP 11236) exited]
2018/03/15 04:18:21 datachannelHandler ends

Thread 7 "proxy-go" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7fffeeffd700 (LWP 11229)]
CGO_Close (peer=<optimized out>) at /home/david/go/src/github.com/keroserene/go-webrtc/peerconnection.cc:481
481       cPeer->pc_->Close();
(gdb) bt
#0  CGO_Close (peer=<optimized out>) at /home/david/go/src/github.com/keroserene/go-webrtc/peerconnection.cc:481
#1  0x0000555555726390 in runtime.asmcgocall () at /usr/lib/go-1.7/src/runtime/asm_amd64.s:590
#2  0x000000c42016dd70 in ?? ()
#3  0x00005555556d0e0d in runtime.cgocall (fn=0x0, arg=0xc420001860, ~r2=-285225288)
    at /usr/lib/go-1.7/src/runtime/cgocall.go:115
#4  0x000000c4200018a0 in ?? ()
#5  0x0000000000000000 in ?? ()

I verified with go build -x that peerconnection.cc was being recompiled when I built proxy-go.

comment:47 in reply to:  33 Changed 6 months ago by dcf

Replying to dcf:

Replying to arlolra:

Here's a server-webrtc segfault from gdb.

From the timeline in comment:29, maybe it's a race between the datachannel close on the server and reading from the OR (I'm assuming the logs never make it because they're async and the segfaults prevents them from being written).

You can try something like,

Good call. I tried your idea and added some debugging code to show the addresses of the data channel objects; it was definitely crashing while trying to dereference a NULL pointer before (notice 0x0 in the output).

Touching base on server-webrtc, I didn't experience any crashes with it after c834c76fc5 (explicit frees). I downloaded a few hundred MB through it, then let it idle overnight. I was eyeballing the memory usage in top from time to time, checking for a slow leak. Here are the samples I happened to write down:

time mem
2018-03-14 03:15 1.2%
03:38 2.0%
03:48 2.3%
04:01 2.6%
04:17 2.5%
04:34 2.5%
16:05 3.1%
16:56 3.1%

After the server had been allowed to idle for a while, I saw repeated write attempts (roughly one per 2 seconds) to a nil data connection. These would have crashed the server before c834c76fc5. But their repeated nature shows it's not jsut a transient race condition, but something persistent.

2018/03/14 17:20:43 Write 543 bytes 0x0 --> WebRTC
2018/03/14 17:20:45 Write 543 bytes 0x0 --> WebRTC
2018/03/14 17:20:47 Write 543 bytes 0x0 --> WebRTC
2018/03/14 17:20:50 Write 543 bytes 0x0 --> WebRTC
2018/03/14 17:20:52 Write 543 bytes 0x0 --> WebRTC
2018/03/14 17:20:54 Write 543 bytes 0x0 --> WebRTC
2018/03/14 17:20:58 Write 543 bytes 0x0 --> WebRTC
2018/03/14 17:21:00 Write 543 bytes 0x0 --> WebRTC
2018/03/14 17:21:00 Write 543 bytes 0x0 --> WebRTC
2018/03/14 17:21:03 Write 543 bytes 0x0 --> WebRTC
2018/03/14 17:21:05 Write 543 bytes 0x0 --> WebRTC
2018/03/14 17:21:06 Write 543 bytes 0x0 --> WebRTC
2018/03/14 17:21:11 Write 543 bytes 0x0 --> WebRTC
2018/03/14 17:21:12 Write 543 bytes 0x0 --> WebRTC
2018/03/14 17:21:12 first signal interrupt
2018/03/14 17:21:13 second signal interrupt

To get this output I just changed the log line:

log.Printf("Write %d bytes %p --> WebRTC", len(b), c.dc)

comment:48 Changed 6 months ago by arlolra

With this patch I got another quick crash on my desktop. Also just after the first OnClose, this time a segmentation fault in CGO_Close.

So now what's happening is that pc.Destroy() is being called multiple times (3) in the proxy, when the datachannel closes, two of which I understand. I put some guards around that and now my proxy seems to be happily serving connections w/o crash.

comment:49 Changed 6 months ago by arlolra

I put some guards around that and now my proxy seems to be happily serving connections w/o crash.

I mean, in go-webrtc,

--- a/peerconnection.go
+++ b/peerconnection.go
@@ -44,6 +44,7 @@ import "C"
 import (
        "errors"
        "fmt"
+       "sync"
        "unsafe"
 )
 
@@ -124,6 +125,8 @@ type PeerConnection struct {
        localDescription        *SessionDescription
        remoteDescription       *SessionDescription
        canTrickleIceCandidates bool
+       destroyed               bool
+       destroyLock             *sync.Mutex
 
        // Event handlers
        OnNegotiationNeeded        func()
@@ -152,6 +155,7 @@ func NewPeerConnection(config *Configuration) (*PeerConnection, error) {
                return nil, errors.New("PeerConnection requires a Configuration.")
        }
        pc := new(PeerConnection)
+       pc.destroyLock = &sync.Mutex{}
        pc.index = PCMap.Set(pc)
        // Internal CGO Peer wraps the native webrtc::PeerConnectionInterface.
        pc.cgoPeer = C.CGO_InitializePeer(C.int(pc.index))
@@ -169,6 +173,12 @@ func NewPeerConnection(config *Configuration) (*PeerConnection, error) {
 }
 
 func (pc *PeerConnection) Destroy() error {
+       pc.destroyLock.Lock()
+       if pc.destroyed {
+               return nil
+       }
+       pc.destroyed = true
+       pc.destroyLock.Unlock()
        err := pc.Close()
        PCMap.Delete(pc.index)
        C.CGO_DestroyPeer(pc.cgoPeer)

comment:50 in reply to:  49 Changed 6 months ago by yawning

Replying to arlolra:

I put some guards around that and now my proxy seems to be happily serving connections w/o crash.

I mean, in go-webrtc,

--- a/peerconnection.go
+++ b/peerconnection.go
@@ -44,6 +44,7 @@ import "C"
 import (
        "errors"
        "fmt"
+       "sync"
        "unsafe"
 )
 
@@ -124,6 +125,8 @@ type PeerConnection struct {
        localDescription        *SessionDescription
        remoteDescription       *SessionDescription
        canTrickleIceCandidates bool
+       destroyed               bool
+       destroyLock             *sync.Mutex

Don't need to use a pointer, the zero value of a sync.Mutex is initialized and available for use.
If this is the only mutex in the struct, it would be more idiomatic to do something like:

type PeerConn struct {
  sync.Mutex

  // Other members here.
}

foo := new(PeerConn)
foo.Lock()
// Critical section.
foo.Unlock()
        // Event handlers
        OnNegotiationNeeded        func()
@@ -152,6 +155,7 @@ func NewPeerConnection(config *Configuration) (*PeerConnection, error) {
                return nil, errors.New("PeerConnection requires a Configuration.")
        }
        pc := new(PeerConnection)
+       pc.destroyLock = &sync.Mutex{}

Remove assuming you change the pointer sillyness.

        pc.index = PCMap.Set(pc)
        // Internal CGO Peer wraps the native webrtc::PeerConnectionInterface.
        pc.cgoPeer = C.CGO_InitializePeer(C.int(pc.index))
@@ -169,6 +173,12 @@ func NewPeerConnection(config *Configuration) (*PeerConnection, error) {
 }
 
 func (pc *PeerConnection) Destroy() error {
+       pc.destroyLock.Lock()
+       if pc.destroyed {
+               return nil
+       }
+       pc.destroyed = true
+       pc.destroyLock.Unlock()

If you were having problems with multiple calls to Destroy(), changing the behavior from a nil pointer deref to a deadlock, is not what I would define as "happy".

  pc.destroyLock.Lock()
  defer pc.destroyLock.Unlock()
  if pc.destroyed {
    return nil
  }
  pc.destroyed = true

  // Blah blah blah, unlock handled by the defered call.

Alternatively, if you don't want to use defer, then you need to unlock before returning in the branch.

comment:51 Changed 6 months ago by yawning

There is also sync.Once.

type Foo struct {
   destroyOnce sync.Once
}

func (f *Foo) Destroy() {
  f.destroyOnce.Do(func() {
    // Things that MUST only happen once, here.
  })
}

comment:52 Changed 6 months ago by arlolra

Thanks Yawning, that was instructive.

If you were having problems with multiple calls to Destroy(), changing the behavior from a nil pointer deref to a deadlock, is not what I would define as "happy".

I only pasted the code show what I meant. Unfortunately, I had to retype it since I'd already reset the changes and, in my haste, made that omission. In other words, it wasn't what I was running.

Anyways, I need attach a debugger and figure out why Destroy() is being called that third time.

comment:53 in reply to:  52 Changed 6 months ago by yawning

Replying to arlolra:

Anyways, I need attach a debugger and figure out why Destroy() is being called that third time.

Manual inspection of proxy-go suggests that Destroy() will be called 3 times.

  • dataChannelHandler(), line 205
  • CopyLoopTimeout()
    • As dst, line 191.
    • As src, line 192.

comment:54 Changed 6 months ago by arlolra

Manual inspection of proxy-go suggests that Destroy() will be called 3 times.

Indeed, thanks again Yawning!

I've added two updated patches which I guess are now the things to review and test here. They supersede anything I posted above that isn't already merged.

comment:55 in reply to:  54 ; Changed 6 months ago by dcf

Replying to arlolra:

I've added two updated patches which I guess are now the things to review and test here. They supersede anything I posted above that isn't already merged.

Thank you so much.

I've started some proxy-go instances with the patches. The one on my desktop has seen a few opens and closes so far, and is still running. I'll make more graphs of restarts tomorrow, but so far it's looking good, one hour without any abnormal restarts.

attachment:0001-Add-synchronization-to-manipulating-vectors.patch looks ok.

About attachment:0001-Add-synchronization-around-destroying-dataChannels-a.patch:

  • WebRTCPeer uses the names lock and once, but webRTCConn uses the names lk and do. Is there a reason for the difference. I think I prefer the lock and once names.
  • I tried to work out what exact members the synchonization covers. I think it is:
    • WebRTCPeer.lock covers transport (a DataChannel).
    • WebRTCPeer.once covers pc (only destroying / setting to nil).
    • webRTCConn.lk covers dc (a DataChannel).
    • webRTCConn.do covers pc (only destroying).
    Do I have it right? I would like there to be comments by each of the declarations, stating what their responsibilities are.
  • Small typo: "since we'll otherwise we'll end up".

comment:56 in reply to:  55 Changed 6 months ago by yawning

Replying to dcf:

About attachment:0001-Add-synchronization-around-destroying-dataChannels-a.patch:

To add to dcf's comments. There are a few places where you panic() with the lock held without releasing the lock (eg: the first change in establishDataChannel. If a panics bring down the whole program (a la an assert), this might be ok, but it feels sloppy to me.

Changed 6 months ago by dcf

Attachment: proxy-go-log.20180316.zip added

Changed 6 months ago by dcf

Attachment: proxy-go-fd.20180316.png added

Changed 6 months ago by dcf

Attachment: proxy-go-mem.20180316.png added

Changed 6 months ago by dcf

comment:57 in reply to:  55 ; Changed 6 months ago by dcf

Replying to dcf:

I've started some proxy-go instances with the patches. The one on my desktop has seen a few opens and closes so far, and is still running. I'll make more graphs of restarts tomorrow, but so far it's looking good, one hour without any abnormal restarts.

  • The first blue line is 44ab82bc61, which just adds a "starting" marker to the log.
  • The second blue line is f2abf5b60c, which tries to protect against the crash with a nil check.
  • The third blue line is comment:54, which adds synchonization around datachannels and peerconnections.

The patches from comment:54 have stopped the frequent crashes. Now the process is restarting according to schedule.


Here are the fd and memory graphs. Now we see memory slowly increasing over time--hard to say at this point whether it's just normal elevated usage or a leak. I'll leave the monitor running.



comment:58 in reply to:  55 ; Changed 6 months ago by arlolra

Attached is a patch that hopefully addresses the review in 55 and 56.

Now we see memory slowly increasing over time--hard to say at this point whether it's just normal elevated usage or a leak. I'll leave the monitor running.

Any update on that?

Changed 6 months ago by dcf

Attachment: proxy-go-log.20180320.zip added

Changed 6 months ago by dcf

Attachment: proxy-go-fd.20180320.png added

Changed 6 months ago by dcf

Attachment: proxy-go-mem.20180320.png added

Changed 6 months ago by dcf

comment:59 in reply to:  57 ; Changed 6 months ago by dcf

Replying to dcf:

The patches from comment:54 have stopped the frequent crashes. Now the process is restarting according to schedule.
Here are the fd and memory graphs. Now we see memory slowly increasing over time--hard to say at this point whether it's just normal elevated usage or a leak. I'll leave the monitor running.

Updates. Restarts are stable, memory and file descriptions seem fine.

I'm thinking about losing the periodic restarts, and just putting a memory ulimit on the processes (just in case).




comment:60 in reply to:  58 Changed 6 months ago by dcf

Status: needs_reviewmerge_ready

Replying to arlolra:

Attached is a patch that hopefully addresses the review in 55 and 56.

Looks good to me. Thanks.

comment:62 Changed 6 months ago by dcf

Summary: snowflake-client is pegged at 100% cpuMemory and file descriptor leaks in programs that use go-webrtc

comment:63 in reply to:  59 ; Changed 6 months ago by dcf

Replying to dcf:

I'm thinking about losing the periodic restarts, and just putting a memory ulimit on the processes (just in case).

I've done this now, removing the periodic restarts. There are still the same number of proxies, 3 for the App Engine broker and 3 for the standalone broker.

I first tried limiting the memory using the chpst program that we're already using to change the uid:

exec chpst -u snowflake-proxy -m 419430400 proxy-go

However chpst does not work, it seems due to some bad interaction with Cgo. No matter how much memory I give it, it aborts immediately with this error:

runtime/cgo: pthread_create failed: Resource temporarily unavailable
SIGABRT: abort
PC=0x7f0fe207afcf m=0

goroutine 0 [idle]:

goroutine 1 [running]:
runtime.systemstack_switch()
        /usr/lib/go-1.7/src/runtime/asm_amd64.s:252 fp=0xc42001e768 sp=0xc42001e760
runtime.main()
        /usr/lib/go-1.7/src/runtime/proc.go:127 +0x6c fp=0xc42001e7c0 sp=0xc42001e768
runtime.goexit()
        /usr/lib/go-1.7/src/runtime/asm_amd64.s:2086 +0x1 fp=0xc42001e7c8 sp=0xc42001e7c0

goroutine 17 [syscall, locked to thread]:
runtime.goexit()
        /usr/lib/go-1.7/src/runtime/asm_amd64.s:2086 +0x1

So instead, I'm just using ulimit like this:

ulimit -v 409600
exec chpst -u snowflake-proxy proxy-go

I originally gave each process 200 MB. But they were restarting (on average once during the 15 minutes I was watching). So I then bumped it up to 400 MB. Here are the timestamps:

2018-03-22 03:03:45 ulimit to 200 MB
2018-03-22 03:19:19 ulimit to 400 MB

comment:64 in reply to:  63 ; Changed 6 months ago by dcf

Replying to dcf:

So instead, I'm just using ulimit like this:

ulimit -v 409600
exec chpst -u snowflake-proxy proxy-go

I originally gave each process 200 MB. But they were restarting (on average once during the 15 minutes I was watching). So I then bumped it up to 400 MB. Here are the timestamps:

An update on this. ulimit -v (limit virtual memory) didn't work: it would still cause a hang after a while even with 800 MB. (5/6 processes had hung within about 10 hours.) I changed it to ulimit -m (limit resident memory) with 200 MB and now it seems fine.

top showed that with ulimit -v, the processes had all hung, rather than crashed, when their virtual memory (VIRT column) was close to its limit.

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
12864 snowfla+  20   0 1241148  56840  12464 S  11.0  2.8  14:24.84 proxy-go  # no limit
12847 snowfla+  20   0  810328  24952  13408 S   0.7  1.2   1:14.24 proxy-go  # limit 819200
12848 snowfla+  20   0  406892  21516  13376 S   0.0  1.0   0:11.57 proxy-go  # limit 409600
12849 snowfla+  20   0  398868  21208  13244 S   0.0  1.0   0:09.94 proxy-go  # limit 409600
13340 snowfla+  20   0  398864  21036  13228 S   0.0  1.0   0:24.22 proxy-go  # limit 409600
13978 snowfla+  20   0  408144  20784  13088 S   0.0  1.0   0:42.28 proxy-go  # limit 409600

comment:65 in reply to:  64 Changed 6 months ago by yawning

Replying to dcf:

Replying to dcf:

So instead, I'm just using ulimit like this:

ulimit -v 409600
exec chpst -u snowflake-proxy proxy-go

I originally gave each process 200 MB. But they were restarting (on average once during the 15 minutes I was watching). So I then bumped it up to 400 MB. Here are the timestamps:

An update on this. ulimit -v (limit virtual memory) didn't work: it would still cause a hang after a while even with 800 MB. (5/6 processes had hung within about 10 hours.) I changed it to ulimit -m (limit resident memory) with 200 MB and now it seems fine.

top showed that with ulimit -v, the processes had all hung, rather than crashed, when their virtual memory (VIRT column) was close to its limit.

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
12864 snowfla+  20   0 1241148  56840  12464 S  11.0  2.8  14:24.84 proxy-go  # no limit
12847 snowfla+  20   0  810328  24952  13408 S   0.7  1.2   1:14.24 proxy-go  # limit 819200
12848 snowfla+  20   0  406892  21516  13376 S   0.0  1.0   0:11.57 proxy-go  # limit 409600
12849 snowfla+  20   0  398868  21208  13244 S   0.0  1.0   0:09.94 proxy-go  # limit 409600
13340 snowfla+  20   0  398864  21036  13228 S   0.0  1.0   0:24.22 proxy-go  # limit 409600
13978 snowfla+  20   0  408144  20784  13088 S   0.0  1.0   0:42.28 proxy-go  # limit 409600

Bad things may (probably will) happen regardless. See https://github.com/golang/go/issues/5049 (Superseded by https://github.com/golang/go/issues/16843).

The runtime and runtime/debug packages have methods for interacting with the garbage collector/memory allocator that may work better.

Note: See TracTickets for help on using tickets.