Opened 10 months ago

Last modified 8 weeks ago

#21312 assigned defect

snowflake-client is pegged at 100% cpu

Reported by: arlolra Owned by: arlolra
Priority: High Milestone:
Component: Obfuscation/Snowflake Version:
Severity: Major Keywords:
Cc: arma, dcf Actual Points:
Parent ID: Points:
Reviewer: Sponsor:

Description

1:30:49 AM - arma5: 2017/01/24 18:57:44 candidate:3131354255 1 tcp 1518280447 192.168.1.154 36729 typ host tcptype passive generation 0 ufrag YXKXffPBBnA3SJ3K network-id 1
1:30:52 AM - arma5: this is the last line in its log
1:31:20 AM - arma5: and snowflake-client is pegged at 100% cpu
1:32:39 AM - arma5: %  strace -p3074
1:32:39 AM - arma5: Process 3074 attached
1:32:39 AM - arma5: futex(0x1059710, FUTEX_WAIT, 0, NULL

Child Tickets

Attachments (4)

20170809-info-threads.txt (71.0 KB) - added by dcf 4 months ago.
20170809-thread-apply-all-bt.txt (240.1 KB) - added by dcf 4 months ago.
snowflake-client-fds-20170814.png (10.8 KB) - added by dcf 3 months ago.
Graph of number of file descriptors used by snowflake-client.
20170816-thread-apply-all-bt.txt (55.8 KB) - added by dcf 3 months ago.

Download all attachments as: .zip

Change History (24)

comment:1 Changed 10 months ago by arlolra

1:39:57 AM - arma5: or does strace try to watch all threads
1:40:43 AM - arma5: ok with -f i got a bunch of output
1:42:49 AM - arma5: a bunch of futex, accept, getsockopt, select
1:42:54 AM - arma5: with a bunch of EAGAINs
1:43:03 AM - arma5: could be normal, could be abnormal, hard to say from that
1:45:06 AM - arma5: man that is a lot a lot a lot of threads.
1:45:27 AM - arma5: 1024 threads
1:45:32 AM - arma5: what an intriguing number of threads.

comment:2 Changed 10 months ago by arma

My snowflake-client has 1024 threads open, according to gdb's 'thread apply all bt'.

About half of them are sitting in select:

Thread 559 (Thread 0x7f1178072700 (LWP 5595)):
#0  0x00007f127a105893 in select () at ../sysdeps/unix/syscall-template.S:81
#1  0x00000000007b1fd5 in ?? ()
#2  0x0000000000681185 in ?? ()
#3  0x00000000006938e4 in ?? ()
#4  0x000000000069428c in ?? ()    
#5  0x00007f127ac7d064 in start_thread (arg=0x7f1178072700)
    at pthread_create.c:309
#6  0x00007f127a10c62d in clone ()
    at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

and about half are in pthread_cond_timedwait:

Thread 561 (Thread 0x7f1178a75700 (LWP 5586)):
#0  pthread_cond_timedwait@@GLIBC_2.3.2 ()
    at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:238
#1  0x0000000000696b6a in ?? ()
#2  0x000000000096c9b3 in ?? ()
#3  0x000000000096cb29 in ?? ()
#4  0x00000000006999e5 in ?? ()
#5  0x0000000000699a19 in ?? ()
#6  0x00007f127ac7d064 in start_thread (arg=0x7f1178a75700)
    at pthread_create.c:309
#7  0x00007f127a10c62d in clone ()
    at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

but anyway I don't have many symbols so hey.

What thing makes it keep making new threads? It's probably good that something in my system stopped it at 1024. :)

comment:3 Changed 10 months ago by arma

It's also using a good chunk of virtual memory:

arma      3074 42.5  6.9 6854568 267736 pts/5  Sl   15:54 129:25 ./TorBrowser/Tor/PluggableTransports/snowflake-client -url https://snowflake-reg.appspot.com/ -front www.google.com -ice stun:stun.l.google.com:19302

(I assume that all the threads share most of their memory so it's not so bad in reality.)

Last edited 7 months ago by arma (previous) (diff)

comment:4 Changed 10 months ago by arma

arlo's first thought is "libwebrtc"

yawning points out that cgo has pathological cases with blocking cgo calls, which "lead to the go runtime creating a shit load of OS threads"

comment:5 Changed 10 months ago by arma

Oh, and to be clear, it's not *just* pegged at 100% cpu. It's also not doing anything useful. :)

comment:6 Changed 10 months ago by yawning

Cc: yawning added

To clarify, cgo calls are dispatched in their own OS thread, and the massive VM use is more than likely multi-megabyte stacks being allocated per thread. Go's lightweight concurrency does not apply with cgo in the picture, especially if "blocking in cgo calls" is part of the equation. GOMAXPROCS will not save you because the parameter does not apply to cgo calls.

The easy way to confirm that this is what is happening is probably to set GODEBUG to something appropriate and look at debug spew on stderr (https://golang.org/pkg/runtime/ probably want scheddetail and schedtrace).

Instrumenting the code to utilize runtime/pprof's threadcreate profile will be able to give stack traces at thread creation time, if that's useful.

Changed 4 months ago by dcf

Attachment: 20170809-info-threads.txt added

Changed 4 months ago by dcf

comment:9 Changed 4 months ago by dcf

I encountered this lockup for the first time today. snowflake-client was using 200% CPU. I had been browsing and my requests started timing out. Here is the end of snowflake.log; it must have been around 17:00:00 when I noticed the hang.

2017/08/09 16:58:43 ---- Handler: snowflake assigned ----
2017/08/09 16:58:43 Buffered 201 bytes --> WebRTC
2017/08/09 16:58:43 WebRTC: DataChannel.OnOpen
2017/08/09 16:58:43 Flushed 201 bytes.
2017/08/09 16:58:45 Traffic Bytes (in|out): 743 | 201 -- (1 OnMessages, 1 Sends)
2017/08/09 16:58:51 Traffic Bytes (in|out): 51 | 166 -- (1 OnMessages, 2 Sends)
2017/08/09 16:58:53 WebRTC: At capacity [1/1]  Retrying in 10 seconds...
2017/08/09 16:59:03 WebRTC: At capacity [1/1]  Retrying in 10 seconds...
2017/08/09 16:59:13 WebRTC: At capacity [1/1]  Retrying in 10 seconds...
2017/08/09 16:59:17 WebRTC: No messages received for 30 seconds -- closing stale connection.
2017/08/09 16:59:17 WebRTC: closing DataChannel
2017/08/09 16:59:23 WebRTC: Collecting a new Snowflake. Currently at [0/1]
2017/08/09 16:59:23 snowflake-SeQZtvutAUlE6pFr  connecting...
2017/08/09 17:03:45 copy loop ended
2017/08/09 17:03:45 ---- Handler: closed ---
2017/08/09 17:03:45 SOCKS listening...
2017/08/09 17:03:46 SOCKS accepted:  {0.0.3.0:1  map[]}

The process had 521 threads, not 1024.

comment:10 Changed 4 months ago by yawning

I mean, the bindings still appears to leak OS threads like a sieve, and still use cgo, so it's not like anything has changed since arma encountered this.

Changed 3 months ago by dcf

Graph of number of file descriptors used by snowflake-client.

comment:11 Changed 3 months ago by dcf

Here's a graph of the number of file descriptors held open by the snowflake-client process while I was using it today. The first plateau at 21:30 is when it stopped working and I turned off snowflake in the browser. snowflake-client continued running, got out of whatever rut it was in, and consumed a few more descriptors before I finally closed the browser.

Graph of number of file descriptors used by snowflake-client.

The descriptors are six-fd repetitions of this nature. I suspect there's one for each attempt to contact the broker (i.e. ---- Handler: snowflake assigned ----).

snowflake 28366 david  217w     FIFO       0,10      0t0 1150258477 pipe
snowflake 28366 david  218r     FIFO       0,10      0t0 1150258478 pipe
snowflake 28366 david  219w     FIFO       0,10      0t0 1150258478 pipe
snowflake 28366 david  220u     IPv4 1150257563      0t0        UDP xx.Berkeley.EDU:48700 
snowflake 28366 david  221u     IPv4 1150258488      0t0        TCP xx.Berkeley.EDU:46037 (LISTEN)
snowflake 28366 david  222r     FIFO       0,10      0t0 1150262588 pipe
snowflake 28366 david  223w     FIFO       0,10      0t0 1150262588 pipe
snowflake 28366 david  224r     FIFO       0,10      0t0 1150262589 pipe
snowflake 28366 david  225w     FIFO       0,10      0t0 1150262589 pipe
snowflake 28366 david  226u     IPv4 1150260856      0t0        UDP xx.Berkeley.EDU:49943 
snowflake 28366 david  227u     IPv4 1150262068      0t0        TCP xx.Berkeley.EDU:52465 (LISTEN)

During the plateau at 21:30, there were a lot of Buffered messages in the log:

2017/08/14 14:24:34 WebRTC: DataChannel.OnOpen
2017/08/14 14:24:34 Buffered 1057 bytes --> WebRTC
2017/08/14 14:24:36 Buffered 1057 bytes --> WebRTC
2017/08/14 14:24:38 Buffered 1057 bytes --> WebRTC
2017/08/14 14:24:39 Buffered 1057 bytes --> WebRTC
2017/08/14 14:24:40 Buffered 1057 bytes --> WebRTC
2017/08/14 14:24:42 Buffered 1057 bytes --> WebRTC
2017/08/14 14:24:44 Buffered 1057 bytes --> WebRTC
2017/08/14 14:24:46 Buffered 1057 bytes --> WebRTC
2017/08/14 14:24:48 Buffered 1057 bytes --> WebRTC
2017/08/14 14:24:50 Buffered 1057 bytes --> WebRTC
2017/08/14 14:24:52 Buffered 1057 bytes --> WebRTC
2017/08/14 14:24:54 Buffered 1057 bytes --> WebRTC
2017/08/14 14:24:56 Buffered 1057 bytes --> WebRTC
2017/08/14 14:24:58 Buffered 1057 bytes --> WebRTC
2017/08/14 14:25:00 Buffered 1057 bytes --> WebRTC
2017/08/14 14:25:01 Buffered 543 bytes --> WebRTC
2017/08/14 14:25:02 Buffered 543 bytes --> WebRTC
2017/08/14 14:25:03 Buffered 543 bytes --> WebRTC
2017/08/14 14:25:03 WebRTC: No messages received for 30 seconds -- closing stale connection.
2017/08/14 14:25:03 WebRTC: closing DataChannel
2017/08/14 14:25:03 WebRTC: DataChannel.OnClose [locally]
2017/08/14 14:25:03 WebRTC: closing PeerConnection
2017/08/14 14:25:03 WebRTC: Closing
2017/08/14 14:25:04 Buffered 543 bytes --> WebRTC
2017/08/14 14:25:05 Buffered 543 bytes --> WebRTC
2017/08/14 14:25:06 Buffered 543 bytes --> WebRTC
2017/08/14 14:25:07 Buffered 543 bytes --> WebRTC

Changed 3 months ago by dcf

comment:13 Changed 3 months ago by dcf

Thanks yawning.

Here are backtraces from another case from today, this time with debugging symbols. In this case there were 89 threads and 168 fds.

attachment:20170816-thread-apply-all-bt.txt

Most of the backtraces are in one of two patterns.

  • 27 threads
    #0  pthread_cond_timedwait@@GLIBC_2.3.2 () at ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:225
    #1  0x000000000063c9de in rtc::Event::Wait(int) ()
    #2  0x000000000088833a in webrtc::EventWrapperImpl::Wait(unsigned long) ()
    #3  0x000000000087ac0d in webrtc::ProcessThreadImpl::Process() ()
    #4  0x000000000063e837 in rtc::PlatformThread::Run() ()
    #5  0x000000000063e786 in rtc::PlatformThread::StartThread(void*) ()
    #6  0x00007fcf4b70a494 in start_thread (arg=0x7fcebc5f6700) at pthread_create.c:333
    #7  0x00007fcf4aba2a8f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:97
    
  • 50 threads
    #0  0x00007fcf4ab9b383 in select () at ../sysdeps/unix/syscall-template.S:84
    #1  0x0000000000688e48 in rtc::PhysicalSocketServer::Wait(int, bool) ()
    #2  0x000000000063b3b1 in rtc::MessageQueue::Get(rtc::Message*, int, bool) ()
    #3  0x0000000000639a72 in rtc::Thread::Run() ()
    #4  0x0000000000639783 in rtc::Thread::PreRun(void*) ()
    #5  0x00007fcf4b70a494 in start_thread (arg=0x7fcead7ea700) at pthread_create.c:333
    #6  0x00007fcf4aba2a8f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:97
    

comment:14 in reply to:  13 Changed 3 months ago by yawning

Replying to dcf:

Here are backtraces from another case from today, this time with debugging symbols. In this case there were 89 threads and 168 fds.

Not immediately attributable to the leak I pointed out 7 MONTHS AGO: 55, 12, 11, 10, 6, 5, 4, 3, 2, 1

The Peer object leaks webrtc Thread objects, and said Thread objects also happen to call pipe() as part of the creation process. This is no longer news. While I'm not going to rule out there being other things that are wrong, fixing the bindings is probably a good idea.

Last edited 3 months ago by yawning (previous) (diff)

comment:15 Changed 3 months ago by arlolra

For what it's worth (and probably not much), rtc::scoped_refptr<PeerConnectionFactoryInterface> pc_factory; should be automatically deallocated when the Peer goes out of scope, so isn't exactly leaking. The culprit is vector<rtc::scoped_refptr<Peer>> localPeers; which never releases the Peers. See, https://github.com/keroserene/go-webrtc/issues/35#issuecomment-233465318

comment:16 in reply to:  15 Changed 3 months ago by yawning

Replying to arlolra:

For what it's worth (and probably not much), rtc::scoped_refptr<PeerConnectionFactoryInterface> pc_factory; should be automatically deallocated when the Peer goes out of scope, so isn't exactly leaking. The culprit is vector<rtc::scoped_refptr<Peer>> localPeers; which never releases the Peers. See, https://github.com/keroserene/go-webrtc/issues/35#issuecomment-233465318

I guess?

The Peer dtor needs to call delete on the created Threads, because the PeerConnectionFactory's dtor will not if you explicitly provide one. It's not immediately obvious to me what handles tearing down the SocketServer either, but there's a limit to how much I want to look at the bindings or the webrtc code.

I can't tell where worker_thread_ in the PeerConnectionFactory is supposed to be terminated either, though the PeerConnectionFactory dtor appears to shut down the signalling thread. At least if the worker_thread_ magically gets terminated somehow (which it may) the code will only be leaking memory and maybe some file descriptors.

But at this point I'm done wasting my time even thinking about this. Good luck.

comment:17 Changed 3 months ago by yawning

Cc: yawning removed

comment:18 Changed 8 weeks ago by cypherpunks

Priority: MediumHigh
Severity: NormalMajor

comment:19 Changed 8 weeks ago by dcf

Cc: dcf added

comment:20 Changed 8 weeks ago by arlolra

Owner: set to arlolra
Status: newassigned

I'll probably be working on this next week at the dev meeting.

Note: See TracTickets for help on using tickets.