1:30:49 AM - arma5: 2017/01/24 18:57:44 candidate:3131354255 1 tcp 1518280447 192.168.1.154 36729 typ host tcptype passive generation 0 ufrag YXKXffPBBnA3SJ3K network-id 11:30:52 AM - arma5: this is the last line in its log1:31:20 AM - arma5: and snowflake-client is pegged at 100% cpu1:32:39 AM - arma5: % strace -p30741:32:39 AM - arma5: Process 3074 attached1:32:39 AM - arma5: futex(0x1059710, FUTEX_WAIT, 0, NULL
To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information
Child items 0
Show closed items
No child items are currently assigned. Use child items to break down this issue into smaller parts.
Linked items 0
Link issues together to show that they're related.
Learn more.
1:39:57 AM - arma5: or does strace try to watch all threads1:40:43 AM - arma5: ok with -f i got a bunch of output1:42:49 AM - arma5: a bunch of futex, accept, getsockopt, select1:42:54 AM - arma5: with a bunch of EAGAINs1:43:03 AM - arma5: could be normal, could be abnormal, hard to say from that1:45:06 AM - arma5: man that is a lot a lot a lot of threads.1:45:27 AM - arma5: 1024 threads1:45:32 AM - arma5: what an intriguing number of threads.
My snowflake-client has 1024 threads open, according to gdb's 'thread apply all bt'.
About half of them are sitting in select:
Thread 559 (Thread 0x7f1178072700 (LWP 5595)):#0 0x00007f127a105893 in select () at ../sysdeps/unix/syscall-template.S:81#1 0x00000000007b1fd5 in ?? ()#2 0x0000000000681185 in ?? ()#3 0x00000000006938e4 in ?? ()#4 0x000000000069428c in ?? () #5 0x00007f127ac7d064 in start_thread (arg=0x7f1178072700) at pthread_create.c:309#6 0x00007f127a10c62d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111
and about half are in pthread_cond_timedwait:
Thread 561 (Thread 0x7f1178a75700 (LWP 5586)):#0 pthread_cond_timedwait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:238#1 0x0000000000696b6a in ?? ()#2 0x000000000096c9b3 in ?? ()#3 0x000000000096cb29 in ?? ()#4 0x00000000006999e5 in ?? ()#5 0x0000000000699a19 in ?? ()#6 0x00007f127ac7d064 in start_thread (arg=0x7f1178a75700) at pthread_create.c:309#7 0x00007f127a10c62d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111
but anyway I don't have many symbols so hey.
What thing makes it keep making new threads? It's probably good that something in my system stopped it at 1024. :)
To clarify, cgo calls are dispatched in their own OS thread, and the massive VM use is more than likely multi-megabyte stacks being allocated per thread. Go's lightweight concurrency does not apply with cgo in the picture, especially if "blocking in cgo calls" is part of the equation. GOMAXPROCS will not save you because the parameter does not apply to cgo calls.
The easy way to confirm that this is what is happening is probably to set GODEBUG to something appropriate and look at debug spew on stderr (https://golang.org/pkg/runtime/ probably want scheddetail and schedtrace).
Instrumenting the code to utilize runtime/pprof's threadcreate profile will be able to give stack traces at thread creation time, if that's useful.
I encountered this lockup for the first time today. snowflake-client was using 200% CPU. I had been browsing and my requests started timing out. Here is the end of snowflake.log; it must have been around 17:00:00 when I noticed the hang.
2017/08/09 16:58:43 ---- Handler: snowflake assigned ----2017/08/09 16:58:43 Buffered 201 bytes --> WebRTC2017/08/09 16:58:43 WebRTC: DataChannel.OnOpen2017/08/09 16:58:43 Flushed 201 bytes.2017/08/09 16:58:45 Traffic Bytes (in|out): 743 | 201 -- (1 OnMessages, 1 Sends)2017/08/09 16:58:51 Traffic Bytes (in|out): 51 | 166 -- (1 OnMessages, 2 Sends)2017/08/09 16:58:53 WebRTC: At capacity [1/1] Retrying in 10 seconds...2017/08/09 16:59:03 WebRTC: At capacity [1/1] Retrying in 10 seconds...2017/08/09 16:59:13 WebRTC: At capacity [1/1] Retrying in 10 seconds...2017/08/09 16:59:17 WebRTC: No messages received for 30 seconds -- closing stale connection.2017/08/09 16:59:17 WebRTC: closing DataChannel2017/08/09 16:59:23 WebRTC: Collecting a new Snowflake. Currently at [0/1]2017/08/09 16:59:23 snowflake-SeQZtvutAUlE6pFr connecting...2017/08/09 17:03:45 copy loop ended2017/08/09 17:03:45 ---- Handler: closed ---2017/08/09 17:03:45 SOCKS listening...2017/08/09 17:03:46 SOCKS accepted: {0.0.3.0:1 map[]}
I mean, the bindings still appears to leak OS threads like a sieve, and still use cgo, so it's not like anything has changed since arma encountered this.
Here's a graph of the number of file descriptors held open by the snowflake-client process while I was using it today. The first plateau at 21:30 is when it stopped working and I turned off snowflake in the browser. snowflake-client continued running, got out of whatever rut it was in, and consumed a few more descriptors before I finally closed the browser.
The descriptors are six-fd repetitions of this nature. I suspect there's one for each attempt to contact the broker (i.e. ---- Handler: snowflake assigned ----).
snowflake 28366 david 217w FIFO 0,10 0t0 1150258477 pipesnowflake 28366 david 218r FIFO 0,10 0t0 1150258478 pipesnowflake 28366 david 219w FIFO 0,10 0t0 1150258478 pipesnowflake 28366 david 220u IPv4 1150257563 0t0 UDP xx.Berkeley.EDU:48700 snowflake 28366 david 221u IPv4 1150258488 0t0 TCP xx.Berkeley.EDU:46037 (LISTEN)snowflake 28366 david 222r FIFO 0,10 0t0 1150262588 pipesnowflake 28366 david 223w FIFO 0,10 0t0 1150262588 pipesnowflake 28366 david 224r FIFO 0,10 0t0 1150262589 pipesnowflake 28366 david 225w FIFO 0,10 0t0 1150262589 pipesnowflake 28366 david 226u IPv4 1150260856 0t0 UDP xx.Berkeley.EDU:49943 snowflake 28366 david 227u IPv4 1150262068 0t0 TCP xx.Berkeley.EDU:52465 (LISTEN)
During the plateau at 21:30, there were a lot of Buffered messages in the log:
Here are backtraces from another case from today, this time with debugging symbols. In this case there were 89 threads and 168 fds.
20170816-thread-apply-all-bt.txt
Most of the backtraces are in one of two patterns.
27 threads
#0 pthread_cond_timedwait@@GLIBC_2.3.2 () at ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:225
#1 0x000000000063c9de in rtc::Event::Wait(int) ()
#2 (closed) 0x000000000088833a in webrtc::EventWrapperImpl::Wait(unsigned long) ()
#3 (closed) 0x000000000087ac0d in webrtc::ProcessThreadImpl::Process() ()
#4 (closed) 0x000000000063e837 in rtc::PlatformThread::Run() ()
#5 (closed) 0x000000000063e786 in rtc::PlatformThread::StartThread(void*) ()
#6 (closed) 0x00007fcf4b70a494 in start_thread (arg=0x7fcebc5f6700) at pthread_create.c:333
#7 (closed) 0x00007fcf4aba2a8f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:97
* 50 threads
#0 0x00007fcf4ab9b383 in select () at ../sysdeps/unix/syscall-template.S:84
#1 0x0000000000688e48 in rtc::PhysicalSocketServer::Wait(int, bool) ()
#2 (closed) 0x000000000063b3b1 in rtc::MessageQueue::Get(rtc::Message*, int, bool) ()
#3 (closed) 0x0000000000639a72 in rtc::Thread::Run() ()
#4 (closed) 0x0000000000639783 in rtc::Thread::PreRun(void*) ()
#5 (closed) 0x00007fcf4b70a494 in start_thread (arg=0x7fcead7ea700) at pthread_create.c:333
#6 (closed) 0x00007fcf4aba2a8f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:97
Here are backtraces from another case from today, this time with debugging symbols. In this case there were 89 threads and 168 fds.
Not immediately attributable to the leak I pointed out 7 MONTHS AGO: 55, 12, 11, 10, 6, 5, 4, 3, 2, 1
The Peer object leaks webrtc Thread objects, and said Thread objects also happen to call pipe() as part of the creation process. This is no longer news. While I'm not going to rule out there being other things that are wrong, fixing the bindings is probably a good idea.