Opened 18 months ago

Closed 5 months ago

Last modified 5 months ago

#25688 closed defect (fixed)

proxy-go is still deadlocking occasionally

Reported by: dcf Owned by: cohosh
Priority: Low Milestone:
Component: Circumvention/Snowflake Version:
Severity: Normal Keywords: network-team-roadmap-2019-Q1Q2
Cc: dcf, arlolra Actual Points:
Parent ID: Points: 3
Reviewer: Sponsor:

Description

The three fallback proxy-go instances are still hanging, after variable delays of a few days. This is even after removing all memory restrictions I discussed in comment:64:ticket:21312.

The more heavily used instances seem to deadlock sooner. Those for the currently used broker would be more likely to stop than those for the standalone broker. But the ones for the standalone broker would stop too.

In the meantime, I've put the fallback proxies back on periodic restarts. Before the intervals were 1h,2h,10h; now I increased them to 17h,23h,29h (prime numbers, so the average time before the next restart is < 17h).

I'll update this ticket with a graph showing uptimes when I have time.

Child Tickets

Attachments (4)

proxy-go-log.20180403.zip (2.0 MB) - added by dcf 18 months ago.
proxy-go-fd.20180403.png (30.7 KB) - added by dcf 18 months ago.
proxy-go-mem.20180403.png (49.5 KB) - added by dcf 18 months ago.
proxy-go-starting.20180403.png (130.8 KB) - added by dcf 18 months ago.

Change History (32)

Changed 18 months ago by dcf

Attachment: proxy-go-log.20180403.zip added

Changed 18 months ago by dcf

Attachment: proxy-go-fd.20180403.png added

Changed 18 months ago by dcf

Attachment: proxy-go-mem.20180403.png added

Changed 18 months ago by dcf

comment:1 in reply to:  description Changed 18 months ago by dcf

Replying to dcf:

In the meantime, I've put the fallback proxies back on periodic restarts. Before the intervals were 1h,2h,10h; now I increased them to 17h,23h,29h (prime numbers, so the average time before the next restart is < 17h).

I'll update this ticket with a graph showing uptimes when I have time.

This graph shows restarts ╳, as well as polls and data transfer , so you can see where they got stuck after 3 or 4 days.

The rows 1h, 2h, 10h, 17h, 23h, 29h are set to periodically restart with no memory limit. The rows a, b, c ran with different memory limits at different times:

2018-03-22 03:19:19 all ulimit -v 400 MB
2018-03-22 05:16:59 a appengine unlimited, c appengine ulimit -v 800 MB, all others ulimit -v 400 MB
2018-03-23 01:02:18 a appengine unlimited, all others ulimit -m 200 MB
2018-03-27 19:26:12 a, b, c unlimited

ulimit -v caused quick deadlocks; ulimit -m was less quick but it still happened, and it still happened even with no ulimit.

Scroll right →→→

Memory use plateaus around the time of the hangs, though that may be as a result rather than a cause. The server has 2 GB and it doesn't get close to that. (Beware the horizontal axis doesn't quite line up with the graph above.)

File descriptors seem stable.

comment:2 Changed 17 months ago by arma

Is there some strace or ptrace or gdb equivalent for go, that lets you figure out *where* it's deadlocking? :)

comment:3 Changed 17 months ago by arlolra

Is there some strace or ptrace or gdb equivalent for go, that lets you figure out *where* it's deadlocking? :)

I'm working on that, but since we don't have any more clients, it's becoming harder to produce.

comment:4 Changed 14 months ago by arlolra

Is there some strace or ptrace or gdb equivalent for go, that lets you figure out *where* it's deadlocking? :)

(gdb) info threads
 Id   Target Id         Frame 
 1    Thread 0x7fef23e90740 (LWP 10511) "proxy-go" runtime.epollwait () at /usr/lib/go-1.7/src/runtime/sys_linux_amd64.s:525
 2    Thread 0x7fef21e8e700 (LWP 10512) "proxy-go" runtime.futex () at /usr/lib/go-1.7/src/runtime/sys_linux_amd64.s:388
 3    Thread 0x7fef2168d700 (LWP 10513) "proxy-go" runtime.futex () at /usr/lib/go-1.7/src/runtime/sys_linux_amd64.s:388
 4    Thread 0x7fef20e8c700 (LWP 10514) "proxy-go" runtime.futex () at /usr/lib/go-1.7/src/runtime/sys_linux_amd64.s:388
 5    Thread 0x7fef1bfff700 (LWP 10515) "proxy-go" pthread_cond_wait@@GLIBC_2.3.2 () at ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
 6    Thread 0x7fef1b7fe700 (LWP 10516) "proxy-go" runtime.futex () at /usr/lib/go-1.7/src/runtime/sys_linux_amd64.s:388
 7    Thread 0x7fef1affd700 (LWP 13632) "CGO Signalling" 0x00007fef227b83c3 in select () at ../sysdeps/unix/syscall-template.S:84
 8    Thread 0x7fef1a7fc700 (LWP 13633) "CGO Worker" 0x00007fef227b83c3 in select () at ../sysdeps/unix/syscall-template.S:84
 9    Thread 0x7fef2052b700 (LWP 13634) "VoiceProcessThr" pthread_cond_timedwait@@GLIBC_2.3.2 () at ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:225
 10   Thread 0x7fef2042a700 (LWP 13635) "RtcEventLog thr" pthread_cond_wait@@GLIBC_2.3.2 () at ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
 11   Thread 0x7fef197fa700 (LWP 14054) "CGO Signalling" 0x00007fef227b83c3 in select () at ../sysdeps/unix/syscall-template.S:84
 12   Thread 0x7fef19ffb700 (LWP 14055) "CGO Worker" 0x00007fef227b83c3 in select () at ../sysdeps/unix/syscall-template.S:84
 13   Thread 0x7fef20117700 (LWP 14056) "VoiceProcessThr" pthread_cond_timedwait@@GLIBC_2.3.2 () at ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:225
 14   Thread 0x7fef18bdc700 (LWP 14057) "RtcEventLog thr" pthread_cond_wait@@GLIBC_2.3.2 () at ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
 15   Thread 0x7feefbfff700 (LWP 14061) "SCTP timer" 0x00007fff62becdb8 in gettimeofday ()
 16   Thread 0x7feefb7fe700 (LWP 14112) "CGO Signalling" 0x00007fef227b83c3 in select () at ../sysdeps/unix/syscall-template.S:84
 17   Thread 0x7feefaffd700 (LWP 14113) "CGO Worker" 0x00007fef227b83c3 in select () at ../sysdeps/unix/syscall-template.S:84
 18   Thread 0x7fef182da700 (LWP 14114) "VoiceProcessThr" pthread_cond_timedwait@@GLIBC_2.3.2 () at ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:225
 19   Thread 0x7fef181d9700 (LWP 14115) "RtcEventLog thr" pthread_cond_wait@@GLIBC_2.3.2 () at ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
 20   Thread 0x7feef9ffb700 (LWP 14156) "CGO Signalling" 0x00007fef227b83c3 in select () at ../sysdeps/unix/syscall-template.S:84
 21   Thread 0x7feefa7fc700 (LWP 14157) "CGO Worker" 0x00007fef227b83c3 in select () at ../sysdeps/unix/syscall-template.S:84
 22   Thread 0x7feef97fa700 (LWP 14158) "VoiceProcessThr" pthread_cond_timedwait@@GLIBC_2.3.2 () at ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:225
 23   Thread 0x7feef96f9700 (LWP 14159) "RtcEventLog thr" pthread_cond_wait@@GLIBC_2.3.2 () at ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
 24   Thread 0x7feef95f8700 (LWP 14167) "CGO Signalling" runtime.futex () at /usr/lib/go-1.7/src/runtime/sys_linux_amd64.s:388
 25   Thread 0x7feef8df7700 (LWP 14168) "CGO Worker" 0x00007fef227b83c3 in select () at ../sysdeps/unix/syscall-template.S:84
 26   Thread 0x7feef85f6700 (LWP 14169) "VoiceProcessThr" pthread_cond_timedwait@@GLIBC_2.3.2 () at ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:225
 27   Thread 0x7feef84f5700 (LWP 14170) "RtcEventLog thr" pthread_cond_wait@@GLIBC_2.3.2 () at ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
 28   Thread 0x7feede7fc700 (LWP 14250) "proxy-go" runtime.futex () at /usr/lib/go-1.7/src/runtime/sys_linux_amd64.s:388
 29   Thread 0x7feedeffd700 (LWP 14251) "proxy-go" runtime.futex () at /usr/lib/go-1.7/src/runtime/sys_linux_amd64.s:388
 30   Thread 0x7feedffff700 (LWP 14287) "CGO Signalling" 0x00007fef227b83c3 in select () at ../sysdeps/unix/syscall-template.S:84
 31   Thread 0x7feedf7fe700 (LWP 14288) "CGO Worker" 0x00007fef227b83c3 in select () at ../sysdeps/unix/syscall-template.S:84
 32   Thread 0x7feef83f4700 (LWP 14289) "VoiceProcessThr" pthread_cond_timedwait@@GLIBC_2.3.2 () at ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:225
 33   Thread 0x7feef82f3700 (LWP 14290) "RtcEventLog thr" pthread_cond_wait@@GLIBC_2.3.2 () at ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
 34   Thread 0x7feedd7fa700 (LWP 14293) "CGO Signalling" 0x00007fef227b83c3 in select () at ../sysdeps/unix/syscall-template.S:84
 35   Thread 0x7feeddffb700 (LWP 14294) "CGO Worker" 0x00007fef227b83c3 in select () at ../sysdeps/unix/syscall-template.S:84
 36   Thread 0x7feef81f2700 (LWP 14295) "VoiceProcessThr" pthread_cond_timedwait@@GLIBC_2.3.2 () at ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:225
 37   Thread 0x7feedcff9700 (LWP 14296) "RtcEventLog thr" pthread_cond_wait@@GLIBC_2.3.2 () at ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
 38   Thread 0x7feed77fe700 (LWP 14506) "CGO Signalling" 0x00007fef227b83c3 in select () at ../sysdeps/unix/syscall-template.S:84
 39   Thread 0x7feed6ffd700 (LWP 14507) "CGO Worker" 0x00007fef227b83c3 in select () at ../sysdeps/unix/syscall-template.S:84
 40   Thread 0x7feedc4f5700 (LWP 14508) "VoiceProcessThr" pthread_cond_timedwait@@GLIBC_2.3.2 () at ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:225
 41   Thread 0x7feedc3f4700 (LWP 14509) "RtcEventLog thr" pthread_cond_wait@@GLIBC_2.3.2 () at ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
 42   Thread 0x7feed7fff700 (LWP 14923) "CGO Signalling" 0x00007fef227b83c3 in select () at ../sysdeps/unix/syscall-template.S:84
 43   Thread 0x7feedcef8700 (LWP 14924) "CGO Worker" 0x00007fef227b83c3 in select () at ../sysdeps/unix/syscall-template.S:84
 44   Thread 0x7feedc6f7700 (LWP 14925) "VoiceProcessThr" pthread_cond_timedwait@@GLIBC_2.3.2 () at ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:225
 45   Thread 0x7feedc5f6700 (LWP 14926) "RtcEventLog thr" pthread_cond_wait@@GLIBC_2.3.2 () at ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
 46   Thread 0x7feed5ffb700 (LWP 15038) "CGO Signalling" 0x00007fef227b83c3 in select () at ../sysdeps/unix/syscall-template.S:84
 47   Thread 0x7feed67fc700 (LWP 15039) "CGO Worker" 0x00007fef227b83c3 in select () at ../sysdeps/unix/syscall-template.S:84
 48   Thread 0x7feedc2f3700 (LWP 15040) "VoiceProcessThr" pthread_cond_timedwait@@GLIBC_2.3.2 () at ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:225
 49   Thread 0x7feedc1f2700 (LWP 15041) "RtcEventLog thr" pthread_cond_wait@@GLIBC_2.3.2 () at ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185


(gdb) thread 5
[Switching to thread 5 (Thread 0x7fef1bfff700 (LWP 10515))]
#0  pthread_cond_wait@@GLIBC_2.3.2 () at ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
185	../sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S: No such file or directory.
(gdb) bt
#0  pthread_cond_wait@@GLIBC_2.3.2 () at ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
#1  0x000055a388bd59cb in rtc::Event::Wait(int) ()
#2  0x000055a388cea279 in webrtc::DataChannelProxyWithInternal<webrtc::DataChannelInterface>::Send(webrtc::DataBuffer const&) ()
#3  0x000055a388bc7e88 in CGO_Channel_Send (channel=0x7feee401a060, data=0xc420116000, size=<optimized out>, binary=<optimized out>) at /home/[scrub]/go/src/github.com/keroserene/go-webrtc/datachannel.cc:19
#4  0x000055a388a10c70 in runtime.asmcgocall () at /usr/lib/go-1.7/src/runtime/asm_amd64.s:590
#5  0x000000c4204d9c20 in ?? ()
#6  0x000055a3889bb54d in runtime.cgocall (fn=0x0, arg=0xc4200bb1e0, ~r2=469757624) at /usr/lib/go-1.7/src/runtime/cgocall.go:115
#7  0x000000c420013500 in ?? ()
#8  0x0000000000000000 in ?? ()


(gdb) thread 17
[Switching to thread 17 (Thread 0x7feefaffd700 (LWP 14113))]
#0  0x00007fef227b83c3 in select () at ../sysdeps/unix/syscall-template.S:84
84	in ../sysdeps/unix/syscall-template.S
(gdb) bt
#0  0x00007fef227b83c3 in select () at ../sysdeps/unix/syscall-template.S:84
#1  0x000055a388c21e08 in rtc::PhysicalSocketServer::Wait(int, bool) ()
#2  0x000055a388bd4371 in rtc::MessageQueue::Get(rtc::Message*, int, bool) ()
#3  0x000055a388bd2a32 in rtc::Thread::Run() ()
#4  0x000055a388bd2743 in rtc::Thread::PreRun(void*) ()
#5  0x00007fef2331a494 in start_thread (arg=0x7feefaffd700) at pthread_create.c:333
#6  0x00007fef227bfacf in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:97


(gdb) thread 45
[Switching to thread 45 (Thread 0x7feedc5f6700 (LWP 14926))]
#0  pthread_cond_wait@@GLIBC_2.3.2 () at ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
185	../sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S: No such file or directory.
(gdb) bt
#0  pthread_cond_wait@@GLIBC_2.3.2 () at ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
#1  0x000055a388bd59cb in rtc::Event::Wait(int) ()
#2  0x000055a38900630f in webrtc::RtcEventLogHelperThread::ProcessEvents() ()
#3  0x000055a388bd783f in rtc::PlatformThread::Run() ()
#4  0x000055a388bd7746 in rtc::PlatformThread::StartThread(void*) ()
#5  0x00007fef2331a494 in start_thread (arg=0x7feedc5f6700) at pthread_create.c:333
#6  0x00007fef227bfacf in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:97

comment:5 Changed 7 months ago by gaba

Points: 3

comment:6 Changed 7 months ago by cohosh

Owner: set to cohosh
Status: newassigned

comment:7 Changed 7 months ago by gaba

Keywords: network-team-roadmap-2019-Q1Q2 added

comment:8 Changed 6 months ago by cohosh

I used delve (gdb for Go) on a deadlocked process and this is what I found:

There is one proxy-go go routine running (the rest are helper routines for things like channels):

(dlv) goroutines
* Goroutine 1 - User: /go/src/snowflake.git/proxy-go/snowflake.go:118 main.main (0x6632ae)
  Goroutine 2 - User: /usr/local/go/src/runtime/proc.go:303 runtime.gopark (0x444d2b)
  Goroutine 3 - User: /usr/local/go/src/runtime/proc.go:303 runtime.gopark (0x444d2b)
  Goroutine 4 - User: /usr/local/go/src/runtime/proc.go:303 runtime.gopark (0x444d2b)
  Goroutine 6 - User: /usr/local/go/src/runtime/proc.go:303 runtime.gopark (0x444d2b)
  Goroutine 18 - User: /usr/local/go/src/runtime/proc.go:303 runtime.gopark (0x444d2b)
  Goroutine 20 - User: /usr/local/go/src/runtime/proc.go:303 runtime.gopark (0x444d2b)
  Goroutine 25 - User: /usr/local/go/src/runtime/proc.go:303 runtime.gopark (0x444d2b)
[8 goroutines]

This routine is waiting for more tokens to become available, and our tokens channel is empty despite the absence of other running go routines:

(dlv) frame 4
> runtime.gopark() /usr/local/go/src/runtime/proc.go:303 (PC: 0x444d2b)
Warning: debugging optimized function
Frame 4: /go/src/snowflake.git/proxy-go/snowflake.go:118 (PC: 6632ae)
(dlv) p tokens
chan bool {
	qcount: 0,
	dataqsiz: 10,
	buf: *[10]bool [false,false,false,false,false,false,false,false,false,false],
	elemsize: 1,
	closed: 0,
	elemtype: *runtime._type {
		size: 1,
		ptrdata: 0,
		hash: 335480517,
		tflag: tflagUncommon|tflagExtraStar|tflagNamed (7),
		align: 1,
		fieldalign: 1,
		kind: 129,
		alg: *(*runtime.typeAlg)(0x14982a0),
		gcdata: *1,
		str: 8072,
		ptrToThis: 122880,},
	sendx: 0,
	recvx: 0,
	recvq: waitq<bool> {
		first: *(*sudog<bool>)(0xc000060ea0),
		last: *(*sudog<bool>)(0xc000060ea0),},
	sendq: waitq<bool> {
		first: *sudog<bool> nil,
		last: *sudog<bool> nil,},
	lock: runtime.mutex {key: 0},}

So it seems that somehow WebRTC connections are exiting without calling retToken() to repopulate the channel.

comment:9 Changed 6 months ago by cohosh

Also noting that setting the capacity lower will likely produce deadlocks more quickly as it will take less time to lose all of the tokens.

comment:10 Changed 6 months ago by cohosh

Looking at the log output:

A healthy sequence of events for a single token produces the following messages:

2019/03/29 21:52:53 broker returns: 504
2019/03/29 21:52:56 sdp offer successfully received.
2019/03/29 21:52:56 Generating answer...
2019/03/29 21:52:57 OnDataChannel
2019/03/29 21:52:57 OnOpen channel
2019/03/29 21:52:57 connected to relay
2019/03/29 21:53:31 OnClose channel
2019/03/29 21:53:31 datachannelHandler ends

If we grep the logs for these messages we find:

$ grep "Generating" snowflake-proxy.log | wc -l
1046
$ grep "OnData" snowflake-proxy.log | wc -l
1036

So my guess is that the token is being lost somewhere in generating the answer or sendAnswer.

Last edited 6 months ago by cohosh (previous) (diff)

comment:11 Changed 6 months ago by cohosh

This is likely the cause of the problem. OnICEComplete is fired when the ICEGatheringState is complete:

func cgoOnIceGatheringStateChange(p int, state IceGatheringState) {
	INFO.Println("fired OnIceGatheringStateChange:", p)
...
	if IceGatheringStateComplete == state && nil != pc.OnIceComplete {
		pc.OnIceComplete()

However, we see an ICE Candidate error in the logs right before the deadlock:

INFO: peerconnection.go:513: fired OnIceGatheringStateChange: 1046
INFO: peerconnection.go:468: fired OnIceCandidateError:  1046

This error appears to occur *after* OnIceComplete() is fired, meaning that in the proxy-go code:

        pc.OnIceComplete = func() {
                answerChan <- struct{}{}
        }

occurs, which results in the answer channel being read instead of the error channel:

        // Wait until answer is ready.
        select {
        case err = <-errChan:
                pc.Destroy()
                return nil, err
        case _, ok := <-answerChan:
                if !ok {
                        pc.Destroy()
                        return nil, fmt.Errorf("Failed gathering ICE candidates.")
                }
        }
        return pc, nil

This causes the function makePeerConnection to return a nil error, bypassing the retToken() call:

        pc, err := makePeerConnectionFromOffer(offer, config)
        if err != nil {
                log.Printf("error making WebRTC connection: %s", err)
                retToken()
                return
        }

Because of the ICEcandidate error, OnDataChannel is never called.

comment:12 Changed 6 months ago by cohosh

The fix for this is quite small: https://github.com/cohosh/snowflake/compare/deadlock

It's safer to only write to answerChan after we're sure that dataChannelHandler (which calls retToken() on return) has been called.

go datachannelHandler(conn, conn.RemoteAddr())
answerChan <- struct{}{}

It's also worth pointing out that apparently OnICECompletel is being deprecated (peerconnection.go#L518) we are better off not relying on it anyway.

I'm going to run this modified version as a proxy-go instance for a few days to see if it deadlocks.

Last edited 6 months ago by cohosh (previous) (diff)

comment:13 Changed 6 months ago by cohosh

Status: assignedneeds_review

comment:14 Changed 6 months ago by cohosh

Status: needs_reviewneeds_revision

Oop, looks like it hangs reading from the errorChannel now... we'll need to do something a bit different.

comment:15 Changed 6 months ago by arlolra

This error appears to occur *after* OnIceComplete() is fired, meaning that in the proxy-go code:

Ah, so it is a bug in go-webrtc to assume iceComplete, when only iceGathering has completed,
https://github.com/keroserene/go-webrtc/blob/master/peerconnection.go#L518-L522

It's also worth pointing out that apparently OnICECompletel is being deprecated (​peerconnection.go#L518) we are better off not relying on it anyway.

If that's the case, we should remove it there too.

comment:16 in reply to:  11 ; Changed 6 months ago by dcf

Replying to cohosh:

This is likely the cause of the problem. OnICEComplete is fired when the ICEGatheringState is complete:

Great analysis!

        case _, ok := <-answerChan:
                if !ok {
                        pc.Destroy()
                        return nil, fmt.Errorf("Failed gathering ICE candidates.")
                }
        }
        return pc, nil

This code is weird... At first I saw the answerChan <- struct{}{} and I thought, this channel is only used to send a single signal; maybe this would be better as close(answerChan) as in this example. But then I don't understand why it's checking ok from the channel read--that seems to indicate that it's trying to distinguish a send from a channel close... But the code only uses one of those two possibilities.

Replying to arlolra:

This error appears to occur *after* OnIceComplete() is fired, meaning that in the proxy-go code:

Ah, so it is a bug in go-webrtc to assume iceComplete, when only iceGathering has completed,
https://github.com/keroserene/go-webrtc/blob/master/peerconnection.go#L518-L522

Yes, I think you're right.

It's also worth pointing out that apparently OnICECompletel is being deprecated (​peerconnection.go#L518) we are better off not relying on it anyway.

If that's the case, we should remove it there too.

Agreed.

comment:17 in reply to:  16 Changed 6 months ago by cohosh

Here's a partial solution for the ICECandidateError problem: https://github.com/cohosh/snowflake/compare/deadlock

Replying to dcf:

This code is weird... At first I saw the answerChan <- struct{}{} and I thought, this channel is only used to send a single signal; maybe this would be better as close(answerChan) as in this example. But then I don't understand why it's checking ok from the channel read--that seems to indicate that it's trying to distinguish a send from a channel close... But the code only uses one of those two possibilities.

The code was a bit unusual. I think CreateAnswer was put into a go routine because it's blocking on the ICE negotiation, but we need it to complete before calling sendAnswer anyway, so I removed the go routine and it appears to be functioning fine. This is not a complete solution (see below).

Replying to arlolra:

This error appears to occur *after* OnIceComplete() is fired, meaning that in the proxy-go code:

Ah, so it is a bug in go-webrtc to assume iceComplete, when only iceGathering has completed,
https://github.com/keroserene/go-webrtc/blob/master/peerconnection.go#L518-L522

Yes, I think you're right.

It's also worth pointing out that the proxy's completion of ICE to create the SDP answer doesn't necessarily imply that OnDataChannel will be called (which is what was causing the problem in the first place). In order for OnDataChannel to be called, the client has to receive the answer from the broker, call SetRemoteDescripter, and trigger the data channel to open.

The token is being returned only once OnDataChannel has been called:

func datachannelHandler(conn *webRTCConn, remoteAddr net.Addr) {
        defer conn.Close()
        defer retToken()

and the rest of the code assumes that as long as pc.CreateAnswer succeeded, it doesn't need to worry about returning the token.

That means that a client can cause the proxy to lose tokens by sending an offer, and then doing nothing with the answer. I think it was happening because of an ICE error before, but we need to make sure we fix the faulty/malicious client case as well.

This is tricky to solve because there isn't a PeerConnection callback for "SDP answer sent, but data channel was never opened".

It's also worth pointing out that apparently OnICEComplete is being deprecated (​peerconnection.go#L518) we are better off not relying on it anyway.

If that's the case, we should remove it there too.

Agreed.

Just noting that the snowflake client also uses the OnICEComplete callback

comment:18 Changed 6 months ago by cohosh

The code was a bit unusual. I think CreateAnswer was put into a go routine because it's blocking on the ICE negotiation, but we need it to complete before calling sendAnswer anyway, so I removed the go routine and it appears to be functioning fine. This is not a complete solution (see below).

I'll add that there's a 3 second timeout on CreateAnswer in go-webrtc

comment:19 Changed 6 months ago by cohosh

That means that a client can cause the proxy to lose tokens by sending an offer, and then doing nothing with the answer.

Just confirmed this is the case, even with the above fix, in the test environment by commenting out

        //err := c.pc.SetRemoteDescription(answer)

in client/lib/webrtc.go. The client repeatedly tries to connect to a snowflake and after 10 tries, the proxy deadlocks because the tokens are gone.

comment:20 Changed 6 months ago by cohosh

Status: needs_revisionneeds_review

Here we go: https://github.com/cohosh/snowflake/compare/deadlock

This adds a timeout that checks after a minute to see if the peerconnection has been established. If not, the peer connection is destroyed and the token returned.

comment:21 Changed 6 months ago by dcf

The token system is meant to limit the number of simultaneous sessions, right? What if we removed it entirely, and let it handle sessions up to OS limits?

comment:22 in reply to:  21 ; Changed 6 months ago by cohosh

Replying to dcf:

The token system is meant to limit the number of simultaneous sessions, right? What if we removed it entirely, and let it handle sessions up to OS limits?

I am ok with this as well, but we should probably be tearing down the peer connections properly after a timeout anyway (though maybe go handles this on its own eventually?)

comment:23 in reply to:  22 ; Changed 6 months ago by dcf

I was going to quibble about makePeerConnectionFromOffer becoming blocking, which through runSession, will block the main polling loop in main (if I understand right). But I think this is an architectural problem unrelated to the deadlock fix, so let's ignore it. (It seems like makePeerConnectionFromOffer and sendAnswer should run in their own goroutine, responsible for a single PeerConnection.)

Replying to cohosh:

Replying to dcf:
I am ok with this as well, but we should probably be tearing down the peer connections properly after a timeout anyway (though maybe go handles this on its own eventually?)

I'm looking at the code and I don't quite get how it's supposed to work. The error handlers (here here here) call pc.Destroy(), and retToken() in the caller. The timeout checker here calls both pc.Destroy() and also retToken(), which makes sense because it doesn't have a caller to call retToken(). So that looks good.

When a PeerConnection ends "naturally", I suppose what happens is that dc.OnClose() gets called, which calls pc.DeleteDataChannel(dc), but not pc.Destroy() nor retToken(). I can understand why pc.DeleteDataChannel(dc) gets called here and not in the other cases--because in the other cases we don't have a DataChannel yet--but then I'm wondering why it doesn't call the other two functions. Are we losing a token in this case too?

I was thinking, what we need is an OnError handler so we can get called back when a DataChannel fails to establish. I found this comment:

OnError - is not implemented because the underlying Send
always returns true as specified for SCTP, there is no reasonable
exposure of other specific errors from the native code, and OnClose
already covers the bases.

I was curious about what happens in browser WebRTC so I hacked this demo and code to comment out the onicecandidate callback in the remote. What happens is you get a browser-produced line in the console:

⚠️ ICE failed, add a STUN server and see about:webrtc for more details

but none of the error callbacks get called, so the application is never aware of the failure. (There is a onicecandidateerror callback but apparently nothing implements it.) So it looks like a browser is not doing anything fundamentally different, and checking after a timeout seems like a reasonable way to do it.

So I think your patch looks good, if you can

  1. answer whether the OnClose path needs to call retToken()
  2. move the time.Minute value into a commented constant

comment:24 in reply to:  23 Changed 6 months ago by cohosh

Replying to dcf:

I was going to quibble about makePeerConnectionFromOffer becoming blocking, which through runSession, will block the main polling loop in main (if I understand right). But I think this is an architectural problem unrelated to the deadlock fix, so let's ignore it. (It seems like makePeerConnectionFromOffer and sendAnswer should run in their own goroutine, responsible for a single PeerConnection.)

I agree that these should ideally be their own goroutine, though this is unrelated to this bug. Note also that the blocking of makePeerConnectionFromOffer is accompanied by a 3 second timeout here in the blocking library call. It's not great design to rely on this but perhaps this is better fixed in a separate refactoring ticket.

Replying to cohosh:

Replying to dcf:
I am ok with this as well, but we should probably be tearing down the peer connections properly after a timeout anyway (though maybe go handles this on its own eventually?)

I'm looking at the code and I don't quite get how it's supposed to work. The error handlers (here here here) call pc.Destroy(), and retToken() in the caller. The timeout checker here calls both pc.Destroy() and also retToken(), which makes sense because it doesn't have a caller to call retToken(). So that looks good.

When a PeerConnection ends "naturally", I suppose what happens is that dc.OnClose() gets called, which calls pc.DeleteDataChannel(dc), but not pc.Destroy() nor retToken(). I can understand why pc.DeleteDataChannel(dc) gets called here and not in the other cases--because in the other cases we don't have a DataChannel yet--but then I'm wondering why it doesn't call the other two functions. Are we losing a token in this case too?

I was thinking, what we need is an OnError handler so we can get called back when a DataChannel fails to establish. I found this comment:

OnError - is not implemented because the underlying Send
always returns true as specified for SCTP, there is no reasonable
exposure of other specific errors from the native code, and OnClose
already covers the bases.

I was curious about what happens in browser WebRTC so I hacked this demo and code to comment out the onicecandidate callback in the remote. What happens is you get a browser-produced line in the console:

⚠️ ICE failed, add a STUN server and see about:webrtc for more details

but none of the error callbacks get called, so the application is never aware of the failure. (There is a onicecandidateerror callback but apparently nothing implements it.) So it looks like a browser is not doing anything fundamentally different, and checking after a timeout seems like a reasonable way to do it.

Yeah I found the same unimplemented error callback. I'm still not sure that an onIceCandidateError is what we want, since even if ICE succeeds at the proxy side, it still doesn't guarantee that the client will use the SDP offer they receive through the signaling channel. What we'd need is for the library itself to also have a timeout between creating the SDP answer and firing the OnDataChannel callback (which is essentially what we're doing in this code).

So I think your patch looks good, if you can

  1. answer whether the OnClose path needs to call retToken()
  2. move the time.Minute value into a commented constant

To answer (1), the call to pc.Destroy() and retToken() is handled by the goroutine running dataChannelHandler here. This function calls both of these when it returns, which it should automatically do when OnClose is called because the deletion of the data channel in OnClose will cause CopyLoopTimeout to return.

It might actually be better design for us to move these to OnClose, but right now this works as well.

(2) has now been implemented in this commit

comment:25 Changed 5 months ago by dcf

Status: needs_reviewmerge_ready

Thanks for answering my questions. It looks good to me.

comment:26 Changed 5 months ago by cohosh

Resolution: fixed
Status: merge_readyclosed

Merged to master.

I've set up snowflake-proxy-restartless to use the new version of proxy-go, but kept the other instances in the old version for now. After a few days I'll move the others over.

comment:27 Changed 5 months ago by cypherpunks

@cohosh I'm still not a 100% sure but each 7min or so it sure looks like the snowflake is deadlocking (or something else?) while I'm being connected and I get another one, this is very visible in my browsing experience. It would be great if you could investigate this.

comment:28 in reply to:  27 Changed 5 months ago by cohosh

Replying to cypherpunks:

@cohosh I'm still not a 100% sure but each 7min or so it sure looks like the snowflake is deadlocking (or something else?) while I'm being connected and I get another one, this is very visible in my browsing experience. It would be great if you could investigate this.

Thanks! I've made a new ticket for this here: #30258

Note: See TracTickets for help on using tickets.