Opened 3 weeks ago

Closed 6 days ago

#32576 closed defect (fixed)

Fix race condition in snowflake broker

Reported by: cohosh Owned by: cohosh
Priority: Very High Milestone:
Component: Circumvention/Snowflake Version:
Severity: Normal Keywords: metrics
Cc: karsten, arlolra, cohosh, phw, dcf Actual Points: 1
Parent ID: Points:
Reviewer: phw Sponsor: Sponsor28

Description (last modified by cohosh)

There is a race condition with the snowflake heap that has been causing the broker to crash several times a day. This race condition has existed in the broker for several years, but some recent updates as well as the host migration managed to shake it loose.


This race condition is causing the snowflake broker to crash repeatedly and often since the migration. We noticed because CollecTor stopped collecting metrics since the restart on 14 November 2019.

Child Tickets

Attachments (2)

metrics-fetch.txt (4.8 KB) - added by cohosh 3 weeks ago.
metrics-broker.txt (78.4 KB) - added by cohosh 3 weeks ago.

Download all attachments as: .zip

Change History (19)

Changed 3 weeks ago by cohosh

Attachment: metrics-fetch.txt added

Changed 3 weeks ago by cohosh

Attachment: metrics-broker.txt added

comment:1 Changed 3 weeks ago by cohosh

Okay it turns out I was still ssh-ing into the old broker. I copied over the metrics log from the old broker so that CollecTor at least has *some* data from the last week, but note that this will be metrics from the old broker host so most proxies will have migrated to the new host.

Interestingly, it looks like there is still one proxy in Japan that's polling the old broker still.

I'm still not sure why the new broker wasn't saving metrics since migrating the other domains on November 14th. Looking at the log files, it looks like the broker might have crashed.

comment:2 Changed 3 weeks ago by cohosh

Alright I found what look like a few crashes:

>[scrubbed]: read: connection reset by peer
panic: runtime error: index out of range

goroutine 360108 [running]:
main.SnowflakeHeap.Swap(0xc001440200, 0x10, 0x40, 0xffffffffffffffff, 0xf)
	/home/david/branches/snowflake/broker/snowflake-heap.go:30 +0x8a
container/heap.Remove(0x7bc1c0, 0xc00007abe0, 0xffffffffffffffff, 0x1, 0xc00007c601)
	/usr/lib/go-1.11/src/container/heap/heap.go:75 +0x9c
main.(*BrokerContext).Broker.func1(0xc000ef4240, 0xc00007ac00, 0xc00111e5e0)
	/home/david/branches/snowflake/broker/broker.go:129 +0x128
created by main.(*BrokerContext).Broker
	/home/david/branches/snowflake/broker/broker.go:122 +0xa2
2019/11/15 12:25:29 http2: received GOAWAY [FrameHeader GOAWAY len=8], starting graceful shutdown
fatal error: concurrent map writes
2019/11/15 17:23:46 http: TLS handshake error from [scrubbed]: EOF
panic: runtime error: index out of range

goroutine 1949618 [running]:
main.SnowflakeHeap.Swap(0xc003e08400, 0x20, 0x40, 0xffffffffffffffff, 0x1f)
        /home/david/branches/snowflake/broker/snowflake-heap.go:30 +0x8a
container/heap.Remove(0x7bc1c0, 0xc00000cc00, 0xffffffffffffffff, 0x1, 0xc001de4f01)
        /usr/lib/go-1.11/src/container/heap/heap.go:75 +0x9c
main.(*BrokerContext).Broker.func1(0xc004d065a0, 0xc00000cc20, 0xc004b03300)
        /home/david/branches/snowflake/broker/broker.go:129 +0x128
created by main.(*BrokerContext).Broker
        /home/david/branches/snowflake/broker/broker.go:122 +0xa2
2019/11/15 18:03:08 http: TLS handshake error from [scrubbed]: read tcp [scrubbed]->[scrubbed]: read:
 connection reset by peer
fatal error: concurrent map writes

... and so on, pretty much continuously through to today:

2019/11/22 10:15:44 http: TLS handshake error from [scrubbed]: EOF
fatal error: concurrent map writes

The weird thing is, these errors don't occur on the old broker host:
/var/log/snowflake-broker# grep "fatal error" ./*
/var/log/snowflake-broker# grep "panic" ./*
both yield no results.

The frequency with which the panics on the new host occur explain the lack of metrics, we only log metrics once the broker has been running for 24 hours straight and that hasn't happened since 14-11-2019.

comment:3 Changed 3 weeks ago by cohosh

Okay so the crashes have been happening every since this restart: https://trac.torproject.org/projects/tor/ticket/29207#comment:35

comment:4 Changed 3 weeks ago by cohosh

The weird thing is that the #29207 update was deployed on the old broker at the same time but hasn't caused any problems. It's possible this was due to the DNS update to point to the new broker, but as seen from the metrics data, the old broker was still getting a lot of proxy polls for several days after.

comment:5 Changed 3 weeks ago by cohosh

In any case, this is almost certainly due to this documented race condition, so we should fix that.

comment:6 Changed 3 weeks ago by cohosh

Description: modified (diff)
Summary: Snowflake broker not serving metrics correctlyFix race condition in snowflake broker

Just updating the summary of this ticket to reflect the actual problem.

comment:7 in reply to:  4 ; Changed 3 weeks ago by dcf

Replying to cohosh:

The weird thing is that the #29207 update was deployed on the old broker at the same time but hasn't caused any problems. It's possible this was due to the DNS update to point to the new broker, but as seen from the metrics data, the old broker was still getting a lot of proxy polls for several days after.

Perhaps because the old broker had 1 CPU core and the new one has 2 CPU cores? I can imagine that might affect goroutine scheduling which might affect the visibility of race conditions.

You might try setting GOMAXPROCS=1 in /etc/service/snowflake-broker/run as a temporary workaround, if indeed it helps.

comment:8 Changed 3 weeks ago by cohosh

Status: assignedneeds_review

This race condition occurs because both the proxy process and the client process try to Pop/Remove the same snowflake from the heap.

When a proxy polls the broker, it starts a go routine that waits for an offer through the snowflake's offer channel, or waits for a timeout:

        go func(request *ProxyPoll) {
            select {
            case offer := <-snowflake.offerChannel:
                request.offerChannel <- offer
            case <-time.After(time.Second * ProxyTimeout):
                // This snowflake is no longer available to serve clients.
                // TODO: Fix race using a delete channel
                heap.Remove(ctx.snowflakes, snowflake.index)
                delete(ctx.idToSnowflake, snowflake.id)
                request.offerChannel <- nil
            }
        }(request)

Snowflakes are removed from the heap when they time out or if they are popped off the heap by a client:

func clientOffers(ctx *BrokerContext, w http.ResponseWriter, r *http.Request) {

...

    // Delete must be deferred in order to correctly process answer request later.
    snowflake := heap.Pop(ctx.snowflakes).(*Snowflake)
    defer delete(ctx.idToSnowflake, snowflake.id)
    snowflake.offerChannel <- offer

A race can always occur where the timeout happens after the Pop removes the snowflake but before the offer is sent through offerChannel. This can be fixed through the use of locks and a check to see if snowflake.index has been set to -1 (which happens after it has been popped off the heap). Here's a patch that adds synchronization to the broker to prevent simultaneous access to the heap as well as the idToSnowflake map: https://github.com/cohosh/snowflake/pull/14

I'd like to do some tests before merging this to make sure that the synchronization doesn't slow things down too much, but a code review would be good at this point.

comment:9 in reply to:  7 Changed 3 weeks ago by cohosh

Replying to dcf:

Perhaps because the old broker had 1 CPU core and the new one has 2 CPU cores? I can imagine that might affect goroutine scheduling which might affect the visibility of race conditions.

Ah that would do it :) The reason we didn't notice then was because clients and proxies weren't around to trigger the race until we switched over the DNS records for the bamsoftware.com domains.

You might try setting GOMAXPROCS=1 in /etc/service/snowflake-broker/run as a temporary workaround, if indeed it helps.

Okay I did this and restarted the broker as of 2019/11/22 22:50:27 that should work until we get this patch reviewed. Thanks!

comment:10 Changed 3 weeks ago by cohosh

Setting export GOMAXPROCS=1 doesn't seem to have helped. The broker crashed again at 2019/11/23 08:24:18 this morning.

comment:11 Changed 2 weeks ago by cohosh

Reviewer: phw

Alright I modified the patch and added a test that produce a data race before the patch but pass afterwards. This is definitely ready for a review.

For a broker update: the broker hasn't crashed for over 2 days so we have some recent metrics: https://metrics.torproject.org/collector/recent/snowflakes/

comment:12 Changed 2 weeks ago by cohosh

Actual Points: 1

comment:13 Changed 2 weeks ago by phw

This looks good. I left a review on your GitHub pull request. I had a minor suggestion and a question regarding the scope of what needs to be protected by a lock.

comment:14 Changed 9 days ago by cohosh

Added another commit and tested it locally. How does it look?

comment:15 Changed 8 days ago by phw

Status: needs_reviewmerge_ready

Looks good to me!

comment:17 Changed 6 days ago by cohosh

Resolution: fixed
Status: merge_readyclosed

Deployed at broker at 2019/12/05 16:16:05

Note: See TracTickets for help on using tickets.