Fix race condition in snowflake broker

added actualpoints::1 component::circumvention/snowflake metrics owner::cohosh priority::very high resolution::fixed reviewer::phw severity::normal sponsor::28 status::closed type::defect labels

Trac:
metrics-fetch.txt

Trac:
metrics-broker.txt

Okay it turns out I was still ssh-ing into the old broker. I copied over the metrics log from the old broker so that CollecTor at least has some data from the last week, but note that this will be metrics from the old broker host so most proxies will have migrated to the new host.

Interestingly, it looks like there is still one proxy in Japan that's polling the old broker still.

I'm still not sure why the new broker wasn't saving metrics since migrating the other domains on November 14th. Looking at the log files, it looks like the broker might have crashed.

Alright I found what look like a few crashes:

>[scrubbed]: read: connection reset by peer
panic: runtime error: index out of range

goroutine 360108 [running]:
main.SnowflakeHeap.Swap(0xc001440200, 0x10, 0x40, 0xffffffffffffffff, 0xf)
	/home/david/branches/snowflake/broker/snowflake-heap.go:30 +0x8a
container/heap.Remove(0x7bc1c0, 0xc00007abe0, 0xffffffffffffffff, 0x1, 0xc00007c601)
	/usr/lib/go-1.11/src/container/heap/heap.go:75 +0x9c
main.(*BrokerContext).Broker.func1(0xc000ef4240, 0xc00007ac00, 0xc00111e5e0)
	/home/david/branches/snowflake/broker/broker.go:129 +0x128
created by main.(*BrokerContext).Broker
	/home/david/branches/snowflake/broker/broker.go:122 +0xa2

2019/11/15 12:25:29 http2: received GOAWAY [FrameHeader GOAWAY len=8], starting graceful shutdown
fatal error: concurrent map writes

2019/11/15 17:23:46 http: TLS handshake error from [scrubbed]: EOF
panic: runtime error: index out of range

goroutine 1949618 [running]:
main.SnowflakeHeap.Swap(0xc003e08400, 0x20, 0x40, 0xffffffffffffffff, 0x1f)
        /home/david/branches/snowflake/broker/snowflake-heap.go:30 +0x8a
container/heap.Remove(0x7bc1c0, 0xc00000cc00, 0xffffffffffffffff, 0x1, 0xc001de4f01)
        /usr/lib/go-1.11/src/container/heap/heap.go:75 +0x9c
main.(*BrokerContext).Broker.func1(0xc004d065a0, 0xc00000cc20, 0xc004b03300)
        /home/david/branches/snowflake/broker/broker.go:129 +0x128
created by main.(*BrokerContext).Broker
        /home/david/branches/snowflake/broker/broker.go:122 +0xa2

2019/11/15 18:03:08 http: TLS handshake error from [scrubbed]: read tcp [scrubbed]->[scrubbed]: read:
 connection reset by peer
fatal error: concurrent map writes

... and so on, pretty much continuously through to today:

2019/11/22 10:15:44 http: TLS handshake error from [scrubbed]: EOF
fatal error: concurrent map writes

The weird thing is, these errors don't occur on the old broker host: /var/log/snowflake-broker# grep "fatal error" ./* /var/log/snowflake-broker# grep "panic" ./* both yield no results.

The frequency with which the panics on the new host occur explain the lack of metrics, we only log metrics once the broker has been running for 24 hours straight and that hasn't happened since 14-11-2019.

Okay so the crashes have been happening every since this restart: https://trac.torproject.org/projects/tor/ticket/29207#comment:35

The weird thing is that the #29207 (moved) update was deployed on the old broker at the same time but hasn't caused any problems. It's possible this was due to the DNS update to point to the new broker, but as seen from the metrics data, the old broker was still getting a lot of proxy polls for several days after.

In any case, this is almost certainly due to this documented race condition, so we should fix that.

Just updating the summary of this ticket to reflect the actual problem.

Trac:
Summary: Snowflake broker not serving metrics correctly to Fix race condition in snowflake broker
Description: Looks like the snowflake broker isn't serving metrics since 2019-11-14 (which happens to coincide with the host migration.

At first I thought this was a DNS issue and filed #32570 (moved), but even when I manually fetch https://snowflake-broker.torproject.net/metrics I'm getting outdated metrics.

I've attached the result of the above fetch as well as the metrics file located on the snowflake broker host that the broker should be reading from.

to

There is a race condition with the snowflake heap that has been causing the broker to crash several times a day. This race condition has existed in the broker for several years, but some recent updates as well as the host migration managed to shake it loose.

This race condition is causing the snowflake broker to crash repeatedly and often since the migration. We noticed because CollecTor stopped collecting metrics since the restart on 14 November 2019.

Replying to cohosh:

The weird thing is that the #29207 (moved) update was deployed on the old broker at the same time but hasn't caused any problems. It's possible this was due to the DNS update to point to the new broker, but as seen from the metrics data, the old broker was still getting a lot of proxy polls for several days after.

Perhaps because the old broker had 1 CPU core and the new one has 2 CPU cores? I can imagine that might affect goroutine scheduling which might affect the visibility of race conditions.

You might try setting GOMAXPROCS=1 in /etc/service/snowflake-broker/run as a temporary workaround, if indeed it helps.

This race condition occurs because both the proxy process and the client process try to Pop/Remove the same snowflake from the heap.

When a proxy polls the broker, it starts a go routine that waits for an offer through the snowflake's offer channel, or waits for a timeout:

        go func(request *ProxyPoll) {
            select {
            case offer := <-snowflake.offerChannel:
                request.offerChannel <- offer
            case <-time.After(time.Second * ProxyTimeout):
                // This snowflake is no longer available to serve clients.
                // TODO: Fix race using a delete channel
                heap.Remove(ctx.snowflakes, snowflake.index)
                delete(ctx.idToSnowflake, snowflake.id)
                request.offerChannel <- nil
            }
        }(request)

Snowflakes are removed from the heap when they time out or if they are popped off the heap by a client:

func clientOffers(ctx *BrokerContext, w http.ResponseWriter, r *http.Request) {

...

    // Delete must be deferred in order to correctly process answer request later.
    snowflake := heap.Pop(ctx.snowflakes).(*Snowflake)
    defer delete(ctx.idToSnowflake, snowflake.id)
    snowflake.offerChannel <- offer

A race can always occur where the timeout happens after the Pop removes the snowflake but before the offer is sent through offerChannel. This can be fixed through the use of locks and a check to see if snowflake.index has been set to -1 (which happens after it has been popped off the heap). Here's a patch that adds synchronization to the broker to prevent simultaneous access to the heap as well as the idToSnowflake map: https://github.com/cohosh/snowflake/pull/14

I'd like to do some tests before merging this to make sure that the synchronization doesn't slow things down too much, but a code review would be good at this point.

Trac:
Status: assigned to needs_review

Replying to dcf:

Perhaps because the old broker had 1 CPU core and the new one has 2 CPU cores? I can imagine that might affect goroutine scheduling which might affect the visibility of race conditions. Ah that would do it :) The reason we didn't notice then was because clients and proxies weren't around to trigger the race until we switched over the DNS records for the bamsoftware.com domains.

You might try setting GOMAXPROCS=1 in /etc/service/snowflake-broker/run as a temporary workaround, if indeed it helps. Okay I did this and restarted the broker as of 2019/11/22 22:50:27 that should work until we get this patch reviewed. Thanks!

Setting export GOMAXPROCS=1 doesn't seem to have helped. The broker crashed again at 2019/11/23 08:24:18 this morning.

Alright I modified the patch and added a test that produce a data race before the patch but pass afterwards. This is definitely ready for a review.

For a broker update: the broker hasn't crashed for over 2 days so we have some recent metrics: https://metrics.torproject.org/collector/recent/snowflakes/

Trac:
Reviewer: N/A to phw

Trac:
Actualpoints: N/A to 1

This looks good. I left a review on your GitHub pull request. I had a minor suggestion and a question regarding the scope of what needs to be protected by a lock.

Added another commit and tested it locally. How does it look?

Looks good to me!

Trac:
Status: needs_review to merge_ready

Merged in 06298eec73: https://gitweb.torproject.org/pluggable-transports/snowflake.git/commit/?id=06298eec730aa2664bb61d4cce4ef56dfce91ee3

Deployed at broker at 2019/12/05 16:16:05

Trac:
Status: merge_ready to closed
Resolution: N/A to fixed

Fix race condition in snowflake broker

Child items ...

Activity