Opened 2 months ago

Last modified 3 weeks ago

#31425 needs_review defect

Snowflake broker is sluggish and sometimes fails

Reported by: cohosh Owned by: dcf
Priority: Low Milestone:
Component: Circumvention/Snowflake Version:
Severity: Normal Keywords: broker
Cc: arlolra, cohosh, phw, dcf Actual Points:
Parent ID: Points:
Reviewer: Sponsor:

Description

This morning while trying to deploy a change I noticed that my SSH connection is extremely sluggish.

I'm also getting a lot of timeouts at the client:

2019/08/16 15:37:54 Negotiating via BrokerChannel...
Target URL:  snowflake-broker.azureedge.net 
Front URL:   ajax.aspnetcdn.com
2019/08/16 15:38:05 BrokerChannel Response:
504 Gateway Timeout

2019/08/16 15:38:05 BrokerChannel Error: Unexpected error, no answer.
2019/08/16 15:38:05 Failed to retrieve answer. Retrying in 10 seconds

Some requests go through, but the timeouts seem to be occurring more frequently. At first glance, this doesn't seem to be a CPU or memory resources consumption problem. Maybe there's a data transfer limit we're hitting?

Child Tickets

Change History (20)

comment:1 Changed 7 weeks ago by cohosh

Priority: Very HighLow

This no longer seems to be an issue. I'll leave the ticket open for a bit in case it happens again.

comment:2 Changed 7 weeks ago by cohosh

Status: newneeds_information

comment:3 in reply to:  1 ; Changed 7 weeks ago by cypherpunks

Replying to cohosh:

This no longer seems to be an issue. I'll leave the ticket open for a bit in case it happens again.

That's probably because snowflake usage has dropped significantly due to the problems in #31385 (even our buddy from China recently reported problems in #30350#comment:35 ).

comment:4 in reply to:  3 ; Changed 7 weeks ago by cohosh

Replying to cypherpunks:

Replying to cohosh:

This no longer seems to be an issue. I'll leave the ticket open for a bit in case it happens again.

That's probably because snowflake usage has dropped significantly due to the problems in #31385 (even our buddy from China recently reported problems in #30350#comment:35 ).

Thanks cypherpunks, I'd be surprised if the broker problems were related to the changes in client usage. Note that by "drops significantly" we're talking about going from 30 users to 10 users. Our proxy usage has been steadily growing, however.

comment:5 in reply to:  4 ; Changed 7 weeks ago by cypherpunks

Replying to cohosh:

Thanks cypherpunks, I'd be surprised if the broker problems were related to the changes in client usage. Note that by "drops significantly" we're talking about going from 30 users to 10 users. Our proxy usage has been steadily growing, however.

I'm referring to actual usage, since it seems unclear to me how one can keep using snowflake when it stalls indefinitely whenever one hits a proxy operated with the WebExt in Firefox (please consider #31100#comment:26).

Last edited 7 weeks ago by cypherpunks (previous) (diff)

comment:6 in reply to:  5 ; Changed 7 weeks ago by cohosh

Replying to cypherpunks:

Replying to cohosh:

Thanks cypherpunks, I'd be surprised if the broker problems were related to the changes in client usage. Note that by "drops significantly" we're talking about going from 30 users to 10 users. Our proxy usage has been steadily growing, however.

I'm referring to actual usage, since it seems unclear to me how one can keep using snowflake when it stalls indefinitely whenever one hits a proxy operated with the WebExt in Firefox (please consider #31100#comment:26).

Please keep your comments on topic. This ticket is for problems with the broker being slow. Usage of the broker that could possibly result in the problems described here are related to client and proxy connections to the broker for the rendezvous protocol. The metrics you linked are about bridge usage and as mentioned I don't think these reflect information useful to this ticket.

We are investigating the concerns you have pointed out, please keep your comments about those concerns to ticket #31385.

comment:7 in reply to:  6 ; Changed 6 weeks ago by cypherpunks

Replying to cohosh:

Please keep your comments on topic. This ticket is for problems with the broker being slow.

I apologize for the misunderstanding but I was on-topic, what I wanted to convey was that since there were problems with connecting with snowflake there were less users of snowflake and hence less requests to the broker and hence this issue seemed to disappear (if we assume that it's caused by a rate liming of some sorts). And what may confirm this analysis further is that now it takes a ton of time for snowflake to bootstrap since I constantly get 504 Gateway Timeouts, and it takes a ton of time for tabs to load again when I need to pick a new snowflake while browsing. This again makes snowflake completely unusable for me.

comment:8 in reply to:  7 ; Changed 6 weeks ago by cohosh

Replying to cypherpunks:

now it takes a ton of time for snowflake to bootstrap since I constantly get 504 Gateway Timeouts, and it takes a ton of time for tabs to load again when I need to pick a new snowflake while browsing. This again makes snowflake completely unusable for me.

Thanks for the update cypherpunks,

To add to the information here, starting yesterday I've been getting notifications from sysmon that the snowflake broker becomes unreachable periodically (I received 3 notifications yesterday that unreachability lasted approximately 10-25 minutes.

I also SSH'd into the broker today and it is very slow. Nothing in the metrics suggests an unusually high load of regular clients and proxies compared to last week when it was working just fine.
Running top shows that there isn't a lot of CPU or memory usage. This leaves a data transfer or maybe I/O as possible causes for the slowdown. I'll set up some packet captures to see whether we're getting a large amount of traffic not represented in the metrics.

Snowflake proxy metrics: https://snowflake-broker.bamsoftware.com/metrics

Possibly related: #30498

comment:9 Changed 6 weeks ago by cohosh

Update:

I did a packet capture and nothing seemed unusual there.

The interface isn't reporting any dropped packets:

        RX packets 2023380128  bytes 115089245834 (107.1 GiB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 426706137  bytes 46915271886 (43.6 GiB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

and the disk has plenty of space. I installed atop to see if there is network or disk load but I'm not seeing anything there:

 DSK |        xvda1  | busy      0%  |
 NET | eth0    ----  | pcki    2689  | pcko    1525  | sp    0 Mbps  | si  163 Kbps  | so  121 Kbps 

I'm not familiar with atop, but a speed of 0 Mbps seems weird and the busy percentage isn't showing...
maybe this is a hardware problem? I asked before if we hit a transfer limit, which is worth checking into but I'll need dcf for that.

comment:10 in reply to:  8 ; Changed 6 weeks ago by dcf

Replying to cohosh:

To add to the information here, starting yesterday I've been getting notifications from sysmon that the snowflake broker becomes unreachable periodically (I received 3 notifications yesterday that unreachability lasted approximately 10-25 minutes.

What reachability test does sysmon do? An HTTPS request?

I also have been seeing lots of 504 timeouts on the client side.

I wondered if there were some pathology in the in-memory data structures used to store active snowflakes and clients. But the set of snowflakes uses a priority queue and I don't see any obvious bugs. If the problem were an inefficient algorithm, I'd expect to see higher CPU usage anyway.

netstat shows about 800 established connections. Compared to about 400 proxies, with 2 sockets per proxy, that seems reasonable. 800 open sockets doesn't seem like it should cause problems in kernel I/O.

$ netstat -n | grep -c ESTABLISHED
828

Replying to cohosh:

I asked before if we hit a transfer limit, which is worth checking into but I'll need dcf for that.

There is no such limit in place as far as I know. I tried downloading a Debian ISO and it went at 7.5 MB/s, so it doesn't look like bandwidth is being throttled.

$ wget https://cdimage.debian.org/debian-cd/current/amd64/iso-cd/debian-10.0.0-amd64-netinst.iso
...
2019-09-03 16:48:54 (7.49 MB/s) - ‘debian-10.0.0-amd64-netinst.iso’ saved [350224384/350224384]

comment:11 in reply to:  10 Changed 6 weeks ago by cohosh

Replying to dcf:

Replying to cohosh:

To add to the information here, starting yesterday I've been getting notifications from sysmon that the snowflake broker becomes unreachable periodically (I received 3 notifications yesterday that unreachability lasted approximately 10-25 minutes.

What reachability test does sysmon do? An HTTPS request?

Just a simple TCP reachability test to port 443.

I wondered if there were some pathology in the in-memory data structures used to store active snowflakes and clients. But the set of snowflakes uses a priority queue and I don't see any obvious bugs. If the problem were an inefficient algorithm, I'd expect to see higher CPU usage anyway.

Another main symptom of sluggishness I was experiencing was that when I logged in, it took forever for me to type commands and for the commands to run. So for example simply running top was a multi-minute affair. Once I ran it, I was similarly seeing very low CPU and memory usage.

So it's not just that the snowflake-broker is running slowly, it seems to be all other processes on that machine (or at least from what I could tell through an SSH connection). I even stopped the snowflake broker for a few minutes and that didn't help at all. The machine was still just as sluggish.

When I logged in again this morning, the sluggishness seems to have disappeared for the moment, and I haven't gotten any sysmon notifications recently.

comment:12 in reply to:  10 Changed 6 weeks ago by cohosh

Replying to cohosh:

I asked before if we hit a transfer limit, which is worth checking into but I'll need dcf for that.

There is no such limit in place as far as I know. I tried downloading a Debian ISO and it went at 7.5 MB/s, so it doesn't look like bandwidth is being throttled.

$ wget https://cdimage.debian.org/debian-cd/current/amd64/iso-cd/debian-10.0.0-amd64-netinst.iso
...
2019-09-03 16:48:54 (7.49 MB/s) - ‘debian-10.0.0-amd64-netinst.iso’ saved [350224384/350224384]

Is this machine by any chance running in a VM on a shared host? It could be due to the hosting provider de-prioritizing our VM.

I suppose a good next step is to try to narrow down whether this is a network issue, but we can't do much until the symptoms start up again. My current plan is to wait for sysmon notifications and then run something using time to see if it's taking a lot longer than usual.

It might be the case that the issues weren't showing when you did the download. I don't have any alerts since the 1st of September so we should try that again when we're sure the broker is acting up.

Last edited 6 weeks ago by cohosh (previous) (diff)

comment:13 Changed 6 weeks ago by cypherpunks

FWIW it's happening again.

comment:14 Changed 6 weeks ago by dcf

The plan here, as discussed at today's meeting, is to move the broker to another data center, as we were planning to do in comment:8:ticket:29258 for IPv6 support anyway. I'll get a bigger, more capable VM than we have been using, too.

comment:15 Changed 6 weeks ago by dcf

Owner: set to dcf
Status: needs_informationassigned

comment:16 Changed 5 weeks ago by cohosh

Ah ha! I see these messages now in the snowflake broker logs:

2019/09/12 13:50:21 http: Accept error: accept tcp [scrubbed]: accept4: too many op
en files; retrying in 20ms
2019/09/12 13:50:21 http: Accept error: accept tcp [scrubbed]: accept4: too many op
en files; retrying in 40ms
2019/09/12 13:50:21 http: Accept error: accept tcp [scrubbed]: accept4: too many op
en files; retrying in 80ms

comment:17 in reply to:  16 Changed 5 weeks ago by dcf

Replying to cohosh:

Ah ha! I see these messages now in the snowflake broker logs:

Good find. I'm looking now, and

# lsof -n -p $(pidof broker) | grep -c IPv
1019
# ulimit -n
1024

In /etc/runit/snowflake-broker/run, we're running the broker via chpst. I've just added -o 32768 to the command to raise the limit on open files and restarted the process with sv restart snowflake-broker at 2019-09-12 14:19:00.

comment:18 Changed 4 weeks ago by gaba

Keywords: BugSmashFund added

comment:19 Changed 4 weeks ago by gaba

Keywords: BugSmashFund removed

comment:20 Changed 3 weeks ago by cohosh

Status: assignedneeds_review

Looks like this has been fixed in comment:17. We're still migrating the broker in #29258, but it's probably okay to close this ticket now.

Note: See TracTickets for help on using tickets.