Opened 16 months ago

Last modified 2 weeks ago

#25429 assigned defect

Need something better than client's `checkForStaleness`

Reported by: arlolra Owned by: cohosh
Priority: Medium Milestone:
Component: Circumvention/Snowflake Version:
Severity: Normal Keywords: ex-sponsor-19, anti-censorship-roadmap
Cc: dcf, arlolra, cohosh Actual Points:
Parent ID: Points:
Reviewer: Sponsor: Sponsor28-can

Description

If no message has been received on the datachannel on the client for SnowflakeTimeout (30 seconds), checkForStaleness closes the connection. The comment says this is to,

Prevent long-lived broken remotes.

but there's no heartbeat at this level of abstraction so the connection is constantly being reset anytime the user pauses their activity (for example, to read a webpage).

This greatly exacerbated #21312

Child Tickets

Change History (12)

comment:1 Changed 16 months ago by dcf

I wonder if the repeated disconnections after 30 seconds is also the cause of "Your Guard is failing an extremely large amount of circuits" in #23780.

comment:2 Changed 16 months ago by arlolra

I wonder if the repeated disconnections after 30 seconds is also the cause of ...

I doubt it. Commenting out // go c.checkForStaleness() doesn't have any effect on that log line. However, changing the value of -max from 1 to 3 reduces it to single instance. And, I've only ever seen it at startup. Which leads me to believe it has something to do with buffering when the initial connections are made.

comment:3 in reply to:  2 Changed 16 months ago by dcf

Replying to arlolra:

I wonder if the repeated disconnections after 30 seconds is also the cause of ...

I doubt it. Commenting out // go c.checkForStaleness() doesn't have any effect on that log line.

That test may not work without deleting the state file in between--I believe this message comes from parsing the pb_ parameters in a Guard line in the state file:

Guard in=bridges rsa_id=2B280B23E1107BB62ABFC40DDCC8824814F80A72 bridge_addr=0.0.3.0:1 sampled_on=2018-03-01T19:18:39 sampled_by=0.3.3.2-alpha listed=1 confirmed_on=2018-02-26T22:51:38 confirmed_idx=0 pb_use_attempts=70.011719 pb_use_successes=46.431641 pb_circ_attempts=207.411194 pb_circ_successes=199.674622 pb_successful_circuits_closed=76.945984 pb_collapsed_circuits=94.614807 pb_unusable_circuits=28.113830 pb_timeouts=1.649048

Since the state file persists between runs, you wouldn't see the message go away until you had had enough successful connections to push the average down below some threshold, or something like that. And it seems that tor will only emit the message once, keeping track of whether it has done so in a path_bias_use_extreme variable, so that could explain why it is only seen at startup:

https://gitweb.torproject.org/tor.git/tree/src/or/entrynodes.h?h=tor-0.3.2.10#n46
https://gitweb.torproject.org/tor.git/tree/src/or/circpathbias.c?h=tor-0.3.2.10#n1284

But as for what to do with checkForStaleness, I don't understand its purpose either, but if we can't figure it out, we could just bump it up to a high value, like 5 hours or so.

comment:4 Changed 16 months ago by arlolra

I don't understand its purpose either

It was added in, https://gitweb.torproject.org/pluggable-transports/snowflake.git/commit/?id=ac9d49b8727b953c12a76e3645fe71a9ec3aab75
which doesn't provide much info.

It might be related to, https://gitweb.torproject.org/pluggable-transports/snowflake.git/commit/?id=cf1b0a49f13f2550cad1b32ef4e4820b4c26bcf1
where the client could be sitting around for several minutes waiting for the datachannel to close because the proxy disappeared.

Or, a denial of service where the proxy keeps the connection open but just doesn't send any data down the channel.

comment:5 Changed 3 months ago by cohosh

Cc: cohosh added
Sponsor: Sponsor19

comment:6 Changed 3 months ago by cohosh

Owner: set to cohosh
Status: newassigned

comment:7 Changed 7 weeks ago by cohosh

The 30 second timeout is what causes the snowflake client to request another proxy if STUN connections to the proxy are being blocked as suspected in #30350. If we want to remove or lengthen this timeout we need a way of verifying that the connection to the proxy was successful in the first place (we probably want this anyway). 30 seconds is already a long time to wait.

comment:8 Changed 7 weeks ago by cohosh

It looks there are 3 possible cases of things that happen:

  1. The client isn't requesting any traffic and isn't using the network

Users may spend a long time while browsing a page or leave Tor Browser running in the background while performing other tasks and it will be a long time before they have more network requests to send. Snowflake's current behaviour is to timeout the connection and request new snowflake(s) every 30 seconds. This uses up network traffic and snowflake resources.

Desired behaviour: stop collecting snowflakes until the client has traffic to send (as in #21314). This of course means that we have to wait to construct an entirely new Tor circuit before the client can actually send their request. I'd suggest putting a fairly long timeout on this (tor-spec sets KeepaliveTimeout to 5 minutes by default)

  1. The proxy is subject to network interference and is completely unreachable.

A shown in #30350, the snowflake client assumes that the proxy is reachable as soon as they receive the offer from the broker and attempt to open the DataChannel. We were seeing STUN messages being sent from the client with no response (https://trac.torproject.org/projects/tor/ticket/30350#comment:5).

Desired behaviour: We want to detect this and retry with a different snowflake as quickly as possible. An alternative would be to attempt to connect to multiple snowflakes (set max to > 1) to increase the chance of getting a good snowflake right away. I think we should have a different way of detecting this rather than the 30 second timeout that detects it now.

  1. The proxy is malicious, unreliable, or reachable but subject to network interference

A snowflake proxy could stop processing traffic partway through a browsing session where the user is actively making requests. This could be a DoS performed by the proxy or the censor

Desired behaviour: We want to be able to detect this as quickly as possible and start sending traffic through another proxy. Again, we want something different than the current 30s timeout here. This is something that #25723 could help out with as well. If we had a sequencing and reliability layer, we could notice that our requests have not been acknowledged within some short (< 30s) timeout and retransmit them through a different proxy

The current checkForStaleness works really well for case (1). I'd propose to keep it around for that purpose and lengthen the timeout to something closer to the 5 minute default of Tor relays.

To handle (2) and (3) I think we want something like a sequencing/reliability layer that can also be extended for multiplexing across multiple proxies (#25723). I think a good way forward is to get this layer working for a single proxy to detect unreliability and blocking first and then move forward on multiplexing.

comment:9 Changed 4 weeks ago by cohosh

I'm going to move the discussion of the sequencing and reliability layer over to #29206.

comment:10 Changed 3 weeks ago by gaba

Keywords: ex-sponsor-19 added

Adding the keyword to mark everything that didn't fit into the time for sponsor 19.

comment:11 Changed 3 weeks ago by phw

Sponsor: Sponsor19Sponsor28-can

Moving from Sponsor 19 to Sponsor 28.

comment:12 Changed 2 weeks ago by gaba

Keywords: anti-censorship-roadmap added
Note: See TracTickets for help on using tickets.