Opened 11 months ago

Last modified 3 weeks ago

#29206 needs_revision task

New design for client -- server protocol for Snowflake

Reported by: cohosh Owned by: cohosh
Priority: Medium Milestone:
Component: Circumvention/Snowflake Version:
Severity: Normal Keywords: anti-censorship-roadmap-september
Cc: cohosh, phw, dcf, arlolra Actual Points:
Parent ID: Points: 6
Reviewer: dcf Sponsor: Sponsor28-must

Description (last modified by cohosh)

Right now we just send traffic. We could add some control messages or another layer for reliability/sequencing.

Child Tickets

Attachments (1)

reconnecting-snowflakeconn.zip (4.0 KB) - added by dcf 8 weeks ago.

Download all attachments as: .zip

Change History (41)

comment:1 Changed 11 months ago by cohosh

Cc: cohosh added
Component: - Select a componentObfuscation/Obfsproxy
Owner: set to asn

comment:2 Changed 11 months ago by cohosh

Component: Obfuscation/ObfsproxyObfuscation/Snowflake

comment:3 Changed 10 months ago by cohosh

Priority: Very HighMedium
Sponsor: Sponsor19
Summary: New design for client -- proxy protocolNew design for client -- proxy protocol for Snowflake

comment:4 Changed 10 months ago by teor

Owner: asn deleted
Status: newassigned

asn does not need to own any obfuscation tickets any more. Default owners are trouble.

comment:5 Changed 10 months ago by gaba

Points: 6

comment:6 Changed 10 months ago by gaba

Keywords: network-team-roadmap-2019-Q1Q2 added

comment:7 Changed 9 months ago by gaba

cohosh, are you working on this ticket?

comment:8 Changed 9 months ago by cohosh

Owner: set to cohosh

At moment, we're focusing on the proxy--broker protocol, but it is somewhat related. I'll assign it to myself since we'll cover it together with #29207

comment:9 Changed 8 months ago by gaba

Keywords: network-team-roadmap-2019-Q1Q2 removed

comment:10 Changed 6 months ago by cohosh

Cc: phw dcf arlolra added

Here's a first stab at a very simple sequencing layer for snowflake: https://github.com/cohosh/snowflake/compare/sequencing

There is still a lot to do such as:

  • send acknowledement packets when data has been received
  • implement a timeout feature (which I will probably do by making SnowflakeReadWriter an actual net.Conn.
  • resend data chunks that have not been acknowledged
  • implement a fixed sized window to avoid sending too much data

But I thought I'd put this out there sooner to get feedback since this is a big change

comment:11 Changed 6 months ago by gaba

Keywords: ex-sponsor-19 added

Adding the keyword to mark everything that didn't fit into the time for sponsor 19.

comment:12 Changed 6 months ago by phw

Sponsor: Sponsor19Sponsor28-must

Moving from Sponsor 19 to Sponsor 28.

comment:13 Changed 6 months ago by cohosh

Status: assignedneeds_review

I'm going to ask for an incremental review on this one, mostly just to get another pair of eyes on what I've done and my next steps before I sink more time into going possibly the wrong direction: https://github.com/cohosh/snowflake/compare/sequencing

What I've done:

  • Added a ReadWriteCloser that takes data to be sent and adds a header before passing it on to the Write function of the webrtc connection and removes the header in a Read on the other side. This is implemented as a common package because it's used by both the client and server
  • Wrote some tests for the package
  • Send and acknowledgement packet at every read: https://github.com/cohosh/snowflake/commit/a7191b6b5ea4c9e58709c03fefc2dcd07571dc0f

Next steps:

  • Implement a timeout (maybe making the ReadWriteCloser a net.Conn and implementing SetReadDeadline and a fixed size window to stop sending packets before others have been acknowledged)
  • Make it backwards compatable. I'm thinking of doing this by having an extra field in the header that should be set to all zeros. If it's not zeros, we just forward the data as is. If it is zeros, we extract the header.
  • Make the tests better. Right now I'm just reading and writing from a buffer, we should test with network connections

comment:14 Changed 6 months ago by gaba

Reviewer: dcf

comment:15 Changed 6 months ago by gaba

Keywords: anti-censorship-roadmap added

comment:16 in reply to:  13 Changed 6 months ago by cohosh

Replying to cohosh:

  • Make it backwards compatable. I'm thinking of doing this by having an extra field in the header that should be set to all zeros. If it's not zeros, we just forward the data as is. If it is zeros, we extract the header.

Actually another way of doing backwards compatability is to make a new handler for the new version, or add an http header with some kind of versioning information.

Last edited 6 months ago by cohosh (previous) (diff)

comment:17 in reply to:  13 ; Changed 6 months ago by dcf

Replying to cohosh:

I'm going to ask for an incremental review on this one, mostly just to get another pair of eyes on what I've done and my next steps before I sink more time into going possibly the wrong direction: https://github.com/cohosh/snowflake/compare/sequencing

High-level summary: splits the data stream into frames. Each frame is preceded by a 12-byte header consisting of seq, ack, and length fields. The receiver buffers an entire frame in memory, then accepts it if its sequence number exactly matches the expected sequence number, or drops it otherwise.

Overall the design looks good.

type snowflakeHeader struct {
	seq    int
	ack    int
	length int //length of the accompanying data (including header length)
}

The fields would be better as uint32, because the size of int may vary (though I think you handled it correctly by always specifying 32 bits in serialization functions). length may be better as uint16; that's plenty for one call to Write and it avoids the risk that an implementation will try to buffer 4 GB in memory.

IMO length should be exclusive of the header size. Otherwise you have to define what it means if length < 12. (I believe the current code doesn't work in this case because header.length - snowflakeHeaderLen underflows.) The packets sent by sendAck should have length = 0.

	h.seq = int(binary.LittleEndian.Uint32(b[0:4]))
	h.ack = int(binary.LittleEndian.Uint32(b[4:8]))
	h.length = int(binary.LittleEndian.Uint32(b[8:12]))

Big-endian is more usual for network protocols.

I'm not sure about the implementation of Read:

  • Should return 0 here, not length. b hasn't really been filled in with any de-serialized data yet.
    	length, err := s.Conn.Read(b)
    	if err != nil {
    		return length, err
    	}
    
  • This should have b[:length] in place of b:
    	s.buffer = append(s.buffer, b...)
    
  • This order of operations (Read then copy) will lose any data remaining in s.out at EOF. It would be better to do the copy first, and only when s.out is empty do a Read and possibly return an error.
    	length, err := s.Conn.Read(b)
    	if err != nil {
    		return length, err
    	}
    	s.buffer = append(s.buffer, b...)
    
    	n := copy(b, s.out)
    	s.out = s.out[n:]
    	if n == len(b) {
    		return n, err
    	}
    

I would prefer to start with an implementation of Read that is simpler, if less efficient. I am picturing something like this (untested):

func NewSnowflakeReadWriter(sConn io.ReadWriteCloser) *SnowflakeReadWriter {
	pr, pw := io.Pipe()
	s := &SnowflakeReadWriter{
		Conn: sConn,
		pr: pr,
	}
	go s.readLoop(pw)
	return s
}
func (s *SnowflakeReadWriter) readLoop(pw io.PipeWriter) {
	var err error
	for err == nil {
		// strip headers and write data into the pipe
		var header snowflakeHeader
		err = readHeader(s.Conn, &header)
		if err != nil {
			break
		}
		if header.seq == s.ack {
			_, err = io.CopyN(pw, s.Conn, int64(header.length))
		} else {
			_, err = io.CopyN(ioutil.Discard, s.Conn, int64(header.length))
		}
	}
	pw.CloseWithError(err)
}
func (s *SnowflakeReadWriter) Read(b []byte) (int, error) {
	// read de-headered data from the pipe
	return s.pr.Read(b)
}
  • Implement a timeout (maybe making the ReadWriteCloser a net.Conn and implementing SetReadDeadline and a fixed size window to stop sending packets before others have been acknowledged)

I agree that ultimately SnowflakeReadWriter should be a net.Conn, though I don't see why the read deadline should affect the sending of packets.

  • Make the tests better. Right now I'm just reading and writing from a buffer, we should test with network connections

net.Pipe can be good for this.

comment:18 Changed 6 months ago by cohosh

Description: modified (diff)
Status: needs_reviewneeds_revision
Summary: New design for client -- proxy protocol for SnowflakeNew design for client -- server protocol for Snowflake

comment:19 in reply to:  17 Changed 6 months ago by cohosh

Replying to dcf:

Replying to cohosh:

I'm going to ask for an incremental review on this one, mostly just to get another pair of eyes on what I've done and my next steps before I sink more time into going possibly the wrong direction: https://github.com/cohosh/snowflake/compare/sequencing

Thanks for the feedback! I've addressed most of the feedback so far here: https://github.com/cohosh/snowflake/tree/sequencing

The fields would be better as uint32, because the size of int may vary (though I think you handled it correctly by always specifying 32 bits in serialization functions). length may be better as uint16; that's plenty for one call to Write and it avoids the risk that an implementation will try to buffer 4 GB in memory.

Fixed in c102894

IMO length should be exclusive of the header size. Otherwise you have to define what it means if length < 12. (I believe the current code doesn't work in this case because header.length - snowflakeHeaderLen underflows.) The packets sent by sendAck should have length = 0.

Fixed in 02bfa1b

Big-endian is more usual for network protocols.

Fixed in 7566cb

I'm not sure about the implementation of Read:
I would prefer to start with an implementation of Read that is simpler, if less efficient. I am picturing something like this [...]

I like this implementation a lot more. Fixed in b6fb987 and 9c1b12f

I'm going to start on implementing fixed size send windows and cleaning up the unit tests a bit.

comment:20 Changed 5 months ago by dcf

I'm thinking one other field that will be required is some kind of session ID, to enable the server to associate connections that come in via different proxies. meek has something like this, to link up separate HTTP requests:

QUIC uses a Connection ID that is just a random string generated by the initiator.

comment:21 Changed 5 months ago by gaba

Keywords: anti-censorship-roadmap-september added; ex-sponsor-19 anti-censorship-roadmap removed

comment:22 Changed 4 months ago by cohosh

Updating this for those following along at home. branch

Recent changes:

  • added a sessionID
  • rewrote the snowflake tests to use net.Pipe
  • added a timeout feature to close the underlying connection if sent data was not acknowledged

Next steps:

  • send acknoweldgements
  • correlate the sessionID on both ends

comment:23 Changed 4 months ago by cohosh

Status: needs_revisionneeds_review

This is finally ready for another review. Here's a rebased branch: https://github.com/cohosh/snowflake/compare/sequencing The main improvements from the last version are:

  • Each end of a SnowflakeConn will send empty acknowledgement packets (consisting of just the Snowflake header) upon the receipt of new data
  • Each write method spins up a goroutine that will wait for some specified time until checking to see if the acknowledgement packet for that data has been received. If not, it closes the underlying connection causing future reads and writes to fail.
  • On each call to SnowflakeConn's write method, data is saved in an internal buffer until it has been acknowledged
  • SnowflakeConn now has a new method to set and reset the underlying connection. If there is buffered data, SnowflakeConn will resend that data under the same session ID whenever a new underlying connection has been specified

My reasoning for implementing it this way is that the client and server shouldn't have to worry about buffering data after a call to Write(). This makes the buffer in `client/webrtc.go` redundant, but I'll remove it later when finishing up tying together the client and server interactions.

The next step of course is to allow for resetting the underlying connection in SnowflakeConn and using the sessionID to correlate new connections with old ones. There's going to have to be some tricky refactoring here. Right now when the webrtc connection times out (due to staleness), both the webrtc connection and the socks connection are closed and the client waits for a new socks connection to open. The SnowflakeConn should be persistent across snowflakes (and the way it is currently set up perhaps also across SOCKS connections (??)), so the question is where SnowflakeConn should "live".

I'm thinking of adding a new method to SnowflakeCollector that will set (and reset) the underlying connection, and then modifying the Handler function to redefine the snowflake rather than closing the SOCKS connection and waiting for a new one. This doesn't fit perfectly with what I'd assume a SnowflakeCollector does by name, but then again maybe it does. This would make the SnowflakeCollector responsible for both getting new snowflakes and also deciding which snowflake(s) to send traffic through.

Any feedback on decisions up to this point and on future plans is welcome :)

comment:24 Changed 3 months ago by dcf

I would like there to be a paragraph-long or so comment at the top of snowflake-proto/proto.go that briefly explains how things work.

windowSize = 1500 seems small; it would only allow for ≈1 full-sized packet in flight at a time. (I realize windowSize is not used yet.)

Does SnowflakeConn.Conn need to be exposed publicly? Maybe NewSnowflakeConn and SnowflakeConn.NewSnowflake should be the only ways to modify it externally.

LocalAddr, RemoteAddr should return non-nil. They could pass through to s.Conn.LocalAddr and s.Conn.RemoteAddr, but maybe better is to have them return an Addr type that reflects the sessionID, because that is our persistent addressing abstraction over ephemeral network connections. Like:

type sessionAddr []byte;
func (addr sessionAddr) Network() string {
	return "session"
}
func (addr sessionAddr) String() string {
	return string(addr)
}

The header.ack > s.acked comparison looks like it will have problems with overflow. It probably ought to be something like int32(header.ack - s.acked) > 0 instead. Same on the inside, it is probably important to use int32 rather than int. There should be a test for it, too.

		if header.ack > s.acked {
			// remove newly acknowledged bytes from buffer
			s.buf.Next(int(header.ack - s.acked))
			s.acked = header.ack
		}

You don't need an extra check here because io.Writer says "Write must return a non-nil error if it returns n < len(p)."

		//TODO: check to make sure we wrote out all of the bytes
  • SnowflakeConn now has a new method to set and reset the underlying connection. If there is buffered data, SnowflakeConn will resend that data under the same session ID whenever a new underlying connection has been specified

The NewSnowflake function writes the locally buffered bytes directly to the underlying Conn, without prepending new headers as SnowflakeConn.Write does. I expected it to chop up the buffer into new packets and send them with headers (and up-to-date ack numbers).

io.ReadFull(r, b) would be more clear than io.ReadAtLeast(r, b, len(b)).

	b := make([]byte, snowflakeHeaderLen, snowflakeHeaderLen)
	_, err := io.ReadAtLeast(r, b, snowflakeHeaderLen)

I don't think this is worth checking for in SnowflakeConn.Read. I would just let it crash. Or else, panic instead of returning an error. In any case, a check on s.Conn would be better in readLoop than Read, because Read does not even access s.Conn. A better overall design may be to have NewSnowflakeConn take a Conn as a parameter, so that it's impossible to create one that's uninitialized.

	if s.Conn == nil {
		return 0, fmt.Errorf("No network connection to read from ")
	}

Replying to cohosh:

The next step of course is to allow for resetting the underlying connection in SnowflakeConn and using the sessionID to correlate new connections with old ones. There's going to have to be some tricky refactoring here. Right now when the webrtc connection times out (due to staleness), both the webrtc connection and the socks connection are closed and the client waits for a new socks connection to open. The SnowflakeConn should be persistent across snowflakes (and the way it is currently set up perhaps also across SOCKS connections (??)), so the question is where SnowflakeConn should "live".

I'm thinking of adding a new method to SnowflakeCollector that will set (and reset) the underlying connection, and then modifying the Handler function to redefine the snowflake rather than closing the SOCKS connection and waiting for a new one. This doesn't fit perfectly with what I'd assume a SnowflakeCollector does by name, but then again maybe it does. This would make the SnowflakeCollector responsible for both getting new snowflakes and also deciding which snowflake(s) to send traffic through.

I think there should be a 1:1 correspondence between SnowflakeConns and SOCKS connections. SnowflakeConn is analogous to the long-lived TCP connection to a user's guard. If the SnowflakeConn somehow fails unrecoverably, we should close the SOCKS connection to signal to tor that it's dead.

comment:25 Changed 3 months ago by cohosh

Status: needs_reviewneeds_revision

comment:26 Changed 3 months ago by cohosh

Thanks for the feedback! I added a bunch of new commits onto the sequencing branch

comment:27 Changed 2 months ago by cohosh

Status: needs_revisionneeds_review

Okay I made a lot of changes, and it's done in the sense that the sequencing and reliability layer is fully integrated to the client and server and all the features I think we want are there. I've squashed these changes into two commits on a new branch: https://github.com/cohosh/snowflake/compare/sequencing_squashed

The first commit adds the new SnowflakeConn network connection and the second integrates it into both the client and the server.

Honestly I think it'll need another round of revisions for the following reasons:

  • The locks are a bit messy and I think we'll need more
  • We don't currently have unit tests that check the client and server integration and we should have them
  • In my integration tests I'm seeing the server occasionally fail, but I haven't figured out how yet

Right now there's a plain http test deployment of this server running on ws://68.183.200.16 if you want to try it out.

Some additional notes:

  • I see some motivation for another feature that allows us to set some kind of FIN or RST flag to notify the client that the OR port has been closed and the server that the client has closed the connection. Right now each endpoint can't tell the difference between a problem with the snowflake proxy and with an endpoint.
  • Even though we aren't closing the OR connection on the server side when a snowflake dies, from my tests it looks like the second snowflake connection isn't happening fast enough and I'm still getting a connection reset by peer error when I try to write to it again.

Perhaps 10s is too long a timeout?

Is there a client keep-alive functionality that should happen that we can simulate at the bridge?

comment:28 Changed 8 weeks ago by dcf

I think this design is looking pretty good. You and I converged on similar decisions here and in https://github.com/net4people/bbs/issues/14#issuecomment-542898991, such as maintaining a dynamic map of session ID to net.Conn on the server, and treating a session ID as a net.Addr.

I see that a map of the type Flurry maintains the set of mappings from client session IDs to (Snowflake proxy, OR conn). The map definitely needs protection against concurrent modification—I wonder if that's the cause of the occasional server failures you saw. If you haven't done it yet, try building with go build -race and see whether it reports anything. Take a look at the connMap type in reconnecting-kcp/server/conn_map.go in https://github.com/net4people/bbs/files/3736441/turbo-tunnel-reconnection-demo.zip. It's meant to be a synchronized version of the same data structure.

Could Flurry.conn be a plain net.Conn, and Flurry.or also? Or do they need the specialization to *proto.Snowflake.Conn and *net.TCPConn?

Could the directory common/snowflake-proto be called common/proto instead, so that the directory name matches the package name?

I see some motivation for another feature that allows us to set some kind of FIN or RST flag to notify the client that the OR port has been closed and the server that the client has closed the connection.

Yes, but in the Tor model, it essentially never happens that a relay/bridge closes an ORPort connection, right? It's usually the client that decides when to disconnect.

Perhaps 10s is too long a timeout?

I don't understand this. Do you mean too short a timeout? As a retransmission timer, 10 s doesn't seem short, but as a timer that terminates the whole end-to-end connection, it does seem short. Since in this design, there's no retransmission except when kicked off by a NewSnowflake transition, it might be worth increasing the timeout.

I'm not sure about this design of setting a timer on every Write, to wait for the ack to come back. Does the timers slice grow without bound? In the name of simplicity, could there be one timer, that gets reset whenever any amount of data is acked?

Speaking of timeouts, I think it's worth reducing the timeout in tests, to make them run faster.

I don't understand the design where methods such as NewSnowflake and readLoop take an optional *snowflakeHeader. Can you explain that?

What's the reason for the nil check in snowflakeHeader.Marshal? It's not a typical pattern for a method like this. Is there a place where the error is anticipated, or would it always indicate a bug of some sort?

It looks like this block is meant to return 0, err2, not len(b), nil? Also err2.Error() in the log statement, not err.Error().

	if err2 != nil {
		log.Printf("Error writing to connection: %s", err.Error())
		return len(b), nil
	}

SnowflakeConn needs a test for duplicate and overlapping sequence numbers. That's the primary expected use case: the client sends 100 bytes with seq 1234, the server receives, the proxy dies before sending the server's ack, the client reconnects to a different proxy and retransmits the same 100 bytes with seq 1234, the server receives duplicate sequence numbers and should ignore the second packet. Here's a stab at writing a test:

		Convey("Overlapping sequence numbers", func(ctx C) {
			sessionID := []byte("session_")
			go func() {
				hb, _ := (&snowflakeHeader{
					seq:       0,
					ack:       0,
					length:    5,
					sessionID: sessionID,
				}).Marshal()
				// Write first 5 bytes.
				client.Write(hb)
				client.Write([]byte("HELLO"))
				// Exact duplicate packet.
				client.Write(hb)
				client.Write([]byte("HELLO"))

				hb, _ = (&snowflakeHeader{
					seq:       3,
					ack:       0,
					length:    5,
					sessionID: sessionID,
				}).Marshal()
				// Overlapping sequence number -- should be
				// rejected (not overwrite what was already
				// received) and bytes past the expected seq
				// ignored.
				client.Write(hb)
				client.Write([]byte("XXXXX"))

				// Now the expected sequence number.
				hb, _ = (&snowflakeHeader{
					seq:       5,
					ack:       0,
					length:    5,
					sessionID: sessionID,
				}).Marshal()
				client.Write(hb)
				client.Write([]byte("WORLD"))

				client.Close()
			}()

			received, err := ioutil.ReadAll(s)
			ctx.So(err, ShouldEqual, nil)
			ctx.So(received, ShouldResemble, []byte("HELLOWORLD"))
		})

I think the "Test reading multiple chunks" test is more specific than necessary, in that it tests that Write boundaries are preserved on Read, when that behavior isn't required by net.Conn nor is it something we need. The first s.Read call would be within its rights to return 2, or 1, or even 0 bytes instead of 3, according to the contract of io.Reader. This would be better expressed, in my opinion, as some io.ReadFull calls on a 3-byte buffer, to ensure that all 3 bytes are read if availabl. Then the second s.Read should also return 3 bytes, not 2, notwithstanding that it spans what was originally a break between two Write calls.

Should it be an EOF error to receive future sequence numbers, as is tested here? If there's a hole in the sequence stream, it seems like the future packet should just be dropped; it shouldn't cause an error. On the other hand, maybe we don't care about this case, because for us SnowflakeConn is always layered on top of a reliable transport, so while we could receive a duplicate past sequence number, we could never receive a future sequence number. But if we're going to specify a behavior, I think it shouldn't be (0, io.EOF), rather (0, nil) or block until at least 1 byte is available.

The "webRTC header" here looks like a typo for "snowflakeHeader".

// Parses a webRTC header from bytes received on the
// webRTC connection

NewSnowflakeConn should call GenSessionID on its own. Otherwise there is an easy risk of misuse by forgetting to call GenSessionID and getting a nil session ID.

Last edited 8 weeks ago by dcf (previous) (diff)

Changed 8 weeks ago by dcf

comment:29 in reply to:  27 Changed 8 weeks ago by dcf

Replying to cohosh:

  • We don't currently have unit tests that check the client and server integration and we should have them

I ported my recent Turbo Tunnel prototype program to SnowflakeConn. You'll have to check it; I may have gotten the API wrong. But perhaps it can help test in a controlled environment.

attachment:reconnecting-snowflakeconn.zip

I managed to get a few panics, trying different things.


server$ ./server 127.0.0.1:4000
2019/10/17 21:14:11 error in handleConn: EOF
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x18 pc=0x4f8d24]

goroutine 5 [running]:
github.com/cohosh/snowflake/common/snowflake-proto.(*SnowflakeConn).Write(0xc0000ba000, 0xc0000307a8, 0xc, 0x20, 0xc, 0x20, 0x0)
        $GOPATH/pkg/mod/github.com/cohosh/snowflake@v0.0.0-20191002223810-1707fcc12518/common/snowflake-proto/proto.go:337 +0x424
main.handleConn.func1(0xc0000ba000)
        reconnecting-snowflakeconn/server/main.go:62 +0x87
created by main.handleConn
        reconnecting-snowflakeconn/server/main.go:58 +0x207
client$ ./client 127.0.0.1:4000
2019/10/17 21:14:08 begin SnowflakeConn c0bd93336fcab1ef
2019/10/17 21:14:08 begin TCP connection 127.0.0.1:52386 -> 127.0.0.1:4000
abcd
ABCD
^C
client$ ./client 127.0.0.1:4000
2019/10/17 21:14:15 begin SnowflakeConn e446789c28f69bbc
2019/10/17 21:14:15 begin TCP connection 127.0.0.1:52388 -> 127.0.0.1:4000
xyz
XYZ
2019/10/17 21:14:19 stdout <- conn finished: EOF

This looks like it's the err/err2 confusion from comment:28.


lilbastard$ lilbastard -w 10 127.0.0.1:3000 127.0.0.1:4000
$ ./server 127.0.0.1:4000
2019/10/17 21:18:41 error in handleConn: EOF
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x18 pc=0x4f8d24]

goroutine 5 [running]:
github.com/cohosh/snowflake/common/snowflake-proto.(*SnowflakeConn).Write(0xc0000be000, 0xc00002ffa8, 0xc, 0x20, 0xc, 0x20, 0x0)
        $GOPATH/pkg/mod/github.com/cohosh/snowflake@v0.0.0-20191002223810-1707fcc12518/common/snowflake-proto/proto.go:337 +0x424
main.handleConn.func1(0xc0000be000)
        reconnecting-snowflakeconn/server/main.go:62 +0x87
created by main.handleConn
        reconnecting-snowflakeconn/server/main.go:58 +0x207
$ ./client 127.0.0.1:3000
2019/10/17 21:18:31 begin SnowflakeConn f2a5d31f6afa8727
2019/10/17 21:18:31 begin TCP connection 127.0.0.1:39108 -> 127.0.0.1:3000
abcd
ABCD
efgh
EFGH
ijkl2019/10/17 21:18:41 stdout <- conn finished: EOF

2019/10/17 21:18:51 Closing WebRTC connection, timed out waiting for ACK

This looks similar, except the server and client report an EOF first.


server$ ./server 127.0.0.1:4000
client$ yes | ./client 127.0.0.1:4000
2019/10/17 21:16:01 begin SnowflakeConn eb27f73167a77490
2019/10/17 21:16:01 begin TCP connection 127.0.0.1:52392 -> 127.0.0.1:4000
2019/10/17 21:16:11 Closing WebRTC connection, timed out waiting for ACK
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x18 pc=0x4f52c4]

goroutine 6 [running]:
github.com/cohosh/snowflake/common/snowflake-proto.(*SnowflakeConn).Write(0xc0000a4000, 0xc0000b8000, 0x8000, 0x8000, 0x8000, 0x0, 0x0)
        $GOPATH/pkg/mod/github.com/cohosh/snowflake@v0.0.0-20191002223810-1707fcc12518/common/snowflake-proto/proto.go:337 +0x424
github.com/cohosh/snowflake/common/snowflake-proto.Proxy(0x55edc0, 0xc0000a4000, 0x55ef80, 0xc00000e010, 0x0, 0x0, 0x0)
        $GOPATH/pkg/mod/github.com/cohosh/snowflake@v0.0.0-20191002223810-1707fcc12518/common/snowflake-proto/proto.go:406 +0xf0
main.run.func1(0xc000012380, 0xc0000a4000)
        reconnecting-snowflakeconn/client/main.go:37 +0x7d
created by main.run
        reconnecting-snowflakeconn/client/main.go:35 +0x3e8

This panic is in the client rather than the server.

comment:30 Changed 7 weeks ago by cohosh

Status: needs_reviewneeds_revision

comment:31 in reply to:  28 ; Changed 7 weeks ago by cohosh

Status: needs_revisionneeds_review

Replying to dcf:

Thanks! This feedback was very helpful. I made several changes to the implementation. Here are a series of commits on top of the old branch, and here is a newly squashed version. The main changes are:

  • I restructured the protocol a bit to only send the session ID at the start. This work as long as we're using the WebRTC datachannel with full reliability, and I think it's worth doing for this very simple precursor to Turbo Tunnel. The reason for the change was to simplify the calls to NewSnowflake so we don't have to pass in the header we read in order to determine which SnowflakeConn the snowflake belongs to. The drawback to this is that the server has to remember to call ReadSessionID before calling Read (which should be straightforward because Read should only be called on a SnowflakeConn and at the start, the WebRTC connection doesn't belong to a SnowflakeConn yet.

Basically, throughout writing this, I've tried to keep the client and server as symmetric as possible so that we share most of the code. This is a bit different from the approach taken in Turbo Tunnel.

  • I fixed the race conditions and found a bug that was causing a seg fault. The server now seems to be running consistently well without crashing.

To address specific comments:

Could Flurry.conn be a plain net.Conn, and Flurry.or also? Or do they need the specialization to *proto.Snowflake.Conn and *net.TCPConn?

Not the way I've written in right now because of the calls to Flurry.conn.NewSnowflake here. I don't think generalizing this at the expense of complexity at the client is desirable, since Flurries are specific to Snowflake connections.

I see some motivation for another feature that allows us to set some kind of FIN or RST flag to notify the client that the OR port has been closed and the server that the client has closed the connection.

Yes, but in the Tor model, it essentially never happens that a relay/bridge closes an ORPort connection, right? It's usually the client that decides when to disconnect.

The exception to this is if there is the client is giving the bridge bad data. Of course that shouldn't happen unless there is a bug in the client so maybe that's ok to just let time out. I think you're right here that this feature is unnecessary.

Perhaps 10s is too long a timeout?

I don't understand this. Do you mean too short a timeout? As a retransmission timer, 10 s doesn't seem short, but as a timer that terminates the whole end-to-end connection, it does seem short. Since in this design, there's no retransmission except when kicked off by a NewSnowflake transition, it might be worth increasing the timeout.

You're right that 10s is short for a network connection timeout. This brings us to what remains to be a tricky engineering challenge here. The goal of the sequencing layer is to allow the client to recover from a faulty snowflake. However, it takes 30s for the connection to go stale in the client's checkForStaleness function. So it takes 30s for a client to request a new snowflake and start se nding data through it. In all of my tests, the SOCKS connection timed out well before the client connected to a new snowflake. Since a new SOCKS connection means a new snowflake session and new OR connection anyway, this means the client never actually recovers and the browser reports a connection error.

So my thought was that if we lowered it to 10s, we have a chance to recover before the SOCKS connection goes stale. However in practice this is still a bit too long and I'm still seeing SOCKS timeouts.

So, I'm wondering if it's ok if we make this even shorter. All it means is that the client will abandon the proxy connection for a better one with less latency and that the threshold for that can be low (maybe 2-5 seconds).

Should it be an EOF error to receive future sequence numbers, as is tested here?

The EOF was actually a result of the timed out connection, not because a future packet causes Read to return an EOF. I rewrote that test in a better way that should avoid confusion here.

comment:32 Changed 7 weeks ago by cohosh

I restructured the protocol a bit to only send the session ID at the start. This work as long as we're using the WebRTC datachannel with full reliability, and I think it's worth doing for this very simple precursor to Turbo Tunnel. The reason for the change was to simplify the calls to NewSnowflake so we don't have to pass in the header we read in order to determine which SnowflakeConn the snowflake belongs to. The drawback to this is that the server has to remember to call ReadSessionID before calling Read (which should be straightforward because Read should only be called on a SnowflakeConn and at the start, the WebRTC connection doesn't belong to a SnowflakeConn yet.

Basically, throughout writing this, I've tried to keep the client and server as symmetric as possible so that we share most of the code. This is a bit different from the approach taken in Turbo Tunnel.

Also noting that doing things this way means we can't use this version for multiplexing which is maybe more along the lines of what we want to solve the issues with SOCKS timeouts. The way this is done in Turbo Tunnel with including the session id as the first bytes sent in each write and the connection migration technique from applications like mosh is very nice. Packets can be sent by the client through multiple multiplexed channels and the receiving end sends their data through whichever channel they most recently received data from.

comment:33 Changed 7 weeks ago by cohosh

Alright, I had a chance to take a look at the obfs4 integration with Turbo Tunnel: https://github.com/net4people/bbs/issues/14#issuecomment-544747519

Another option is to just scrap the work done here so far and work turbo tunnel into snowflake.

One of the main differences between Snowflake and obfs4 in how it relates to the work on Turbo Tunnel so far is the difference in how Dial would be called. There is no Dial in snowflake, but rather a simultaneous routine that collects snowflakes in the background until we are ready to use them. A call to a call to snowflakes.Pop() retrieves it for use by the Handler function.

I'm curious about whether Turbo Tunnel is going to be implemented as a separate library.

Snowflake could accommodate this with a dial function similar to the one provided by the obfs4 ClientFactory: https://dip.torproject.org/dcf/obfs4/blob/reconnecting-quic/transports/base/base.go#L56
Here Dial would likely be provided by SnowflakeCollector and just wrap the call the snowflakes.Pop. Then dialAndExchange could take a Dialer interface as the first argument.

comment:34 in reply to:  31 ; Changed 7 weeks ago by dcf

Replying to cohosh:

I made several changes to the implementation. Here are a series of commits on top of the old branch, and here is a newly squashed version.

Branch sequencing2 is missing server/flurry.go. sequencing2_squashed has it.

Here, it still looks like it should be err2 inside the error handler, not err.

	n, err2 := s.conn.Write(bytes)
	s.writeLock.Unlock()
	if err2 != nil {
		log.Printf("Error writing to connection: %s", err.Error())
		return len(b), err
	}

Perhaps 10s is too long a timeout?

I don't understand this. Do you mean too short a timeout? As a retransmission timer, 10 s doesn't seem short, but as a timer that terminates the whole end-to-end connection, it does seem short. Since in this design, there's no retransmission except when kicked off by a NewSnowflake transition, it might be worth increasing the timeout.

You're right that 10s is short for a network connection timeout. This brings us to what remains to be a tricky engineering challenge here. The goal of the sequencing layer is to allow the client to recover from a faulty snowflake. However, it takes 30s for the connection to go stale in the client's checkForStaleness function. So it takes 30s for a client to request a new snowflake and start se nding data through it. In all of my tests, the SOCKS connection timed out well before the client connected to a new snowflake. Since a new SOCKS connection means a new snowflake session and new OR connection anyway, this means the client never actually recovers and the browser reports a connection error.

So my thought was that if we lowered it to 10s, we have a chance to recover before the SOCKS connection goes stale. However in practice this is still a bit too long and I'm still seeing SOCKS timeouts.

So, I'm wondering if it's ok if we make this even shorter. All it means is that the client will abandon the proxy connection for a better one with less latency and that the threshold for that can be low (maybe 2-5 seconds).

I guess I really don't understand how this works. How is the SOCKS connection timing out? As far as I know, tor does terminate connections to client PTs because of idleness. Is there a specific log message I can look for to see when this happens? snowflake-client should be hiding those temporary losses of connectivity.

Is it the timeout when tor sends a SOCKS request and is waiting for the "Grant" or "Reject" response? With the Snowflake protocol in place, tor should not be requesting new SOCKS sessions anyway -- because snowflake-client won't be closing its existing SOCKS connection every time it loses proxy connectivity. Anyway, you can always send a fake "Grant" response without waiting for a proxy to be available; that's what meek does and the PT protocol kind of forces that behavior on non-connection-oriented protocols like this.

In any case, 5 seconds strikes me as way way too short for a liveness timer. We'll be throwing away good proxies all the time. The fact that tor is requesting new SOCKS connections indicates that we're signaling some failure condition upward that we shouldn't be, I think; I would check that out first.

comment:35 in reply to:  33 Changed 7 weeks ago by dcf

Replying to cohosh:

Alright, I had a chance to take a look at the obfs4 integration with Turbo Tunnel: https://github.com/net4people/bbs/issues/14#issuecomment-544747519

Another option is to just scrap the work done here so far and work turbo tunnel into snowflake.

We talked about this at the last meeting and decided to continue with snowflakeConn, because work on adapting Snowflake to a turbo tunnel scheme is at least a few months off.

I'm curious about whether Turbo Tunnel is going to be implemented as a separate library.

No, I'm not planning anything like that. For one thing, turbo tunnel isn't a singular thing, it's more of a high-level design principle: putting a session/reliability layer in the middle of a circumvention protocol enables a lot of nice features. Even the connection migration stuff shown off in the obfs4proxy demo, I wouldn't call core to the turbo tunnel idea; it's just one of many things made possible. My second consideration is that I feel we're still in the phase of requirements gathering, which is accomplished by doing a series of non-reusable, concrete implementations. The likelihood of getting the API wrong is too high otherwise. With my software engineer hat on, I see a resusable library at this point as having both high risk and high cost.

I am, however, planning to try implementing the turbo tunnel idea directly in Snowflake, as part of this whole process.

comment:36 Changed 6 weeks ago by teor

Some users even struggle with Tor's 10-30 second timeouts, because they are on really slow links.

Even mobile from Australia to Europe can take a few seconds.

comment:37 in reply to:  34 ; Changed 6 weeks ago by cohosh

Replying to dcf:

In any case, 5 seconds strikes me as way way too short for a liveness timer. We'll be throwing away good proxies all the time. The fact that tor is requesting new SOCKS connections indicates that we're signaling some failure condition upward that we shouldn't be, I think; I would check that out first.

Replying to teor:

Some users even struggle with Tor's 10-30 second timeouts, because they are on really slow links.
Even mobile from Australia to Europe can take a few seconds.

So Tor's timeout is 10-30 seconds? That seems like our problem here then. If the SOCKS connection to the PT closes after not being able to bootstrap in 10-30 seconds, then we only get one shot to bootstrap fully with the first snowflake we get. Otherwise the user gets a warning that there's something wrong with the network and has to restart the bootstrap process manually (which may again fail). Any thoughts on how to solve this other than multiplexing and trying to send the data through multiple snowflakes at once?

comment:38 Changed 6 weeks ago by cohosh

Is it the timeout when tor sends a SOCKS request and is waiting for the "Grant" or "Reject" response? With the Snowflake protocol in place, tor should not be requesting new SOCKS sessions anyway -- because snowflake-client won't be closing its existing SOCKS connection every time it loses proxy connectivity. Anyway, you can always send a fake "Grant" response without waiting for a proxy to be available; that's what meek does and the PT protocol kind of forces that behavior on non-connection-oriented protocols like this.

Snowflake does the same here. It's timing out because after granting the request, the SOCKS connection hasn't received any data in 30s if the user happens to get a bad snowflake. The connection isn't been closed on the PT side, it's being closed on the tor side it seems. So unless there's some way for the PT to send some kind of keep alive signal, there's no way for the SOCKS connection to tell the difference between a connection that's gone completely dead for some reason.

Perhaps multiplexing is the way to go. What I was asking for with the short timeouts was maybe a bit misleading. I was thinking of trying to cycle really quickly through bad proxies before landing on a good one but that still takes time and doesn't seem like the right approach here. We could try sending data through a few proxies (maybe 3?) and then just use whichever snowflake gets the data to the server the fastest?

comment:39 in reply to:  37 Changed 6 weeks ago by dcf

Replying to cohosh:

Replying to teor:

Some users even struggle with Tor's 10-30 second timeouts, because they are on really slow links.
Even mobile from Australia to Europe can take a few seconds.

So Tor's timeout is 10-30 seconds?

Hum, TIL tor has a SocksTimeout option. But it's supposed to be 120 s, not 30 s.

SocksTimeout NUM
Let a socks connection wait NUM seconds handshaking, and NUM seconds unattached waiting for an appropriate circuit, before we fail it. (Default: 2 minutes)

I'm not sure where the 30 s is coming from. Could it be LearnCircuitBuildTimeout or something like that? Anyway, you could test using an artificially high SocksTimeout, to see if the problem still happens.

Replying to cohosh:

Anyway, you can always send a fake "Grant" response without waiting for a proxy to be available

Snowflake does the same here.

No, what it's doing now is not quite what I mean. I mean putting the socks.Grant before the snowflakes.Pop() (or even inside socksAcceptLoop). If I'm not mistaken, snowflake.Pop is a blocking operation that doesn't return until an entire broker exchange has happened and a WebRTC connection is established. I mean optimistically calling socks.Grant on the incoming SOCKS connection immediately, even if we have 0 snowflakes available. Especially now that this sequencing branch exists, tor's SOCKS connection is not tied to any single WebRTC connection.

We could try sending data through a few proxies (maybe 3?) and then just use whichever snowflake gets the data to the server the fastest?

Yes, I think this can work. (Something like IPv4/IPv6 "happy eyeballs.") My plan for a turbo tunnel adaptation is to basically do this, try sending through all available snowflakes at all times (using e.g. round robin), and if one doesn't work or drops packets, it's no problem. Even without multiplexing, you could race just the WebRTC rendezvous across 3 snowflakes, then take the one that finishes first, and throw the others away.

comment:40 Changed 3 weeks ago by cohosh

Status: needs_reviewneeds_revision
Note: See TracTickets for help on using tickets.