If there is sufficient clock skew between a Tor client and its bridge, the client starts carrying out the link handshake protocol incorrectly, leading to (info-level) errors like this on the server side:
Aug 19 19:21:13.000 [info] channel_register(): Channel 0x7fcbd2927d00 (global ID 14) in state opening (1) registered with no identity digestAug 19 19:21:13.000 [info] channel_tls_process_versions_cell(): Negotiated version 4 with [scrubbed]:49733; Sending cells: VERSIONS CERTS NETINFOAug 19 19:21:13.000 [info] channel_tls_process_certs_cell(): Got some good certificates from [scrubbed]:49733: Waiting for AUTHENTICATE.Aug 19 19:21:13.000 [info] channel_tls_handle_cell(): Received unexpected cell command 10 in chan state opening / conn state handshaking (Tor, v3 handshake); closing the connection.
From IRC (edited):
<nickm> Looks like the client is sending a CREATE2 cell before the handshake is done.<nickm> It appears that the client isn't sending a NETINFO cell.<nickm> If this is an unmodified Tor client and server, then that's a bug.<zwol> 0.2.4.15-rc (git-f41c20b344fb7359) on the server.<zwol> 0.2.4.15-rc (git-e7b435872cce464f) on the client.<zwol> f41c20b344fb7359 corresponds to the tor 0.2.4.15-rc-1 package from current Debian unstable, and e7b435872cce464f is what was shipped with TBB 3.0a3.
I will attach two client-side info-level logs: one demonstrating the problem, one successful connection. The only configuration difference between the two is that in the problem case, the client's clock was 26 hours fast:
Aug 21 18:57:01.000 [info] connection_dir_client_reached_eof(): Received directory with skewed time (server '[scrubbed]:9002'): It seems that our clock is ahead by 1 days, 2 hours, 32 minutes, or that theirs is behind. Tor requires an accurate clock to work: please check your time, timezone, and date settings.
To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information
Child items ...
Show closed items
Linked items 0
Link issues together to show that they're related.
Learn more.
Interesting! This is exactly the same problem I saw in my #9166 (moved) experiment, and it happens if uTP communication is disabled. My client and private bridge don't have their clocks skewed significantly. In fact, the problem shows up even when the client is not connected, presumably during self-tests of the bridge. So I'd say it's unrelated to clock skew. I'm running a modified 0.2.4.4-alpha-dev. Will try a recent 0.2.4.x next.
Confirming that this happens on a private bridge that performs reachability tests:
Aug 20 17:46:02.829 [info] update_consensus_router_descriptor_downloads(): 0 router descriptors downloadable. 0 delayed; 4383 present (0 of those were in old_routers); 0 would_reject; 0 wouldnt_use; 0 in progress.Aug 20 17:46:02.831 [info] consider_testing_reachability(): Testing reachability of my ORPort: 174.129.86.221:9001.Aug 20 17:46:02.831 [info] onion_pick_cpath_exit(): Using requested exit node '$224FC9615C2899B3C982A947FF7912201F986B1C~utpbridge at 174.129.86.221'Aug 20 17:46:02.834 [info] circuit_send_next_onion_skin(): First hop: finished sending CREATE_FAST cell to '$64186650FFE4469EBBE52B644AE543864D32F43C=PsychoOnion3 at 89.187.142.208'Aug 20 17:46:02.834 [info] routerlist_remove_old_routers(): We have 4383 live routers and 10402 old router descriptors.Aug 20 17:46:02.937 [info] circuit_finish_handshake(): Finished building circuit hop:Aug 20 17:46:02.937 [info] internal circ (length 3, last hop utpbridge): PsychoOnion3(open) fmj(closed) $224FC9615C2899B3C982A947FF7912201F986B1C(closed)Aug 20 17:46:02.937 [info] circuit_send_next_onion_skin(): Sending extend relay cell.Aug 20 17:46:03.192 [info] circuit_finish_handshake(): Finished building circuit hop:Aug 20 17:46:03.192 [info] internal circ (length 3, last hop utpbridge): PsychoOnion3(open) fmj(open) $224FC9615C2899B3C982A947FF7912201F986B1C(closed)Aug 20 17:46:03.193 [info] circuit_send_next_onion_skin(): Sending extend relay cell.Aug 20 17:46:03.399 [info] channel_register(): Channel 0x7f466fadb510 (global ID 5) in state opening (1) registered with no identity digestAug 20 17:46:03.595 [info] channel_tls_process_versions_cell(): Negotiated version 3 with 91.213.195.244:44739; Sending cells: VERSIONS CERTS NETINFOAug 20 17:46:03.701 [info] channel_tls_process_certs_cell(): Got some good certificates from 91.213.195.244:44739: Waiting for AUTHENTICATE.Aug 20 17:46:03.825 [info] channel_tls_handle_cell(): Received unexpected cell command 1 in chan state opening / conn state handshaking (Tor, v3 handshake); closing the connection.Aug 20 17:46:04.010 [info] circuit_testing_failed(): Our testing circuit (to see if your ORPort is reachable) has failed. I'll try again later.
This is latest maint-0.2.4 on a machine with a non-skewed clock.
" Aug 20 17:46:03.825 [info] channel_tls_handle_cell(): Received unexpected cell command 1 in chan state opening / conn state handshaking (Tor, v3 handshake); closing the connection. "
Interesting that this is a CREATE cell too, not a CREATE_FAST cell.
I'd almost suspect that we're getting CREATE cells from another relay rather than from a client directly, since we're not seeing CREATE_FAST cells here.
I generated new logs from both client and server, at "debug" level, with "SafeLogging 0". The client is 172.19.151.215 and the server (bridge) is 128.2.142.99. It looks like the offending CREATE cell is indeed coming from another relay (185.21.101.40 = afo3.torproject.afo-tm.org). I'm not sure why this would happen for a relay with PublishServerDescriptor 0 set, though.
For the record, here is the complete server torrc:
# If non-zero, try to write to disk less frequently than we would otherwise.AvoidDiskWrites 1# Store working data, state, keys, and caches here.DataDirectory /home/user/tbb_VteJUc/Data/TorGeoIPFile /home/user/tbb_VteJUc/Data/Tor/geoip# Where to send logging messages. Format is minSeverity[-maxSeverity]# (stderr|stdout|syslog|file FILENAME).Log notice stdout# Bind to this address to listen to connections from SOCKS-speaking# applications.SocksListenAddress 127.0.0.1SocksPort 9150ControlPort 9151CookieAuthentication 1#ExcludeNodes $B0171148A7081858EE639B9451AF4D6CE0F68361,$6B6B1718DCF6BECB2A8D2FE09A80325E9060E6CD,$B4FDE846864BAD0AD1CA428FBEFABF2CD8CF13A6,$B6B7FBA5874DD4F4337CCFAA460C3B92C264C0ED,$E841B711D279EC27C1555281FA61568B6C45A919Bridge 128.2.142.99:9002UseBridges 1UseMicroDescriptors 0SafeLogging 0Log debug stderr
My working theory is that this happens when a relay is extending to a bridge. The bridge doesn't send an AUTH_CHALLENGE (since it doesn't want to be authenticated to). But that's the trigger for a public relay to send a NETINFO cell, so the public relay never sends the netinfo.
Looking through logs on my bridge. It looks like I am seeing some of these from connections via obfsproxy. That race is surprising, yes?
Aug 09 06:35:10.000 [info] channel_tls_handle_cell(): Received unexpected cell command 5 in chan state opening / conn state waiting for renegotiation or V3 handshake; closing the connection.Aug 09 06:35:10.000 [info] channel_tls_process_versions_cell(): Negotiated version 3 with 127.0.0.1:48158; Sending cells: VERSIONS CERTS NETINFOAug 09 06:35:10.000 [info] channel_tls_process_versions_cell(): Negotiated version 3 with 127.0.0.1:47601; Sending cells: VERSIONS CERTS NETINFOAug 09 06:35:10.000 [info] channel_tls_process_netinfo_cell(): Got good NETINFO cell from 127.0.0.1:48158; OR connection is now open, using protocol version 3. Its ID digest is 0000000000000000000000000000000000000000. Our address is apparently 128.31.0.34.
Trac: Summary: Link handshake fails with "Received unexpected cell command 10" when clocks are skewed to Link handshake fails with "Received unexpected cell command 10" on a bridge
skruffy points out that it's weird the bridge isn't sending an AUTH_CHALLENGE cell.
Our spec says:
When the in-protocol handshake is used, the initiator sends a VERSIONS cell to indicate that it will not be renegotiating. The responder sends a VERSIONS cell, a CERTS cell (4.2 below) to give the initiator the certificates it needs to learn the responder's identity, an AUTH_CHALLENGE cell (4.3) that the initiator must include as part of its answer if it chooses to authenticate, and a NETINFO cell (4.5).
Yet our code says
/* If we're a relay that got a connection, ask for authentication. */ const int send_chall = !started_here && public_server_mode(get_options());
The comment for command_process_auth_challenge_cell() says
/** Process an AUTH_CHALLENGE cell from an OR connection. * * If we weren't supposed to get one (for example, because we're not the * originator of the connection), or it's ill-formed, or we aren't doing a v3 * handshake, mark the connection. If the cell is well-formed but we don't * want to authenticate, just drop it. If the cell is well-formed *and* we * want to authenticate, send an AUTHENTICATE cell and then a NETINFO cell. */
Why do our bridges decide they're too cool to follow the spec? :)
I ran your branch9546 in a Shadow network with 1 directory authority, 8 exits, and 11 private bridges. I didn't get a single log line saying "Received unexpected cell command [...]".
I also ran the previous commit of your branch9546 (edaea77) in the same network and got plenty of those warnings, all complaining about unexpected cell command 1, so CREATE cells.
Why do our bridges decide they're too cool to follow the spec? :)
Looking at the commit logs doesn't shine much light on this.
As near as I can guess, the rationales might have been:
They don't need to have authenticated incoming connections.
It's a little weird to let a relay authenticate a connection to a bridge such that the bridge will use that connection for extending circuits to that relay.
But I'm not actually seeing a flaw there -- this happens already on relay<->relay connections to no ill effect. Further, any relay that wanted to create an authenticated connection to a bridge could do so by using the v1 or v2 handshake, by acting as a client and extending to itself, or something like that.
So I'm adding another commit to these branches to cause bridges to send AUTH_CHALLENGE cells. Please review?
I ran nickm's updated branch9546 (585e6b9) in the same Shadow network and got no "Received unexpected cell command" lines. In contrast to the previous commit (89feaa4), the fixed-up 585e6b9 doesn't break Shadow anymore. Looks good.
Patches look fine otherwise. I admit I am nervous putting them into 0.2.3, since the v3 link handshake is complex.
But then, I'm nervous about not putting them into 0.2.3 also.
I guess the downside of doing it is that we could screw up our stable. And the downside of not is that 0.2.3 bridges would need either an 0.2.2.x or an 0.2.4.17+ relay for their reachability test.
And there are other impacts to ponder, like whether this changes the "i can talk to it and distinguish whether it's a bridge" attacks. My first thought there is that such attacks are possible both before and after this patch, and having some patched bridges and some unpatched bridges won't change things much.
I guess that means "merge the 0.2.4 version, and keep this in mind for an 0.2.3 version if we ever do one"? Or "merge them both and expect we'll test it well enough before we make a new 0.2.3 version anyway"?
I think "merge into both, and expect that 0.2.4.17-rc will see enough testing before 0.2.3.next comes out." Plus also we should come up with a plan for trying to avoid 0.2.4 getting into the situation where 0.2.3 is now once it's stable.
Merging into 0.2.3 and later.
Trac: Resolution: N/Ato fixed Status: needs_review to closed