Opened 6 years ago

Closed 6 years ago

#9546 closed defect (fixed)

Link handshake fails with "Received unexpected cell command 10" on a bridge

Reported by: zwol Owned by:
Priority: High Milestone: Tor: 0.2.4.x-final
Component: Core Tor/Tor Version: Tor: 0.2.4.15-rc
Severity: Keywords: tor-bridge tor-client 023-backport
Cc: isis@… Actual Points:
Parent ID: Points:
Reviewer: Sponsor:

Description

If there is sufficient clock skew between a Tor client and its bridge, the client starts carrying out the link handshake protocol incorrectly, leading to (info-level) errors like this on the server side:

Aug 19 19:21:13.000 [info] channel_register(): Channel 0x7fcbd2927d00 (global ID 14) in state opening (1) registered with no identity digest
Aug 19 19:21:13.000 [info] channel_tls_process_versions_cell(): Negotiated version 4 with [scrubbed]:49733; Sending cells: VERSIONS CERTS NETINFO
Aug 19 19:21:13.000 [info] channel_tls_process_certs_cell(): Got some good certificates from [scrubbed]:49733: Waiting for AUTHENTICATE.
Aug 19 19:21:13.000 [info] channel_tls_handle_cell(): Received unexpected cell command 10 in chan state opening / conn state handshaking (Tor, v3 handshake); closing the connection.

From IRC (edited):

<nickm> Looks like the client is sending a CREATE2 cell before the handshake is done.
<nickm> It appears that the client isn't sending a NETINFO cell.
<nickm> If this is an unmodified Tor client and server, then that's a bug.
<zwol> 0.2.4.15-rc (git-f41c20b344fb7359) on the server.
<zwol> 0.2.4.15-rc (git-e7b435872cce464f) on the client.
<zwol> f41c20b344fb7359 corresponds to the tor 0.2.4.15-rc-1 package from current Debian unstable, and e7b435872cce464f is what was shipped with TBB 3.0a3.

I will attach two client-side info-level logs: one demonstrating the problem, one successful connection. The only configuration difference between the two is that in the problem case, the client's clock was 26 hours fast:

Aug 21 18:57:01.000 [info] connection_dir_client_reached_eof(): Received directory with skewed time (server '[scrubbed]:9002'): It seems that our clock is ahead by 1 days, 2 hours, 32 minutes, or that theirs is behind. Tor requires an accurate clock to work: please check your time, timezone, and date settings.

Child Tickets

Attachments (4)

client-failure.log (21.3 KB) - added by zwol 6 years ago.
client log (failure case)
client-success.log (133.9 KB) - added by zwol 6 years ago.
client log (success case)
client.debug.log.gz (26.5 KB) - added by zwol 6 years ago.
fresh client log, debug-level, SafeLogging 0
server.debug.log.gz (472.2 KB) - added by zwol 6 years ago.
fresh server log, debug-level, SafeLogging 0

Download all attachments as: .zip

Change History (33)

Changed 6 years ago by zwol

Attachment: client-failure.log added

client log (failure case)

Changed 6 years ago by zwol

Attachment: client-success.log added

client log (success case)

comment:1 Changed 6 years ago by karsten

Interesting! This is exactly the same problem I saw in my #9166 experiment, and it happens if uTP communication is disabled. My client and private bridge don't have their clocks skewed significantly. In fact, the problem shows up even when the client is not connected, presumably during self-tests of the bridge. So I'd say it's unrelated to clock skew. I'm running a modified 0.2.4.4-alpha-dev. Will try a recent 0.2.4.x next.

comment:2 Changed 6 years ago by karsten

Component: - Select a componentTor
Keywords: tor-bridge added

comment:3 Changed 6 years ago by nickm

Keywords: tor-client added
Milestone: Tor: 0.2.4.x-final

comment:4 Changed 6 years ago by nickm

How about the server logs that correspond to those client logs? Do you still have them?

I'm a little confused by the client's "sending CREATE_FAST" compared to the server's "unexpected cell command 10". 10 is CREATE2; CREATE_FAST is 5.

comment:5 Changed 6 years ago by karsten

Confirming that this happens on a private bridge that performs reachability tests:

Aug 20 17:46:02.829 [info] update_consensus_router_descriptor_downloads(): 0 router descriptors downloadable. 0 delayed; 4383 present (0 of those were in old_routers); 0 would_reject; 0 wouldnt_use; 0 in progress.
Aug 20 17:46:02.831 [info] consider_testing_reachability(): Testing reachability of my ORPort: 174.129.86.221:9001.
Aug 20 17:46:02.831 [info] onion_pick_cpath_exit(): Using requested exit node '$224FC9615C2899B3C982A947FF7912201F986B1C~utpbridge at 174.129.86.221'
Aug 20 17:46:02.834 [info] circuit_send_next_onion_skin(): First hop: finished sending CREATE_FAST cell to '$64186650FFE4469EBBE52B644AE543864D32F43C=PsychoOnion3 at 89.187.142.208'
Aug 20 17:46:02.834 [info] routerlist_remove_old_routers(): We have 4383 live routers and 10402 old router descriptors.
Aug 20 17:46:02.937 [info] circuit_finish_handshake(): Finished building circuit hop:
Aug 20 17:46:02.937 [info] internal circ (length 3, last hop utpbridge): PsychoOnion3(open) fmj(closed) $224FC9615C2899B3C982A947FF7912201F986B1C(closed)
Aug 20 17:46:02.937 [info] circuit_send_next_onion_skin(): Sending extend relay cell.
Aug 20 17:46:03.192 [info] circuit_finish_handshake(): Finished building circuit hop:
Aug 20 17:46:03.192 [info] internal circ (length 3, last hop utpbridge): PsychoOnion3(open) fmj(open) $224FC9615C2899B3C982A947FF7912201F986B1C(closed)
Aug 20 17:46:03.193 [info] circuit_send_next_onion_skin(): Sending extend relay cell.
Aug 20 17:46:03.399 [info] channel_register(): Channel 0x7f466fadb510 (global ID 5) in state opening (1) registered with no identity digest
Aug 20 17:46:03.595 [info] channel_tls_process_versions_cell(): Negotiated version 3 with 91.213.195.244:44739; Sending cells: VERSIONS CERTS NETINFO
Aug 20 17:46:03.701 [info] channel_tls_process_certs_cell(): Got some good certificates from 91.213.195.244:44739: Waiting for AUTHENTICATE.
Aug 20 17:46:03.825 [info] channel_tls_handle_cell(): Received unexpected cell command 1 in chan state opening / conn state handshaking (Tor, v3 handshake); closing the connection.
Aug 20 17:46:04.010 [info] circuit_testing_failed(): Our testing circuit (to see if your ORPort is reachable) has failed. I'll try again later.

This is latest maint-0.2.4 on a machine with a non-skewed clock.

comment:6 Changed 6 years ago by nickm

" Aug 20 17:46:03.825 [info] channel_tls_handle_cell(): Received unexpected cell command 1 in chan state opening / conn state handshaking (Tor, v3 handshake); closing the connection. "

Interesting that this is a CREATE cell too, not a CREATE_FAST cell.

I'd almost suspect that we're getting CREATE cells from another relay rather than from a client directly, since we're not seeing CREATE_FAST cells here.

Changed 6 years ago by zwol

Attachment: client.debug.log.gz added

fresh client log, debug-level, SafeLogging 0

Changed 6 years ago by zwol

Attachment: server.debug.log.gz added

fresh server log, debug-level, SafeLogging 0

comment:7 Changed 6 years ago by zwol

I generated new logs from both client and server, at "debug" level, with "SafeLogging 0". The client is 172.19.151.215 and the server (bridge) is 128.2.142.99. It looks like the offending CREATE cell is indeed coming from another relay (185.21.101.40 = afo3.torproject.afo-tm.org). I'm not sure why this would happen for a relay with PublishServerDescriptor 0 set, though.

For the record, here is the complete server torrc:

BandwidthRate  100000000 B
BandwidthBurst 100000000 B
DataDirectory  tbbscraper_tor.data
ContactInfo    if this appears somewhere public, something is horribly wrong
Log            debug stdout
SafeLogging    0
HardwareAccel  1
Address        128.2.142.99
ExitPolicy     reject *:*
#MyFamily       $B0171148A7081858EE639B9451AF4D6CE0F68361,$6B6B1718DCF6BECB2A8D2FE09A80325E9060E6CD,$B4FDE846864BAD0AD1CA428FBEFABF2CD8CF13A6,$B6B7FBA5874DD4F4337CCFAA460C3B92C264C0ED,$E841B711D279EC27C1555281FA61568B6C45A919
Nickname       tbbscraperentry
PublishServerDescriptor 0
AssumeReachable 1
ShutdownWaitLength 10
SOCKSPort 0
#DirPort 8999
BridgeRelay 1
BridgeRecordUsageByCountry 0
DirReqStatistics 0
ExtraInfoStatistics 0
ExtendAllowPrivateAddresses 1
ORPort 128.2.142.99:9002 IPv4Only

and client torrc:

# If non-zero, try to write to disk less frequently than we would otherwise.
AvoidDiskWrites 1
# Store working data, state, keys, and caches here.
DataDirectory /home/user/tbb_VteJUc/Data/Tor
GeoIPFile /home/user/tbb_VteJUc/Data/Tor/geoip
# Where to send logging messages.  Format is minSeverity[-maxSeverity]
# (stderr|stdout|syslog|file FILENAME).
Log notice stdout
# Bind to this address to listen to connections from SOCKS-speaking
# applications.
SocksListenAddress 127.0.0.1
SocksPort 9150
ControlPort 9151

CookieAuthentication 1
#ExcludeNodes $B0171148A7081858EE639B9451AF4D6CE0F68361,$6B6B1718DCF6BECB2A8D2FE09A80325E9060E6CD,$B4FDE846864BAD0AD1CA428FBEFABF2CD8CF13A6,$B6B7FBA5874DD4F4337CCFAA460C3B92C264C0ED,$E841B711D279EC27C1555281FA61568B6C45A919
Bridge 128.2.142.99:9002
UseBridges 1
UseMicroDescriptors 0
SafeLogging 0
Log debug stderr

comment:9 Changed 6 years ago by nickm

The reason that you're getting a create cell is almost certainly a self-testing circuit that your node is generating.

Hm. That relay says it's running 0.2.4.16-rc. So we have apparently a bug where it's sending CREATE cells before it should!

comment:10 Changed 6 years ago by nickm

Is it possible that we're sending out CREATE cells too early upon opening a connection to a bridge?

Or perhaps we could be failing to send a NETINFO cell as a relay if we don't first send an AUTHENTICATE cell?

comment:11 Changed 6 years ago by nickm

My working theory is that this happens when a relay is extending to a bridge. The bridge doesn't send an AUTH_CHALLENGE (since it doesn't want to be authenticated to). But that's the trigger for a public relay to send a NETINFO cell, so the public relay never sends the netinfo.

comment:12 Changed 6 years ago by nickm

Priority: normalmajor

comment:13 Changed 6 years ago by nickm

Status: newneeds_review

Please review branch "bug9546".

When was this behavior introduced?

comment:14 Changed 6 years ago by nickm

Keywords: 023-backport added

It appears 0.2.3.x relays have this problem too.

comment:15 Changed 6 years ago by nickm

I backported the patch to the 0.2.3 branch in "bug9546_023"

comment:16 Changed 6 years ago by arma

Looking through logs on my bridge. It looks like I am seeing some of these from connections via obfsproxy. That race is surprising, yes?

Aug 09 06:35:10.000 [info] channel_tls_handle_cell(): Received unexpected cell command 5 in chan state opening / conn state waiting for renegotiation or V3 handshake; closing the connection.
Aug 09 06:35:10.000 [info] channel_tls_process_versions_cell(): Negotiated version 3 with 127.0.0.1:48158; Sending cells: VERSIONS CERTS NETINFO
Aug 09 06:35:10.000 [info] channel_tls_process_versions_cell(): Negotiated version 3 with 127.0.0.1:47601; Sending cells: VERSIONS CERTS NETINFO
Aug 09 06:35:10.000 [info] channel_tls_process_netinfo_cell(): Got good NETINFO cell from 127.0.0.1:48158; OR connection is now open, using protocol version 3. Its ID digest is 0000000000000000000000000000000000000000. Our address is apparently 128.31.0.34.

comment:17 Changed 6 years ago by nickm

Why do you think those are on the same connection?

comment:18 Changed 6 years ago by isis

Cc: isis@… added

comment:19 Changed 6 years ago by arma

Summary: Link handshake fails with "Received unexpected cell command 10" when clocks are skewedLink handshake fails with "Received unexpected cell command 10" on a bridge

comment:20 Changed 6 years ago by arma

skruffy points out that it's weird the bridge isn't sending an AUTH_CHALLENGE cell.

Our spec says:

   When the in-protocol handshake is used, the initiator sends a
   VERSIONS cell to indicate that it will not be renegotiating.  The
   responder sends a VERSIONS cell, a CERTS cell (4.2 below) to give the
   initiator the certificates it needs to learn the responder's
   identity, an AUTH_CHALLENGE cell (4.3) that the initiator must include
   as part of its answer if it chooses to authenticate, and a NETINFO
   cell (4.5).

Yet our code says

    /* If we're a relay that got a connection, ask for authentication. */
    const int send_chall = !started_here && public_server_mode(get_options());

The comment for command_process_auth_challenge_cell() says

/** Process an AUTH_CHALLENGE cell from an OR connection.
 *
 * If we weren't supposed to get one (for example, because we're not the
 * originator of the connection), or it's ill-formed, or we aren't doing a v3
 * handshake, mark the connection.  If the cell is well-formed but we don't
 * want to authenticate, just drop it.  If the cell is well-formed *and* we
 * want to authenticate, send an AUTHENTICATE cell and then a NETINFO cell. */

Why do our bridges decide they're too cool to follow the spec? :)

comment:21 Changed 6 years ago by karsten

I ran your branch9546 in a Shadow network with 1 directory authority, 8 exits, and 11 private bridges. I didn't get a single log line saying "Received unexpected cell command [...]".

I also ran the previous commit of your branch9546 (edaea77) in the same network and got plenty of those warnings, all complaining about unexpected cell command 1, so CREATE cells.

So, looks good, I'd say.

comment:22 in reply to:  20 Changed 6 years ago by nickm

Replying to arma:

Why do our bridges decide they're too cool to follow the spec? :)

Looking at the commit logs doesn't shine much light on this.

As near as I can guess, the rationales might have been:

  • They don't need to have authenticated incoming connections.
  • It's a little weird to let a relay authenticate a connection to a bridge such that the bridge will use that connection for extending circuits to that relay.

But I'm not actually seeing a flaw there -- this happens already on relay<->relay connections to no ill effect. Further, any relay that wanted to create an authenticated connection to a bridge could do so by using the v1 or v2 handshake, by acting as a client and extending to itself, or something like that.

So I'm adding another commit to these branches to cause bridges to send AUTH_CHALLENGE cells. Please review?

comment:23 Changed 6 years ago by nickm

Also see the paragraph I added to the spec in branch "bug9546_spec" in my public gitspec repository.

comment:24 Changed 6 years ago by karsten

I ran nickm's updated branch9546 (585e6b9) in the same Shadow network and got no "Received unexpected cell command" lines. In contrast to the previous commit (89feaa4), the fixed-up 585e6b9 doesn't break Shadow anymore. Looks good.

comment:25 Changed 6 years ago by nickm

Updated, squashed versions are in "bug9546_v2" and "bug9546_023_v2"

comment:26 Changed 6 years ago by arma

s/send-test/self-test/
(or maybe s/send-test/reachability-test/)

comment:27 Changed 6 years ago by arma

In your nickm/bug9546_spec it should say 0.2.4.17-rc

comment:28 Changed 6 years ago by arma

Patches look fine otherwise. I admit I am nervous putting them into 0.2.3, since the v3 link handshake is complex.

But then, I'm nervous about not putting them into 0.2.3 also.

I guess the downside of doing it is that we could screw up our stable. And the downside of not is that 0.2.3 bridges would need either an 0.2.2.x or an 0.2.4.17+ relay for their reachability test.

And there are other impacts to ponder, like whether this changes the "i can talk to it and distinguish whether it's a bridge" attacks. My first thought there is that such attacks are possible both before and after this patch, and having some patched bridges and some unpatched bridges won't change things much.

I guess that means "merge the 0.2.4 version, and keep this in mind for an 0.2.3 version if we ever do one"? Or "merge them both and expect we'll test it well enough before we make a new 0.2.3 version anyway"?

comment:29 Changed 6 years ago by nickm

Resolution: fixed
Status: needs_reviewclosed

I think "merge into both, and expect that 0.2.4.17-rc will see enough testing before 0.2.3.next comes out." Plus also we should come up with a plan for trying to avoid 0.2.4 getting into the situation where 0.2.3 is now once it's stable.

Merging into 0.2.3 and later.

Note: See TracTickets for help on using tickets.