Link handshake fails with "Received unexpected cell command 10" on a bridge

changed milestone to %Tor: 0.2.4.x-final

added 023-backport component::core tor/tor milestone::Tor: 0.2.4.x-final priority::high resolution::fixed status::closed tor-bridge tor-client type::defect version::tor 0.2.4.15-rc labels

Trac:
client-failure.log

client log (failure case)

Trac:
client-success.log

client log (success case)

Interesting! This is exactly the same problem I saw in my #9166 (moved) experiment, and it happens if uTP communication is disabled. My client and private bridge don't have their clocks skewed significantly. In fact, the problem shows up even when the client is not connected, presumably during self-tests of the bridge. So I'd say it's unrelated to clock skew. I'm running a modified 0.2.4.4-alpha-dev. Will try a recent 0.2.4.x next.

Trac:
Keywords: N/A deleted, tor-bridge added
Component: - Select a component to Tor

Trac:
Keywords: tor-bridge deleted, tor-bridge tor-client added
Milestone: N/A to Tor: 0.2.4.x-final

How about the server logs that correspond to those client logs? Do you still have them?

I'm a little confused by the client's "sending CREATE_FAST" compared to the server's "unexpected cell command 10". 10 is CREATE2; CREATE_FAST is 5.

Confirming that this happens on a private bridge that performs reachability tests:

Aug 20 17:46:02.829 [info] update_consensus_router_descriptor_downloads(): 0 router descriptors downloadable. 0 delayed; 4383 present (0 of those were in old_routers); 0 would_reject; 0 wouldnt_use; 0 in progress.
Aug 20 17:46:02.831 [info] consider_testing_reachability(): Testing reachability of my ORPort: 174.129.86.221:9001.
Aug 20 17:46:02.831 [info] onion_pick_cpath_exit(): Using requested exit node '$224FC9615C2899B3C982A947FF7912201F986B1C~utpbridge at 174.129.86.221'
Aug 20 17:46:02.834 [info] circuit_send_next_onion_skin(): First hop: finished sending CREATE_FAST cell to '$64186650FFE4469EBBE52B644AE543864D32F43C=PsychoOnion3 at 89.187.142.208'
Aug 20 17:46:02.834 [info] routerlist_remove_old_routers(): We have 4383 live routers and 10402 old router descriptors.
Aug 20 17:46:02.937 [info] circuit_finish_handshake(): Finished building circuit hop:
Aug 20 17:46:02.937 [info] internal circ (length 3, last hop utpbridge): PsychoOnion3(open) fmj(closed) $224FC9615C2899B3C982A947FF7912201F986B1C(closed)
Aug 20 17:46:02.937 [info] circuit_send_next_onion_skin(): Sending extend relay cell.
Aug 20 17:46:03.192 [info] circuit_finish_handshake(): Finished building circuit hop:
Aug 20 17:46:03.192 [info] internal circ (length 3, last hop utpbridge): PsychoOnion3(open) fmj(open) $224FC9615C2899B3C982A947FF7912201F986B1C(closed)
Aug 20 17:46:03.193 [info] circuit_send_next_onion_skin(): Sending extend relay cell.
Aug 20 17:46:03.399 [info] channel_register(): Channel 0x7f466fadb510 (global ID 5) in state opening (1) registered with no identity digest
Aug 20 17:46:03.595 [info] channel_tls_process_versions_cell(): Negotiated version 3 with 91.213.195.244:44739; Sending cells: VERSIONS CERTS NETINFO
Aug 20 17:46:03.701 [info] channel_tls_process_certs_cell(): Got some good certificates from 91.213.195.244:44739: Waiting for AUTHENTICATE.
Aug 20 17:46:03.825 [info] channel_tls_handle_cell(): Received unexpected cell command 1 in chan state opening / conn state handshaking (Tor, v3 handshake); closing the connection.
Aug 20 17:46:04.010 [info] circuit_testing_failed(): Our testing circuit (to see if your ORPort is reachable) has failed. I'll try again later.

This is latest maint-0.2.4 on a machine with a non-skewed clock.

" Aug 20 17:46:03.825 [info] channel_tls_handle_cell(): Received unexpected cell command 1 in chan state opening / conn state handshaking (Tor, v3 handshake); closing the connection. "

Interesting that this is a CREATE cell too, not a CREATE_FAST cell.

I'd almost suspect that we're getting CREATE cells from another relay rather than from a client directly, since we're not seeing CREATE_FAST cells here.

Trac:
client.debug.log.gz

fresh client log, debug-level, SafeLogging 0

Trac:
server.debug.log.gz

fresh server log, debug-level, SafeLogging 0

I generated new logs from both client and server, at "debug" level, with "SafeLogging 0". The client is 172.19.151.215 and the server (bridge) is 128.2.142.99. It looks like the offending CREATE cell is indeed coming from another relay (185.21.101.40 = afo3.torproject.afo-tm.org). I'm not sure why this would happen for a relay with PublishServerDescriptor 0 set, though.

For the record, here is the complete server torrc:

BandwidthRate  100000000 B
BandwidthBurst 100000000 B
DataDirectory  tbbscraper_tor.data
ContactInfo    if this appears somewhere public, something is horribly wrong
Log            debug stdout
SafeLogging    0
HardwareAccel  1
Address        128.2.142.99
ExitPolicy     reject *:*
#MyFamily       $B0171148A7081858EE639B9451AF4D6CE0F68361,$6B6B1718DCF6BECB2A8D2FE09A80325E9060E6CD,$B4FDE846864BAD0AD1CA428FBEFABF2CD8CF13A6,$B6B7FBA5874DD4F4337CCFAA460C3B92C264C0ED,$E841B711D279EC27C1555281FA61568B6C45A919
Nickname       tbbscraperentry
PublishServerDescriptor 0
AssumeReachable 1
ShutdownWaitLength 10
SOCKSPort 0
#DirPort 8999
BridgeRelay 1
BridgeRecordUsageByCountry 0
DirReqStatistics 0
ExtraInfoStatistics 0
ExtendAllowPrivateAddresses 1
ORPort 128.2.142.99:9002 IPv4Only

and client torrc:

# If non-zero, try to write to disk less frequently than we would otherwise.
AvoidDiskWrites 1
# Store working data, state, keys, and caches here.
DataDirectory /home/user/tbb_VteJUc/Data/Tor
GeoIPFile /home/user/tbb_VteJUc/Data/Tor/geoip
# Where to send logging messages.  Format is minSeverity[-maxSeverity]
# (stderr|stdout|syslog|file FILENAME).
Log notice stdout
# Bind to this address to listen to connections from SOCKS-speaking
# applications.
SocksListenAddress 127.0.0.1
SocksPort 9150
ControlPort 9151

CookieAuthentication 1
#ExcludeNodes $B0171148A7081858EE639B9451AF4D6CE0F68361,$6B6B1718DCF6BECB2A8D2FE09A80325E9060E6CD,$B4FDE846864BAD0AD1CA428FBEFABF2CD8CF13A6,$B6B7FBA5874DD4F4337CCFAA460C3B92C264C0ED,$E841B711D279EC27C1555281FA61568B6C45A919
Bridge 128.2.142.99:9002
UseBridges 1
UseMicroDescriptors 0
SafeLogging 0
Log debug stderr

185.21.101.40 = https://atlas.torproject.org/#details/1AF0199AD12A16A23239453A1C7CBDE683F821B1 .

The reason that you're getting a create cell is almost certainly a self-testing circuit that your node is generating.

Hm. That relay says it's running 0.2.4.16-rc. So we have apparently a bug where it's sending CREATE cells before it should!

Is it possible that we're sending out CREATE cells too early upon opening a connection to a bridge?

Or perhaps we could be failing to send a NETINFO cell as a relay if we don't first send an AUTHENTICATE cell?

My working theory is that this happens when a relay is extending to a bridge. The bridge doesn't send an AUTH_CHALLENGE (since it doesn't want to be authenticated to). But that's the trigger for a public relay to send a NETINFO cell, so the public relay never sends the netinfo.

Trac:
Priority: normal to major

Please review branch "bug9546".

When was this behavior introduced?

Trac:
Status: new to needs_review

It appears 0.2.3.x relays have this problem too.

Trac:
Keywords: tor-bridge tor-client deleted, tor-bridge tor-client 023-backport added

I backported the patch to the 0.2.3 branch in "bug9546_023"

Looking through logs on my bridge. It looks like I am seeing some of these from connections via obfsproxy. That race is surprising, yes?

Aug 09 06:35:10.000 [info] channel_tls_handle_cell(): Received unexpected cell command 5 in chan state opening / conn state waiting for renegotiation or V3 handshake; closing the connection.
Aug 09 06:35:10.000 [info] channel_tls_process_versions_cell(): Negotiated version 3 with 127.0.0.1:48158; Sending cells: VERSIONS CERTS NETINFO
Aug 09 06:35:10.000 [info] channel_tls_process_versions_cell(): Negotiated version 3 with 127.0.0.1:47601; Sending cells: VERSIONS CERTS NETINFO
Aug 09 06:35:10.000 [info] channel_tls_process_netinfo_cell(): Got good NETINFO cell from 127.0.0.1:48158; OR connection is now open, using protocol version 3. Its ID digest is 0000000000000000000000000000000000000000. Our address is apparently 128.31.0.34.

Why do you think those are on the same connection?

Trac:
Cc: N/A to isis@torproject.org

Trac:
Summary: Link handshake fails with "Received unexpected cell command 10" when clocks are skewed to Link handshake fails with "Received unexpected cell command 10" on a bridge

skruffy points out that it's weird the bridge isn't sending an AUTH_CHALLENGE cell.

Our spec says:

   When the in-protocol handshake is used, the initiator sends a
   VERSIONS cell to indicate that it will not be renegotiating.  The
   responder sends a VERSIONS cell, a CERTS cell (4.2 below) to give the
   initiator the certificates it needs to learn the responder's
   identity, an AUTH_CHALLENGE cell (4.3) that the initiator must include
   as part of its answer if it chooses to authenticate, and a NETINFO
   cell (4.5).

Yet our code says

    /* If we're a relay that got a connection, ask for authentication. */
    const int send_chall = !started_here && public_server_mode(get_options());

The comment for command_process_auth_challenge_cell() says

/** Process an AUTH_CHALLENGE cell from an OR connection.
 *
 * If we weren't supposed to get one (for example, because we're not the
 * originator of the connection), or it's ill-formed, or we aren't doing a v3
 * handshake, mark the connection.  If the cell is well-formed but we don't
 * want to authenticate, just drop it.  If the cell is well-formed *and* we
 * want to authenticate, send an AUTHENTICATE cell and then a NETINFO cell. */

Why do our bridges decide they're too cool to follow the spec? :)

I ran your branch9546 in a Shadow network with 1 directory authority, 8 exits, and 11 private bridges. I didn't get a single log line saying "Received unexpected cell command [...]".

I also ran the previous commit of your branch9546 (edaea77) in the same network and got plenty of those warnings, all complaining about unexpected cell command 1, so CREATE cells.

So, looks good, I'd say.

Replying to arma:

Why do our bridges decide they're too cool to follow the spec? :)

Looking at the commit logs doesn't shine much light on this.

As near as I can guess, the rationales might have been:

They don't need to have authenticated incoming connections.
It's a little weird to let a relay authenticate a connection to a bridge such that the bridge will use that connection for extending circuits to that relay.

But I'm not actually seeing a flaw there -- this happens already on relay<->relay connections to no ill effect. Further, any relay that wanted to create an authenticated connection to a bridge could do so by using the v1 or v2 handshake, by acting as a client and extending to itself, or something like that.

So I'm adding another commit to these branches to cause bridges to send AUTH_CHALLENGE cells. Please review?

Also see the paragraph I added to the spec in branch "bug9546_spec" in my public gitspec repository.

I ran nickm's updated branch9546 (585e6b9) in the same Shadow network and got no "Received unexpected cell command" lines. In contrast to the previous commit (89feaa4), the fixed-up 585e6b9 doesn't break Shadow anymore. Looks good.

Updated, squashed versions are in "bug9546_v2" and "bug9546_023_v2"

s/send-test/self-test/ (or maybe s/send-test/reachability-test/)

In your nickm/bug9546_spec it should say 0.2.4.17-rc

Patches look fine otherwise. I admit I am nervous putting them into 0.2.3, since the v3 link handshake is complex.

But then, I'm nervous about not putting them into 0.2.3 also.

I guess the downside of doing it is that we could screw up our stable. And the downside of not is that 0.2.3 bridges would need either an 0.2.2.x or an 0.2.4.17+ relay for their reachability test.

And there are other impacts to ponder, like whether this changes the "i can talk to it and distinguish whether it's a bridge" attacks. My first thought there is that such attacks are possible both before and after this patch, and having some patched bridges and some unpatched bridges won't change things much.

I guess that means "merge the 0.2.4 version, and keep this in mind for an 0.2.3 version if we ever do one"? Or "merge them both and expect we'll test it well enough before we make a new 0.2.3 version anyway"?

I think "merge into both, and expect that 0.2.4.17-rc will see enough testing before 0.2.3.next comes out." Plus also we should come up with a plan for trying to avoid 0.2.4 getting into the situation where 0.2.3 is now once it's stable.

Merging into 0.2.3 and later.

Trac:
Resolution: N/A to fixed
Status: needs_review to closed

closed

mentioned in issue #9633 (moved)

moved to tpo/core/tor#9546 (closed)

Link handshake fails with "Received unexpected cell command 10" on a bridge

Child items ...

Activity