Opened 5 years ago

Closed 3 years ago

#7733 closed defect (fixed)

Two channels required for bootstrap

Reported by: dcf Owned by: nickm
Priority: Medium Milestone: Tor: 0.2.6.x-final
Component: Core Tor/Tor Version:
Severity: Keywords: tor-client, flashproxy, 026-triaged-0, nickm-patch
Cc: Actual Points:
Parent ID: Points:
Reviewer: Sponsor:

Description

I noticed a change in behavior in cb62a0b69a7d67b427224ca4c3075b49853a3a1f or thereabouts. tor opens a new SOCKS connection to a client transport plugin while bootstrapping at 50% (if descriptors are not cached) or at 85% (if descriptors are not cached). The upshot is that the flash proxy transport, for which new connections are not free, needs two connected browser proxies in order to bootstrap. Earlier revisions permit bootstrapping and circuit creation with just one connection to the proxy.

I start an external proxy thus:

$ ./flashproxy-client --external --unsafe-logging
2012-12-03 09:02:05 Listening remote on 0.0.0.0:9000.
2012-12-03 09:02:05 Listening remote on [::]:9000.
2012-12-03 09:02:05 Listening local on 127.0.0.1:9001.
2012-12-03 09:02:05 Listening local on [::1]:9001.

The "remote" listener is waiting for WebSocket connections from a
browser. The "local" listener is waiting for SOCKS connections from Tor.
Then I start Tor to use the proxy:

$ ./src/or/tor ClientTransportPlugin "websocket socks4 127.0.0.1:9001" UseBridges 1 Bridge "websocket 0.0.1.0:1" LearnCircuitBuildTimeout 0 CircuitBuildTimeout 60 Log "info stderr"
...
Dec 03 09:02:13.000 [notice] Bootstrapped 10%: Finishing handshake with directory server.

flashproxy-client notices Tor's pending SOCKS connection:

2012-12-03 09:02:13 Local connection from 127.0.0.1:55421.
2012-12-03 09:02:13 SOCKS request from 127.0.0.1:55421.
2012-12-03 09:02:13 Got SOCKS request for 0.0.1.0:1.
2012-12-03 09:02:13 locals  (1): [u'127.0.0.1:55421']
2012-12-03 09:02:13 remotes (0): []
2012-12-03 09:02:13 Data from unlinked local 127.0.0.1:55421 (217 bytes).
2012-12-03 09:02:13 locals  (1): [u'127.0.0.1:55421']
2012-12-03 09:02:13 remotes (0): []

Then I open a browser to make a single WebSocket connection which will appear as one of the pluggable transport's "remotes".

http://crypto.stanford.edu/flashproxy/embed.html?debug=1&client=127.0.0.1:9000&relay=173.255.221.44:9901

flashproxy-client sees the new remote and starts proxying data.

2012-12-03 09:02:17 Remote connection from 127.0.0.1:51321.
2012-12-03 09:02:17 Data from WebSocket-pending 127.0.0.1:51321.
2012-12-03 09:02:17 locals  (1): [u'127.0.0.1:55421']
2012-12-03 09:02:17 remotes (1): [u'127.0.0.1:51321']
2012-12-03 09:02:17 Linking 127.0.0.1:55421 and 127.0.0.1:51321.

Now bootstrapping continues (over the WebSocket channel) until reaching 85%, and then it says "connections all too old, or too non-canonical" and makes a new SOCKS request:

Dec 03 09:02:18.000 [notice] new bridge descriptor '3VXRyxz67OeRoqHn' (fresh): $7C03D29CA58BE8EED5F483F31A2DEDF6FD7CC444~3VXRyxz67OeRoqHn at 0.0.1.0
Dec 03 09:02:18.000 [notice] We now have enough directory information to build circuits.
Dec 03 09:02:18.000 [notice] Bootstrapped 80%: Connecting to the Tor network.
...
Dec 03 09:02:18.000 [info] circuit_handle_first_hop(): Next router is [scrubbed]: Connections all too old, or too non-canonical.  Launching a new one.
Dec 03 09:02:18.000 [notice] Bootstrapped 85%: Finishing handshake with first hop.
Dec 03 09:02:18.000 [info] connection_read_proxy_handshake(): Proxy Client: connection to 0.0.1.0:1 successful

flashproxy-client sees the SOCKS request, but because there are no more browser connections forthcoming, everything stalls at this point.

2012-12-03 09:02:18 Local connection from 127.0.0.1:55427.
2012-12-03 09:02:18 SOCKS request from 127.0.0.1:55427.
2012-12-03 09:02:18 Got SOCKS request for 0.0.1.0:1.
2012-12-03 09:02:18 locals  (2): [u'127.0.0.1:55421', u'127.0.0.1:55427']
2012-12-03 09:02:18 remotes (1): [u'127.0.0.1:51321']
2012-12-03 09:02:18 Data from unlinked local 127.0.0.1:55427 (231 bytes).

I've verified this failure to bootstrap with recent 190c1d4981e5751aabd3d894095851c830f1d570. After bisecting, I think the last commit with the old behavior (bootstrapped 100%) was 751b3aabb5ab88fca16834e559a8d9835831b05f. There were some compile errors during bisecting so I couldn't narrow it to a specific revision; git reports

There are only 'skip'ped commits left to test.
The first bad commit could be any of:
35924435d22c2469ecbe06156c8069a928859d63
e136f7ccb4e671e33b6c92a48df819082291f5c1
4768c0efe3e9471cc367c3740d1a4ba0ab79626c
6cce6241dd4405f6cf21057f9913e07633cf18bb
519c971f6a3b89f1e81cda3c0290d4d943ec0d78
77dac97354974e8a819d8e35ad4e7a76199999b4
32337502f11e6c84e4db8591f5f81c4fc6d1da58
8b14db9628f0e8982e894034e86c8efdd78cff32
15303c32ec9d84aff8de5ed9df28e779c36c6e5c
28f108bcceab59fcf9f27e33065f64bfdb0f159a
7f952da55334d3a3693d1c6e8531fd96730265db
f0f87cb68a22feaf552a18b521d3313d843f8793
838743654c1bed2bfe22789ff53a1993c005f176
9ad7ba9f2267a9ee34fafda9356f1fa86633f00f
cb62a0b69a7d67b427224ca4c3075b49853a3a1f
We cannot bisect more!

Based on the log, cb62a0b69a7d67b427224ca4c3075b49853a3a1f seems a likely cause of the change: "Use channel_is_bad_for_new_circs(), connection_or_get_num_circs() in main.c".

I thought I would be able to reproduce this with another transport or with a simple SOCKS proxy, showing two connections where there used to be one, but I can't. I see two connections even with the old code. For example with an ssh SOCKS proxy (ssh -v -D 9001 -N localhost):

debug1: Connection to port 9001 forwarding to socks port 0 requested.
debug1: channel 2: new [dynamic-tcpip]
debug1: Connection to port 9001 forwarding to socks port 0 requested.
debug1: channel 3: new [dynamic-tcpip]

I guess that the difference is that previously, the second connection happens after bootstrapping is complete, while now it happens at 85%. (That is only a guess, I haven't verified it.)

Ticket description from https://lists.torproject.org/pipermail/tor-dev/2012-December/004221.html. nickm suggests log messages to test whether "is bad for new circs" is the reason the first channel isn't getting reused: https://lists.torproject.org/pipermail/tor-dev/2012-December/004246.html.

Child Tickets

Attachments (1)

log-conn.patch (570 bytes) - added by nickm 3 years ago.

Download all attachments as: .zip

Change History (19)

comment:1 Changed 5 years ago by nickm

Keywords: tor-client added
Milestone: Tor: 0.2.4.x-final

comment:2 Changed 5 years ago by nickm

Milestone: Tor: 0.2.4.x-finalTor: 0.2.5.x-final

dcf says there's a workaround for this; putting in 0.2.5.

comment:3 Changed 4 years ago by dcf

Keywords: flashproxy added

comment:4 Changed 4 years ago by dcf

arma suggested testing with

UseEntryGuardsAsDirGuards 0

but cautioned that it probably wouldn't work if the culprit was indeed the channels code. Indeed, the above configuration didn't help when I tried it today. If I hack the proxy code to make just one connection instead of two (undoing the workaround mentioned in comment:2), the tor client hangs after bootstrapping 50%. (I waited a couple of minutes to make sure it wasn't #9229.) Allowing the proxy to make two connections, bootstrapping finishes.

comment:5 Changed 3 years ago by nickm

Keywords: 025-triaged added

comment:6 Changed 3 years ago by nickm

I can almost reproduce this, modulo other annoying startup bugs (eg #9229).

(I don't have a working flashproxy setup handy, so I tried adding a log message to channel_tls_connect() to let me know when it was invoked. Then I started with a single bridge *without* a configured identity. When it started with full directory information, it called channel_tls_connect() twice: once with a null ID, and once with the real ID from the cached bridge descriptor. So, I can try to fix _that_, though I don't think that's necessarily the issue here.)

Can the behavior on this ticket be reproduced without a pluggable transport? In particular, I want an easy reproducible way to make Tor launch the two channels as described here.

I'm attaching the small patch that makes us log new TLS channels at NOTICE.

Changed 3 years ago by nickm

Attachment: log-conn.patch added

comment:7 Changed 3 years ago by nickm

I mentiond that in the case where I have everything cached and I have a bridge with no specified ID, we launch two channels right at startup, one with ID==0 and one with ID set. Here are the stack traces for those two launches.

========================================
This one happens with ID == 0.

2   tor                                 0x000000010610370d channel_tls_connect + 477
3   tor                                 0x000000010610be75 circuit_handle_first_hop + 373
4   tor                                 0x000000010610bb36 circuit_establish_circuit + 3254
5   tor                                 0x000000010611c7de circuit_launch_by_extend_info + 382
6   tor                                 0x000000010611dbe8 circuit_get_open_circ_or_launch + 1752
7   tor                                 0x000000010611d0e7 connection_ap_handshake_attach_circuit + 551
8   tor                                 0x000000010613eaff connection_ap_make_link + 463
9   tor                                 0x000000010615197f directory_initiate_command_rend + 751
10  tor                                 0x0000000106151317 directory_initiate_command + 87
11  tor                                 0x0000000106177436 launch_direct_bridge_descriptor_fetch + 246
12  tor                                 0x00000001061776ba fetch_bridge_descriptors + 618
13  tor                                 0x000000010617cf7b second_elapsed_callback + 923
14  libevent-2.0.5.dylib                0x00000001062da997 event_base_loop + 1745
15  tor                                 0x000000010617c935 do_main_loop + 709
16  tor                                 0x000000010617fb2e tor_main + 2558
17  libdyld.dylib                       0x00007fff9885e5fd start + 1
18  ???                                 0x0000000000000001 0x0 + 1


========================================
This one happens with ID set

2   tor                                 0x00000001020ad70d channel_tls_connect + 477
3   tor                                 0x00000001020b5e75 circuit_handle_first_hop + 373
4   tor                                 0x00000001020b5b36 circuit_establish_circuit + 3254
5   tor                                 0x00000001020c67de circuit_launch_by_extend_info + 382
6   tor                                 0x00000001020c7be8 circuit_get_open_circ_or_launch + 1752
7   tor                                 0x00000001020c70e7 connection_ap_handshake_attach_circuit + 551
8   tor                                 0x00000001020e8aff connection_ap_make_link + 463
9   tor                                 0x00000001020fb97f directory_initiate_command_rend + 751
10  tor                                 0x00000001020faf5f directory_get_from_dirserver + 671
11  tor                                 0x000000010212f7c7 update_consensus_networkstatus_downloads + 487
12  tor                                 0x000000010212fcf8 update_networkstatus_downloads + 104
13  tor                                 0x000000010212743b second_elapsed_callback + 2139
14  libevent-2.0.5.dylib                0x0000000102287997 event_base_loop + 1745
15  tor                                 0x0000000102126935 do_main_loop + 709
16  tor                                 0x0000000102129b2e tor_main + 2558
17  libdyld.dylib                       0x00007fff9885e5fd start + 1
18  ???                                 0x0000000000000001 0x0 + 1

(Also what's confusing me a bit too is that we're launching these downloads at all, since we should already have a cached networkstatus and a cached bridge descriptor.)

comment:8 Changed 3 years ago by nickm

Owner: set to nickm
Status: newassigned

marking as assigned since I've started coding on these.

comment:9 Changed 3 years ago by nickm

Current plan on these is to investigate the bug, try to finish writing the patch on each, and then evaluate whether the patch is 0.2.5 or 0.2.6 material in terms of simplicity and importance.

comment:10 Changed 3 years ago by nickm

Milestone: Tor: 0.2.5.x-finalTor: 0.2.6.x-final
Status: assignedneeds_review

Diagnosis:

The reason that we're calling launch_direct_bridge_descriptor_fetch() even when starting with a bridge descriptor is that we only reset the download schedule on learning a descriptor when we are learning it at startup. That's a distraction.

The reason we're doing it with the ID set to 0 -- and thereby causing another channel to be launched -- is that requests for the bridge's descriptor always use bridge->identity, and never a learned identity from the bridge descriptor.

So we could maybe fix this by altering learned_bridge_descriptor so that it sets bridge->identity from ri->identity if the bridge identity was not previously set. Something like branch 'bug7733a' in my public repository would solve it. (Not tested)

Alternatively, we could cache bridge-address => identity mappings in the state file.

I worry about two things, though:

  • having learned_bridge_descriptor race with fetch_bridge_descriptors. If the second happens first, we'll launch a channel with an unknown identity.
  • the change in handling for a bridge that changes its identity.

It's possible neither of these concerns is valid, but given the relatively small impact of this issue, and the possible hard-to-figure-out destabilization in bootstrapping, I think this should wait for 0.2.6.

(Also, this is to a certain extent not-a-bug: it is totally valid for Tor to open multiple connections through a single PT to a single bridge; this is expected behavior when rotating connections. PTs should IMO handle this.)

comment:11 Changed 3 years ago by nickm

Keywords: 026-triaged added; 025-triaged removed

comment:12 Changed 3 years ago by nickm

Keywords: 026-triaged-0 added; 026-triaged removed

comment:13 Changed 3 years ago by nickm

Keywords: nickm-patch added

Apply a nickm-patch keyword to tickets in needs_review in 0.2.6 where I wrote the patch, so I know which ones I can('t) review myself.

comment:14 Changed 3 years ago by andrea

Review: the patch looks like it does what it says it does, but I'd feel better about it if we had a way to repro the problem and test that this is actually the cause.

Re: worry no. 1 - "having learned_bridge_descriptor race with fetch_bridge_descriptors. If the second happens first, we'll launch a channel with an unknown identity."; the consequence of this is just launching an extra channel as in the original bug, so at least this patch couldn't make that any worse?

Re: worry no. 2 - "the change in handling for a bridge that changes its identity"; you mean, for example, if the bridge changes identity and the cached descriptor is obsolete?

comment:15 in reply to:  14 Changed 3 years ago by dcf

Replying to andrea:

Review: the patch looks like it does what it says it does, but I'd feel better about it if we had a way to repro the problem and test that this is actually the cause.

It's really easy to reproduce. Just use a SOCKS tunnel as a pseudo–pluggable transport.

ssh -v -N -D 9001 people.torproject.org

Without a fingerprint, you get two connections through the tunnel:

tor ClientTransportPlugin "dummy socks5 127.0.0.1:9001" UseBridges 1 Bridge "dummy 128.31.0.34:9101"
debug1: Connection to port 9001 forwarding to socks port 0 requested.
debug1: channel 2: new [dynamic-tcpip]
debug1: Connection to port 9001 forwarding to socks port 0 requested.
debug1: channel 3: new [dynamic-tcpip]

If you include the fingerprint, you only get one:

tor ClientTransportPlugin "dummy socks5 127.0.0.1:9001" UseBridges 1 Bridge "dummy 128.31.0.34:9101 9695DFC35FFEB861329B9F1AB04C46397020CE31"
debug1: Connection to port 9001 forwarding to socks port 0 requested.
debug1: channel 2: new [dynamic-tcpip]

comment:16 Changed 3 years ago by andrea

Okay, I've got a repro and confirmed that the patch fixes the bug. It seems potentially a bit racy; I couldn't repro on my first attempt with a fresh DataDirectory, but then did with a full cache.

comment:17 Changed 3 years ago by andrea

Verified that revision 747f368c6d72cc6a68bb0a12c4209c534517b656 immediately preceding nickm's patch sometimes opens two connections when no fingerprint is specified, but only one when it is, and revision 79c875ecdc546094dc761241a8f4cd7caccf9c0e opens only one in both cases.

comment:18 Changed 3 years ago by nickm

Resolution: fixed
Status: needs_reviewclosed

Okay, merged it. Thanks!

Note: See TracTickets for help on using tickets.