outline of high-level bootstrap tracker abstractions

changed milestone to %Tor: unspecified

added 040-deferred-20190220 actualpoints::2 bootstrap-arch component::core tor/tor milestone::Tor: unspecified owner::catalyst parent::28018 points::0.5 priority::medium resolution::implemented s8-bootstrap severity::normal sponsor::19 status::closed type::task labels

Working from the list in [ticket:27103#comment:4 #27103 (moved)], this is a hopefully useful breakdown of the high-level phases of bootstrapping:

making the initial OR_CONN to any relay or bridge (see #27103 (moved))
- this should track the farthest progress that any individual attempt has made so far
- "farthest progress" should probably be reset under some circumstances (see #27691 (moved))
directory info a. one-hop circuit, if needed? b. bridge descriptor, if bridges are used? (see #11966 (moved)) c. consensus d. descriptors (usually microdescs for clients)
building a useful application circuit a. first OR_CONN to a guard if we're not using bridges b. intermediate progress such as noting when each hop gets built (see #27104 (moved))

In a pubsub framework, (1) will need to subscribe to events from connections, and keep track of the maximum progress any one connection has gotten.

There should be an abstraction that tracks circuit-building progress. We can use it for (2)(a) and (3)(b).

We could make separate trackers for (2)(c) and (2)(d). As a bonus, those trackers could handle the scaling of incremental progress for downloads.

If we make a tracker that subscribes to both circuit and connection events, we could cleanly solve bugs such as #25061 (moved). It would also work for (3)(a), which needs to know both circuit type (application circuit) and connection state.

[edit: fix comment ref]

Tor 0.3.6.x has been renamed to 0.4.0.x.

Trac:
Milestone: Tor: 0.3.6.x-final to Tor: 0.4.0.x-final

After chatting some with ahf, I thought it might be a good idea to write down here a proposed new set of bootstrap phases. The numbering of the new phases is yet to be determined, but they're meant to be in order. (Some phases might get skipped, and that's OK.)

Some design considerations include the spacing between phases. Right now many of them seem separated by 5%, which seems to be a decent amount of progress as seen by the user's eye. Any increments smaller than this aren't necessarily meaningful to show to the user, but we could use the smaller increments to add phase names that could give a more accurate picture about where something is broken than the user currently gets.

There are two gaps in the existing phases, one of which corresponds to incremental progress downloading descriptors. (The other one doesn't seem to currently be used to display incremental progress downloading a consensus.)

undef:: shouldn't be visible to controllers or users starting:: can stay the same

The following high-level grouping of phases should deal with the first outbound connection to a Tor relay. This might be to a directory cache, a proxy, or a guard/bridge. Here we use "first" to mean whichever one has made the most progress so far, in case we open multiple connections before any one is fully open.

connecting:: the initial outbound TCP connection toward the Tor network, for any purpose, which might include a firewall-bypassing proxy, or a pluggable transport; corresponds to OR_CONN_STATE_CONNECTING proxy_handshake:: the initial handshake with a firewall-bypass proxy or PT; corresponds to OR_CONN_STATE_PROXY_HANDSHAKING; might be skipped if not using proxies or PTs

Maybe insert additional phases here for intermediate proxy handshaking steps?

tls_handshake:: the TLS handshake with the first relay; corresponds to OR_CONN_STATE_TLS_HANDSHAKING or related ORCONN states (some of these involve TLS protocol renegotiations to deal with older link protocol versions) open:: the Tor link protocol is open to the first relay and can send and receive cells

The following high-level grouping of phases should deal with receiving and verifying directory information. Some of these might get skipped if we're starting from cached info.

dir_circ_create:: corresponds to the CREATE command opening the first circuit to a directory server; maybe reuse the existing onehop_create tag, because it already mostly means this? it might be better to have the more normalized naming though dir_circ_created:: corresponds to the CREATED response that means the first directory circuit is created dir_stream_begin:: corresponds to the BEGIN_DIR command dir_stream_connected:: corresponds to the CONNECTED response to the BEGIN_DIR command; the existing requesting_status phase actually gets sent here instead of where the corresponding work actually begins requesting_bridge_desc:: start downloading the bridge descriptor, if we're connected to a bridge; this is related to #11966 (moved) requesting_status:: this can stay the same loading_status:: this can stay the same

Right now there is a gap (from 20 to 40) between these two phases, but we don't currently fill it in with incremental progress in downloading the consensus. Maybe we should?

loading_keys:: this can stay the same requesting_descriptors:: this can stay the same loading_descriptors:: this can stay the same

Right now there is a gap between loading_descriptors and the next phase (from 50 to 80), which we fill in with incremental progress. Maybe we should retain this gap and the incremental progress display?

The next high-level grouping of phases corresponds to connecting to a guard, if bridges aren't in use. Similarly to the connecting grouping, these represent the furthest progress that any one attempt has made so far.

guard_connecting:: same as connecting but for a guard guard_proxy_handshake:: same as proxy_handshake but for a guard guard_tls_handshake:: same as tls_handshake but for a guard guard_open:: same as open but for a guard

circ_create:: same as dir_circ_create except for an application circuit circ_created:: same as dir_created except for an application circuit circ_extend:: corresponds to EXTEND command for the second hop circ_extended:: corresponds to EXTENDED response for the second hop circ_exit_extend:: corresponds to EXTEND command for the exit circ_exit_extended:: corresponds to EXTENDED response for the exit

done:: same as existing phase

Good stuff!

Replying to catalyst:

Right now there is a gap (from 20 to 40) between these two phases, but we don't currently fill it in with incremental progress in downloading the consensus. Maybe we should?

A thought for this part in particular: incremental progress at fetching descriptors can be confirmed as we go (we verify that we got the bytes we wanted). But if we try to show incremental progress at fetching a consensus, but then we get it and we don't like it, we'll find ourselves going backwards in bootstrap progress. Not the end of the world but maybe something to avoid getting ourselves into if we can.

But most importantly of all: this particular incremental-progress dilemma can be totally deferred until everything else is done and in place. :)

I thought about this a bit more, and I think we might want to disambiguate the connection progress messages a bit. We probably shouldn't always report the first TCP connection the same way, because it means something different to the user if the TCP connection to the first proxy fails, compared to if the TCP connection to the first relay fails. So we shouldn't use the raw connection progress indications from the ORCONN code without decoding them first a bit.

I think if we know we're connecting through a proxy, we should report the first TCP connection as something like proxy_connecting and proxy_connected. But then this gets confusingly named with the proxy handling code in connection.c that talks the proxy protocol and makes connection requests to the proxy. Maybe we should report the progress of asking the proxy to make the relay connection as connecting and connected?

We might need to further disambiguate between PT proxies and firewall bypass proxies.

I think we have a terminology quirk we need to be mindful of: Tor Browser refers to PT bridges as simply "bridges". It also uses "proxy" to refer to only firewall bypass proxies.

Replying to arma:

A thought for this part in particular: incremental progress at fetching descriptors can be confirmed as we go (we verify that we got the bytes we wanted). But if we try to show incremental progress at fetching a consensus, but then we get it and we don't like it, we'll find ourselves going backwards in bootstrap progress. Not the end of the world but maybe something to avoid getting ourselves into if we can.

We would like to avoid showing "backwards" progress in the Tor Launcher UI, but there are ways the situation you described could be addressed. Fr example, if each bootstrap phase that reports incremental progress was delineated clearly with "start" and "end" messages, then one could create a "checkmark" based UI that showed a second attempt rather than moving a simple progress bar backwards. For example:

  Connecting to the Tor network
  ...
  x Fetching relay information                [######           ]  FAILED; will retry
  ✓ Fetching relay information (attempt #2)   [#################]  
  ...

A general principle that I mentioned before is that the tor daemon itself should enable many different types of user interfaces. Making a rich set of info available that includes start and end milestones will make that possible.

Replying to catalyst:

We might need to further disambiguate between PT proxies and firewall bypass proxies.

I think we have a terminology quirk we need to be mindful of: Tor Browser refers to PT bridges as simply "bridges". It also uses "proxy" to refer to only firewall bypass proxies.

Terminology used in the browser UI has evolved over time and will almost certainly do so again (especially if user testing shows that changing it will help users better understand what is going on). Also, Brave, Firefox, and other clients may choose to present completely different terminology to their end-users. To me this just means that tor log and control port messages should use whatever terminology will make the most sense to developers and experts.

+1 on making sure clients can disambiguate between messages related to PT proxies vs. firewall bypass proxies (and other similar components). It will be a big win if we can show users exactly where things went wrong. Today, in many cases, people need to guess at the next step to take if they are unable to connect to the network. Will a local proxy help? Will using a different PT bridge help? Maybe my device lacks a working Internet connection? Wait, what time is it?

Trac:
Keywords: N/A deleted, bootstrap-arch added

Trac:
Keywords: s8-boostrap deleted, s8-bootstrap added

Trac:
Sponsor: Sponsor8 to Sponsor19

Deferring 51 tickets from 0.4.0.x-final. Tagging them with 040-deferred-20190220 for visibility. These are the tickets that did not get 040-must, 040-can, or tor-ci.

Trac:
Keywords: N/A deleted, 040-deferred-20190220 added
Milestone: Tor: 0.4.0.x-final to Tor: unspecified

I'll probably close this soon, under the assumption that #28928 (moved) took care of recording this knowledge somewhere more stable. People should please let me know if they think there's still stuff to document.

I believe all the important information in this ticket is now captured in control-spec.txt.

Trac:
Status: assigned to closed
Resolution: N/A to implemented
Actualpoints: N/A to 2

closed

changed time estimate to 4h

added 16h of time spent

moved to tpo/core/tor#28281 (closed)

outline of high-level bootstrap tracker abstractions

Child items 0

Activity