Working from the list in [ticket:27103#comment:4 #27103 (moved)], this is a hopefully useful breakdown of the high-level phases of bootstrapping:
making the initial OR_CONN to any relay or bridge (see #27103 (moved))
this should track the farthest progress that any individual attempt has made so far
"farthest progress" should probably be reset under some circumstances (see #27691 (moved))
directory info
a. one-hop circuit, if needed?
b. bridge descriptor, if bridges are used? (see #11966 (moved))
c. consensus
d. descriptors (usually microdescs for clients)
building a useful application circuit
a. first OR_CONN to a guard if we're not using bridges
b. intermediate progress such as noting when each hop gets built (see #27104 (moved))
In a pubsub framework, (1) will need to subscribe to events from connections, and keep track of the maximum progress any one connection has gotten.
There should be an abstraction that tracks circuit-building progress. We can use it for (2)(a) and (3)(b).
We could make separate trackers for (2)(c) and (2)(d). As a bonus, those trackers could handle the scaling of incremental progress for downloads.
If we make a tracker that subscribes to both circuit and connection events, we could cleanly solve bugs such as #25061 (moved). It would also work for (3)(a), which needs to know both circuit type (application circuit) and connection state.
After chatting some with ahf, I thought it might be a good idea to write down here a proposed new set of bootstrap phases. The numbering of the new phases is yet to be determined, but they're meant to be in order. (Some phases might get skipped, and that's OK.)
Some design considerations include the spacing between phases. Right now many of them seem separated by 5%, which seems to be a decent amount of progress as seen by the user's eye. Any increments smaller than this aren't necessarily meaningful to show to the user, but we could use the smaller increments to add phase names that could give a more accurate picture about where something is broken than the user currently gets.
There are two gaps in the existing phases, one of which corresponds to incremental progress downloading descriptors. (The other one doesn't seem to currently be used to display incremental progress downloading a consensus.)
undef::
shouldn't be visible to controllers or users
starting::
can stay the same
The following high-level grouping of phases should deal with the first outbound connection to a Tor relay. This might be to a directory cache, a proxy, or a guard/bridge. Here we use "first" to mean whichever one has made the most progress so far, in case we open multiple connections before any one is fully open.
connecting::
the initial outbound TCP connection toward the Tor network, for any purpose, which might include a firewall-bypassing proxy, or a pluggable transport; corresponds to OR_CONN_STATE_CONNECTINGproxy_handshake::
the initial handshake with a firewall-bypass proxy or PT; corresponds to OR_CONN_STATE_PROXY_HANDSHAKING; might be skipped if not using proxies or PTs
Maybe insert additional phases here for intermediate proxy handshaking steps?
tls_handshake::
the TLS handshake with the first relay; corresponds to OR_CONN_STATE_TLS_HANDSHAKING or related ORCONN states (some of these involve TLS protocol renegotiations to deal with older link protocol versions)
open::
the Tor link protocol is open to the first relay and can send and receive cells
The following high-level grouping of phases should deal with receiving and verifying directory information. Some of these might get skipped if we're starting from cached info.
dir_circ_create::
corresponds to the CREATE command opening the first circuit to a directory server; maybe reuse the existing onehop_create tag, because it already mostly means this? it might be better to have the more normalized naming though
dir_circ_created::
corresponds to the CREATED response that means the first directory circuit is created
dir_stream_begin::
corresponds to the BEGIN_DIR command
dir_stream_connected::
corresponds to the CONNECTED response to the BEGIN_DIR command; the existing requesting_status phase actually gets sent here instead of where the corresponding work actually begins
requesting_bridge_desc::
start downloading the bridge descriptor, if we're connected to a bridge; this is related to #11966 (moved)requesting_status::
this can stay the same
loading_status::
this can stay the same
Right now there is a gap (from 20 to 40) between these two phases, but we don't currently fill it in with incremental progress in downloading the consensus. Maybe we should?
loading_keys::
this can stay the same
requesting_descriptors::
this can stay the same
loading_descriptors::
this can stay the same
Right now there is a gap between loading_descriptors and the next phase (from 50 to 80), which we fill in with incremental progress. Maybe we should retain this gap and the incremental progress display?
The next high-level grouping of phases corresponds to connecting to a guard, if bridges aren't in use. Similarly to the connecting grouping, these represent the furthest progress that any one attempt has made so far.
guard_connecting::
same as connecting but for a guard
guard_proxy_handshake::
same as proxy_handshake but for a guard
guard_tls_handshake::
same as tls_handshake but for a guard
guard_open::
same as open but for a guard
circ_create::
same as dir_circ_create except for an application circuit
circ_created::
same as dir_created except for an application circuit
circ_extend::
corresponds to EXTEND command for the second hop
circ_extended::
corresponds to EXTENDED response for the second hop
circ_exit_extend::
corresponds to EXTEND command for the exit
circ_exit_extended::
corresponds to EXTENDED response for the exit
Right now there is a gap (from 20 to 40) between these two phases, but we don't currently fill it in with incremental progress in downloading the consensus. Maybe we should?
A thought for this part in particular: incremental progress at fetching descriptors can be confirmed as we go (we verify that we got the bytes we wanted). But if we try to show incremental progress at fetching a consensus, but then we get it and we don't like it, we'll find ourselves going backwards in bootstrap progress. Not the end of the world but maybe something to avoid getting ourselves into if we can.
But most importantly of all: this particular incremental-progress dilemma can be totally deferred until everything else is done and in place. :)
I thought about this a bit more, and I think we might want to disambiguate the connection progress messages a bit. We probably shouldn't always report the first TCP connection the same way, because it means something different to the user if the TCP connection to the first proxy fails, compared to if the TCP connection to the first relay fails. So we shouldn't use the raw connection progress indications from the ORCONN code without decoding them first a bit.
I think if we know we're connecting through a proxy, we should report the first TCP connection as something like proxy_connecting and proxy_connected. But then this gets confusingly named with the proxy handling code in connection.c that talks the proxy protocol and makes connection requests to the proxy. Maybe we should report the progress of asking the proxy to make the relay connection as connecting and connected?
We might need to further disambiguate between PT proxies and firewall bypass proxies.
I think we have a terminology quirk we need to be mindful of: Tor Browser refers to PT bridges as simply "bridges". It also uses "proxy" to refer to only firewall bypass proxies.
A thought for this part in particular: incremental progress at fetching descriptors can be confirmed as we go (we verify that we got the bytes we wanted). But if we try to show incremental progress at fetching a consensus, but then we get it and we don't like it, we'll find ourselves going backwards in bootstrap progress. Not the end of the world but maybe something to avoid getting ourselves into if we can.
We would like to avoid showing "backwards" progress in the Tor Launcher UI, but there are ways the situation you described could be addressed. Fr example, if each bootstrap phase that reports incremental progress was delineated clearly with "start" and "end" messages, then one could create a "checkmark" based UI that showed a second attempt rather than moving a simple progress bar backwards. For example:
Connecting to the Tor network ... x Fetching relay information [###### ] FAILED; will retry ✓ Fetching relay information (attempt #2) [#################] ...
A general principle that I mentioned before is that the tor daemon itself should enable many different types of user interfaces. Making a rich set of info available that includes start and end milestones will make that possible.
We might need to further disambiguate between PT proxies and firewall bypass proxies.
I think we have a terminology quirk we need to be mindful of: Tor Browser refers to PT bridges as simply "bridges". It also uses "proxy" to refer to only firewall bypass proxies.
Terminology used in the browser UI has evolved over time and will almost certainly do so again (especially if user testing shows that changing it will help users better understand what is going on). Also, Brave, Firefox, and other clients may choose to present completely different terminology to their end-users. To me this just means that tor log and control port messages should use whatever terminology will make the most sense to developers and experts.
+1 on making sure clients can disambiguate between messages related to PT proxies vs. firewall bypass proxies (and other similar components). It will be a big win if we can show users exactly where things went wrong. Today, in many cases, people need to guess at the next step to take if they are unable to connect to the network. Will a local proxy help? Will using a different PT bridge help? Maybe my device lacks a working Internet connection? Wait, what time is it?
Deferring 51 tickets from 0.4.0.x-final. Tagging them with 040-deferred-20190220 for visibility. These are the tickets that did not get 040-must, 040-can, or tor-ci.
I'll probably close this soon, under the assumption that #28928 (moved) took care of recording this knowledge somewhere more stable. People should please let me know if they think there's still stuff to document.