report bootstrap phase when we actually start, not just unblock something

changed milestone to %Tor: unspecified

added 035-roadmap-subtask 035-triaged-in-20180711 040-deferred-20190220 bootstrap bootstrap-arch component::core tor/tor ex-28018-child ex-catalyst-20190619 ex-sponsor-19 ex-sponsor19 milestone::Tor: unspecified priority::medium s8-bootstrap severity::normal status::new type::enhancement usability ux ux-team labels

Trac:
Type: defect to enhancement

The following bootstrap milestones (maybe not an exhaustive list) get reported by the preceding step, not by code that actually starts the next milestone:

BOOTSTRAP_STATUS_REQUESTING_STATUS (20)
BOOTSTRAP_STATUS_HANDSHAKE (-2)
BOOTSTRAP_STATUS_REQUESTING_DESCRIPTORS (45)
BOOTSTRAP_STATUS_CONN_OR (80)

BOOTSTRAP_STATUS_REQUESTING_STATUS is reported by circuit_build_no_more_hops() when it observes that a single-hop circuit has completed.

BOOTSTRAP_STATUS_HANDSHAKE is reported by connection_or_finished_connecting() when it probably should be reported by connection_tls_start_handshake().

BOOTSTRAP_STATUS_REQUESTING_DESCRIPTORS is reported by update_router_have_minimum_dir_info(), which doesn't actually initiate the descriptor downloads.

BOOTSTRAP_STATUS_CONN_OR is reported by update_router_have_minimum_dir_info() when it probably should actually be reported by the connection launch code in circuit_handle_first_hop().

Move my post-freeze 0.3.5 items to 0.3.6.

Trac:
Milestone: Tor: 0.3.5.x-final to Tor: 0.3.6.x-final

Trac:
Parent: #22266 (moved) to #28018 (moved)

Trac:
Sponsor: Sponsor8 to Sponsor8-can

Tor 0.3.6.x has been renamed to 0.4.0.x.

Trac:
Milestone: Tor: 0.3.6.x-final to Tor: 0.4.0.x-final

rough notes -- phases marked with * are in some way implicit, probably in a way that gives a misleading idea of what's going on or what's gone wrong.

-1 undef

-2 handshake
   this gets decoded to handshake_dir or handshake_or depending on
   whether we have a consensus yet. we should fix this, because it
   causes usability problems.

00 starting

05 conn_dir *
   circuit_handle_first_hop in circuitbuild.c, if it's about to launch
   a one-hop circuit

10 handshake_dir *
   decoded in control_event_bootstrap from handshake -- implicit in
   that it doesn't correspond to the actual TCP connect

15 onehop_create
   circuit_send_first_onion_skin in circuitbuild.c -- assumes
   directory lookups are the only reason to build one-hop circuits

20 requesting_status *
   circuit_build_no_more_hops in circuitbuild.c
   -- this implicit and doesn't mean the request actually got sent; it
   just means the circuit is built. the begin_dir might not have been
   sent yet, etc.

25 loading_status *
   connection_edge_process_relay_cell_not_open -- probably actually
   the completion of a begin_dir command

40 loading_keys *
   connection_edge_process_relay_cell_not_open -- same as
   loading_status but infers from consensus_is_waiting_for_certs

45 requesting_descriptors *
   update_router_have_minimum_dir_info in nodelist.c -- not the actual
   request but actually that we noticed that we need more descriptors

50 loading_descriptors
   connection_edge_process_relay_cell_not_open in relay.c -- similar
   to loading_status and loading_keys

   load_downloaded_routers in dirclient.c -- if
   router_load_routers_from_string says it added new routers

   handle_response_fetch_certificate in dirclient.c --
   microdescs_add_to_cache says it added new ones

80 conn_or *
   update_router_have_minimum_dir_info in nodelist.c -- basicallly
   once we have the minimum amount of directory info to build
   circuits, but it doesn't mean we've necessarily launched any yet

85 handshake_or *
   decoded in control_event_bootstrap from handshake

90 circuit_create
   circuit_send_first_onion_skin in circuitbuild.c, when it's building
   a circuit that's not one-hop

100 done
   circuit_build_no_more_hops in circuitbuild.c

We probably should scope this a bit better. I think there are two categories of "previous phase reports progress on starting the following phase":

reporting our best guess as to what the next asynchronous progress will occur next that's been unblocked by what just happened
reporting progress of the next phase when we're about to call into code that almost certainly begins the phase we're reporting that we're starting

(1) is more important to fix, and #27167 (moved) fixes some of them. (2) is less important to fix, because the distinction is only important to someone who's familiar with the code. In these cases, it's very unlikely that progress will be interrupted or blocked before the actual work begins on the next phase.

I'm no longer convinced that adding completion milestones is the best way to fix this class of problems. I think we want to report starting something when we've actually started it, not when we think we've unblocked it from starting.

I've been functionally using this as a ticket for tracking where we report misleading implicit progress in bootstrap phases.

Edited the summary and description accordingly.

Trac:
Description: Right now many bootstrap events get reported when the preceding task has completed. This makes it somewhat harder to tell what has gone wrong if bootstrap progress stalls.

We should add completion milestones to bootstrap reporting. This makes bootstrap reporting more future-proof. If in the future we add a time-consuming task with (no bootstrap reporting) between two existing bootstrap tasks, it will be a little more obvious what's going on.

For example, say we have task X followed by task Z, but then we add a lengthy task Y without adding bootstrap reporting to it. In the old scheme without completion milestones, if Y stalls, the user sees:

starting X
starting Z
[hang] The user thinks Z has already started when no such thing has happened because Y is still in progress. If we add completion milestones, the user will see:
starting X
finished X
starting Z
finishing Z in a normal bootstrap. If something gets stuck in task Y, the user will see:
starting X
finished X
[hang] This will make it more clear that something got stuck in between tasks.

In a one-line display like Tor Launcher, the completion milestones will normally flash by quickly and not be very visible to users. Completion milestones might make the NOTICE logs a bit more verbose.

to

Right now many bootstrap events get reported when the preceding task has completed. This makes it somewhat harder to tell what has gone wrong if bootstrap progress stalls.

[edit: The following isn't necessarily the best way to fix this. It might be better to figure out how to report starting something when actually starting it.]

We should add completion milestones to bootstrap reporting. This makes bootstrap reporting more future-proof. If in the future we add a time-consuming task with (no bootstrap reporting) between two existing bootstrap tasks, it will be a little more obvious what's going on.

For example, say we have task X followed by task Z, but then we add a lengthy task Y without adding bootstrap reporting to it. In the old scheme without completion milestones, if Y stalls, the user sees:

starting X
starting Z
[hang] The user thinks Z has already started when no such thing has happened because Y is still in progress. If we add completion milestones, the user will see:
starting X
finished X
starting Z
finishing Z in a normal bootstrap. If something gets stuck in task Y, the user will see:
starting X
finished X
[hang] This will make it more clear that something got stuck in between tasks.

In a one-line display like Tor Launcher, the completion milestones will normally flash by quickly and not be very visible to users. Completion milestones might make the NOTICE logs a bit more verbose.
Summary: add completion milestones to bootstrap to report bootstrap phase when we actually start, not just unblock something

Trac:
Sponsor: Sponsor8-can to Sponsor19-can

Trac:
Keywords: N/A deleted, bootstrap-arch added

Deferring 51 tickets from 0.4.0.x-final. Tagging them with 040-deferred-20190220 for visibility. These are the tickets that did not get 040-must, 040-can, or tor-ci.

Trac:
Milestone: Tor: 0.4.0.x-final to Tor: unspecified
Keywords: N/A deleted, 040-deferred-20190220 added

Adding the keyword to mark everything that didn't fit into the time for sponsor 19.

Trac:
Keywords: N/A deleted, ex-sponsor-19 added

Remove sponsor 19 and add a keyword ex-sponsor19 to mark all the tickets that could have been in the scope of the sponsor.

Trac:
Sponsor: Sponsor19-can to N/A
Keywords: N/A deleted, ex-sponsor19 added

Unparent remaining open children of #28018 (moved).

Trac:
Parent: #28018 (moved) to N/A
Keywords: N/A deleted, ex-28018-child added

Remove myself from tickets I'm not actively working on.

Trac:
Keywords: N/A deleted, ex-catalyst-20190619 added
Owner: catalyst to N/A
Cc: brade, mcs, linda, isabela, iry, intrigeri to brade, mcs, linda, isabela, iry, intrigeri, catalyst

Trac:
Status: assigned to new

mentioned in issue #28926 (moved)

moved to tpo/core/tor#27308 (closed)

report bootstrap phase when we actually start, not just unblock something

Child items ...

Activity