Opened 3 years ago

Closed 8 months ago

#22132 closed defect (fixed)

Chutney should avoid waiting for set times: wait for conditions instead

Reported by: teor Owned by: nickm
Priority: Medium Milestone:
Component: Core Tor/Chutney Version:
Severity: Normal Keywords: network-team-roadmap-2019-Q1Q2
Cc: pastly, catalyst Actual Points: 1.2
Parent ID: #20647 Points: 2
Reviewer: teor Sponsor: Sponsor19

Description

Chutney has a few places where it waits for hard-coded amounts of time. It should wait for events to happen instead.

For example:

  • re-trying connections (#22131)
  • waiting for bootstrap
  • anywhere else we use sleep()

Child Tickets

TicketStatusOwnerSummaryComponent
#29760closedteorStop repeating chutney's defaults in commentsCore Tor/Chutney

Change History (32)

comment:1 Changed 2 years ago by teor

Owner: teor deleted
Status: newassigned

Disowning tickets I don't intend to work on in the next 6 months.

comment:2 Changed 2 years ago by teor

Status: assignednew

Mark all tickets that are assigned to nobody as "new".

comment:3 Changed 18 months ago by teor

Cc: pastly added
Status: newneeds_revision

Here's an example script that uses stem to check each node on the network to see if it has bootstrapped:

https://github.com/pastly/simple-bw-scanner/blob/master/tests/testnets/simple.common/03-network-in-ready-state.py

comment:4 Changed 16 months ago by catalyst

Cc: catalyst added

comment:5 Changed 13 months ago by teor

Parent ID: #20647

Might be useful for better chutney debugging

comment:6 Changed 9 months ago by nickm

I have a chutney branch here called wait_until_bootstrap. We'll need to update the test scripts accordingly.

comment:7 Changed 9 months ago by nickm

Actual Points: 1
Status: needs_revisionneeds_review

comment:8 Changed 9 months ago by teor

Status: needs_reviewneeds_revision

Looks good, but I have some changes I want to suggest.
Would you mind opening a pull request?

I would do it myself, but GitHub has some weird restrictions on reviews by the person who opens a pull request.

comment:9 Changed 9 months ago by teor

We might need to fix #20473 as part of this ticket. I guess we'll find out!

comment:10 Changed 9 months ago by nickm

Status: needs_revisionneeds_review

Would you mind opening a pull request?

Okay -- https://github.com/torproject/chutney/pull/5

comment:11 Changed 9 months ago by teor

Status: needs_reviewneeds_revision

I found a typo on the pull request.
And I think we should not hard-code the bootstrap limit.

Should we be calling this command from test_network.sh to get diagnostic bootstrap status?

Should we be using this command from test_network.sh instead of our 20 second wait?
If we do, we might need to wait after bootstrap for:

  • a consensus interval (10 seconds?) for relays to be in the consensus
  • another consensus interval (10 seconds?) for onion services to update their consensus and post descriptors

comment:12 Changed 9 months ago by nickm

Should we be calling this command from test_network.sh to get diagnostic bootstrap status?

Yes, but I think that should be a separate ticket.

comment:13 Changed 9 months ago by teor

Owner: set to nickm
Status: needs_revisionassigned

comment:14 Changed 9 months ago by teor

Status: assignedneeds_revision

comment:15 in reply to:  12 Changed 9 months ago by teor

Replying to nickm:

Should we be calling this command from test_network.sh to get diagnostic bootstrap status?

Yes, but I think that should be a separate ticket.

Ok, that's the existing ticket #28203.
Let's add a commit to call the script instead of the bootstrap delay.
And let's ignore the exit status for the moment.

Then let's run CI to see if we need to wait longer.
We should merge this diagnostic right after we merge the chutney.git CI in #27912.

Replying to teor:

I found a typo on the pull request.
And I think we should not hard-code the bootstrap limit.

Let's make these changes so we can merge #28203.

Should we be using this command from test_network.sh instead of our 20 second wait?
If we do, we might need to wait after bootstrap for:

  • a consensus interval (10 seconds?) for relays to be in the consensus
  • another consensus interval (10 seconds?) for onion services to update their consensus and post descriptors

Let's fix bootstrap issues (#20473), then start failing when bootstrap fails.
We don't have to make these changes right away.

comment:16 Changed 9 months ago by teor

Sponsor: Sponsor19

comment:17 Changed 8 months ago by nickm

Actual Points: 11.1
Status: needs_revisionneeds_review

I've made the requested changes on the branch.

comment:18 Changed 8 months ago by teor

Status: needs_reviewmerge_ready

Looks good to me. Let's merge after CI.

comment:19 Changed 8 months ago by teor

Resolution: fixed
Status: merge_readyclosed

The CI passed, merged to master as 800c11c.

The command is actually called wait_for_bootstrap, which I fixed by documenting it in the README in #29762.

comment:20 Changed 8 months ago by teor

Resolution: fixed
Status: closedreopened

I did a draft branch for #28203, and discovered that wait_for_bootstrap only works with tor master:

0.2.9 through 0.3.5 say:

Launching chutney using Python 2.7.15
Waiting for nodes to bootstrap...
Bootstrap failed. Node status:
(-100, 'no_message', 'No bootstrap messages yet.')
(-100, 'no_message', 'No bootstrap messages yet.')
(-100, 'no_message', 'No bootstrap messages yet.')
(-100, 'no_message', 'No bootstrap messages yet.')
(-100, 'no_message', 'No bootstrap messages yet.')
(-100, 'no_message', 'No bootstrap messages yet.')
(-100, 'no_message', 'No bootstrap messages yet.')
(-100, 'no_message', 'No bootstrap messages yet.')
(-100, 'no_message', 'No bootstrap messages yet.')
(-100, 'no_message', 'No bootstrap messages yet.')
(-100, 'no_message', 'No bootstrap messages yet.')
(-100, 'no_message', 'No bootstrap messages yet.')
Tor bootstrap failed, ignoring for now.
Running 1 verify rounds...

https://travis-ci.org/torproject/chutney/jobs/505562952#L1368

0.4.0 says:

Launching chutney using Python 2.7.15
Waiting for nodes to bootstrap...
Bootstrap failed. Node status:
(100, '(done)', 'Done')
(100, '(done)', 'Done')
(100, '(done)', 'Done')
(0, '(starting)', 'Starting')
(0, '(starting)', 'Starting')
(100, '(done)', 'Done')
(100, '(done)', 'Done')
(100, '(done)', 'Done')
(100, '(done)', 'Done')
(100, '(done)', 'Done')
(100, '(done)', 'Done')
(100, '(done)', 'Done')
Tor bootstrap failed, ignoring for now.

https://travis-ci.org/torproject/chutney/jobs/505562955#L1367

If all the nodes are already bootstrapping, and we don't have the right regexes, then we should fail when any node fails to bootstrap (#20473). That way, we learn if we break tor bootstrap or chutney's wait_for_bootstrap feature.

If some nodes don't bootstrap on some versions, we should allow bootstrap failures on those versions. I guess that's another config option to test-network.sh.

comment:21 Changed 8 months ago by teor

Status: reopenedneeds_revision

comment:22 Changed 8 months ago by teor

Keywords: network-team-roadmap-2019-Q1Q2 added

These chutney tickets are on the network team roadmap, or they are required for tickets that are on the network team roadmap.

comment:23 Changed 8 months ago by nickm

For 0.4.0, the problem seems to be that we need a longer timeout. I waited a while longer and it bootstrapped fine.

For 0.3.5 and earlier, there's a problem with our regex; I'm doing a fix in a wait_for_bootstrap_better branch.

If all the nodes are already bootstrapping, and we don't have the right regexes, then we should fail when any node fails to bootstrap (#20473). That way, we learn if we break tor bootstrap or chutney's wait_for_bootstrap feature.

chutney wait_for_bootstrap currently exits with a failure if the nodes don't bootstrap; I guess this should be easy to use in the test-network script.

comment:24 Changed 8 months ago by nickm

Status: needs_revisionneeds_review

I've made the regex fix, plus a few other improvements to the output of wait_for_bootstrap, in wait_for_bootstrap_better. PR at https://github.com/torproject/chutney/pull/12

comment:25 Changed 8 months ago by nickm

Actual Points: 1.11.2

comment:26 Changed 8 months ago by teor

Status: needs_reviewneeds_revision

I did a rollup branch for #22132 and #28203:
https://travis-ci.org/teor2345/chutney/builds/506043823

There are two bugs in this branch:

  1. CHUTNEY_START_TIME is too low:

https://travis-ci.org/teor2345/chutney/jobs/506043826
https://travis-ci.org/teor2345/chutney/jobs/506043830
https://travis-ci.org/teor2345/chutney/jobs/506043835

Let's try setting it to 60 seconds by default?

  1. The logs say that all the nodes are done, but the script still fails:

https://travis-ci.org/teor2345/chutney/jobs/506043837

Maybe there's a race condition here, because we load the logs for c.isBootstrapped(), then we load them again for c.getLastBootstrapStatus().

comment:27 Changed 8 months ago by teor

Here's a pull request for that branch:
https://github.com/torproject/chutney/pull/13

Although we should probably re-do the merges when the fixes are in.

comment:28 Changed 8 months ago by nickm

I've made the requested changes in wait_for_bootstrap_better. PR still at ​https://github.com/torproject/chutney/pull/12

comment:29 Changed 8 months ago by nickm

Status: needs_revisionneeds_review

comment:30 Changed 8 months ago by teor

Reviewer: teor
Status: needs_reviewmerge_ready

We also need to change tools/test-network-impl.sh, I added a fixup here:
https://github.com/torproject/chutney/pull/14

I think we're at the point that I'm happy with this fix, and my extra changes are pretty small. I'll wait for CI, then merge.

comment:31 Changed 8 months ago by teor

I didn't get to merge this today. Anyone can merge, or I will merge it tomorrow.

comment:32 Changed 8 months ago by teor

Resolution: fixed
Status: merge_readyclosed

Merged #22132 and #28203 to master.

Note: See TracTickets for help on using tickets.