Opened 4 years ago

Closed 4 years ago

#16952 closed defect (not a bug)

Retry configure/verify on transient chutney failures

Reported by: teor Owned by: nickm
Priority: Medium Milestone:
Component: Core Tor/Tor Version:
Severity: Keywords: Post027Freeze TorCoreTeam201509 testing SponsorS
Cc: Actual Points:
Parent ID: #16953 Points:
Reviewer: Sponsor:

Description

chutney occasionally fails to bootstrap or verify. Sometimes, and on some platforms / machines, bootstrap is slower than we expect.

We should either: fix intermittent failures (#16951), or lengthen the default sleep time (this ticket), or retry up to N times (argument) (this ticket) with increasing sleep times (this ticket), or implement a chutney command that finds out whether the network has finished bootstrapping (#16950).

Easy fixes are to: make the default sleep time 30 seconds; retry verify at 45 and 60 seconds before giving up (args for interval and number of retries); and then retry once from the start with a new network in case of chutney configure / launch failure (arg for number of reconfigures). This will make it a maximum of 4.5 minutes before the test fails, after 3 x 2 retries.

Child Tickets

Change History (5)

comment:1 Changed 4 years ago by teor

Observed successful bootstrap times range from around 19-23 seconds (AWS T2 Linux) to 21-30+ seconds (MacBook Air OS X) when measured via src/test/test-network.sh --sleep. However, these figures depend significantly on load on the box (and CPU credits in the case of AWS).

Retrying configuration generally works after one try for transient failures, otherwise it's likely a permanent failure.

comment:2 Changed 4 years ago by teor

Component: ChutneyTor
Keywords: Post027Freeze TorCoreTeam201509 added
Parent ID: #16949#16953
Status: newneeds_review

Please see my branch retry-test-network-on-failure on https://github.com/teor2345/tor.git
It has the following changes:

  • Retry boostrap (configure and launch) once on failure, configurable via CHUTNEY_RECONFIGURES or --reconfigures
  • Retry verify every 5 seconds from BOOTSTRAP_TIME (now 20s) to MAX_BOOTSTRAP_TIME (60s), configurable via --time and --max-time
  • A separate commit that fixes up the indenting and spacing in the script

This makes src/test/test-network.sh suitably robust to intermittent failures and variable machine speeds. This will ensure that #16953 reports actual failures, not transient conditions or slowdowns.

I have tested it on OS X and Linux.

comment:3 Changed 4 years ago by teor

Updated the retry-test-network-on-failure branch to retry configuration on chutney verify failure. (It previously only retried on chutney configure failure.)
Also added further comments and fixed up a trailing space.

The script now runs for a maximum of approximately 3 minutes per test network (2 x 60 seconds MAX_BOOSTRAP + 2 x 8 x 3-4 seconds chutney verify run time / network timeout).

comment:4 Changed 4 years ago by nickm

Hmm. Looks plausible to me, but I'm hoping that we can make it so any failure, even a "transient" one, can get treated as a possible bug. Possibly, we should exit with a nonzero exit status? Otherwise there's no way to actually get an alert about a bug that only happens (say) one time out of 10.

comment:5 Changed 4 years ago by teor

Resolution: not a bug
Status: needs_reviewclosed

Let's do this the right way in #16950: "implement that 'chutney has-bootstrapped' test, and then maybe 'chutney wait-n-seconds-or-until-bootstrapped'".

In the meantime, I'll increase the default bootstrap to 30 seconds for the larger networks and slower test machines.

Note: See TracTickets for help on using tickets.