Our stochastic tests are supposed to fail around 1 in 100 runs. But when I'm doing a backport to 0.2.9, there are up to 14 jobs times 9 branches, each of which runs a test instance.
So let's decrease the probability to about 1 in (100 * 14 * 9).
Here's what the output looks like:
slow/prob_distr/stochastic_uniform: [forking] fail uniform sampler FAIL src/test/test_prob_distr.c:1209: assert(ok)NOTE: This is a stochastic test, and we expect it to fail fromtime to time, with some low probability. If you see it fail morethan one trial in 100, though, please tell us.Seed: 5DB9A3B32C29B76D7A0032700DD142BB [stochastic_uniform FAILED]
nickm: Hi! You are welcome to publish the IRC discussion we had earlier about stochastic tests.
(I don't remember which one, but you have my permission to publish all of the discussions we've had about stochastic tests and the distribution samplers since November or whenever this all started.)
I saw that there was an issue about changing the false positive rate. I'm low on energy right now, but here's the three things that I would suggest doing, some of which I might do if I had more energy:
Tweak NTRIALS and NPASSES_MIN so that the false positive rates of the usual tests and of the test-tests are acceptable. The first one you can compute analytically as I described in past conversations; the second will necessarily be based on the empirical measurements in (1).
Teach the CI to report the alarm rates -- not just number of alarms, but ratio of alarms to total tests run. And keep this state continuously across CI jobs so it can be aggregated over time.
In 0.4.0, I think increasing NTRIALS is our best option. I don't have access to the previous conversation about NTRIALS. If we can't find it, let's ask Riastradh, or just double NTRIALS.
One tiny addendum: It would be entirely reasonable to separate stochastic tests of distributions altogether, as in ./test, ./test-slow, ./test-stochastic. They are, after all, qualitatively different from deterministic tests, or from tests like testing that a signature made by a random public key verifies.
One tiny addendum: It would be entirely reasonable to separate stochastic tests of distributions altogether, as in ./test, ./test-slow, ./test-stochastic. They are, after all, qualitatively different from deterministic tests, or from tests like testing that a signature made by a random public key verifies.
This change would be appropriate for master (in another ticket) - we don't add new binaries late in an alpha series.
In #29527 (moved) and #29298 (moved), we modified some probability distribution tests to have a smaller range in 0.4.1. But we left the larger range in 0.4.0. I hope that doesn't affect the stochastic failure rate that much.
slow/prob_distr/stochastic_weibull: [forking] fail Weibull sampler FAIL src/test/test_prob_distr.c:1419: assert(ok)NOTE: This is a stochastic test, and we expect it to fail fromtime to time, with some low probability. If you see it fail morethan one trial in 100, though, please tell us.Seed: D954C0E889C484D1BF3BB1D895E726F1 [stochastic_weibull FAILED]1/20 TESTS FAILED. (0 skipped)
OK, my suggestion is to increase N_TRIALS from 2 to 3 for now, and open a ticket for the future to do more advanced stuff like the suggestions from comment:4.
In particular with N_TRIALS=2 now and 14 travis jobs per build, we have 1.3% probability of a travis build failing because of the stoch tests. This means that one travis build will fail after 50 travis builds with 50% chance.
If we bump N_TRIALS to 3, then we have 0.013% probability of a travis build failing because of the stoch tests. This means that one travis build will fail after 4952 travis builds with 50% chance. Not so annoying anymore.
N_TRIALS = 3# Probability of a stochastic test failingalpha = pow(0.01, N_TRIALS)# number of stochastic testsn_stoch_tests = 10# Probability of at least one stochastic test failingfailure_rate_for_test_suite = 1 - pow(1 - alpha, n_stoch_tests)print("With N_TRIALS={} and alpha={} we have failure_rate_for_test_suite {}".format(N_TRIALS, alpha, failure_rate_for_test_suite))# Number of travis jobs per buildn_travis_jobs = 14# Probability of at least one travis job failingfailure_rate_of_travis_build = 1 - pow(1 - failure_rate_for_test_suite, n_travis_jobs)print("With N_TRIALS={} and alpha={} we have travis build fail {}".format(N_TRIALS,alpha, failure_rate_of_travis_build))for n in xrange(5000): # Probability of travis build failing after n builds p = 1 - pow(1 - failure_rate_of_travis_build, n) if p > 0.5: print("With N_TRIALS={} and alpha={}, a travis build will fail with 50% chance after {} builds.".format(N_TRIALS,alpha,n)) break
With N_TRIALS=4 and alpha=1e-08 we have failure_rate_for_test_suite 9.99999959506e-08With N_TRIALS=4 and alpha=1e-08 we have travis build fail 1.39999903326e-06With N_TRIALS=4 and alpha=1e-08, a travis build will fail with 50% chance after 495106 builds.