Opened 3 months ago
Closed 3 months ago
#29693 closed defect (fixed)
Decrease probability of stochastic failures in test-slow
Reported by: | teor | Owned by: | asn |
---|---|---|---|
Priority: | High | Milestone: | Tor: 0.4.0.x-final |
Component: | Core Tor/Tor | Version: | Tor: 0.4.0.1-alpha |
Severity: | Normal | Keywords: | nickm-merge, tor-ci, tor-test, 040-must, 040-backport |
Cc: | riastradh | Actual Points: | 0.5 |
Parent ID: | Points: | 0.5 | |
Reviewer: | teor | Sponsor: | Sponsor2-must |
Description
Our stochastic tests are supposed to fail around 1 in 100 runs. But when I'm doing a backport to 0.2.9, there are up to 14 jobs times 9 branches, each of which runs a test instance.
So let's decrease the probability to about 1 in (100 * 14 * 9).
Here's what the output looks like:
slow/prob_distr/stochastic_uniform: [forking] fail uniform sampler FAIL src/test/test_prob_distr.c:1209: assert(ok) NOTE: This is a stochastic test, and we expect it to fail from time to time, with some low probability. If you see it fail more than one trial in 100, though, please tell us. Seed: 5DB9A3B32C29B76D7A0032700DD142BB [stochastic_uniform FAILED]
Child Tickets
Ticket | Status | Owner | Summary | Component |
---|---|---|---|---|
#29767 | closed | Prob distr stochastic tests fail CI (innocuous?) | Core Tor/Tor |
Change History (18)
comment:1 Changed 3 months ago by
Cc: | riastradh added |
---|
comment:2 Changed 3 months ago by
comment:3 Changed 3 months ago by
Owner: | set to asn |
---|---|
Status: | new → assigned |
I think asn is handling this ticket with riastradh.
(I'm sorry about all the 040-must tickets, asn. If I've got them wrong, or if you need some help, feel free to pass them back to me.)
comment:4 Changed 3 months ago by
Another failure at:
https://travis-ci.org/tlyu/tor/jobs/508233862#L4244
Here's what Riastradh said on IRC today:
nickm: Hi! You are welcome to publish the IRC discussion we had earlier about stochastic tests.
(I don't remember which one, but you have my permission to publish all of the discussions we've had about stochastic tests and the distribution samplers since November or whenever this all started.)
I saw that there was an issue about changing the false positive rate. I'm low on energy right now, but here's the three things that I would suggest doing, some of which I might do if I had more energy:
- Write some tests of the tests -- that is, write a _buggy_ sampler for a distribution, and apply a stochastic test to it, and confirm the stochastic test _fails_.
Examples: https://github.com/brave/crypto/blob/master/test/randomTest.js, https://github.com/probcomp/crosscat/blob/master/cpp_code/tests/test_random_number_generator.cpp
You'll want to estimate the false positive rate of these test-tests (i.e., the statistical power of the tests to detect the bugs) empirically, since for most bugs there will be no neat analytic expression for it.
- Tweak NTRIALS and NPASSES_MIN so that the false positive rates of the usual tests _and_ of the test-tests are acceptable. The first one you can compute analytically as I described in past conversations; the second will necessarily be based on the empirical measurements in (1).
- Teach the CI to report the alarm rates -- not just number of alarms, but ratio of alarms to total tests run. And keep this state continuously across CI jobs so it can be aggregated over time.
In 0.4.0, I think increasing NTRIALS is our best option. I don't have access to the previous conversation about NTRIALS. If we can't find it, let's ask Riastradh, or just double NTRIALS.
comment:5 follow-up: 6 Changed 3 months ago by
One tiny addendum: It would be entirely reasonable to separate stochastic tests of distributions altogether, as in ./test, ./test-slow, ./test-stochastic. They are, after all, qualitatively different from deterministic tests, or from tests like testing that a signature made by a random public key verifies.
comment:6 Changed 3 months ago by
Replying to riastradh:
One tiny addendum: It would be entirely reasonable to separate stochastic tests of distributions altogether, as in ./test, ./test-slow, ./test-stochastic. They are, after all, qualitatively different from deterministic tests, or from tests like testing that a signature made by a random public key verifies.
This change would be appropriate for master (in another ticket) - we don't add new binaries late in an alpha series.
comment:7 Changed 3 months ago by
comment:9 Changed 3 months ago by
This bug caused #28636 to fail:
slow/prob_distr/stochastic_weibull: [forking] fail Weibull sampler FAIL src/test/test_prob_distr.c:1419: assert(ok) NOTE: This is a stochastic test, and we expect it to fail from time to time, with some low probability. If you see it fail more than one trial in 100, though, please tell us. Seed: D954C0E889C484D1BF3BB1D895E726F1 [stochastic_weibull FAILED] 1/20 TESTS FAILED. (0 skipped)
comment:10 Changed 3 months ago by
OK, my suggestion is to increase N_TRIALS from 2 to 3 for now, and open a ticket for the future to do more advanced stuff like the suggestions from comment:4.
In particular with N_TRIALS=2 now and 14 travis jobs per build, we have 1.3% probability of a travis build failing because of the stoch tests. This means that one travis build will fail after 50 travis builds with 50% chance.
If we bump N_TRIALS to 3, then we have 0.013% probability of a travis build failing because of the stoch tests. This means that one travis build will fail after 4952 travis builds with 50% chance. Not so annoying anymore.
Here is a (potentially borken) python script I used to calculate the data above, along with the documentation here https://github.com/torproject/tor/blob/938d97cb0d4acfdd1ea57ec0a3094bcc2101f13d/src/test/test_prob_distr.c#L866 :
N_TRIALS = 3 # Probability of a stochastic test failing alpha = pow(0.01, N_TRIALS) # number of stochastic tests n_stoch_tests = 10 # Probability of at least one stochastic test failing failure_rate_for_test_suite = 1 - pow(1 - alpha, n_stoch_tests) print("With N_TRIALS={} and alpha={} we have failure_rate_for_test_suite {}".format(N_TRIALS, alpha, failure_rate_for_test_suite)) # Number of travis jobs per build n_travis_jobs = 14 # Probability of at least one travis job failing failure_rate_of_travis_build = 1 - pow(1 - failure_rate_for_test_suite, n_travis_jobs) print("With N_TRIALS={} and alpha={} we have travis build fail {}".format(N_TRIALS,alpha, failure_rate_of_travis_build)) for n in xrange(5000): # Probability of travis build failing after n builds p = 1 - pow(1 - failure_rate_of_travis_build, n) if p > 0.5: print("With N_TRIALS={} and alpha={}, a travis build will fail with 50% chance after {} builds.".format(N_TRIALS,alpha,n)) break
comment:11 follow-up: 13 Changed 3 months ago by
Here are our worst-case scenarios:
- a merge forward from 0.4.0 (or earlier) to master:
- 3 (maint-0.4.0, release-0.4.0, master) * 14 jobs (travis, appveyor)
- With N_TRIALS=3 and alpha=1e-06, a travis build will fail with 50% chance after 1651 builds.
- pull requests to 0.4.0 and master:
- 2 (branch, PR) * 2 (0.4.0, master) * 14 jobs (travis, appveyor)
- With N_TRIALS=3 and alpha=1e-06, a travis build will fail with 50% chance after 1238 builds.
These error rates are acceptable, but only until we split off maint-0.4.1. Then we will need to go to 4 trials, or find some other solution.
comment:12 Changed 3 months ago by
Actual Points: | → 0.5 |
---|---|
Status: | assigned → needs_review |
Patch: https://github.com/torproject/tor/pull/823
Let me know if you have any questions :)
comment:13 Changed 3 months ago by
Replying to teor:
Here are our worst-case scenarios:
- a merge forward from 0.4.0 (or earlier) to master:
- 3 (maint-0.4.0, release-0.4.0, master) * 14 jobs (travis, appveyor)
- With N_TRIALS=3 and alpha=1e-06, a travis build will fail with 50% chance after 1651 builds.
- pull requests to 0.4.0 and master:
- 2 (branch, PR) * 2 (0.4.0, master) * 14 jobs (travis, appveyor)
- With N_TRIALS=3 and alpha=1e-06, a travis build will fail with 50% chance after 1238 builds.
These error rates are acceptable, but only until we split off maint-0.4.1. Then we will need to go to 4 trials, or find some other solution.
ACK. I also pushed https://github.com/torproject/tor/pull/824 which bumps it to 4. Let's minimize false positive interruptions for now and think of smarter approaches in #29847.
For the record:
With N_TRIALS=4 and alpha=1e-08 we have failure_rate_for_test_suite 9.99999959506e-08 With N_TRIALS=4 and alpha=1e-08 we have travis build fail 1.39999903326e-06 With N_TRIALS=4 and alpha=1e-08, a travis build will fail with 50% chance after 495106 builds.
comment:14 Changed 3 months ago by
Keywords: | nickm-merge 040-backport added |
---|---|
Status: | needs_review → merge_ready |
They both look ok to me. I am glad we finally got this fixed, it has been a pain to do complex backports and merges.
Let's get them merged before 0.4.0.3-alpha?
You should probably tell nickm which one to merge :-)
comment:15 Changed 3 months ago by
Yes, let's! Nick please merge bug29693_040_radical AKA https://github.com/torproject/tor/pull/824.
Thanks!
comment:16 Changed 3 months ago by
Reviewer: | → teor |
---|
comment:18 Changed 3 months ago by
Resolution: | → fixed |
---|---|
Status: | merge_ready → closed |
#29767 has 2 more failures.