Decrease probability of stochastic failures in test-slow

changed milestone to %Tor: 0.4.0.x-final

Trac:
Child Ticket(s): #29767 (moved)

added 040-backport 040-must actualpoints::0.5 component::core tor/tor milestone::Tor: 0.4.0.x-final nickm-merge owner::asn points::0.5 priority::high resolution::fixed reviewer::teor severity::normal sponsor::2-must status::closed tor-ci tor-test type::defect version::tor 0.4.0.1-alpha labels

Trac:
Cc: N/A to riastradh

#29767 (moved) has 2 more failures.

I think asn is handling this ticket with riastradh.

(I'm sorry about all the 040-must tickets, asn. If I've got them wrong, or if you need some help, feel free to pass them back to me.)

Trac:
Status: new to assigned
Owner: N/A to asn

Another failure at: https://travis-ci.org/tlyu/tor/jobs/508233862#L4244

Here's what Riastradh said on IRC today:

nickm: Hi! You are welcome to publish the IRC discussion we had earlier about stochastic tests. (I don't remember which one, but you have my permission to publish all of the discussions we've had about stochastic tests and the distribution samplers since November or whenever this all started.) I saw that there was an issue about changing the false positive rate. I'm low on energy right now, but here's the three things that I would suggest doing, some of which I might do if I had more energy:

Write some tests of the tests -- that is, write a buggy sampler for a distribution, and apply a stochastic test to it, and confirm the stochastic test fails. Examples: https://github.com/brave/crypto/blob/master/test/randomTest.js, https://github.com/probcomp/crosscat/blob/master/cpp_code/tests/test_random_number_generator.cpp You'll want to estimate the false positive rate of these test-tests (i.e., the statistical power of the tests to detect the bugs) empirically, since for most bugs there will be no neat analytic expression for it.

Tweak NTRIALS and NPASSES_MIN so that the false positive rates of the usual tests and of the test-tests are acceptable. The first one you can compute analytically as I described in past conversations; the second will necessarily be based on the empirical measurements in (1).

Teach the CI to report the alarm rates -- not just number of alarms, but ratio of alarms to total tests run. And keep this state continuously across CI jobs so it can be aggregated over time.

In 0.4.0, I think increasing NTRIALS is our best option. I don't have access to the previous conversation about NTRIALS. If we can't find it, let's ask Riastradh, or just double NTRIALS.

One tiny addendum: It would be entirely reasonable to separate stochastic tests of distributions altogether, as in ./test, ./test-slow, ./test-stochastic. They are, after all, qualitatively different from deterministic tests, or from tests like testing that a signature made by a random public key verifies.

Replying to riastradh:

One tiny addendum: It would be entirely reasonable to separate stochastic tests of distributions altogether, as in ./test, ./test-slow, ./test-stochastic. They are, after all, qualitatively different from deterministic tests, or from tests like testing that a signature made by a random public key verifies.

This change would be appropriate for master (in another ticket) - we don't add new binaries late in an alpha series.

In #29527 (moved) and #29298 (moved), we modified some probability distribution tests to have a smaller range in 0.4.1. But we left the larger range in 0.4.0. I hope that doesn't affect the stochastic failure rate that much.

This bug caused #29528 (moved) to fail on master.

This bug caused #28636 (moved) to fail:

slow/prob_distr/stochastic_weibull: [forking] fail Weibull sampler
  FAIL src/test/test_prob_distr.c:1419: assert(ok)
NOTE: This is a stochastic test, and we expect it to fail from
time to time, with some low probability. If you see it fail more
than one trial in 100, though, please tell us.
Seed: D954C0E889C484D1BF3BB1D895E726F1
  [stochastic_weibull FAILED]
1/20 TESTS FAILED. (0 skipped)

https://travis-ci.org/torproject/tor/jobs/509213576#L5738

OK, my suggestion is to increase N_TRIALS from 2 to 3 for now, and open a ticket for the future to do more advanced stuff like the suggestions from comment:4.

In particular with N_TRIALS=2 now and 14 travis jobs per build, we have 1.3% probability of a travis build failing because of the stoch tests. This means that one travis build will fail after 50 travis builds with 50% chance.

If we bump N_TRIALS to 3, then we have 0.013% probability of a travis build failing because of the stoch tests. This means that one travis build will fail after 4952 travis builds with 50% chance. Not so annoying anymore.

Here is a (potentially borken) python script I used to calculate the data above, along with the documentation here https://github.com/torproject/tor/blob/938d97cb0d4acfdd1ea57ec0a3094bcc2101f13d/src/test/test_prob_distr.c#L866 :

N_TRIALS = 3
# Probability of a stochastic test failing
alpha = pow(0.01, N_TRIALS)

# number of stochastic tests
n_stoch_tests = 10
# Probability of at least one stochastic test failing
failure_rate_for_test_suite = 1 - pow(1 - alpha, n_stoch_tests)
print("With N_TRIALS={} and alpha={} we have failure_rate_for_test_suite {}".format(N_TRIALS, alpha, failure_rate_for_test_suite))

# Number of travis jobs per build
n_travis_jobs = 14
# Probability of at least one travis job failing
failure_rate_of_travis_build = 1 - pow(1 - failure_rate_for_test_suite, n_travis_jobs)
print("With N_TRIALS={} and alpha={} we have travis build fail {}".format(N_TRIALS,alpha, failure_rate_of_travis_build))

for n in xrange(5000):
    # Probability of travis build failing after n builds
    p = 1 - pow(1 - failure_rate_of_travis_build, n)
    if p > 0.5:
        print("With N_TRIALS={} and alpha={}, a travis build will fail with 50% chance after {} builds.".format(N_TRIALS,alpha,n))
        break

Here are our worst-case scenarios:

a merge forward from 0.4.0 (or earlier) to master:
- 3 (maint-0.4.0, release-0.4.0, master) * 14 jobs (travis, appveyor)
- With N_TRIALS=3 and alpha=1e-06, a travis build will fail with 50% chance after 1651 builds.
pull requests to 0.4.0 and master:
- 2 (branch, PR) * 2 (0.4.0, master) * 14 jobs (travis, appveyor)
- With N_TRIALS=3 and alpha=1e-06, a travis build will fail with 50% chance after 1238 builds.

These error rates are acceptable, but only until we split off maint-0.4.1. Then we will need to go to 4 trials, or find some other solution.

Patch: https://github.com/torproject/tor/pull/823

Let me know if you have any questions :)

Trac:
Actualpoints: N/A to 0.5
Status: assigned to needs_review

Replying to teor:

Here are our worst-case scenarios:

a merge forward from 0.4.0 (or earlier) to master:

3 (maint-0.4.0, release-0.4.0, master) * 14 jobs (travis, appveyor)

With N_TRIALS=3 and alpha=1e-06, a travis build will fail with 50% chance after 1651 builds.

pull requests to 0.4.0 and master:

2 (branch, PR) * 2 (0.4.0, master) * 14 jobs (travis, appveyor)

With N_TRIALS=3 and alpha=1e-06, a travis build will fail with 50% chance after 1238 builds.

These error rates are acceptable, but only until we split off maint-0.4.1. Then we will need to go to 4 trials, or find some other solution.

ACK. I also pushed https://github.com/torproject/tor/pull/824 which bumps it to 4. Let's minimize false positive interruptions for now and think of smarter approaches in #29847 (moved).

For the record:

With N_TRIALS=4 and alpha=1e-08 we have failure_rate_for_test_suite 9.99999959506e-08
With N_TRIALS=4 and alpha=1e-08 we have travis build fail 1.39999903326e-06
With N_TRIALS=4 and alpha=1e-08, a travis build will fail with 50% chance after 495106 builds.

They both look ok to me. I am glad we finally got this fixed, it has been a pain to do complex backports and merges.

Let's get them merged before 0.4.0.3-alpha?

You should probably tell nickm which one to merge :-)

Trac:
Status: needs_review to merge_ready
Keywords: N/A deleted, nickm-merge, 040-backport added

Yes, let's! Nick please merge bug29693_040_radical AKA https://github.com/torproject/tor/pull/824.

Thanks!

Trac:
Reviewer: N/A to teor

Squashed and merged to 0.4.0 and forward.

Trac:
Status: merge_ready to closed
Resolution: N/A to fixed

closed

changed time estimate to 4h

added 4h of time spent

mentioned in issue #29756 (moved)

mentioned in issue #29767 (moved)

mentioned in issue #29847 (moved)

moved to tpo/core/tor#29693 (closed)

mentioned in issue tpo/core/tor#29847 (closed)

Decrease probability of stochastic failures in test-slow

Child items 0

Activity