Opened 7 weeks ago

Closed 5 weeks ago

#29693 closed defect (fixed)

Decrease probability of stochastic failures in test-slow

Reported by: teor Owned by: asn
Priority: High Milestone: Tor: 0.4.0.x-final
Component: Core Tor/Tor Version: Tor: 0.4.0.1-alpha
Severity: Normal Keywords: nickm-merge, tor-ci, tor-test, 040-must, 040-backport
Cc: riastradh Actual Points: 0.5
Parent ID: Points: 0.5
Reviewer: teor Sponsor: Sponsor2-must

Description

Our stochastic tests are supposed to fail around 1 in 100 runs. But when I'm doing a backport to 0.2.9, there are up to 14 jobs times 9 branches, each of which runs a test instance.

So let's decrease the probability to about 1 in (100 * 14 * 9).

Here's what the output looks like:

slow/prob_distr/stochastic_uniform: [forking] fail uniform sampler
  FAIL src/test/test_prob_distr.c:1209: assert(ok)
NOTE: This is a stochastic test, and we expect it to fail from
time to time, with some low probability. If you see it fail more
than one trial in 100, though, please tell us.
Seed: 5DB9A3B32C29B76D7A0032700DD142BB
  [stochastic_uniform FAILED]

https://travis-ci.org/torproject/tor/jobs/503432646#L5845

Child Tickets

TicketStatusOwnerSummaryComponent
#29767closedProb distr stochastic tests fail CI (innocuous?)Core Tor/Tor

Change History (18)

comment:1 Changed 7 weeks ago by nickm

Cc: riastradh added

comment:2 Changed 6 weeks ago by teor

#29767 has 2 more failures.

comment:3 Changed 5 weeks ago by teor

Owner: set to asn
Status: newassigned

I think asn is handling this ticket with riastradh.

(I'm sorry about all the 040-must tickets, asn. If I've got them wrong, or if you need some help, feel free to pass them back to me.)

comment:4 Changed 5 weeks ago by teor

Another failure at:
https://travis-ci.org/tlyu/tor/jobs/508233862#L4244

Here's what Riastradh said on IRC today:

nickm: Hi! You are welcome to publish the IRC discussion we had earlier about stochastic tests.
(I don't remember which one, but you have my permission to publish all of the discussions we've had about stochastic tests and the distribution samplers since November or whenever this all started.)
I saw that there was an issue about changing the false positive rate. I'm low on energy right now, but here's the three things that I would suggest doing, some of which I might do if I had more energy:

  1. Write some tests of the tests -- that is, write a _buggy_ sampler for a distribution, and apply a stochastic test to it, and confirm the stochastic test _fails_.

Examples: https://github.com/brave/crypto/blob/master/test/randomTest.js, https://github.com/probcomp/crosscat/blob/master/cpp_code/tests/test_random_number_generator.cpp
You'll want to estimate the false positive rate of these test-tests (i.e., the statistical power of the tests to detect the bugs) empirically, since for most bugs there will be no neat analytic expression for it.

  1. Tweak NTRIALS and NPASSES_MIN so that the false positive rates of the usual tests _and_ of the test-tests are acceptable. The first one you can compute analytically as I described in past conversations; the second will necessarily be based on the empirical measurements in (1).
  2. Teach the CI to report the alarm rates -- not just number of alarms, but ratio of alarms to total tests run. And keep this state continuously across CI jobs so it can be aggregated over time.

In 0.4.0, I think increasing NTRIALS is our best option. I don't have access to the previous conversation about NTRIALS. If we can't find it, let's ask Riastradh, or just double NTRIALS.

comment:5 Changed 5 weeks ago by riastradh

One tiny addendum: It would be entirely reasonable to separate stochastic tests of distributions altogether, as in ./test, ./test-slow, ./test-stochastic. They are, after all, qualitatively different from deterministic tests, or from tests like testing that a signature made by a random public key verifies.

comment:6 in reply to:  5 Changed 5 weeks ago by teor

Replying to riastradh:

One tiny addendum: It would be entirely reasonable to separate stochastic tests of distributions altogether, as in ./test, ./test-slow, ./test-stochastic. They are, after all, qualitatively different from deterministic tests, or from tests like testing that a signature made by a random public key verifies.

This change would be appropriate for master (in another ticket) - we don't add new binaries late in an alpha series.

comment:7 Changed 5 weeks ago by teor

In #29527 and #29298, we modified some probability distribution tests to have a smaller range in 0.4.1. But we left the larger range in 0.4.0. I hope that doesn't affect the stochastic failure rate that much.

comment:8 Changed 5 weeks ago by teor

This bug caused #29528 to fail on master.

comment:9 Changed 5 weeks ago by teor

This bug caused #28636 to fail:

slow/prob_distr/stochastic_weibull: [forking] fail Weibull sampler
  FAIL src/test/test_prob_distr.c:1419: assert(ok)
NOTE: This is a stochastic test, and we expect it to fail from
time to time, with some low probability. If you see it fail more
than one trial in 100, though, please tell us.
Seed: D954C0E889C484D1BF3BB1D895E726F1
  [stochastic_weibull FAILED]
1/20 TESTS FAILED. (0 skipped)

https://travis-ci.org/torproject/tor/jobs/509213576#L5738

comment:10 Changed 5 weeks ago by asn

OK, my suggestion is to increase N_TRIALS from 2 to 3 for now, and open a ticket for the future to do more advanced stuff like the suggestions from comment:4.

In particular with N_TRIALS=2 now and 14 travis jobs per build, we have 1.3% probability of a travis build failing because of the stoch tests. This means that one travis build will fail after 50 travis builds with 50% chance.

If we bump N_TRIALS to 3, then we have 0.013% probability of a travis build failing because of the stoch tests. This means that one travis build will fail after 4952 travis builds with 50% chance. Not so annoying anymore.

Here is a (potentially borken) python script I used to calculate the data above, along with the documentation here https://github.com/torproject/tor/blob/938d97cb0d4acfdd1ea57ec0a3094bcc2101f13d/src/test/test_prob_distr.c#L866 :

N_TRIALS = 3
# Probability of a stochastic test failing
alpha = pow(0.01, N_TRIALS)

# number of stochastic tests
n_stoch_tests = 10
# Probability of at least one stochastic test failing
failure_rate_for_test_suite = 1 - pow(1 - alpha, n_stoch_tests)
print("With N_TRIALS={} and alpha={} we have failure_rate_for_test_suite {}".format(N_TRIALS, alpha, failure_rate_for_test_suite))

# Number of travis jobs per build
n_travis_jobs = 14
# Probability of at least one travis job failing
failure_rate_of_travis_build = 1 - pow(1 - failure_rate_for_test_suite, n_travis_jobs)
print("With N_TRIALS={} and alpha={} we have travis build fail {}".format(N_TRIALS,alpha, failure_rate_of_travis_build))

for n in xrange(5000):
    # Probability of travis build failing after n builds
    p = 1 - pow(1 - failure_rate_of_travis_build, n)
    if p > 0.5:
        print("With N_TRIALS={} and alpha={}, a travis build will fail with 50% chance after {} builds.".format(N_TRIALS,alpha,n))
        break

comment:11 Changed 5 weeks ago by teor

Here are our worst-case scenarios:

  • a merge forward from 0.4.0 (or earlier) to master:
    • 3 (maint-0.4.0, release-0.4.0, master) * 14 jobs (travis, appveyor)
    • With N_TRIALS=3 and alpha=1e-06, a travis build will fail with 50% chance after 1651 builds.
  • pull requests to 0.4.0 and master:
    • 2 (branch, PR) * 2 (0.4.0, master) * 14 jobs (travis, appveyor)
    • With N_TRIALS=3 and alpha=1e-06, a travis build will fail with 50% chance after 1238 builds.

These error rates are acceptable, but only until we split off maint-0.4.1. Then we will need to go to 4 trials, or find some other solution.

comment:12 Changed 5 weeks ago by asn

Actual Points: 0.5
Status: assignedneeds_review

Patch: https://github.com/torproject/tor/pull/823

Let me know if you have any questions :)

comment:13 in reply to:  11 Changed 5 weeks ago by asn

Replying to teor:

Here are our worst-case scenarios:

  • a merge forward from 0.4.0 (or earlier) to master:
    • 3 (maint-0.4.0, release-0.4.0, master) * 14 jobs (travis, appveyor)
    • With N_TRIALS=3 and alpha=1e-06, a travis build will fail with 50% chance after 1651 builds.
  • pull requests to 0.4.0 and master:
    • 2 (branch, PR) * 2 (0.4.0, master) * 14 jobs (travis, appveyor)
    • With N_TRIALS=3 and alpha=1e-06, a travis build will fail with 50% chance after 1238 builds.

These error rates are acceptable, but only until we split off maint-0.4.1. Then we will need to go to 4 trials, or find some other solution.

ACK. I also pushed https://github.com/torproject/tor/pull/824 which bumps it to 4. Let's minimize false positive interruptions for now and think of smarter approaches in #29847.

For the record:

With N_TRIALS=4 and alpha=1e-08 we have failure_rate_for_test_suite 9.99999959506e-08
With N_TRIALS=4 and alpha=1e-08 we have travis build fail 1.39999903326e-06
With N_TRIALS=4 and alpha=1e-08, a travis build will fail with 50% chance after 495106 builds.

comment:14 Changed 5 weeks ago by teor

Keywords: nickm-merge 040-backport added
Status: needs_reviewmerge_ready

They both look ok to me. I am glad we finally got this fixed, it has been a pain to do complex backports and merges.

Let's get them merged before 0.4.0.3-alpha?

You should probably tell nickm which one to merge :-)

comment:15 Changed 5 weeks ago by asn

Yes, let's! Nick please merge bug29693_040_radical AKA https://github.com/torproject/tor/pull/824.

Thanks!

comment:16 Changed 5 weeks ago by teor

Reviewer: teor

comment:17 Changed 5 weeks ago by nickm

Squashed and merged to 0.4.0 and forward.

comment:18 Changed 5 weeks ago by nickm

Resolution: fixed
Status: merge_readyclosed
Note: See TracTickets for help on using tickets.