When running the tor unit tests on macOS with --enable-expensive-hardening, I get the following error:
circuitpadding/circuitpadding_sample_distribution: [forking] ../src/lib/math/prob_distr.c:1311:17: runtime error: division by zeroSUMMARY: UndefinedBehaviorSanitizer: undefined-behavior ../src/lib/math/prob_distr.c:1311:17 in ../src/lib/math/prob_distr.c:1219:49: runtime error: division by zeroSUMMARY: UndefinedBehaviorSanitizer: undefined-behavior ../src/lib/math/prob_distr.c:1219:49 in OK
Floating-point division by zero isn't undefined behaviour -- it is precisely defined in IEEE 754 and has been consistently implemented in essentially all software and hardware for decades. This looks like mostly, if not entirely, false positives from a buggy UB sanitizer that's confused about floating-point arithmetic.
All of the cases in test_prob_distr.c appear to be working as intended. For example, test_prob_distr.c line 496 invokes CHECK_RELERR(np, cdf_log_logistic(1/x, 1, 1)). The log-logistic distribution has the property that CDF(1/x) = 1 - CDF(x). When x is arbitrarily close to 0, 1 - CDF(x) should be extremely close to 1; when we round x to zero, CDF(x) should be 0, 1/x is rounded to infinity, and CDF(1/x) = 1 - CDF(x) should be 1, exactly as the test confirms.
In prob_distr.c:
line 344, column 17 is inside logit. logit(1) should be +inf (lim_{x --> 1^-^} logit(x) = +inf), and the +inf yielded by division by zero is correctly propagated by the expression log(p/(1 - p)). The test suite specifically checks that logit(1) = +inf (test_prob_distr.c lines 323 and 395).
line 427, column 26 is inside logithalf. logithalf(1/2) = logit(1) should be +inf, and the +inf yielded by division by zero is correctly propagated by the expression log((0.5 + p0)/(0.5 - p0)). The test suite specifically checks that logithalf(+/-1/2) = +/-inf (test_prob_distr.c lines 294, 323, and 397).
line 685, column 27 is inside isf_log_logistic. The log logistic distribution has the property that CDF(1/x) = 1 - CDF(x), which means CDF^-1^(1 - p) = SF^-1^(p) = 1/x. In test_prob_distr.c, line 450 specifies a test where p = 0, np = 1 - p = 1, and line 553 confirms isf_log_logistic(p, 1, 1), which entails dividing by p, correctly gives 1/x.
line 814, column 23 is in cdf_genpareto. Here we are checking whether the approximation 1 - e^-x_0^ is safe for the specified value of xi, which we consider it to be if |xi| < 1e-17/x_0. If x_0 is infinite, we consider it safe. For any xi, as x_0 goes to zero, the CDF goes to zero too, which is exactly what the approximation computes, so this is correct. The test case on line 718 of test_prob_distr.c specifies x_0 = 0. (Side note: It appears I wrote no tests with a nonzero mean. Should maybe add some.)
line 837, column 23 is in sf_genpareto, which has the same case analysis and test cases as the above instance.
line 1170, column 20 is in sample_log_logistic, evaluating the quotient in (1 - p0)/p0. This is division by zero only if p0 = 0. In that case, we return the correct answer is +inf, since as p0 -> 0, this function grows without bound. In test_prob_distr.c, lines 565, 567, and 569 specify tests with p0 = 0. (However, in actual runs of the code with p0 drawn using random_uniform_01, p0 = 0 should happen only with probability 2^-1075^, i.e. never.)
line 1311, column 17 is inside sample_geometric, evaluating the quotient in -x/log1p(-p). The divisor can be zero only if p = 0, which means we're trying to sample from a geometric distribution with zero success probability. Of course, the only reasonable result is +inf. Is there anything upstream of this logic that tries to construct a CIRCPAD_DIST_GEOMETRIC with zero success probability?
line 1219, column 49 is inside sample_weibull, computing 1/k. This can happen only if k = 0, which is an edge case that's probably not very useful, but the expression does return +inf which is the correct limit as k --> 0. Is there anything upstream of this logic that tries to construct a CIRCPAD_DIST_WEIBULL with zero shape?
By the way, pretty much all non-arm hardware supports a kind of ‘sanitizer’ for floating-point invalid operations (which are not undefined behaviour, but are usually undesirable) with zero overhead if they don't happen, namely trapping the invalid-operation exception, which Unix will translate to SIGFPE. (Most arm hardware supports the exception, but only via sticky bits you can explicitly test with fetestexcept, not via trapping.)
I tried running the tests with tinytest.c modified to do feenableexcept(FE_INVALID) first thing in tinytest_main (needs #define _GNU_SOURCE and #include <fenv.h>). This turned up only one invalid operation in the tests: the logsumexp in test_stochastic_geometric_impl slightly exceeds zero, so log1mexp performs an invalid operation (log of negative); it is safe to replace log1mexp(logsumexp(...)) here by log1mexp(fmin(0, logsumexp(...))) to avoid this.
(The issue in test_stochastic_geometric_impl does not invalidate any of the test results: the NaN it produced without exceptions trapped was never used again in the subsequent computation, because it should have been an effectively zero probability (less than e^-100^ = 2^-144^ or something) and the corresponding count is always zero in tests as it should be, so psi_test ignores the probability. If the count were ever nonzero, indicating a broken geometric sampler, psi would be computed as a NaN and the psi test would fail. So floating-point arithmetic once again does the right thing in the end, though it still would probably be better to use log1mexp(fmin(0, logsumexp(...))).)
I tried running the tests with tinytest.c modified to do feenableexcept(FE_INVALID) first thing in tinytest_main (needs #define _GNU_SOURCE and #include <fenv.h>). This turned up only one invalid operation in the tests: the logsumexp in test_stochastic_geometric_impl slightly exceeds zero, so log1mexp performs an invalid operation (log of negative); it is safe to replace log1mexp(logsumexp(...)) here by log1mexp(fmin(0, logsumexp(...))) to avoid this.
(The issue in test_stochastic_geometric_impl does not invalidate any of the test results: the NaN it produced without exceptions trapped was never used again in the subsequent computation, because it should have been an effectively zero probability (less than e^-100^ = 2^-144^ or something) and the corresponding count is always zero in tests as it should be, so psi_test ignores the probability. If the count were ever nonzero, indicating a broken geometric sampler, psi would be computed as a NaN and the psi test would fail. So floating-point arithmetic once again does the right thing in the end, though it still would probably be better to use log1mexp(fmin(0, logsumexp(...))).)
Ok, let's fix these issues in this ticket?
#29528 (moved) deals with the general case.
Thanks for the great analysis in this ticket. Based on the comments here it was quite easy to to write a patch.
You can see it in https://github.com/torproject/tor/pull/722
Please note that only the last two commits are about this ticket, the others are there because I based the branch on #29298 (moved) to avoid some conflicts in the future.
The first commit (a64ccf151f) fixes the probability distribution parameters that Riastradh mentioned with bold in comment:7. The second commit (f708055e9f) silences float-divide-by-zero as suggested by #29528 (moved).
The first commit (a64ccf151f) fixes the probability distribution parameters that Riastradh mentioned with bold in comment:7. The second commit (f708055e9f) silences float-divide-by-zero as suggested by #29528 (moved).
Cool. One suggestion for a change and one comment:
It might be helpful if you put a comment on each line describing what the name and purpose of the parameter is, like /* k, shape parameter */, because I can never remember which one goes first and which one goes last.
This is the kind of code where the struct weibull dist = { WEIBULL(dist), .k = ..., .lambda = ... } can be much more legible than struct circpad_distribution dist = { .type = CIRCPAD_DIST_WEIBULL, .param1 = ..., .param2 = ... }. (Obviously changing the code to work like that here is more involved than would be warranted, even if you wanted it, for this particular issue, of course.)
The first commit (a64ccf151f) fixes the probability distribution parameters that Riastradh mentioned with bold in comment:7. The second commit (f708055e9f) silences float-divide-by-zero as suggested by #29528 (moved).
Cool. One suggestion for a change and one comment:
It might be helpful if you put a comment on each line describing what the name and purpose of the parameter is, like /* k, shape parameter */, because I can never remember which one goes first and which one goes last.
This is the kind of code where the struct weibull dist = { WEIBULL(dist), .k = ..., .lambda = ... } can be much more legible than struct circpad_distribution dist = { .type = CIRCPAD_DIST_WEIBULL, .param1 = ..., .param2 = ... }. (Obviously changing the code to work like that here is more involved than would be warranted, even if you wanted it, for this particular issue, of course.)
Little nervous about disabling divide by zero sanitation across all of Tor for this code, but it's not actually undefined behavior, so it's probably ok.
Thought I should still just flag that for Nick in case he wants to hack autoconf to apply -fno-sanitize=float-divide-by-zero to just this c file somehow.
This isn't merge_ready yet: it needs #29298 (moved) to merge first, then it needs its CI re-run and checked, and then it can go back in to merge_ready.
(Does this need a (partial?) backport to 0.4.0? Please reopen if so.)
The issues are there on 040 but perhaps we dont need to fix them since they dont fail the build or the CI, except if we also merge #29528 (moved) to 040.
(Does this need a (partial?) backport to 0.4.0? Please reopen if so.)
The issues are there on 040 but perhaps we dont need to fix them since they dont fail the build or the CI, except if we also merge #29528 (moved) to 040.
I think we should backport, otherwise users will notice and send us bug reports.
Also, there are no guarantees that future UndefinedBehaviourSanitizers will continue on error.
I guess that means we need a cherry-pick to 0.4.0?
Trac: Status: closed to reopened Resolution: fixed toN/A
This branch merges cleanly to master, because the cherry-picked commit f708055 is already in master. (So it just adds a changes file, which contains useful information in addition to the 0.4.1 changes file.)
asn, this ticket had two commits in 0.4.1:
Fix test prob distr parameters that caused warnings. a64ccf1
a64ccf1 didn't apply cleanly to maint-0.4.0, which means the tests will diverge slightly. I hope that doesn't affect the stochastic failure rate that much in #29693 (moved).
Trac: Status: needs_revision to needs_review Keywords: merge-after-29298 deleted, was-merged-to-041-after-29298 added