Was this continually failing or was this the first failure?
Also do we have instructions for running the unit tests in Wine? I am not sure how to reproduce this, especially if it was intermittent, which I assume it is.
FYI, I ran these two tests in a loop for ~20 hours in wine64 on my laptop (using https://github.com/ahf/tor-win32/ to build) and was unable to reproduce. :/
I ran the tests on 32-bit x86 Windows 10 for ~4.5h yesterday and didn't manage to hit this test failure. I let the same machine run over night with a 64-bit x86 build and also didn't see the test failure there.
Mike, wrt the test_circuitpadding_tokens() test, could it be another case where the test actually schedules legit padding because of a state transition or something and it might trigger or might not depending on how the timing of the test goes? Like the one from #29122 (moved)?
Mike, wrt the test_circuitpadding_tokens() test, could it be another case where the test actually schedules legit padding because of a state transition or something and it might trigger or might not depending on how the timing of the test goes? Like the one from #29122 (moved)?
Mike, wrt the test_circuitpadding_tokens() test, could it be another case where the test actually schedules legit padding because of a state transition or something and it might trigger or might not depending on how the timing of the test goes? Like the one from #29122 (moved)?
The 0.4.0 and master builds all failed due to this bug after #29599 (moved) was merged.
But only the "Image: Visual Studio 2017; Environment: target=x86_64-w64-mingw32..." jobs failed, so it might be timing-sensitive. Or the OS bug might happen reliably on Windows Server 2016.
More generally, there is no guarantee that the monotonic clock will have increased by at least 1000 nanoseconds between monotime_init() and circpad_estimate_circ_rtt_on_received()'s call to monotime_absolute_usec(). A non-increasing value is more likely when the monotonic clock's resolution is in milliseconds: two calls can easily return the same value.
But the circuitpadding code assumes that monotime_absolute_usec() is always greater than zero.
Mike, wrt the test_circuitpadding_tokens() test, could it be another case where the test actually schedules legit padding because of a state transition or something and it might trigger or might not depending on how the timing of the test goes? Like the one from #29122 (moved)?
The 0.4.0 and master builds all failed due to this bug after #29599 (moved) was merged.
But only the "Image: Visual Studio 2017; Environment: target=x86_64-w64-mingw32..." jobs failed, so it might be timing-sensitive. Or the OS bug might happen reliably on Windows Server 2016.
It will be easier to set up if you use your own appveyor account, not Tor's.
Or you could create your best fix for these issues, get it merged, and then see if it fails again?
It might be easier that messing around with RDP: there's no guarantee you'd be able to reproduce the timing issues on Appveyor within the 1 hour job limit.
More generally, there is no guarantee that the monotonic clock will have increased by at least 1000 nanoseconds between monotime_init() and circpad_estimate_circ_rtt_on_received()'s call to monotime_absolute_usec(). A non-increasing value is more likely when the monotonic clock's resolution is in milliseconds: two calls can easily return the same value.
But the circuitpadding code assumes that monotime_absolute_usec() is always greater than zero.
But wait, how is this OS bug still happening even though I mocked monotime and monotime_coarse in test_circuitpad_rtt()? I set the mocked absolute time to be 1000 nsec, so
monotime_absolute_usec() should be returning 1 in circpad_estimate_circ_rtt_on_received().
Is the problem maybe that sometimes the nsec_per_tick_numer value is very high on some appveyor instances? Would using a larger initial mocked time than 1000 nsec solve the issue for this test?
Ok, I changed the monotime mocking to start at 1000 usec rather than 1000 nsec, on the wild guess that this is a precision issue on some platforms due to our mocking. That fix, plus the tokens test fix are at https://github.com/torproject/tor/pull/757
Might as well set needs_review. I suppose there could be an argument for handling the case where monotime_absolute_usec() might return 0 in the wild, but I don't think that is actually possible there (this code only runs on relays upon receiving valid onionskins for circuit extension.. lots of time should have passed since monotime_init() at that point).
...
Might as well set needs_review. I suppose there could be an argument for handling the case where monotime_absolute_usec() might return 0 in the wild, but I don't think that is actually possible there (this code only runs on relays upon receiving valid onionskins for circuit extension.. lots of time should have passed since monotime_init() at that point).
It doesn't matter how much time has passed, the monotime_diff_() functions can still return zero. So all code that uses the monotime_diff_() functions needs to handle zero correctly.
Zero deltas happen more often: * - on Windows (due to an OS bug), * - when using monotime_coarse, or on systems with low-resolution timers, * - on platforms where we emulate monotonic time using wall-clock time, and * - when using time units that are larger than nanoseconds (due to * truncation on division).
I added a better explanation to #29640 (moved) in response to your comments, and nickm's comments:
* Zero time differences can happen on some machines, even after a large * amount of time has elapsed. Our ratchet corrects non-monotonic times and * returns a zero time difference: * - on Windows (due to an OS bug that yields non-monotonic times from the * monotonic time API), * - on platforms where we emulate monotonic time using wall-clock time * (after wall clock time is set to an earlier time).
I noticed your question during the network team meeting:
mikeperry: the #29500 one is annoying. I think we need another flag or something. I also only barely understand how this could ever happen in a real scenariomikeperry: it seems almost impossiblenickm: mikeperry: maybe add the appropriate assertions that should never be hit, so that we can figure out why it would be happening sometime anyway?nickm: diagnosis is the next best thing to a fix
Here are three scenarios where the monotime_diff or monotime_absolute functions can return zero:
(I'll use monotime_absolute below, because I think that's what you're using. But monotime_absolute just calls monotime_diff with the previous monotime value.)
The Ratchet Returns Zero
Tor is running on a platform which has monotime API bugs (Windows), or Tor is compiled without support for monotime functions (old Linux, old macOS, other platforms where we haven't implemented monotime support)
Tor calls a monotime_absolute() function, and stores the last monotime value
Any amount of time elapses
Tor calls a monotime_absolute() function, and Windows experiences its monotime bug, or the user has changed the wall clock time backwards
Tor emulates monotonic time using a ratchet. The ratchet stores an internal correction factor for future monotonic times, and returns the same value as it previously returned
The monotime_absolute() function gets the same value for the previous and current times. It returns zero.
Low Resolution Timers in the OS
Tor is running on a platform with low-resolution timers
Tor calls a monotime_absolute() function, and stores the last monotime value
A small amount of time elapses, which is lower than the timer resolution
Tor calls a monotime_absolute() function, which gets the same monotime value as the previous call to the low-resolution timer
The monotime_absolute() function gets the same value for the previous and current times. It returns zero.
Low Resolution Timer APIs in Tor
Tor is running on any platform
Tor calls a monotime_absolute() function with low-resolution units (seconds, milliseconds, or microseconds) and stores the last monotime value
A small amount of time elapses, which is less than one unit
Tor calls a monotime_absolute() function, which gets a monotime value which is greater than the previous call to the low-resolution timer, but less than one unit away from the previous timer.
The monotime_absolute() function gets less than one unit's difference between the previous and current times. It divides by the unit conversion factor, and returns zero. (These functions truncate, rather than rounding.)
I hope that explains the scenarios that Tor code needs to deal with when calling Tor's monotime APIs.
If you have any more questions, please feel free to ask on this ticket. Sometimes I miss questions on IRC or in meetings.
I don't know which tests fail intermittently with Mike's PR 757, and which succeed.
So I'll assume that it is an improvement, and keep working on it until it passes.
After we merge, we should open tickets for further failures.
deleted a failing test in 0.4.0, because we merged some required fixes into master, but didn't backport them,
fixed some comments.
For master, here's what I did:
reverted the test deletion,
merged 0.4.0 forward to master,
weakened the test conditions in the failing test, but left the old conditions in the code. We can set ENABLE_BUG_29990_WORKAROUND to false once we've fixed #29990 (moved). (Or revert that commit.)
Here are three scenarios where the monotime_diff or monotime_absolute functions can return zero:
(I'll use monotime_absolute below, because I think that's what you're using. But monotime_absolute just calls monotime_diff with the previous monotime value.)
I am still not understanding the issue here. Or I am understanding differently than you. Let me try to reply here and say what I think you mean/what I think is true.
The Ratchet Returns Zero
Tor is running on a platform which has monotime API bugs (Windows), or Tor is compiled without support for monotime functions (old Linux, old macOS, other platforms where we haven't implemented monotime support)
Tor calls a monotime_absolute() function, and stores the last monotime value
Tor calls a monotime_init() function at startup, and stores the FIRST monotime value.
Any amount of time elapses
Tor calls a monotime_absolute() function, and Windows experiences its monotime bug, or the user has changed the wall clock time backwards
Tor emulates monotonic time using a ratchet. The ratchet stores an internal correction factor for future monotonic times, and returns the same value as it previously returned
The monotime_absolute() function gets the same value for the previous and current times. It returns zero.
If wall clock time moved back to a time period earlier than monotime_init() AND does this before Tor stores a ratchet value, then monotime_absolute() function gets the same value for the previous and current times. It returns zero.
Low Resolution Timers in the OS
Tor is running on a platform with low-resolution timers
Tor calls a monotime_absolute() function, and stores the last monotime value
Tor calls monotime_init() function at startup, and stores the FIRST monotime value.
A small amount of time elapses, which is lower than the timer resolution
Tor calls a monotime_absolute() function, which gets the same monotime value as the previous call to the low-resolution timer
The monotime_absolute() function gets the same value for the previous and current times. It returns zero.
The monotime_absolute() function gets the same time value as monotime_init(), because less than the timer resolution amount of time has elapsed since monotime_init() was called. Then, and only then, it returns 0.
Low Resolution Timer APIs in Tor
Tor is running on any platform
Tor calls a monotime_absolute() function with low-resolution units (seconds, milliseconds, or microseconds) and stores the last monotime value
Tor calls monotime_init() function at startup, and stores an initialization time in low-resolution units.
A small amount of time elapses, which is less than one unit
Tor calls a monotime_absolute() function, which gets a monotime value which is greater than the previous call to the low-resolution timer, but less than one unit away from the previous timer.
The monotime_absolute() function gets less than one unit's difference between the previous and current times. It divides by the unit conversion factor, and returns zero. (These functions truncate, rather than rounding.)
The monotime_absolute() function gets the same time value as monotime_init(), because less than the timer resolution amount of time has elapsed since monotime_init() was called. Then, and only then, it returns 0.
In all but the first case (monotime Windows bugs), the monotime_absolute_usec() functions cannot return 0 after more than the time resolution period has passed since monotime_init(). Even in the first case, monotime_absolute_usec() cannot return 0 unless the clock moves backwards before the ratchet value is stored. In all other cases, the ratchet value should be stored as a non-zero value and all subsequent calls to monotime_absolute_usec() are non-zero...
Furthermore, I'm deeply confused about how in #29990 (moved), host and VM time has any effect on our mocked monotime values for unit tests. I think what we're actually looking at here is that the unittests themselves are incrementing monotime below the resolution of some platforms, which I why I tried to fix this with changes to just the unittests... but I'll take my comments about #29990 (moved) there.
I would request that we not merge hacky changes to the unittests for the #29990 (moved) workarounds. If we're going to hack this, let's hack it so monotime_init() (or the ratchet) has a non-zero value, and then we can go through circpad for the rare cases where the difference between monotime_init() and the first ratchet update is 0 (or negative), in production (which I still believe should be rare/impossible).
I would request that we not merge hacky changes to the unittests for the #29990 (moved) workarounds.
Ok, I'll close my pull requests on this ticket, and mark #29990 (moved) as 040-backport.
If we're going to hack this, let's hack it so monotime_init() (or the ratchet) has a non-zero value, and then we can go through circpad for the rare cases where the difference between monotime_init() and the first ratchet update is 0 (or negative), in production (which I still believe should be rare/impossible).
These situations are rare, but they can happen. And they can happen more than you think.
Any pre-ratchet monotime update can be zero or negative, because the Windows API and gettimeofday() don't provide monotonic source times. So any number of calls to the monotonic time functions may return the same value.
Would you like to open a new ticket to test and fix these zero delta issues in circuitpad?
#29990 (moved) is now about another bug, and this ticket is full of comments.
Fixing tags: we must fix this issue before we turn on circuitpadding. (Or we should document the issues in code comments and the spec, and triage it out.)
Trac: Keywords: tor-ci-fail-sometimes, 041-proposed, 040-must deleted, 041-must added Milestone: Tor: 0.4.0.x-final to Tor: 0.4.1.x-final Owner: teor to mikeperry Version: N/Ato Tor: 0.4.0.1-alpha Status: new to assigned
I made a list of all the circumstances where we've seen zero monotonic time deltas (or negative deltas, before we added the ratchet). I did some more analysis, and I think thoses cases are rare. If they become common, we should notice, because our monotonic time unit tests in test/util will fail.
So I don't think we need to block WTF-PAD or 041 on that issue.
I backported Mike's extra commit to 0.4.0, then applied the same changes as my previous 0.4.0 and master PRs. There was one extra data structure name change that I needed to undo then re-apply.