Broken circuitpadding unittests on appveyor

changed milestone to %Tor: 0.4.0.x-final

Trac:
Child Ticket(s): #29990 (moved)

added 040-backport 040-must actualpoints::3.5 asn-merge component::core tor/tor milestone::Tor: 0.4.0.x-final nickm-merge owner::mikeperry padding points::3 priority::high resolution::fixed reviewer::asn reviewer::nickm reviewer::teor severity::normal sponsor::2 status::closed tor-cell tor-relay type::defect version::tor 0.4.0.1-alpha wtf-pad labels

Trac:
Description: This appveyor built failed two unittests: https://ci.appveyor.com/project/torproject/tor/builds/22350196/job/457arqlgfgf6b0a3

circuitpadding/circuitpadding_tokens: [forking] 
  FAIL ../src/test/test_circuitpadding.c:1125: assert(mi->histogram[4] OP_EQ 2): 0 vs 2
  [circuitpadding_tokens FAILED]
circuitpadding/circuitpadding_rtt: [forking] 
  FAIL ../src/test/test_circuitpadding.c:324: assert(relay_side->padding_info[0]->last_received_time_usec OP_NE 0): 0 vs 0
  [circuitpadding_rtt FAILED]

to

This x86 appveyor build failed two unittests: https://ci.appveyor.com/project/torproject/tor/builds/22350196/job/457arqlgfgf6b0a3

circuitpadding/circuitpadding_tokens: [forking] 
  FAIL ../src/test/test_circuitpadding.c:1125: assert(mi->histogram[4] OP_EQ 2): 0 vs 2
  [circuitpadding_tokens FAILED]
circuitpadding/circuitpadding_rtt: [forking] 
  FAIL ../src/test/test_circuitpadding.c:324: assert(relay_side->padding_info[0]->last_received_time_usec OP_NE 0): 0 vs 0
  [circuitpadding_rtt FAILED]

Trac:
Priority: Medium to High
Milestone: Tor: unspecified to Tor: 0.4.0.x-final

Was this continually failing or was this the first failure?

Also do we have instructions for running the unit tests in Wine? I am not sure how to reproduce this, especially if it was intermittent, which I assume it is.

I think this was the first failure. Perhaps it would trigger in a loop.

I think the best way to run unittests in Wine is using Alex's https://github.com/ahf/tor-win32/

Marking tickets as 040-must based on triage with dgoulet.

Trac:
Keywords: N/A deleted, 040-must added

FYI, I ran these two tests in a loop for ~20 hours in wine64 on my laptop (using https://github.com/ahf/tor-win32/ to build) and was unable to reproduce. :/

Trac:
Cc: nickm to nickm, ahf

I ran the tests on 32-bit x86 Windows 10 for ~4.5h yesterday and didn't manage to hit this test failure. I let the same machine run over night with a 64-bit x86 build and also didn't see the test failure there.

MinGW version:

x86_64-w64-mingw32-gcc: 6.3.0 20170516 i686-w64-mingw32-gcc: 6.3.0 20170516

PowerShell code to run the tests in a loop:

while ($true) { .\test.exe circuitpadding/... ; if ($LastExitCode -ne 0) { break; } }

Maybe we should move this to needs_information?

Mike, wrt the test_circuitpadding_tokens() test, could it be another case where the test actually schedules legit padding because of a state transition or something and it might trigger or might not depending on how the timing of the test goes? Like the one from #29122 (moved)?

Replying to asn:

Mike, wrt the test_circuitpadding_tokens() test, could it be another case where the test actually schedules legit padding because of a state transition or something and it might trigger or might not depending on how the timing of the test goes? Like the one from #29122 (moved)?

Yes, that does seem likely for the tokens test. In fact, I think this should fix it, if we could reproduce: https://github.com/mikeperry-tor/tor/commit/b61cd3709be53dd0aee55111dc0c29b882c31cc6

However, I have no idea why the rtt test is failing. It almost seems like a compiler bug.

Replying to mikeperry:

Replying to asn:

Mike, wrt the test_circuitpadding_tokens() test, could it be another case where the test actually schedules legit padding because of a state transition or something and it might trigger or might not depending on how the timing of the test goes? Like the one from #29122 (moved)?

Yes, that does seem likely for the tokens test. In fact, I think this should fix it, if we could reproduce: https://github.com/mikeperry-tor/tor/commit/b61cd3709be53dd0aee55111dc0c29b882c31cc6

The 0.4.0 and master builds all failed due to this bug after #29599 (moved) was merged.

But only the "Image: Visual Studio 2017; Environment: target=x86_64-w64-mingw32..." jobs failed, so it might be timing-sensitive. Or the OS bug might happen reliably on Windows Server 2016.

You can remote desktop to the build machine if you want: https://www.appveyor.com/docs/how-to/rdp-to-build-worker/

It will be easier to set up if you use your own appveyor account, not Tor's.

However, I have no idea why the rtt test is failing. It almost seems like a compiler bug.

It's an OS bug: https://gitweb.torproject.org/tor.git/tree/src/lib/time/compat_time.c#n543 https://www.python.org/dev/peps/pep-0418/#windows-queryperformancecounter

Which tor works around by pretending that no time has elapsed: https://gitweb.torproject.org/tor.git/tree/src/lib/time/compat_time.c#n175

More generally, there is no guarantee that the monotonic clock will have increased by at least 1000 nanoseconds between monotime_init() and circpad_estimate_circ_rtt_on_received()'s call to monotime_absolute_usec(). A non-increasing value is more likely when the monotonic clock's resolution is in milliseconds: two calls can easily return the same value.

But the circuitpadding code assumes that monotime_absolute_usec() is always greater than zero.

Replying to teor:

Replying to mikeperry:

Replying to asn:

Mike, wrt the test_circuitpadding_tokens() test, could it be another case where the test actually schedules legit padding because of a state transition or something and it might trigger or might not depending on how the timing of the test goes? Like the one from #29122 (moved)?

Yes, that does seem likely for the tokens test. In fact, I think this should fix it, if we could reproduce: https://github.com/mikeperry-tor/tor/commit/b61cd3709be53dd0aee55111dc0c29b882c31cc6

The 0.4.0 and master builds all failed due to this bug after #29599 (moved) was merged.

But only the "Image: Visual Studio 2017; Environment: target=x86_64-w64-mingw32..." jobs failed, so it might be timing-sensitive. Or the OS bug might happen reliably on Windows Server 2016.

You can remote desktop to the build machine if you want: https://www.appveyor.com/docs/how-to/rdp-to-build-worker/

It will be easier to set up if you use your own appveyor account, not Tor's.

Or you could create your best fix for these issues, get it merged, and then see if it fails again?

It might be easier that messing around with RDP: there's no guarantee you'd be able to reproduce the timing issues on Appveyor within the 1 hour job limit.

I logged #29640 (moved) to improve the monotonic time documentation in compat_time.c

Replying to teor:

Replying to mikeperry:

However, I have no idea why the rtt test is failing. It almost seems like a compiler bug.

It's an OS bug: https://gitweb.torproject.org/tor.git/tree/src/lib/time/compat_time.c#n543 https://www.python.org/dev/peps/pep-0418/#windows-queryperformancecounter

Which tor works around by pretending that no time has elapsed: https://gitweb.torproject.org/tor.git/tree/src/lib/time/compat_time.c#n175

More generally, there is no guarantee that the monotonic clock will have increased by at least 1000 nanoseconds between monotime_init() and circpad_estimate_circ_rtt_on_received()'s call to monotime_absolute_usec(). A non-increasing value is more likely when the monotonic clock's resolution is in milliseconds: two calls can easily return the same value.

But the circuitpadding code assumes that monotime_absolute_usec() is always greater than zero.

But wait, how is this OS bug still happening even though I mocked monotime and monotime_coarse in test_circuitpad_rtt()? I set the mocked absolute time to be 1000 nsec, so monotime_absolute_usec() should be returning 1 in circpad_estimate_circ_rtt_on_received().

Is the problem maybe that sometimes the nsec_per_tick_numer value is very high on some appveyor instances? Would using a larger initial mocked time than 1000 nsec solve the issue for this test?

Ok, I changed the monotime mocking to start at 1000 usec rather than 1000 nsec, on the wild guess that this is a precision issue on some platforms due to our mocking. That fix, plus the tokens test fix are at https://github.com/torproject/tor/pull/757

Might as well set needs_review. I suppose there could be an argument for handling the case where monotime_absolute_usec() might return 0 in the wild, but I don't think that is actually possible there (this code only runs on relays upon receiving valid onionskins for circuit extension.. lots of time should have passed since monotime_init() at that point).

Trac:
Status: new to needs_review

Replying to mikeperry:

... Might as well set needs_review. I suppose there could be an argument for handling the case where monotime_absolute_usec() might return 0 in the wild, but I don't think that is actually possible there (this code only runs on relays upon receiving valid onionskins for circuit extension.. lots of time should have passed since monotime_init() at that point).

It doesn't matter how much time has passed, the monotime_diff_() functions can still return zero. So all code that uses the monotime_diff_() functions needs to handle zero correctly.

See #29640 (moved), in particular:

Zero deltas happen more often:
 *  - on Windows (due to an OS bug),
 *  - when using monotime_coarse, or on systems with low-resolution timers,
 *  - on platforms where we emulate monotonic time using wall-clock time, and
 *  - when using time units that are larger than nanoseconds (due to
 *    truncation on division).

Trac:
Status: needs_review to needs_revision

I added a better explanation to #29640 (moved) in response to your comments, and nickm's comments:

 * Zero time differences can happen on some machines, even after a large
 * amount of time has elapsed. Our ratchet corrects non-monotonic times and
 * returns a zero time difference:
 *  - on Windows (due to an OS bug that yields non-monotonic times from the
 *    monotonic time API),
 *  - on platforms where we emulate monotonic time using wall-clock time
 *    (after wall clock time is set to an earlier time).

Looks like this was triggered again in https://ci.appveyor.com/project/ahf/tor/build/1.0.78/job/t57vj493837qgvd4#L5547

Hi Mike,

I noticed your question during the network team meeting:

mikeperry: the #29500 one is annoying. I think we need another flag or something. I also only barely understand how this could ever happen in a real scenario
mikeperry: it seems almost impossible
nickm: mikeperry: maybe add the appropriate assertions that should never be hit, so that we can figure out why it would be happening sometime anyway?
nickm: diagnosis is the next best thing to a fix

Here are three scenarios where the monotime_diff or monotime_absolute functions can return zero: (I'll use monotime_absolute below, because I think that's what you're using. But monotime_absolute just calls monotime_diff with the previous monotime value.)

The Ratchet Returns Zero

Tor is running on a platform which has monotime API bugs (Windows), or Tor is compiled without support for monotime functions (old Linux, old macOS, other platforms where we haven't implemented monotime support)
Tor calls a monotime_absolute() function, and stores the last monotime value
Any amount of time elapses
Tor calls a monotime_absolute() function, and Windows experiences its monotime bug, or the user has changed the wall clock time backwards
Tor emulates monotonic time using a ratchet. The ratchet stores an internal correction factor for future monotonic times, and returns the same value as it previously returned
The monotime_absolute() function gets the same value for the previous and current times. It returns zero.

Low Resolution Timers in the OS

Tor is running on a platform with low-resolution timers
Tor calls a monotime_absolute() function, and stores the last monotime value
A small amount of time elapses, which is lower than the timer resolution
Tor calls a monotime_absolute() function, which gets the same monotime value as the previous call to the low-resolution timer
The monotime_absolute() function gets the same value for the previous and current times. It returns zero.

Low Resolution Timer APIs in Tor

Tor is running on any platform
Tor calls a monotime_absolute() function with low-resolution units (seconds, milliseconds, or microseconds) and stores the last monotime value
A small amount of time elapses, which is less than one unit
Tor calls a monotime_absolute() function, which gets a monotime value which is greater than the previous call to the low-resolution timer, but less than one unit away from the previous timer.
The monotime_absolute() function gets less than one unit's difference between the previous and current times. It divides by the unit conversion factor, and returns zero. (These functions truncate, rather than rounding.)

I hope that explains the scenarios that Tor code needs to deal with when calling Tor's monotime APIs.

If you have any more questions, please feel free to ask on this ticket. Sometimes I miss questions on IRC or in meetings.

Edit: apparently I can't count small integers

Mike has a pull request in this ticket.

Trac:
Owner: N/A to mikeperry
Status: needs_revision to assigned

This bug caused #29824 (moved) to fail:

circuitpadding/circuitpadding_tokens: [forking] 
  FAIL ../src/test/test_circuitpadding.c:1150: assert(mi->histogram[4] OP_EQ 2): 0 vs 2
  [circuitpadding_tokens FAILED]
circuitpadding/circuitpadding_negotiation: [forking] OK
circuitpadding/circuitpadding_wronghop: [forking] OK
circuitpadding/circuitpadding_conditions: [forking] OK
circuitpadding/circuitpadding_rtt: [forking] 
  FAIL ../src/test/test_circuitpadding.c:322: assert(relay_side->padding_info[0]->last_received_time_usec OP_NE 0): 0 vs 0
  [circuitpadding_rtt FAILED]
circuitpadding/circuitpadding_sample_distribution: [forking] OK

https://ci.appveyor.com/project/torproject/tor/builds/23230458/job/e8is73u02ymt6xfl?fullLog=true

Trac:
Status: assigned to needs_revision

This bug caused the 0.4.0 changelog merge to release-0.4.0 to fail:

circuitpadding/circuitpadding_tokens: [forking] 
  FAIL ../src/test/test_circuitpadding.c:1124: assert(mi->histogram[4] OP_EQ 2): 0 vs 2
  [circuitpadding_tokens FAILED]
circuitpadding/circuitpadding_negotiation: [forking] OK
circuitpadding/circuitpadding_wronghop: [forking] OK
circuitpadding/circuitpadding_conditions: [forking] OK
circuitpadding/circuitpadding_rtt: [forking] 
  FAIL ../src/test/test_circuitpadding.c:323: assert(relay_side->padding_info[0]->last_received_time_usec OP_NE 0): 0 vs 0
  [circuitpadding_rtt FAILED]

https://ci.appveyor.com/project/torproject/tor/builds/23252909/job/4ax058o810mck9xe?fullLog=true#L3043

Trac:
Keywords: N/A deleted, tor-ci-fail-sometimes added

Un-assigning 040-must bugs from Mike, because he's overloaded. We'll work out what to do with these tickets at the meeting.

Trac:
Owner: mikeperry to N/A
Status: needs_revision to assigned

Trac:
Owner: N/A to teor

Trac:
Reviewer: N/A to nickm

I don't know which tests fail intermittently with Mike's PR 757, and which succeed. So I'll assume that it is an improvement, and keep working on it until it passes.

After we merge, we should open tickets for further failures.

I have a draft branch here, which I am checking against CI: https://github.com/torproject/tor/pull/885

I guessed the points, because I don't think mikeperry has time to update them this week.

Trac:
Actualpoints: N/A to 3
Points: N/A to 3
Status: assigned to needs_revision

Move all WTF-PAD tickets that do not have a sponsor into sponsor2.

Trac:
Sponsor: N/A to Sponsor2

For 0.4.0, here's what I did:

rebased Mike's PR 757 to the current maint-0.4.0
deleted a failing test in 0.4.0, because we merged some required fixes into master, but didn't backport them,
fixed some comments.

For master, here's what I did:

reverted the test deletion,
merged 0.4.0 forward to master,
weakened the test conditions in the failing test, but left the old conditions in the code. We can set ENABLE_BUG_29990_WORKAROUND to false once we've fixed #29990 (moved). (Or revert that commit.)

Here are the pull requests:

0.4.0: https://github.com/torproject/tor/pull/893
master: https://github.com/torproject/tor/pull/885

The tests pass on my macOS VM with and without a skewed clock.

Trac:
Status: needs_revision to needs_review

Did a review: Just a small question on one of the commits, and also the commit msg says unrealiable.

I think we are ready to get this fixed after this.

Trac:
Status: needs_review to needs_revision

Replying to teor:

Here are three scenarios where the monotime_diff or monotime_absolute functions can return zero: (I'll use monotime_absolute below, because I think that's what you're using. But monotime_absolute just calls monotime_diff with the previous monotime value.)

I am still not understanding the issue here. Or I am understanding differently than you. Let me try to reply here and say what I think you mean/what I think is true.

The Ratchet Returns Zero

Tor is running on a platform which has monotime API bugs (Windows), or Tor is compiled without support for monotime functions (old Linux, old macOS, other platforms where we haven't implemented monotime support)

Tor calls a monotime_absolute() function, and stores the last monotime value

Tor calls a monotime_init() function at startup, and stores the FIRST monotime value.

Any amount of time elapses

Tor calls a monotime_absolute() function, and Windows experiences its monotime bug, or the user has changed the wall clock time backwards

Tor emulates monotonic time using a ratchet. The ratchet stores an internal correction factor for future monotonic times, and returns the same value as it previously returned

The monotime_absolute() function gets the same value for the previous and current times. It returns zero.

If wall clock time moved back to a time period earlier than monotime_init() AND does this before Tor stores a ratchet value, then monotime_absolute() function gets the same value for the previous and current times. It returns zero.

Low Resolution Timers in the OS

Tor is running on a platform with low-resolution timers

Tor calls a monotime_absolute() function, and stores the last monotime value

Tor calls monotime_init() function at startup, and stores the FIRST monotime value.

A small amount of time elapses, which is lower than the timer resolution

Tor calls a monotime_absolute() function, which gets the same monotime value as the previous call to the low-resolution timer

The monotime_absolute() function gets the same value for the previous and current times. It returns zero.

The monotime_absolute() function gets the same time value as monotime_init(), because less than the timer resolution amount of time has elapsed since monotime_init() was called. Then, and only then, it returns 0.

Low Resolution Timer APIs in Tor

Tor is running on any platform

Tor calls a monotime_absolute() function with low-resolution units (seconds, milliseconds, or microseconds) and stores the last monotime value

Tor calls monotime_init() function at startup, and stores an initialization time in low-resolution units.

A small amount of time elapses, which is less than one unit

Tor calls a monotime_absolute() function, which gets a monotime value which is greater than the previous call to the low-resolution timer, but less than one unit away from the previous timer.

The monotime_absolute() function gets less than one unit's difference between the previous and current times. It divides by the unit conversion factor, and returns zero. (These functions truncate, rather than rounding.)

The monotime_absolute() function gets the same time value as monotime_init(), because less than the timer resolution amount of time has elapsed since monotime_init() was called. Then, and only then, it returns 0.

In all but the first case (monotime Windows bugs), the monotime_absolute_usec() functions cannot return 0 after more than the time resolution period has passed since monotime_init(). Even in the first case, monotime_absolute_usec() cannot return 0 unless the clock moves backwards before the ratchet value is stored. In all other cases, the ratchet value should be stored as a non-zero value and all subsequent calls to monotime_absolute_usec() are non-zero...

Furthermore, I'm deeply confused about how in #29990 (moved), host and VM time has any effect on our mocked monotime values for unit tests. I think what we're actually looking at here is that the unittests themselves are incrementing monotime below the resolution of some platforms, which I why I tried to fix this with changes to just the unittests... but I'll take my comments about #29990 (moved) there.

I would request that we not merge hacky changes to the unittests for the #29990 (moved) workarounds. If we're going to hack this, let's hack it so monotime_init() (or the ratchet) has a non-zero value, and then we can go through circpad for the rare cases where the difference between monotime_init() and the first ratchet update is 0 (or negative), in production (which I still believe should be rare/impossible).

Replying to mikeperry:

I would request that we not merge hacky changes to the unittests for the #29990 (moved) workarounds.

Ok, I'll close my pull requests on this ticket, and mark #29990 (moved) as 040-backport.

If we're going to hack this, let's hack it so monotime_init() (or the ratchet) has a non-zero value, and then we can go through circpad for the rare cases where the difference between monotime_init() and the first ratchet update is 0 (or negative), in production (which I still believe should be rare/impossible).

These situations are rare, but they can happen. And they can happen more than you think.

Any pre-ratchet monotime update can be zero or negative, because the Windows API and gettimeofday() don't provide monotonic source times. So any number of calls to the monotonic time functions may return the same value.

Would you like to open a new ticket to test and fix these zero delta issues in circuitpad? #29990 (moved) is now about another bug, and this ticket is full of comments.

Trac:
Status: needs_revision to new

Fixing tags: we must fix this issue before we turn on circuitpadding. (Or we should document the issues in code comments and the spec, and triage it out.)

Trac:
Keywords: tor-ci-fail-sometimes, 041-proposed, 040-must deleted, 041-must added
Milestone: Tor: 0.4.0.x-final to Tor: 0.4.1.x-final
Owner: teor to mikeperry
Version: N/A to Tor: 0.4.0.1-alpha
Status: new to assigned

Ok, here's a PR to fix the test flapping (I hope -- again I have been unable to reproduce this, and not for lack of trying): https://github.com/torproject/tor/pull/905

Trac:
Status: assigned to needs_review

Thanks for the patch. I think I can backport it to 0.4.0 without too much trouble.

I closed #29990 (moved), and opened #30031 (moved) for the remaining zero delta issue.

I made a list of all the circumstances where we've seen zero monotonic time deltas (or negative deltas, before we added the ratchet). I did some more analysis, and I think thoses cases are rare. If they become common, we should notice, because our monotonic time unit tests in test/util will fail.

So I don't think we need to block WTF-PAD or 041 on that issue.

Trac:
Actualpoints: 3 to 3.5
Keywords: 041-must deleted, 040-backport, 040-must added

I backported Mike's extra commit to 0.4.0, then applied the same changes as my previous 0.4.0 and master PRs. There was one extra data structure name change that I needed to undo then re-apply.

Here are the final pull requests:

0.4.0: https://github.com/torproject/tor/pull/906
master: https://github.com/torproject/tor/pull/907

So I think everything has been reviewed here multiple times, and we should just merge. (But I shouldn't merge, because I did the final PRs.)

Trac:
Keywords: N/A deleted, nickm-merge, asn-merge added
Reviewer: nickm to nickm, asn, teor

LGTM. Pushed to 040 and onwards. Let's hope this fixes the issue. If not, let's reopen.

Trac:
Resolution: N/A to fixed
Status: needs_review to closed

Thanks everyone for your hard work on this tricky bug!

Trac:
Parent: #28631 (moved) to N/A
Milestone: Tor: 0.4.1.x-final to Tor: 0.4.0.x-final

closed

changed time estimate to 24h

added 28h of time spent

mentioned in issue #29599 (moved)

mentioned in issue #29640 (moved)

mentioned in issue #29824 (moved)

Broken circuitpadding unittests on appveyor

Child items ...

Activity