Opened 10 months ago

Closed 8 months ago

Last modified 6 months ago

#29500 closed defect (fixed)

Broken circuitpadding unittests on appveyor

Reported by: asn Owned by: mikeperry
Priority: High Milestone: Tor: 0.4.0.x-final
Component: Core Tor/Tor Version: Tor: 0.4.0.1-alpha
Severity: Normal Keywords: asn-merge, nickm-merge, wtf-pad, tor-relay, tor-cell, padding, 040-must, 040-backport
Cc: nickm, ahf Actual Points: 3.5
Parent ID: Points: 3
Reviewer: nickm, asn, teor Sponsor: Sponsor2

Description (last modified by asn)

This x86 appveyor build failed two unittests: https://ci.appveyor.com/project/torproject/tor/builds/22350196/job/457arqlgfgf6b0a3

circuitpadding/circuitpadding_tokens: [forking] 
  FAIL ../src/test/test_circuitpadding.c:1125: assert(mi->histogram[4] OP_EQ 2): 0 vs 2
  [circuitpadding_tokens FAILED]
circuitpadding/circuitpadding_rtt: [forking] 
  FAIL ../src/test/test_circuitpadding.c:324: assert(relay_side->padding_info[0]->last_received_time_usec OP_NE 0): 0 vs 0
  [circuitpadding_rtt FAILED]

Child Tickets

TicketStatusOwnerSummaryComponent
#29990closedmikeperryHandle zero monotonic time differences in the circuit padding codeCore Tor/Tor

Change History (41)

comment:1 Changed 10 months ago by asn

Description: modified (diff)

comment:2 Changed 10 months ago by nickm

Milestone: Tor: unspecifiedTor: 0.4.0.x-final
Priority: MediumHigh

comment:3 Changed 10 months ago by mikeperry

Was this continually failing or was this the first failure?

Also do we have instructions for running the unit tests in Wine? I am not sure how to reproduce this, especially if it was intermittent, which I assume it is.

comment:4 Changed 10 months ago by asn

I think this was the first failure. Perhaps it would trigger in a loop.

I think the best way to run unittests in Wine is using Alex's https://github.com/ahf/tor-win32/

comment:5 Changed 10 months ago by nickm

Keywords: 040-must added

Marking tickets as 040-must based on triage with dgoulet.

comment:6 Changed 10 months ago by mikeperry

FYI, I ran these two tests in a loop for ~20 hours in wine64 on my laptop (using https://github.com/ahf/tor-win32/ to build) and was unable to reproduce. :/

comment:7 Changed 10 months ago by ahf

Cc: ahf added

comment:8 Changed 10 months ago by ahf

I ran the tests on 32-bit x86 Windows 10 for ~4.5h yesterday and didn't manage to hit this test failure. I let the same machine run over night with a 64-bit x86 build and also didn't see the test failure there.

MinGW version:

x86_64-w64-mingw32-gcc: 6.3.0 20170516
i686-w64-mingw32-gcc: 6.3.0 20170516

PowerShell code to run the tests in a loop:

while ($true) { .\test.exe circuitpadding/... ; if ($LastExitCode -ne 0) { break; } }

Maybe we should move this to needs_information?

comment:9 Changed 10 months ago by asn

Mike, wrt the test_circuitpadding_tokens() test, could it be another case where the test actually schedules legit padding because of a state transition or something and it might trigger or might not depending on how the timing of the test goes? Like the one from #29122?

comment:10 in reply to:  9 ; Changed 10 months ago by mikeperry

Replying to asn:

Mike, wrt the test_circuitpadding_tokens() test, could it be another case where the test actually schedules legit padding because of a state transition or something and it might trigger or might not depending on how the timing of the test goes? Like the one from #29122?

Yes, that does seem likely for the tokens test. In fact, I think this should fix it, if we could reproduce:
https://github.com/mikeperry-tor/tor/commit/b61cd3709be53dd0aee55111dc0c29b882c31cc6

However, I have no idea why the rtt test is failing. It almost seems like a compiler bug.

comment:11 in reply to:  10 ; Changed 10 months ago by teor

Replying to mikeperry:

Replying to asn:

Mike, wrt the test_circuitpadding_tokens() test, could it be another case where the test actually schedules legit padding because of a state transition or something and it might trigger or might not depending on how the timing of the test goes? Like the one from #29122?

Yes, that does seem likely for the tokens test. In fact, I think this should fix it, if we could reproduce:
https://github.com/mikeperry-tor/tor/commit/b61cd3709be53dd0aee55111dc0c29b882c31cc6

The 0.4.0 and master builds all failed due to this bug after #29599 was merged.

But only the "Image: Visual Studio 2017; Environment: target=x86_64-w64-mingw32..." jobs failed, so it might be timing-sensitive. Or the OS bug might happen reliably on Windows Server 2016.

You can remote desktop to the build machine if you want:
https://www.appveyor.com/docs/how-to/rdp-to-build-worker/

It will be easier to set up if you use your own appveyor account, not Tor's.

However, I have no idea why the rtt test is failing. It almost seems like a compiler bug.

It's an OS bug:
https://gitweb.torproject.org/tor.git/tree/src/lib/time/compat_time.c#n543
https://www.python.org/dev/peps/pep-0418/#windows-queryperformancecounter

Which tor works around by pretending that no time has elapsed:
https://gitweb.torproject.org/tor.git/tree/src/lib/time/compat_time.c#n175

More generally, there is no guarantee that the monotonic clock will have increased by at least 1000 nanoseconds between monotime_init() and circpad_estimate_circ_rtt_on_received()'s call to monotime_absolute_usec(). A non-increasing value is more likely when the monotonic clock's resolution is in milliseconds: two calls can easily return the same value.

But the circuitpadding code assumes that monotime_absolute_usec() is always greater than zero.

comment:12 in reply to:  11 Changed 10 months ago by teor

Replying to teor:

Replying to mikeperry:

Replying to asn:

Mike, wrt the test_circuitpadding_tokens() test, could it be another case where the test actually schedules legit padding because of a state transition or something and it might trigger or might not depending on how the timing of the test goes? Like the one from #29122?

Yes, that does seem likely for the tokens test. In fact, I think this should fix it, if we could reproduce:
https://github.com/mikeperry-tor/tor/commit/b61cd3709be53dd0aee55111dc0c29b882c31cc6

The 0.4.0 and master builds all failed due to this bug after #29599 was merged.

But only the "Image: Visual Studio 2017; Environment: target=x86_64-w64-mingw32..." jobs failed, so it might be timing-sensitive. Or the OS bug might happen reliably on Windows Server 2016.

You can remote desktop to the build machine if you want:
https://www.appveyor.com/docs/how-to/rdp-to-build-worker/

It will be easier to set up if you use your own appveyor account, not Tor's.

Or you could create your best fix for these issues, get it merged, and then see if it fails again?

It might be easier that messing around with RDP: there's no guarantee you'd be able to reproduce the timing issues on Appveyor within the 1 hour job limit.

comment:13 Changed 10 months ago by teor

I logged #29640 to improve the monotonic time documentation in compat_time.c

comment:14 in reply to:  11 Changed 9 months ago by mikeperry

Replying to teor:

Replying to mikeperry:

However, I have no idea why the rtt test is failing. It almost seems like a compiler bug.

It's an OS bug:
https://gitweb.torproject.org/tor.git/tree/src/lib/time/compat_time.c#n543
https://www.python.org/dev/peps/pep-0418/#windows-queryperformancecounter

Which tor works around by pretending that no time has elapsed:
https://gitweb.torproject.org/tor.git/tree/src/lib/time/compat_time.c#n175

More generally, there is no guarantee that the monotonic clock will have increased by at least 1000 nanoseconds between monotime_init() and circpad_estimate_circ_rtt_on_received()'s call to monotime_absolute_usec(). A non-increasing value is more likely when the monotonic clock's resolution is in milliseconds: two calls can easily return the same value.

But the circuitpadding code assumes that monotime_absolute_usec() is always greater than zero.

But wait, how is this OS bug still happening even though I mocked monotime and monotime_coarse in test_circuitpad_rtt()? I set the mocked absolute time to be 1000 nsec, so
monotime_absolute_usec() should be returning 1 in circpad_estimate_circ_rtt_on_received().

Is the problem maybe that sometimes the nsec_per_tick_numer value is very high on some appveyor instances? Would using a larger initial mocked time than 1000 nsec solve the issue for this test?

comment:15 Changed 9 months ago by mikeperry

Ok, I changed the monotime mocking to start at 1000 usec rather than 1000 nsec, on the wild guess that this is a precision issue on some platforms due to our mocking. That fix, plus the tokens test fix are at https://github.com/torproject/tor/pull/757

Might as well set needs_review. I suppose there could be an argument for handling the case where monotime_absolute_usec() might return 0 in the wild, but I don't think that is actually possible there (this code only runs on relays upon receiving valid onionskins for circuit extension.. lots of time should have passed since monotime_init() at that point).

comment:16 Changed 9 months ago by mikeperry

Status: newneeds_review

comment:17 in reply to:  15 Changed 9 months ago by teor

Status: needs_reviewneeds_revision

Replying to mikeperry:

...
Might as well set needs_review. I suppose there could be an argument for handling the case where monotime_absolute_usec() might return 0 in the wild, but I don't think that is actually possible there (this code only runs on relays upon receiving valid onionskins for circuit extension.. lots of time should have passed since monotime_init() at that point).

It doesn't matter how much time has passed, the monotime_*diff_*() functions can still return zero. So all code that uses the monotime_*diff_*() functions needs to handle zero correctly.

See #29640, in particular:

Zero deltas happen more often:
 *  - on Windows (due to an OS bug),
 *  - when using monotime_coarse, or on systems with low-resolution timers,
 *  - on platforms where we emulate monotonic time using wall-clock time, and
 *  - when using time units that are larger than nanoseconds (due to
 *    truncation on division).

comment:18 Changed 9 months ago by teor

I added a better explanation to #29640 in response to your comments, and nickm's comments:

 * Zero time differences can happen on some machines, even after a large
 * amount of time has elapsed. Our ratchet corrects non-monotonic times and
 * returns a zero time difference:
 *  - on Windows (due to an OS bug that yields non-monotonic times from the
 *    monotonic time API),
 *  - on platforms where we emulate monotonic time using wall-clock time
 *    (after wall clock time is set to an earlier time).

comment:20 Changed 9 months ago by teor

Hi Mike,

I noticed your question during the network team meeting:

mikeperry: the #29500 one is annoying. I think we need another flag or something. I also only barely understand how this could ever happen in a real scenario
mikeperry: it seems almost impossible
nickm: mikeperry: maybe add the appropriate assertions that should never be hit, so that we can figure out why it would be happening sometime anyway?
nickm: diagnosis is the next best thing to a fix

Here are three scenarios where the monotime_diff or monotime_absolute functions can return zero:
(I'll use monotime_absolute below, because I think that's what you're using. But monotime_absolute just calls monotime_diff with the previous monotime value.)

The Ratchet Returns Zero

  1. Tor is running on a platform which has monotime API bugs (Windows), or Tor is compiled without support for monotime functions (old Linux, old macOS, other platforms where we haven't implemented monotime support)
  2. Tor calls a monotime_absolute() function, and stores the last monotime value
  3. Any amount of time elapses
  4. Tor calls a monotime_absolute() function, and Windows experiences its monotime bug, or the user has changed the wall clock time backwards
  5. Tor emulates monotonic time using a ratchet. The ratchet stores an internal correction factor for future monotonic times, and returns the same value as it previously returned
  6. The monotime_absolute() function gets the same value for the previous and current times. It returns zero.

Low Resolution Timers in the OS

  1. Tor is running on a platform with low-resolution timers
  2. Tor calls a monotime_absolute() function, and stores the last monotime value
  3. A small amount of time elapses, which is lower than the timer resolution
  4. Tor calls a monotime_absolute() function, which gets the same monotime value as the previous call to the low-resolution timer
  5. The monotime_absolute() function gets the same value for the previous and current times. It returns zero.

Low Resolution Timer APIs in Tor

  1. Tor is running on any platform
  2. Tor calls a monotime_absolute() function with low-resolution units (seconds, milliseconds, or microseconds) and stores the last monotime value
  3. A small amount of time elapses, which is less than one unit
  4. Tor calls a monotime_absolute() function, which gets a monotime value which is greater than the previous call to the low-resolution timer, but less than one unit away from the previous timer.
  5. The monotime_absolute() function gets less than one unit's difference between the previous and current times. It divides by the unit conversion factor, and returns zero. (These functions truncate, rather than rounding.)

I hope that explains the scenarios that Tor code needs to deal with when calling Tor's monotime APIs.

If you have any more questions, please feel free to ask on this ticket. Sometimes I miss questions on IRC or in meetings.

Edit: apparently I can't count small integers

Last edited 9 months ago by teor (previous) (diff)

comment:21 Changed 9 months ago by teor

Owner: set to mikeperry
Status: needs_revisionassigned

Mike has a pull request in this ticket.

comment:22 Changed 9 months ago by teor

Status: assignedneeds_revision

This bug caused #29824 to fail:

circuitpadding/circuitpadding_tokens: [forking] 
  FAIL ../src/test/test_circuitpadding.c:1150: assert(mi->histogram[4] OP_EQ 2): 0 vs 2
  [circuitpadding_tokens FAILED]
circuitpadding/circuitpadding_negotiation: [forking] OK
circuitpadding/circuitpadding_wronghop: [forking] OK
circuitpadding/circuitpadding_conditions: [forking] OK
circuitpadding/circuitpadding_rtt: [forking] 
  FAIL ../src/test/test_circuitpadding.c:322: assert(relay_side->padding_info[0]->last_received_time_usec OP_NE 0): 0 vs 0
  [circuitpadding_rtt FAILED]
circuitpadding/circuitpadding_sample_distribution: [forking] OK

https://ci.appveyor.com/project/torproject/tor/builds/23230458/job/e8is73u02ymt6xfl?fullLog=true

comment:23 Changed 9 months ago by teor

This bug caused the 0.4.0 changelog merge to release-0.4.0 to fail:

circuitpadding/circuitpadding_tokens: [forking] 
  FAIL ../src/test/test_circuitpadding.c:1124: assert(mi->histogram[4] OP_EQ 2): 0 vs 2
  [circuitpadding_tokens FAILED]
circuitpadding/circuitpadding_negotiation: [forking] OK
circuitpadding/circuitpadding_wronghop: [forking] OK
circuitpadding/circuitpadding_conditions: [forking] OK
circuitpadding/circuitpadding_rtt: [forking] 
  FAIL ../src/test/test_circuitpadding.c:323: assert(relay_side->padding_info[0]->last_received_time_usec OP_NE 0): 0 vs 0
  [circuitpadding_rtt FAILED]

https://ci.appveyor.com/project/torproject/tor/builds/23252909/job/4ax058o810mck9xe?fullLog=true#L3043

comment:24 Changed 9 months ago by teor

Keywords: tor-ci-fail-sometimes added

comment:25 Changed 9 months ago by teor

Owner: mikeperry deleted
Status: needs_revisionassigned

Un-assigning 040-must bugs from Mike, because he's overloaded.
We'll work out what to do with these tickets at the meeting.

comment:26 Changed 9 months ago by teor

Owner: set to teor

comment:27 Changed 9 months ago by teor

Reviewer: nickm

comment:28 Changed 9 months ago by teor

I don't know which tests fail intermittently with Mike's PR 757, and which succeed.
So I'll assume that it is an improvement, and keep working on it until it passes.

After we merge, we should open tickets for further failures.

comment:29 Changed 9 months ago by teor

Actual Points: 3
Points: 3
Status: assignedneeds_revision

I have a draft branch here, which I am checking against CI:
https://github.com/torproject/tor/pull/885

I guessed the points, because I don't think mikeperry has time to update them this week.

comment:30 Changed 9 months ago by teor

Sponsor: Sponsor2

Move all WTF-PAD tickets that do not have a sponsor into sponsor2.

comment:31 Changed 8 months ago by teor

Status: needs_revisionneeds_review

For 0.4.0, here's what I did:

  • rebased Mike's PR 757 to the current maint-0.4.0
  • deleted a failing test in 0.4.0, because we merged some required fixes into master, but didn't backport them,
  • fixed some comments.

For master, here's what I did:

  • reverted the test deletion,
  • merged 0.4.0 forward to master,
  • weakened the test conditions in the failing test, but left the old conditions in the code. We can set ENABLE_BUG_29990_WORKAROUND to false once we've fixed #29990. (Or revert that commit.)

Here are the pull requests:

The tests pass on my macOS VM with and without a skewed clock.

comment:32 Changed 8 months ago by asn

Status: needs_reviewneeds_revision

Did a review: Just a small question on one of the commits, and also the commit msg says unrealiable.

I think we are ready to get this fixed after this.

comment:33 in reply to:  20 ; Changed 8 months ago by mikeperry

Replying to teor:

Here are three scenarios where the monotime_diff or monotime_absolute functions can return zero:
(I'll use monotime_absolute below, because I think that's what you're using. But monotime_absolute just calls monotime_diff with the previous monotime value.)

I am still not understanding the issue here. Or I am understanding differently than you. Let me try to reply here and say what I think you mean/what I think is true.

The Ratchet Returns Zero

  1. Tor is running on a platform which has monotime API bugs (Windows), or Tor is compiled without support for monotime functions (old Linux, old macOS, other platforms where we haven't implemented monotime support)
  2. Tor calls a monotime_absolute() function, and stores the last monotime value
  1. Tor calls a monotime_init() function at startup, and stores the FIRST monotime value.
  1. Any amount of time elapses
  2. Tor calls a monotime_absolute() function, and Windows experiences its monotime bug, or the user has changed the wall clock time backwards
  3. Tor emulates monotonic time using a ratchet. The ratchet stores an internal correction factor for future monotonic times, and returns the same value as it previously returned
  4. The monotime_absolute() function gets the same value for the previous and current times. It returns zero.
  1. If wall clock time moved back to a time period earlier than monotime_init() AND does this before Tor stores a ratchet value, then monotime_absolute() function gets the same value for the previous and current times. It returns zero.

Low Resolution Timers in the OS

  1. Tor is running on a platform with low-resolution timers
  2. Tor calls a monotime_absolute() function, and stores the last monotime value
  1. Tor calls monotime_init() function at startup, and stores the FIRST monotime value.
  1. A small amount of time elapses, which is lower than the timer resolution
  2. Tor calls a monotime_absolute() function, which gets the same monotime value as the previous call to the low-resolution timer
  3. The monotime_absolute() function gets the same value for the previous and current times. It returns zero.
  1. The monotime_absolute() function gets the same time value as monotime_init(), because less than the timer resolution amount of time has elapsed since monotime_init() was called. Then, and only then, it returns 0.

Low Resolution Timer APIs in Tor

  1. Tor is running on any platform
  2. Tor calls a monotime_absolute() function with low-resolution units (seconds, milliseconds, or microseconds) and stores the last monotime value
  1. Tor calls monotime_init() function at startup, and stores an initialization time in low-resolution units.
  1. A small amount of time elapses, which is less than one unit
  2. Tor calls a monotime_absolute() function, which gets a monotime value which is greater than the previous call to the low-resolution timer, but less than one unit away from the previous timer.
  3. The monotime_absolute() function gets less than one unit's difference between the previous and current times. It divides by the unit conversion factor, and returns zero. (These functions truncate, rather than rounding.)
  1. The monotime_absolute() function gets the same time value as monotime_init(), because less than the timer resolution amount of time has elapsed since monotime_init() was called. Then, and only then, it returns 0.

In all but the first case (monotime Windows bugs), the monotime_absolute_usec() functions cannot return 0 after more than the time resolution period has passed since monotime_init(). Even in the first case, monotime_absolute_usec() cannot return 0 unless the clock moves backwards before the ratchet value is stored. In all other cases, the ratchet value should be stored as a non-zero value and all subsequent calls to monotime_absolute_usec() are non-zero...

Furthermore, I'm deeply confused about how in #29990, host and VM time has *any* effect on our mocked monotime values for unit tests. I think what we're actually looking at here is that the unittests themselves are incrementing monotime below the resolution of some platforms, which I why I tried to fix this with changes to just the unittests... but I'll take my comments about #29990 there.

I would request that we not merge hacky changes to the unittests for the #29990 workarounds. If we're going to hack this, let's hack it so monotime_init() (or the ratchet) has a non-zero value, and then we can go through circpad for the rare cases where the difference between monotime_init() and the first ratchet update is 0 (or negative), in production (which I still believe should be rare/impossible).

comment:34 in reply to:  33 Changed 8 months ago by teor

Status: needs_revisionnew

Replying to mikeperry:

I would request that we not merge hacky changes to the unittests for the #29990 workarounds.

Ok, I'll close my pull requests on this ticket, and mark #29990 as 040-backport.

If we're going to hack this, let's hack it so monotime_init() (or the ratchet) has a non-zero value, and then we can go through circpad for the rare cases where the difference between monotime_init() and the first ratchet update is 0 (or negative), in production (which I still believe should be rare/impossible).

These situations are rare, but they can happen. And they can happen more than you think.

*Any* pre-ratchet monotime update can be zero or negative, because the Windows API and gettimeofday() don't provide monotonic source times. So any number of calls to the monotonic time functions may return the same value.

Would you like to open a new ticket to test and fix these zero delta issues in circuitpad?
#29990 is now about another bug, and this ticket is full of comments.

comment:35 Changed 8 months ago by teor

Keywords: 041-must added; 041-proposed 040-must tor-ci-fail-sometimes removed
Milestone: Tor: 0.4.0.x-finalTor: 0.4.1.x-final
Owner: changed from teor to mikeperry
Status: newassigned
Version: Tor: 0.4.0.1-alpha

Fixing tags: we must fix this issue before we turn on circuitpadding. (Or we should document the issues in code comments and the spec, and triage it out.)

comment:36 Changed 8 months ago by mikeperry

Status: assignedneeds_review

Ok, here's a PR to fix the test flapping (I hope -- again I have been unable to reproduce this, and not for lack of trying): https://github.com/torproject/tor/pull/905

comment:37 Changed 8 months ago by teor

Actual Points: 33.5
Keywords: 040-must 040-backport added; 041-must removed

Thanks for the patch. I think I can backport it to 0.4.0 without too much trouble.

I closed #29990, and opened #30031 for the remaining zero delta issue.

I made a list of all the circumstances where we've seen zero monotonic time deltas (or negative deltas, before we added the ratchet). I did some more analysis, and I think thoses cases are rare. If they become common, we should notice, because our monotonic time unit tests in test/util will fail.

So I don't think we need to block WTF-PAD or 041 on that issue.

comment:38 Changed 8 months ago by teor

Keywords: asn-merge nickm-merge added
Reviewer: nickmnickm, asn, teor

I backported Mike's extra commit to 0.4.0, then applied the same changes as my previous 0.4.0 and master PRs. There was one extra data structure name change that I needed to undo then re-apply.

Here are the final pull requests:

So I think everything has been reviewed here multiple times, and we should just merge. (But I shouldn't merge, because I did the final PRs.)

comment:39 Changed 8 months ago by asn

Resolution: fixed
Status: needs_reviewclosed

LGTM. Pushed to 040 and onwards.
Let's hope this fixes the issue. If not, let's reopen.

comment:40 Changed 8 months ago by teor

Thanks everyone for your hard work on this tricky bug!

comment:41 Changed 6 months ago by teor

Milestone: Tor: 0.4.1.x-finalTor: 0.4.0.x-final
Parent ID: #28631
Note: See TracTickets for help on using tickets.