Opened 5 weeks ago

Last modified 13 days ago

#29500 needs_revision defect

Broken circuitpadding unittests on appveyor

Reported by: asn Owned by:
Priority: High Milestone: Tor: 0.4.0.x-final
Component: Core Tor/Tor Version:
Severity: Normal Keywords: wtf-pad, tor-relay, tor-cell, padding, 041-proposed, 040-must
Cc: nickm, ahf Actual Points:
Parent ID: #28631 Points:
Reviewer: Sponsor:

Description (last modified by asn)

This x86 appveyor build failed two unittests: https://ci.appveyor.com/project/torproject/tor/builds/22350196/job/457arqlgfgf6b0a3

circuitpadding/circuitpadding_tokens: [forking] 
  FAIL ../src/test/test_circuitpadding.c:1125: assert(mi->histogram[4] OP_EQ 2): 0 vs 2
  [circuitpadding_tokens FAILED]
circuitpadding/circuitpadding_rtt: [forking] 
  FAIL ../src/test/test_circuitpadding.c:324: assert(relay_side->padding_info[0]->last_received_time_usec OP_NE 0): 0 vs 0
  [circuitpadding_rtt FAILED]

Child Tickets

Change History (19)

comment:1 Changed 5 weeks ago by asn

Description: modified (diff)

comment:2 Changed 5 weeks ago by nickm

Milestone: Tor: unspecifiedTor: 0.4.0.x-final
Priority: MediumHigh

comment:3 Changed 5 weeks ago by mikeperry

Was this continually failing or was this the first failure?

Also do we have instructions for running the unit tests in Wine? I am not sure how to reproduce this, especially if it was intermittent, which I assume it is.

comment:4 Changed 4 weeks ago by asn

I think this was the first failure. Perhaps it would trigger in a loop.

I think the best way to run unittests in Wine is using Alex's https://github.com/ahf/tor-win32/

comment:5 Changed 4 weeks ago by nickm

Keywords: 040-must added

Marking tickets as 040-must based on triage with dgoulet.

comment:6 Changed 4 weeks ago by mikeperry

FYI, I ran these two tests in a loop for ~20 hours in wine64 on my laptop (using https://github.com/ahf/tor-win32/ to build) and was unable to reproduce. :/

comment:7 Changed 3 weeks ago by ahf

Cc: ahf added

comment:8 Changed 3 weeks ago by ahf

I ran the tests on 32-bit x86 Windows 10 for ~4.5h yesterday and didn't manage to hit this test failure. I let the same machine run over night with a 64-bit x86 build and also didn't see the test failure there.

MinGW version:

x86_64-w64-mingw32-gcc: 6.3.0 20170516
i686-w64-mingw32-gcc: 6.3.0 20170516

PowerShell code to run the tests in a loop:

while ($true) { .\test.exe circuitpadding/... ; if ($LastExitCode -ne 0) { break; } }

Maybe we should move this to needs_information?

comment:9 Changed 3 weeks ago by asn

Mike, wrt the test_circuitpadding_tokens() test, could it be another case where the test actually schedules legit padding because of a state transition or something and it might trigger or might not depending on how the timing of the test goes? Like the one from #29122?

comment:10 in reply to:  9 ; Changed 3 weeks ago by mikeperry

Replying to asn:

Mike, wrt the test_circuitpadding_tokens() test, could it be another case where the test actually schedules legit padding because of a state transition or something and it might trigger or might not depending on how the timing of the test goes? Like the one from #29122?

Yes, that does seem likely for the tokens test. In fact, I think this should fix it, if we could reproduce:
https://github.com/mikeperry-tor/tor/commit/b61cd3709be53dd0aee55111dc0c29b882c31cc6

However, I have no idea why the rtt test is failing. It almost seems like a compiler bug.

comment:11 in reply to:  10 ; Changed 3 weeks ago by teor

Replying to mikeperry:

Replying to asn:

Mike, wrt the test_circuitpadding_tokens() test, could it be another case where the test actually schedules legit padding because of a state transition or something and it might trigger or might not depending on how the timing of the test goes? Like the one from #29122?

Yes, that does seem likely for the tokens test. In fact, I think this should fix it, if we could reproduce:
https://github.com/mikeperry-tor/tor/commit/b61cd3709be53dd0aee55111dc0c29b882c31cc6

The 0.4.0 and master builds all failed due to this bug after #29599 was merged.

But only the "Image: Visual Studio 2017; Environment: target=x86_64-w64-mingw32..." jobs failed, so it might be timing-sensitive. Or the OS bug might happen reliably on Windows Server 2016.

You can remote desktop to the build machine if you want:
https://www.appveyor.com/docs/how-to/rdp-to-build-worker/

It will be easier to set up if you use your own appveyor account, not Tor's.

However, I have no idea why the rtt test is failing. It almost seems like a compiler bug.

It's an OS bug:
https://gitweb.torproject.org/tor.git/tree/src/lib/time/compat_time.c#n543
https://www.python.org/dev/peps/pep-0418/#windows-queryperformancecounter

Which tor works around by pretending that no time has elapsed:
https://gitweb.torproject.org/tor.git/tree/src/lib/time/compat_time.c#n175

More generally, there is no guarantee that the monotonic clock will have increased by at least 1000 nanoseconds between monotime_init() and circpad_estimate_circ_rtt_on_received()'s call to monotime_absolute_usec(). A non-increasing value is more likely when the monotonic clock's resolution is in milliseconds: two calls can easily return the same value.

But the circuitpadding code assumes that monotime_absolute_usec() is always greater than zero.

comment:12 in reply to:  11 Changed 3 weeks ago by teor

Replying to teor:

Replying to mikeperry:

Replying to asn:

Mike, wrt the test_circuitpadding_tokens() test, could it be another case where the test actually schedules legit padding because of a state transition or something and it might trigger or might not depending on how the timing of the test goes? Like the one from #29122?

Yes, that does seem likely for the tokens test. In fact, I think this should fix it, if we could reproduce:
https://github.com/mikeperry-tor/tor/commit/b61cd3709be53dd0aee55111dc0c29b882c31cc6

The 0.4.0 and master builds all failed due to this bug after #29599 was merged.

But only the "Image: Visual Studio 2017; Environment: target=x86_64-w64-mingw32..." jobs failed, so it might be timing-sensitive. Or the OS bug might happen reliably on Windows Server 2016.

You can remote desktop to the build machine if you want:
https://www.appveyor.com/docs/how-to/rdp-to-build-worker/

It will be easier to set up if you use your own appveyor account, not Tor's.

Or you could create your best fix for these issues, get it merged, and then see if it fails again?

It might be easier that messing around with RDP: there's no guarantee you'd be able to reproduce the timing issues on Appveyor within the 1 hour job limit.

comment:13 Changed 2 weeks ago by teor

I logged #29640 to improve the monotonic time documentation in compat_time.c

comment:14 in reply to:  11 Changed 2 weeks ago by mikeperry

Replying to teor:

Replying to mikeperry:

However, I have no idea why the rtt test is failing. It almost seems like a compiler bug.

It's an OS bug:
https://gitweb.torproject.org/tor.git/tree/src/lib/time/compat_time.c#n543
https://www.python.org/dev/peps/pep-0418/#windows-queryperformancecounter

Which tor works around by pretending that no time has elapsed:
https://gitweb.torproject.org/tor.git/tree/src/lib/time/compat_time.c#n175

More generally, there is no guarantee that the monotonic clock will have increased by at least 1000 nanoseconds between monotime_init() and circpad_estimate_circ_rtt_on_received()'s call to monotime_absolute_usec(). A non-increasing value is more likely when the monotonic clock's resolution is in milliseconds: two calls can easily return the same value.

But the circuitpadding code assumes that monotime_absolute_usec() is always greater than zero.

But wait, how is this OS bug still happening even though I mocked monotime and monotime_coarse in test_circuitpad_rtt()? I set the mocked absolute time to be 1000 nsec, so
monotime_absolute_usec() should be returning 1 in circpad_estimate_circ_rtt_on_received().

Is the problem maybe that sometimes the nsec_per_tick_numer value is very high on some appveyor instances? Would using a larger initial mocked time than 1000 nsec solve the issue for this test?

comment:15 Changed 2 weeks ago by mikeperry

Ok, I changed the monotime mocking to start at 1000 usec rather than 1000 nsec, on the wild guess that this is a precision issue on some platforms due to our mocking. That fix, plus the tokens test fix are at https://github.com/torproject/tor/pull/757

Might as well set needs_review. I suppose there could be an argument for handling the case where monotime_absolute_usec() might return 0 in the wild, but I don't think that is actually possible there (this code only runs on relays upon receiving valid onionskins for circuit extension.. lots of time should have passed since monotime_init() at that point).

comment:16 Changed 2 weeks ago by mikeperry

Status: newneeds_review

comment:17 in reply to:  15 Changed 2 weeks ago by teor

Status: needs_reviewneeds_revision

Replying to mikeperry:

...
Might as well set needs_review. I suppose there could be an argument for handling the case where monotime_absolute_usec() might return 0 in the wild, but I don't think that is actually possible there (this code only runs on relays upon receiving valid onionskins for circuit extension.. lots of time should have passed since monotime_init() at that point).

It doesn't matter how much time has passed, the monotime_*diff_*() functions can still return zero. So all code that uses the monotime_*diff_*() functions needs to handle zero correctly.

See #29640, in particular:

Zero deltas happen more often:
 *  - on Windows (due to an OS bug),
 *  - when using monotime_coarse, or on systems with low-resolution timers,
 *  - on platforms where we emulate monotonic time using wall-clock time, and
 *  - when using time units that are larger than nanoseconds (due to
 *    truncation on division).

comment:18 Changed 2 weeks ago by teor

I added a better explanation to #29640 in response to your comments, and nickm's comments:

 * Zero time differences can happen on some machines, even after a large
 * amount of time has elapsed. Our ratchet corrects non-monotonic times and
 * returns a zero time difference:
 *  - on Windows (due to an OS bug that yields non-monotonic times from the
 *    monotonic time API),
 *  - on platforms where we emulate monotonic time using wall-clock time
 *    (after wall clock time is set to an earlier time).
Note: See TracTickets for help on using tickets.