Opened 4 months ago

Closed 7 weeks ago

Last modified 7 weeks ago

#29036 closed defect (fixed)

Coverage merge failures cause test_process_slow stderr check to fail

Reported by: teor Owned by: teor
Priority: High Milestone: Tor: 0.2.9.x-final
Component: Core Tor/Tor Version: Tor: 0.2.9.15
Severity: Major Keywords: 041-accepted-20190115, regression, tor-ci, 029-backport, 034-backport, 035-backport, 040-backport, tor-ci-fail-sometimes
Cc: ahf Actual Points: 0.6
Parent ID: Points: 0.5
Reviewer: catalyst Sponsor:

Description

When tor's coverage CI fails to merge coverage records, it logs to stderr. This causes the process stderr checks to fail.

slow/process/callbacks: profiling:/home/travis/build/teor2345/tor/src/trunnel/src_trunnel_libor_trunnel_testing_a-netinfo.gcda:Merge mismatch for function 47
  FAIL src/test/test_process_slow.c:241: assert(smartlist_len(process_data->stderr_data) OP_EQ 3): 4 vs 3
  [callbacks FAILED]

https://travis-ci.org/teor2345/tor/jobs/477561625#L5661

We could skip this test when coverage is active?
Or even better, we could allow 3 or 4 stderr lines when coverage is active?

This is a critical bug, because it's a CI failure. But it doesn't fail all the time. Maybe we should turn off CI caching on coverage builds, or exclude coverage records from the CI cache?

Child Tickets

TicketStatusOwnerSummaryComponent
#29962closedteorTurn off Travis Rust caching, and see if it speeds up the buildsCore Tor/Tor

Change History (38)

comment:1 Changed 4 months ago by ahf

Owner: set to ahf
Status: newassigned

Interesting catch! Assigning to myself.

comment:2 Changed 4 months ago by teor

Keywords: 041-accepted-20190115 added; 041-proposed removed

We accepted this in the meeting.

comment:3 Changed 2 months ago by teor

Priority: MediumHigh
Severity: CriticalMajor

This issue caused a master build to fail:
https://travis-ci.org/torproject/tor/jobs/507372106

It doesn't seem to happen very often, so I'm dropping the severity.

comment:4 Changed 2 months ago by teor

Here are the full logs, I'm going to restart the job to clear the error:

slow/crypto/fuzz_donna/ed25519_donna: [forking] profiling:/home/travis/build/torproject/tor/src/trunnel/src_trunnel_libor_trunnel_testing_a-socks5.gcda:Merge mismatch for function 108
profiling:/home/travis/build/torproject/tor/src/trunnel/src_trunnel_libor_trunnel_testing_a-socks5.gcda:Merge mismatch for function 108
OK
slow/crypto/fuzz_donna/ed25519_ref10: [forking] profiling:/home/travis/build/torproject/tor/src/trunnel/src_trunnel_libor_trunnel_testing_a-socks5.gcda:Merge mismatch for function 108
profiling:/home/travis/build/torproject/tor/src/trunnel/src_trunnel_libor_trunnel_testing_a-socks5.gcda:Merge mismatch for function 108
OK
slow/process/callbacks: profiling:/home/travis/build/torproject/tor/src/trunnel/src_trunnel_libor_trunnel_testing_a-socks5.gcda:Merge mismatch for function 108
  FAIL src/test/test_process_slow.c:241: assert(smartlist_len(process_data->stderr_data) OP_EQ 3): 4 vs 3
  [callbacks FAILED]
slow/process/callbacks_terminate: profiling:/home/travis/build/torproject/tor/src/trunnel/src_trunnel_libor_trunnel_testing_a-socks5.gcda:Merge mismatch for function 108
OK
slow/prob_distr/stochastic_genpareto: [forking] profiling:/home/travis/build/torproject/tor/src/trunnel/src_trunnel_libor_trunnel_testing_a-socks5.gcda:Merge mismatch for function 108
profiling:/home/travis/build/torproject/tor/src/trunnel/src_trunnel_libor_trunnel_testing_a-socks5.gcda:Merge mismatch for function 108
OK
slow/prob_distr/stochastic_geometric: [forking] profiling:/home/travis/build/torproject/tor/src/trunnel/src_trunnel_libor_trunnel_testing_a-socks5.gcda:Merge mismatch for function 108
profiling:/home/travis/build/torproject/tor/src/trunnel/src_trunnel_libor_trunnel_testing_a-socks5.gcda:Merge mismatch for function 108
OK
slow/prob_distr/stochastic_uniform: [forking] profiling:/home/travis/build/torproject/tor/src/trunnel/src_trunnel_libor_trunnel_testing_a-socks5.gcda:Merge mismatch for function 108
profiling:/home/travis/build/torproject/tor/src/trunnel/src_trunnel_libor_trunnel_testing_a-socks5.gcda:Merge mismatch for function 108
OK
slow/prob_distr/stochastic_logistic: [forking] profiling:/home/travis/build/torproject/tor/src/trunnel/src_trunnel_libor_trunnel_testing_a-socks5.gcda:Merge mismatch for function 108
profiling:/home/travis/build/torproject/tor/src/trunnel/src_trunnel_libor_trunnel_testing_a-socks5.gcda:Merge mismatch for function 108
OK
slow/prob_distr/stochastic_log_logistic: [forking] profiling:/home/travis/build/torproject/tor/src/trunnel/src_trunnel_libor_trunnel_testing_a-socks5.gcda:Merge mismatch for function 108
profiling:/home/travis/build/torproject/tor/src/trunnel/src_trunnel_libor_trunnel_testing_a-socks5.gcda:Merge mismatch for function 108
OK
slow/prob_distr/stochastic_weibull: [forking] profiling:/home/travis/build/torproject/tor/src/trunnel/src_trunnel_libor_trunnel_testing_a-socks5.gcda:Merge mismatch for function 108
profiling:/home/travis/build/torproject/tor/src/trunnel/src_trunnel_libor_trunnel_testing_a-socks5.gcda:Merge mismatch for function 108
OK
1/20 TESTS FAILED. (0 skipped)
profiling:/home/travis/build/torproject/tor/src/trunnel/src_trunnel_libor_trunnel_testing_a-socks5.gcda:Merge mismatch for function 108

comment:5 Changed 2 months ago by ahf

I'm not sure if I read the output correctly, but does this error condition mean that the test binaries (and the child-process binary) output "Merge mismatch for function X" on standard error?

We can detect that line in stderr's first line if we want to in the process tests, but I don't think that would be the right solution. Can we either fail in an earlier way when this condition appears OR figure out a way to reset the state so these "Merge mismatch" disappear?

comment:6 in reply to:  5 Changed 2 months ago by teor

Replying to ahf:

I'm not sure if I read the output correctly, but does this error condition mean that the test binaries (and the child-process binary) output "Merge mismatch for function X" on standard error?

We can detect that line in stderr's first line if we want to in the process tests, but I don't think that would be the right solution. Can we either fail in an earlier way when this condition appears OR figure out a way to reset the state so these "Merge mismatch" disappear?

Let's try deleting the cached coverage files before building?
Or let's delete the coverage files after a coverage run, so they don't take up space in the cache?

comment:7 Changed 2 months ago by ahf

I think it would make the most sense to delete the coverage files so they don't take up space in the cache rather than deleting them from the cache before we (maybe?) update them. It seems like these files should never be cached, right?

comment:8 in reply to:  7 Changed 2 months ago by teor

Keywords: 029-backport 034-backport 035-backport 040-backport added

Replying to ahf:

I think it would make the most sense to delete the coverage files so they don't take up space in the cache rather than deleting them from the cache before we (maybe?) update them. It seems like these files should never be cached, right?

It doesn't make sense to cache the files, and it's probably a bad thing: coverage can change when unrelated modules change.

If we delete the files before caching, they'll disappear from the cache after the first build on each branch. So the initial build, and old branches, might still have this bug. I think that's ok?

Let's create a "clean-coverage" make target, and add it in the "before_cache:" phase?
https://docs.travis-ci.com/user/caching/#before_cache-phase

comment:9 Changed 2 months ago by nickm

There's already a target called reset-gcov that might do what you want.

comment:10 Changed 2 months ago by ahf

Thanks for the hint with reset-gcov. Based on that I opened: https://github.com/torproject/tor/pull/803

Let's see what CI thinks of this.

comment:11 Changed 2 months ago by ahf

Status: assignedneeds_review

Looks like CI was happy. Let's get it reviewed.

comment:12 Changed 2 months ago by asn

Reviewer: catalyst

comment:13 Changed 2 months ago by ahf

Probably worth noting here: I use after_script for this after having read: https://blog.travis-ci.com/after_script_behavior_changes

comment:14 in reply to:  13 Changed 2 months ago by teor

Status: needs_reviewneeds_revision

Please rebase on maint-0.2.9: we need to fix the process test failure on master, but we also need reliable coverage stats on 0.2.9 and later. And we want smaller caches on all branches to speed up the build.

Replying to ahf:

Probably worth noting here: I use after_script for this after having read: https://blog.travis-ci.com/after_script_behavior_changes

I think you want before_cache, it's documented to run before the cache phase:
https://docs.travis-ci.com/user/caching/#before_cache-phase

comment:15 Changed 2 months ago by teor

ahf and I spoke about this fix on irc, here's the full list of changes:

  • rebase on maint-0.2.9
  • run coverage as the last script step
  • change make reset-gcov so it deletes ".gcov" files
    • so we don't accidentally use stale files if the coverage command fails
    • it also makes the cache smaller
  • put make reset-gcov in before_cache

comment:16 Changed 2 months ago by ahf

comment:17 Changed 2 months ago by ahf

Status: needs_revisionneeds_review

https://github.com/torproject/tor/pull/806 - CI looks happy. Let's get it reviewed.

comment:18 in reply to:  17 Changed 2 months ago by catalyst

Status: needs_reviewneeds_information

Replying to ahf:

https://github.com/torproject/tor/pull/806 - CI looks happy. Let's get it reviewed.

Thanks! That one branch looks good to me. Does this merge forward cleanly?

comment:19 Changed 2 months ago by teor

Status: needs_informationmerge_ready

I'll check that it merges forward cleanly before I push to torproject.org.

comment:20 Changed 2 months ago by teor

Keywords: teor-merge added

ahf, please remember to fill in actual points?

comment:21 Changed 2 months ago by ahf

Actual Points: 0.5

comment:22 in reply to:  15 Changed 2 months ago by teor

Status: merge_readyneeds_revision

Hi ahf, catalyst,

Replying to teor:

  • change make reset-gcov so it deletes ".gcov" files
    • so we don't accidentally use stale files if the coverage command fails
    • it also makes the cache smaller

I still think we should change reset-gcov so it deletes the .gcov files. They aren't any good in the cache, and they will hide some kinds of coverage build/execution failures. What do you think?

Edit: typos

Last edited 7 weeks ago by teor (previous) (diff)

comment:23 Changed 7 weeks ago by teor

Keywords: tor-ci-fail-sometimes added

comment:24 Changed 7 weeks ago by teor

Owner: changed from ahf to teor
Status: needs_revisionassigned

#29036 and #29962 both add before_cache, so I'd like to backport them and merge them forward together.

comment:25 Changed 7 weeks ago by teor

Status: assignedneeds_revision

comment:26 Changed 7 weeks ago by teor

Status: needs_revisionneeds_review

I added these commits:

  • 57e9fe2: deletes the .gcno and .gcov files in make reset-gcov
  • eb0bd18: tweaks the changes file so it passes make check-changes on 0.3.5
  • clean merge of #29962 into 0.3.4
  • 33be8d8: combine before_cache from #29036 and #29962
  • 7014e57 merge: combine the stem lines from maint-0.3.5 with the moved coverage line from #29036

catalyst, can you review my extra commits on #29036, and my extra commits on #29962?
(You've already reviewed ahf's commits on #29036, and I reviewed rl1987's commits on #29962.)

Here are the pull requests:

Edit: typo

Last edited 7 weeks ago by teor (previous) (diff)

comment:27 in reply to:  26 Changed 7 weeks ago by catalyst

Status: needs_reviewmerge_ready

Replying to teor:

catalyst, can you review my extra commits on #29036, and my extra commits on #29962?
(You've already reviewed ahf's commits on #29036, and I reviewed rl1987's commits on #29962.)

Thanks! These look good.

comment:28 Changed 7 weeks ago by teor

Actual Points: 0.50.6
Keywords: asn-merge nickm-merge added; teor-merge removed

Please merge https://github.com/torproject/tor/pull/879 to 0.4.0 and later.

Once the CI finishes successfully, I'll backport the 0.2.9, 0.3.4, and 0.3.5 branches.

comment:29 Changed 7 weeks ago by nickm

Status: merge_readyneeds_revision

Sorry, I think we need a different target name for this new "reset-gcov" semantics, if that's what we're going for. In the current semantics, running 'reset-gcov' after a test run resets the coverage, but doesn't stop you from running "make check" again. But if we change it to remove the gcno files too, that will means that you need to do a full "make clean" and "make" before you can run the tests for coverage again.

comment:30 in reply to:  29 Changed 7 weeks ago by teor

Replying to nickm:

Sorry, I think we need a different target name for this new "reset-gcov" semantics, if that's what we're going for. In the current semantics, running 'reset-gcov' after a test run resets the coverage, but doesn't stop you from running "make check" again. But if we change it to remove the gcno files too, that will means that you need to do a full "make clean" and "make" before you can run the tests for coverage again.

I think I understand better now:

  • we MUST NOT delete gcno files, because they are created when the code is compiled
  • we MUST delete gcda files to remove the warning (and make our coverage accurate)
  • we MAY delete gcov files, because they take up space in the cache (and leaving old files around might make our coverage inaccurate)

I'll amend that commit to delete gcda and gcov files, but leave gcno files.

comment:31 Changed 7 weeks ago by teor

Status: needs_revisionmerge_ready

Ok, I removed the Makefile line that deletes the gcno files.

Please merge ​https://github.com/torproject/tor/pull/879 to 0.4.0 and later.

comment:32 Changed 7 weeks ago by asn

merged to 040 and forward.

comment:33 Changed 7 weeks ago by asn

Resolution: fixed
Status: merge_readyclosed

comment:34 Changed 7 weeks ago by teor

Milestone: Tor: 0.4.1.x-finalTor: 0.3.5.x-final
Resolution: fixed
Status: closedreopened
Version: Tor: unspecifiedTor: 0.2.9.15

Please don't close tickets that need to be backported: put them in Tor: 0.3.5.x-final instead.

comment:35 Changed 7 weeks ago by teor

Keywords: asn-merge nickm-merge removed

comment:36 Changed 7 weeks ago by teor

Status: reopenedmerge_ready

comment:37 Changed 7 weeks ago by teor

Resolution: fixed
Status: merge_readyclosed

comment:38 Changed 7 weeks ago by teor

Milestone: Tor: 0.3.5.x-finalTor: 0.2.9.x-final
Note: See TracTickets for help on using tickets.