Opened 2 months ago

Last modified 2 days ago

#29036 needs_review defect

Coverage merge failures cause test_process_slow stderr check to fail

Reported by: teor Owned by: ahf
Priority: High Milestone: Tor: 0.4.1.x-final
Component: Core Tor/Tor Version: Tor: unspecified
Severity: Major Keywords: 041-accepted-20190115, regression, tor-ci, 029-backport, 034-backport, 035-backport, 040-backport
Cc: ahf Actual Points:
Parent ID: Points: 0.5
Reviewer: catalyst Sponsor:

Description

When tor's coverage CI fails to merge coverage records, it logs to stderr. This causes the process stderr checks to fail.

slow/process/callbacks: profiling:/home/travis/build/teor2345/tor/src/trunnel/src_trunnel_libor_trunnel_testing_a-netinfo.gcda:Merge mismatch for function 47
  FAIL src/test/test_process_slow.c:241: assert(smartlist_len(process_data->stderr_data) OP_EQ 3): 4 vs 3
  [callbacks FAILED]

https://travis-ci.org/teor2345/tor/jobs/477561625#L5661

We could skip this test when coverage is active?
Or even better, we could allow 3 or 4 stderr lines when coverage is active?

This is a critical bug, because it's a CI failure. But it doesn't fail all the time. Maybe we should turn off CI caching on coverage builds, or exclude coverage records from the CI cache?

Child Tickets

Change History (17)

comment:1 Changed 2 months ago by ahf

Owner: set to ahf
Status: newassigned

Interesting catch! Assigning to myself.

comment:2 Changed 2 months ago by teor

Keywords: 041-accepted-20190115 added; 041-proposed removed

We accepted this in the meeting.

comment:3 Changed 5 days ago by teor

Priority: MediumHigh
Severity: CriticalMajor

This issue caused a master build to fail:
https://travis-ci.org/torproject/tor/jobs/507372106

It doesn't seem to happen very often, so I'm dropping the severity.

comment:4 Changed 5 days ago by teor

Here are the full logs, I'm going to restart the job to clear the error:

slow/crypto/fuzz_donna/ed25519_donna: [forking] profiling:/home/travis/build/torproject/tor/src/trunnel/src_trunnel_libor_trunnel_testing_a-socks5.gcda:Merge mismatch for function 108
profiling:/home/travis/build/torproject/tor/src/trunnel/src_trunnel_libor_trunnel_testing_a-socks5.gcda:Merge mismatch for function 108
OK
slow/crypto/fuzz_donna/ed25519_ref10: [forking] profiling:/home/travis/build/torproject/tor/src/trunnel/src_trunnel_libor_trunnel_testing_a-socks5.gcda:Merge mismatch for function 108
profiling:/home/travis/build/torproject/tor/src/trunnel/src_trunnel_libor_trunnel_testing_a-socks5.gcda:Merge mismatch for function 108
OK
slow/process/callbacks: profiling:/home/travis/build/torproject/tor/src/trunnel/src_trunnel_libor_trunnel_testing_a-socks5.gcda:Merge mismatch for function 108
  FAIL src/test/test_process_slow.c:241: assert(smartlist_len(process_data->stderr_data) OP_EQ 3): 4 vs 3
  [callbacks FAILED]
slow/process/callbacks_terminate: profiling:/home/travis/build/torproject/tor/src/trunnel/src_trunnel_libor_trunnel_testing_a-socks5.gcda:Merge mismatch for function 108
OK
slow/prob_distr/stochastic_genpareto: [forking] profiling:/home/travis/build/torproject/tor/src/trunnel/src_trunnel_libor_trunnel_testing_a-socks5.gcda:Merge mismatch for function 108
profiling:/home/travis/build/torproject/tor/src/trunnel/src_trunnel_libor_trunnel_testing_a-socks5.gcda:Merge mismatch for function 108
OK
slow/prob_distr/stochastic_geometric: [forking] profiling:/home/travis/build/torproject/tor/src/trunnel/src_trunnel_libor_trunnel_testing_a-socks5.gcda:Merge mismatch for function 108
profiling:/home/travis/build/torproject/tor/src/trunnel/src_trunnel_libor_trunnel_testing_a-socks5.gcda:Merge mismatch for function 108
OK
slow/prob_distr/stochastic_uniform: [forking] profiling:/home/travis/build/torproject/tor/src/trunnel/src_trunnel_libor_trunnel_testing_a-socks5.gcda:Merge mismatch for function 108
profiling:/home/travis/build/torproject/tor/src/trunnel/src_trunnel_libor_trunnel_testing_a-socks5.gcda:Merge mismatch for function 108
OK
slow/prob_distr/stochastic_logistic: [forking] profiling:/home/travis/build/torproject/tor/src/trunnel/src_trunnel_libor_trunnel_testing_a-socks5.gcda:Merge mismatch for function 108
profiling:/home/travis/build/torproject/tor/src/trunnel/src_trunnel_libor_trunnel_testing_a-socks5.gcda:Merge mismatch for function 108
OK
slow/prob_distr/stochastic_log_logistic: [forking] profiling:/home/travis/build/torproject/tor/src/trunnel/src_trunnel_libor_trunnel_testing_a-socks5.gcda:Merge mismatch for function 108
profiling:/home/travis/build/torproject/tor/src/trunnel/src_trunnel_libor_trunnel_testing_a-socks5.gcda:Merge mismatch for function 108
OK
slow/prob_distr/stochastic_weibull: [forking] profiling:/home/travis/build/torproject/tor/src/trunnel/src_trunnel_libor_trunnel_testing_a-socks5.gcda:Merge mismatch for function 108
profiling:/home/travis/build/torproject/tor/src/trunnel/src_trunnel_libor_trunnel_testing_a-socks5.gcda:Merge mismatch for function 108
OK
1/20 TESTS FAILED. (0 skipped)
profiling:/home/travis/build/torproject/tor/src/trunnel/src_trunnel_libor_trunnel_testing_a-socks5.gcda:Merge mismatch for function 108

comment:5 Changed 4 days ago by ahf

I'm not sure if I read the output correctly, but does this error condition mean that the test binaries (and the child-process binary) output "Merge mismatch for function X" on standard error?

We can detect that line in stderr's first line if we want to in the process tests, but I don't think that would be the right solution. Can we either fail in an earlier way when this condition appears OR figure out a way to reset the state so these "Merge mismatch" disappear?

comment:6 in reply to:  5 Changed 4 days ago by teor

Replying to ahf:

I'm not sure if I read the output correctly, but does this error condition mean that the test binaries (and the child-process binary) output "Merge mismatch for function X" on standard error?

We can detect that line in stderr's first line if we want to in the process tests, but I don't think that would be the right solution. Can we either fail in an earlier way when this condition appears OR figure out a way to reset the state so these "Merge mismatch" disappear?

Let's try deleting the cached coverage files before building?
Or let's delete the coverage files after a coverage run, so they don't take up space in the cache?

comment:7 Changed 4 days ago by ahf

I think it would make the most sense to delete the coverage files so they don't take up space in the cache rather than deleting them from the cache before we (maybe?) update them. It seems like these files should never be cached, right?

comment:8 in reply to:  7 Changed 4 days ago by teor

Keywords: 029-backport 034-backport 035-backport 040-backport added

Replying to ahf:

I think it would make the most sense to delete the coverage files so they don't take up space in the cache rather than deleting them from the cache before we (maybe?) update them. It seems like these files should never be cached, right?

It doesn't make sense to cache the files, and it's probably a bad thing: coverage can change when unrelated modules change.

If we delete the files before caching, they'll disappear from the cache after the first build on each branch. So the initial build, and old branches, might still have this bug. I think that's ok?

Let's create a "clean-coverage" make target, and add it in the "before_cache:" phase?
https://docs.travis-ci.com/user/caching/#before_cache-phase

comment:9 Changed 4 days ago by nickm

There's already a target called reset-gcov that might do what you want.

comment:10 Changed 4 days ago by ahf

Thanks for the hint with reset-gcov. Based on that I opened: https://github.com/torproject/tor/pull/803

Let's see what CI thinks of this.

comment:11 Changed 4 days ago by ahf

Status: assignedneeds_review

Looks like CI was happy. Let's get it reviewed.

comment:12 Changed 4 days ago by asn

Reviewer: catalyst

comment:13 Changed 4 days ago by ahf

Probably worth noting here: I use after_script for this after having read: https://blog.travis-ci.com/after_script_behavior_changes

comment:14 in reply to:  13 Changed 4 days ago by teor

Status: needs_reviewneeds_revision

Please rebase on maint-0.2.9: we need to fix the process test failure on master, but we also need reliable coverage stats on 0.2.9 and later. And we want smaller caches on all branches to speed up the build.

Replying to ahf:

Probably worth noting here: I use after_script for this after having read: https://blog.travis-ci.com/after_script_behavior_changes

I think you want before_cache, it's documented to run before the cache phase:
https://docs.travis-ci.com/user/caching/#before_cache-phase

comment:15 Changed 3 days ago by teor

ahf and I spoke about this fix on irc, here's the full list of changes:

  • rebase on maint-0.2.9
  • run coverage as the last script step
  • change make reset-gcov so it deletes ".gcov" files
    • so we don't accidentally use stale files if the coverage command fails
    • it also makes the cache smaller
  • put make reset-gcov in before_cache

comment:16 Changed 2 days ago by ahf

comment:17 Changed 2 days ago by ahf

Status: needs_revisionneeds_review

https://github.com/torproject/tor/pull/806 - CI looks happy. Let's get it reviewed.

Note: See TracTickets for help on using tickets.