Opened 4 months ago

Last modified 11 hours ago

#26787 needs_information defect

Core file left on travis hardened rust builld

Reported by: mikeperry Owned by:
Priority: Very High Milestone: Tor: 0.3.5.x-final
Component: Core Tor/Tor Version:
Severity: Normal Keywords: travis, regression, tor-build, 029-backport, 033-backport, 034-backport, ci, 035-can, 032-unreached-backport
Cc: Actual Points:
Parent ID: Points:
Reviewer: Sponsor:

Description

https://travis-ci.org/torproject/tor/jobs/403730172

It looks like all the tests pass, the only problem I can see is the mysterious core left over at the end. I'm unable to get this to happen on my system.

Child Tickets

TicketStatusOwnerSummaryComponent
#26788closedmake distcheck sometimes leaves the core directory behindCore Tor/Tor
#28024closedcatalysttell us where that mystery core file came fromCore Tor/Tor

Change History (27)

comment:1 Changed 4 months ago by nickm

See #26788 as well. Duplicate?

comment:2 Changed 4 months ago by teor

Keywords: 035-must regression tor-build added

Transferring keywords from duplicate ticket #26788.

I tried for a few minutes, but I don't know how to find the crash that creates the core file in either build:
https://travis-ci.org/torproject/tor/jobs/403115782
https://travis-ci.org/torproject/tor/jobs/403730172

comment:3 Changed 4 months ago by nickm

What if this is coming from intentional crash in test-bt-cl?

If so we could fix it with "ulimit -c 0" in test_bt.sh.

comment:4 Changed 4 months ago by nickm

I've made a PR at https://github.com/torproject/tor/pull/233 to test that theory. I have no idea if it will work.

comment:5 Changed 4 months ago by teor

Keywords: 029-backport 032-backport 033-backport 034-backport added
Status: newmerge_ready

The patch is simple enough, and it passes all our CI (except for a Windows failure due to #26076).

Let's merge it as a precaution, and watch travis for a few days?

I have marked it for backport back to 0.2.9, because we backported distcheck CI back to 0.2.9.

comment:6 Changed 4 months ago by nickm

Status: merge_readyneeds_information

I'm not sure whether this will actually work -- I've merged it to master, but let's wait to see whether it fixes the bug before we backport.

comment:7 Changed 4 months ago by nickm

For what it's worth don't expect that the patch above will actually fix the bug for us: I think that if the backtrace test were the part dumping core, it would do it consistently, not just sometimes.

Another possibility here would be to add "core" to CLEANFILES, though I'm not sure whether it's a good idea.

comment:8 in reply to:  7 ; Changed 4 months ago by catalyst

Replying to nickm:

For what it's worth don't expect that the patch above will actually fix the bug for us: I think that if the backtrace test were the part dumping core, it would do it consistently, not just sometimes.

Another possibility here would be to add "core" to CLEANFILES, though I'm not sure whether it's a good idea.

It might be intermittently losing a race condition for cleaning up the core file.

It also seems that this failure turns up on maint-0.3.4 on Travis with DISTCHECK=yes and RUST_OPTIONS="--enable-rust --enable-cargo-online-mode". Maybe it's something having to do with the build.rs stuff.

comment:9 in reply to:  8 Changed 4 months ago by catalyst

Replying to catalyst:

It also seems that this failure turns up on maint-0.3.4 on Travis with DISTCHECK=yes and RUST_OPTIONS="--enable-rust --enable-cargo-online-mode". Maybe it's something having to do with the build.rs stuff.

Doing an experiment on Travis to run the file ./core equivalent to see if that tells us anything useful.

comment:10 Changed 4 months ago by catalyst

The core file (at least one of them) indeed seems to be from test-bt-cl.

tor-0.3.4.4-rc-dev/_build/core: ELF 64-bit LSB  core file x86-64, version 1 (SYSV), SVR4-style, from './src/test/test-bt-cl crash'

I don't know why it happens mostly on the Rust-enabled distcheck builds and mostly (always?) not on the non-Rust distcheck builds.

We should backport to any release where we run distcheck in Travis, even though I've only seen 0.3.4 and master fail.

comment:11 Changed 4 months ago by nickm

I've cherry-picked the fix to 0.2.9, added a changes file, and merged it forward.

comment:12 Changed 4 months ago by teor

Resolution: fixed
Status: needs_informationclosed

Since this was merged to 0.2.9 and later, let's close the ticket, and re-open if we see the issue in CI again.

comment:13 Changed 7 weeks ago by teor

Milestone: Tor: 0.3.4.x-finalTor: 0.3.5.x-final
Resolution: fixed
Status: closedreopened

comment:14 Changed 7 weeks ago by nickm

Keywords: ci added
Priority: MediumVery High

comment:15 Changed 5 weeks ago by catalyst

I wonder if the Rust builds somehow reactivate core dumps? Maybe an explicit setrlimit() in its runtime libraries? Or maybe it's a coredump from a different program?

comment:16 Changed 5 weeks ago by nickm

I wonder if we should just do a top-level "ulimit -c 0" before we run "make distcheck". Not elegant, but it would make the issue go away.

comment:17 in reply to:  16 Changed 5 weeks ago by catalyst

Replying to nickm:

I wonder if we should just do a top-level "ulimit -c 0" before we run "make distcheck". Not elegant, but it would make the issue go away.

I'm nervous about blanket ignoring of unexplained coredumps. That might hide more serious issues later on.

comment:18 Changed 5 weeks ago by catalyst

Owner: set to catalyst
Status: reopenedassigned

comment:19 Changed 5 weeks ago by catalyst

Running make distcheck in a loop to try to figure out more details. So far in my test setup, MAKEFLAGS='-j4' gives some weird racy things involving cargo clean invocations colliding, but -j2 has been chugging for hours without failure.

Also repeatedly hitting "restart" on jobs in my travis-mystery-core branch in Travis, where I added some instrumentation.

comment:20 Changed 5 weeks ago by catalyst

I can't replicate it without commenting the ulimit -c 0 in test_bt.sh.

Maybe it's a different core file? Maybe we should merge a patch to run file on the core file?

comment:21 Changed 5 weeks ago by nickm

Maybe we should merge a patch to run file on the core file?

This sounds great. We could even make it travis-only if we want.

comment:22 in reply to:  21 Changed 5 weeks ago by catalyst

Replying to nickm:

Maybe we should merge a patch to run file on the core file?

This sounds great. We could even make it travis-only if we want.

Doing this in child ticket #28024.

comment:23 Changed 5 weeks ago by catalyst

Another mystery is why this happens more often in the rust build than the non-rust build. (I can't seem to get it to happen in the non-rust build.)

comment:24 Changed 5 weeks ago by catalyst

Owner: catalyst deleted

comment:25 Changed 5 weeks ago by catalyst

Status: assignedneeds_information

We need to know where the mystery core file comes from. #28024 should help with that.

comment:26 Changed 6 days ago by nickm

Keywords: 035-can added; 035-must removed

comment:27 Changed 11 hours ago by teor

Keywords: 032-unreached-backport added; 032-backport removed

0.3.2 is end of life, so 032-backport is now 032-unreached-backport.

Note: See TracTickets for help on using tickets.