Opened 17 months ago

Closed 8 weeks ago

#29645 closed defect (worksforme)

test.exe hangs on Appveyor CI

Reported by: teor Owned by: ahf
Priority: High Milestone: Tor: 0.4.4.x-final
Component: Core Tor/Tor Version: Tor: 0.3.4.1-alpha
Severity: Normal Keywords: asn-merge, tor-ci, tor-windows, tor-test, hang, 044-should
Cc: Actual Points: 0.4
Parent ID: Points: 1
Reviewer: nickm Sponsor:

Description

Tor's test.exe sometimes hangs on our Appveyor Windows CI.

I've seen this happen twice over the past few weeks.
Here is one example:
https://ci.appveyor.com/project/torproject/tor/builds/22791909/job/u0jd5tpr07mt2nv3

We've reduced the job time limit to 30 minutes to mitigate this issue.
But I am not sure how to debug it further.

Child Tickets

TicketStatusOwnerSummaryComponent
#32804closedTravis CI hangs during compile or testCore Tor/Tor

Change History (24)

comment:4 Changed 16 months ago by teor

Keywords: tor-ci-fail-sometimes added

comment:5 Changed 16 months ago by teor

I don't think we can send an abort signal to a Windows process to get a backlog: Windows doesn't do signals.

So we should probably run "make test" before we run "make check". Then we will know which test is hanging, because "make test" logs to stdout.

comment:6 Changed 16 months ago by teor

Actual Points: 0.3
Keywords: 034-backport 035-backport 040-backport added
Milestone: Tor: unspecifiedTor: 0.4.1.x-final
Priority: MediumHigh
Status: newneeds_review
Version: Tor: 0.3.4.1-alpha

We've seen this issue on maint-0.3.5 and maint-0.4.0, so I'm going to assume that it affects all appveyor tests (0.3.4 and later).

I've added a diagnostic that runs "make test", so we can see which test hangs:
0.3.4: https://github.com/torproject/tor/pull/894
0.3.5: https://github.com/torproject/tor/pull/895

  • testing only, clean merge

0.4.0: https://github.com/torproject/tor/pull/896

  • line-based merge

master: https://github.com/torproject/tor/pull/897

  • testing only, clean merge

Please don't close after merging, we still need to fix the bug!

comment:7 Changed 16 months ago by teor

Owner: set to teor
Status: needs_reviewassigned

comment:8 Changed 16 months ago by teor

Status: assignedneeds_review

comment:9 Changed 16 months ago by nickm

Reviewer: nickm
Status: needs_reviewneeds_revision

This LGTM except that the changes file probably shouldn't say "fixes bug 29645" but rather "diagnostic for ticket 29645" or something, since it isn't supposed to actually be a fix.

comment:10 Changed 16 months ago by teor

Actual Points: 0.30.4
Keywords: asn-merge added
Status: needs_revisionmerge_ready

I fixed the changes file, and merged forward to the other PRs. Let's merge!

comment:11 Changed 16 months ago by teor

(I had to force-push to change the commit message.)

comment:12 Changed 16 months ago by teor

Status: merge_readyneeds_information

This patch doesn't help s diagnose the issue.

It happened on the master PR at:
https://ci.appveyor.com/project/teor2345/tor/builds/23616223/job/jpvxrbivq2wb3ueg

"make test" passes, but "make check" hangs. So maybe we are looking for another culprit?

Can we make tests print a message when they start, as well as when they end?
Can we terminate all the tests if they hang, and see what output we get?
(Does timelimit work on Windows?)

Version 0, edited 16 months ago by teor (next)

comment:13 Changed 16 months ago by teor

Owner: changed from teor to nickm
Status: needs_informationassigned

I hope Nick can help answer these questions.

comment:14 Changed 16 months ago by teor

Status: assignedneeds_information

comment:15 Changed 16 months ago by nickm

Owner: changed from nickm to ahf
Status: needs_informationassigned

Hoping that ahf's windows skills can help us here.

comment:16 Changed 15 months ago by nickm

Keywords: 041-should added

comment:17 Changed 14 months ago by nickm

Keywords: 034-backport removed

Removing 034-backport from all open tickets: 034 has reached EOL.

comment:18 Changed 14 months ago by teor

Keywords: tor-ci-fail-sometimes removed

I haven't seen this issue for a while, should we assume it's not a problem any more?

comment:19 Changed 14 months ago by nickm

Resolution: worksforme
Status: assignedclosed

I say let's close it, and re-open if it happens again.

comment:20 Changed 14 months ago by teor

Keywords: 035-backport 040-backport removed

comment:21 Changed 8 months ago by teor

Resolution: worksforme
Status: closedreopened

comment:22 Changed 2 months ago by nickm

Milestone: Tor: 0.4.1.x-finalTor: 0.4.4.x-final

comment:23 Changed 2 months ago by nickm

Keywords: 044-should added; 041-should removed

comment:24 Changed 8 weeks ago by nickm

Resolution: worksforme
Status: reopenedclosed

We haven't run into this in a long time, and it appears this has stalled.

Note: See TracTickets for help on using tickets.