Opened 8 months ago

Closed 4 months ago

Last modified 4 months ago

#29645 closed defect (worksforme)

test.exe hangs on Appveyor CI

Reported by: teor Owned by: ahf
Priority: High Milestone: Tor: 0.4.1.x-final
Component: Core Tor/Tor Version: Tor: 0.3.4.1-alpha
Severity: Normal Keywords: asn-merge, tor-ci, tor-windows, tor-test, hang, 041-should
Cc: Actual Points: 0.4
Parent ID: Points: 1
Reviewer: nickm Sponsor:

Description

Tor's test.exe sometimes hangs on our Appveyor Windows CI.

I've seen this happen twice over the past few weeks.
Here is one example:
https://ci.appveyor.com/project/torproject/tor/builds/22791909/job/u0jd5tpr07mt2nv3

We've reduced the job time limit to 30 minutes to mitigate this issue.
But I am not sure how to debug it further.

Child Tickets

Change History (20)

comment:4 Changed 7 months ago by teor

Keywords: tor-ci-fail-sometimes added

comment:5 Changed 7 months ago by teor

I don't think we can send an abort signal to a Windows process to get a backlog: Windows doesn't do signals.

So we should probably run "make test" before we run "make check". Then we will know which test is hanging, because "make test" logs to stdout.

comment:6 Changed 7 months ago by teor

Actual Points: 0.3
Keywords: 034-backport 035-backport 040-backport added
Milestone: Tor: unspecifiedTor: 0.4.1.x-final
Priority: MediumHigh
Status: newneeds_review
Version: Tor: 0.3.4.1-alpha

We've seen this issue on maint-0.3.5 and maint-0.4.0, so I'm going to assume that it affects all appveyor tests (0.3.4 and later).

I've added a diagnostic that runs "make test", so we can see which test hangs:
0.3.4: https://github.com/torproject/tor/pull/894
0.3.5: https://github.com/torproject/tor/pull/895

  • testing only, clean merge

0.4.0: https://github.com/torproject/tor/pull/896

  • line-based merge

master: https://github.com/torproject/tor/pull/897

  • testing only, clean merge

Please don't close after merging, we still need to fix the bug!

comment:7 Changed 7 months ago by teor

Owner: set to teor
Status: needs_reviewassigned

comment:8 Changed 7 months ago by teor

Status: assignedneeds_review

comment:9 Changed 7 months ago by nickm

Reviewer: nickm
Status: needs_reviewneeds_revision

This LGTM except that the changes file probably shouldn't say "fixes bug 29645" but rather "diagnostic for ticket 29645" or something, since it isn't supposed to actually be a fix.

comment:10 Changed 7 months ago by teor

Actual Points: 0.30.4
Keywords: asn-merge added
Status: needs_revisionmerge_ready

I fixed the changes file, and merged forward to the other PRs. Let's merge!

comment:11 Changed 7 months ago by teor

(I had to force-push to change the commit message.)

comment:12 Changed 7 months ago by teor

Status: merge_readyneeds_information

This patch doesn't help us diagnose the issue.

It happened on the master PR at:
https://ci.appveyor.com/project/teor2345/tor/builds/23616223/job/jpvxrbivq2wb3ueg

"make test" passes, but "make check" hangs. So maybe we are looking for another culprit?

Can we make tests print a message when they start, as well as when they end?
Can we terminate all the tests if they hang, and see what output we get?
(Does timelimit work on Windows?)

Last edited 7 months ago by teor (previous) (diff)

comment:13 Changed 7 months ago by teor

Owner: changed from teor to nickm
Status: needs_informationassigned

I hope Nick can help answer these questions.

comment:14 Changed 7 months ago by teor

Status: assignedneeds_information

comment:15 Changed 6 months ago by nickm

Owner: changed from nickm to ahf
Status: needs_informationassigned

Hoping that ahf's windows skills can help us here.

comment:16 Changed 5 months ago by nickm

Keywords: 041-should added

comment:17 Changed 4 months ago by nickm

Keywords: 034-backport removed

Removing 034-backport from all open tickets: 034 has reached EOL.

comment:18 Changed 4 months ago by teor

Keywords: tor-ci-fail-sometimes removed

I haven't seen this issue for a while, should we assume it's not a problem any more?

comment:19 Changed 4 months ago by nickm

Resolution: worksforme
Status: assignedclosed

I say let's close it, and re-open if it happens again.

comment:20 Changed 4 months ago by teor

Keywords: 035-backport 040-backport removed
Note: See TracTickets for help on using tickets.