Opened 8 weeks ago

Closed 5 weeks ago

#31841 closed defect (fixed)

test addr/parse takes a long time on master on some machines

Reported by: teor Owned by: nickm
Priority: High Milestone: Tor: 0.4.2.x-final
Component: Core Tor/Tor Version: Tor: unspecified
Severity: Normal Keywords: tor-unit-tests 042-should BugSmashFund
Cc: Actual Points: .2
Parent ID: Points:
Reviewer: teor Sponsor:

Description (last modified by teor)

A recent commit to master causes the addr/parse test to run for a few minutes on my Debian 10.1 install. The test works fine on maint-0.4.1, and on some of my older master branches.

This issue doesn't seem to affect our CI builders.

I haven't had time to bisect yet.

Child Tickets

Change History (23)

comment:1 Changed 8 weeks ago by teor

Description: modified (diff)
Summary: test addr/parse hangs on master on some machinestest addr/parse takes a long time on master on some machines

comment:2 Changed 8 weeks ago by nickm

Keywords: 042-must added
Priority: MediumHigh

comment:3 Changed 8 weeks ago by teor

This seems to be a network dependency:

  • when the network interface is up, sometimes the tests are fast
  • when the network interface is down, the tests are almost always slow

Because it's intermittent, I might be wrong about this bug being new in master.

comment:4 Changed 8 weeks ago by nickm

I wonder if this is a hostname-lookup issue.

comment:5 Changed 7 weeks ago by nickm

I think the solution here is, if the test does any hostname lookups, to mock out the lowest-level getaddrinfo or gethostbyname function, so that we aren't going out to the network from this test.

comment:6 Changed 7 weeks ago by nickm

Keywords: 042-should added; 042-must removed

comment:7 Changed 7 weeks ago by nickm

Owner: set to nickm
Status: newaccepted

comment:8 Changed 7 weeks ago by teor

I think it's not just one test, these other tests also seem to be affected:

  • test/options_*

Maybe we should mock those functions for all unit tests?

In 0.4.3, we should consider failing tests that take longer than N seconds. I'll open another ticket.

comment:9 Changed 7 weeks ago by nickm

I suspected systems with slow DNS. Fortunately, all our blocking DNS lookup happens in one place, in one block within tor_addr_lookup(), so it's easy to see which unit tests call it. They are:

  • addr/parse (over 180 times)
  • config/parse_port_config__ports__ports_given (twice)
  • hs_config/valid_service_v2 (once)
  • hs_config/valid_service_v3 (once)
  • options/validate (once)

So if a system has a DNS resolver that is susceptible to timeouts, we would expect addr/parse to block, and nothing else.

comment:10 Changed 7 weeks ago by nickm

(Of course, if the callback is really slow, the other tests might block too.)

comment:11 Changed 7 weeks ago by teor

I opened #31940 for follow-up.

Let's decide if we want to do it in 0.4.3, based on how well this fix works.

comment:12 Changed 7 weeks ago by nickm

Branch is ticket31841; PR at https://github.com/torproject/tor/pull/1393

I suspect that coverage may go down, since we are removing all the code that used to call into tor_addr_lookup_host_{getaddrinfo,gethostbyname}() no longer does so. I'm not sure of a meaningful way to test those, however, without potentially invoking a DNS resolver.

I'll put this in needs_review once CI has passed.

comment:13 Changed 7 weeks ago by nickm

Actual Points: .2
Keywords: BugSmashFund added

comment:14 Changed 7 weeks ago by nickm

Status: acceptedneeds_review

Needs_review. I think that most of the lost coverage outside of lib/net/resolve.c is spurious, but I can investigate more if needed.

comment:15 Changed 7 weeks ago by teor

Status: needs_reviewneeds_revision

I left a few minor comments on the pull request, feel free to rebase and edit existing commits to fix them.

I checked the coverage loss, it did seem spurious.

Do we need to backport this change to 0.2.9 and later?
We're going to be running tests on 0.2.9 for the next 3 months, and 0.3,5 for the next 2 years.

comment:16 Changed 7 weeks ago by nickm

I'd suggest that we not backport. To me, this seems more like a feature than a bug.

comment:17 Changed 7 weeks ago by nickm

Status: needs_revisionneeds_review

I've turned the printf() into a debug log, and rebased the branch to move the mocking into the proper commit.

comment:18 Changed 7 weeks ago by teor

"the tests won't hang if your DNS is slow" doesn't really seem like a feature.
Particularly when I'm the tester, and my builds are incredibly slow.

Let's check how far back the 180 addr/parse calls go?
I think it's 0.4.2, but it might be 0.4.1.

comment:19 Changed 6 weeks ago by dgoulet

Reviewer: teor

comment:20 in reply to:  18 Changed 5 weeks ago by teor

Status: needs_reviewmerge_ready

This PR seems fine to me.

There is something consistently broken about the coverage report: we did change more than 2 lines.

It's also worth noting that we are mocking an IPv6-capable resolver, so some coverage may change (particularly for IPv4-only cases). But overall, coverage will be more stable across environments and network conditions.

Replying to teor:

"the tests won't hang if your DNS is slow" doesn't really seem like a feature.
Particularly when I'm the tester, and my builds are incredibly slow.

Let's check how far back the 180 addr/parse calls go?
I think it's 0.4.2, but it might be 0.4.1.

I have checked, and I added a large number of address lookups in 0.4.2.1-alpha, to increase coverage.
So I don't think we need to backport.

comment:21 Changed 5 weeks ago by nickm

Keywords: dgoulet-merge added

Thanks for the review!

(When merging this, please remember that it goes into maint-0.4.2 and then forward.)

comment:22 Changed 5 weeks ago by nickm

Keywords: dgoulet-merge removed

Merged to 0.4.2 and forward. (Per org/teams/NetworkTeam/MergePolicy, we don't need a third person for this.)

comment:23 Changed 5 weeks ago by nickm

Resolution: fixed
Status: merge_readyclosed
Note: See TracTickets for help on using tickets.