Opened 10 days ago

Closed 18 hours ago

#29706 closed defect (fixed)

Test failure due to memory leaks in shared-random unit tests: long-term fix

Reported by: teor Owned by: teor
Priority: Medium Milestone: Tor: 0.4.0.x-final
Component: Core Tor/Tor Version: Tor: 0.2.9.1-alpha
Severity: Normal Keywords: tor-ci, tor-test, memory-management, 034-unreached-backport, 035-unreached-backport, 040-backport
Cc: Actual Points: 1.1
Parent ID: Points: 0.5
Reviewer: asn Sponsor:

Description

In #29599 we fixed some leaks in the shared-random unit tests. But there are still some test failures.

The shared random state claims to take over ownership of SRVs passed to it using PUT. But it doesn't free them automatically: instead, the caller has to remember to call state_del_current_srv() first. (Or one of its callers.)

The current app code is ok, but the test code doesn't always call the functions in the right order.

Child Tickets

Change History (16)

comment:1 Changed 10 days ago by teor

Keywords: 029-backport-partial added; 029-backport removed
Status: assignedneeds_review

Here's one example of a failure after #29599 was merged:
https://travis-ci.org/torproject/tor/jobs/503849917

I have two pull requests on 0.2.9:

A minimal version that fixes the unit tests on 0.2.9:
https://github.com/torproject/tor/pull/774

A refactor that fixes the underlying memory management bug on 0.2.9:
https://github.com/torproject/tor/pull/775

And a merge of the refactor to 0.3.4:
https://github.com/torproject/tor/pull/776
(There were some trivial conflicts where I needed to combine two sets of changes.)

I expect that we'll backport the minimal version to 0.2.9 after CI has succeeded on master and 0.4.0.

We can merge the refactor to 0.4.0 and master, and then decide whether to backport it after it has been tested in a few alphas. In particular, we should get arma to deploy an alpha or nightly on his authority.

comment:2 Changed 10 days ago by teor

Actual Points: 0.30.5
Points: 0.30.5
Reviewer: asn

Assigning to asn for review, because he's on CI rotation next week. (And it's Saturday for most people. We can wait until Monday.)

comment:3 Changed 7 days ago by asn

Status: needs_reviewmerge_ready

LGTM! I think this new design where the state takes responsibility for the SRVs is cleaner indeed.

I also think that merging the simplified fix to 029 makes sense.

comment:4 Changed 7 days ago by nickm

quick question here -- when we are about to call state_query_del_, do we need a check to make sure that the value we are about to assign is not the same pointer as the previous value?

comment:5 Changed 7 days ago by teor

Status: merge_readyneeds_revision

Yes, we need to do that check. (The current code never assigns the same pointer, but we should allow future code to do so without failing horribly.)

comment:6 in reply to:  1 Changed 7 days ago by teor

Replying to teor:

I have two pull requests on 0.2.9:

A minimal version that fixes the unit tests on 0.2.9:
https://github.com/torproject/tor/pull/774

We should merge pr/774 to 0.4.0 and master, then to 0.2.9 and later once the CI passes.

A refactor that fixes the underlying memory management bug on 0.2.9:
https://github.com/torproject/tor/pull/775

We should not merge pr/775.
This refactor includes the fix in pr/774.
It also independently fixes the bug by making the memory management work better.
I don't think we'll backport this refactor to 0.2.9.

And a merge of the refactor to 0.3.4:
https://github.com/torproject/tor/pull/776
(There were some trivial conflicts where I needed to combine two sets of changes.)

We should merge pr/776 to 0.4.0 and master.
Then we should test it in an alpha, and backport it as far back as we think we'll backport any other shared random changes.
(For authorities, that's 0.3.4 right now. For clients, that's probably 0.3.5, because it's LTS.)

I'll open a child ticket for pr/774.

comment:7 Changed 7 days ago by teor

Keywords: 034-backport-maybe added; 029-backport-partial 034-backport removed

Ok, the quick fix is in #29740.
I'll work on revising the refactor in this ticket.

comment:8 in reply to:  4 Changed 7 days ago by teor

Keywords: consider-backport-after-authority-test consider-backport-after-0404-alpha added
Status: needs_revisionneeds_review
Summary: Test failure due to memory leaks in shared-random unit testsTest failure due to memory leaks in shared-random unit tests: long-term fix

Replying to nickm:

quick question here -- when we are about to call state_query_del_, do we need a check to make sure that the value we are about to assign is not the same pointer as the previous value?

I think replacing a pointer with itself is almost always a bug, because:

  • we've confused current and previous, or
  • we forgot sr_srv_dup().

Unless both pointers are NULL: replacing NULL with NULL is ok.

I changed the code so it does nothing when replacing a non-NULL pointer with the same pointer, and logs a BUG() warning. That BUG() doesn't trigger on the unit tests or chutney.

The changes are in ​https://github.com/torproject/tor/pull/776
I re-did the merge, because I ended up with a lot of partial merges.

We should merge it to 0.4.0 and master, then consider a backport to 0.3.4 and 0.3.5 once arma has run it on moria for a while (and once it's been in at least one full alpha).

comment:9 Changed 7 days ago by teor

Actual Points: 0.51.0
Priority: Very HighMedium

comment:10 Changed 6 days ago by asn

Status: needs_reviewmerge_ready

Latest revisions and added check LGTM.
I also tested it on chutney for some time just to make sure that no BUGs are triggered, and it worked fine!

comment:11 Changed 6 days ago by nickm

Milestone: Tor: 0.4.1.x-finalTor: 0.3.5.x-final

Great; merged PR 776 to 0.4.0 and forward. Marking for possible backport.

comment:12 Changed 5 days ago by teor

Actual Points: 1.01.1
Milestone: Tor: 0.3.5.x-finalTor: 0.4.1.x-final
Status: merge_readyneeds_review

One of the tests failed if we happened to init during the commit-reveal state transition period:

shared-random/utils_auth: [forking] 
  FAIL ../src/test/test_shared_random.c:1173: assert(sr_state_get_current_srv() OP_EQ NULL): 0000000002DFC350 vs 0000000000000000
  [utils_auth FAILED]

https://ci.appveyor.com/project/torproject/tor/builds/23057542/job/0066lphdf9c9oc3l?fullLog=true#L6552

I added a commit to https://github.com/torproject/tor/pull/776 to always clear the SRVs after test init, and before setting either SRV in the tests.

Let's merge to 0.4.0 and master, and try again?

comment:13 Changed 5 days ago by teor

Status: needs_reviewmerge_ready

Actually, the commit seems simple enough for a single review:
https://github.com/torproject/tor/pull/776/commits/a9c3101e2140ceacc47498a8d15c8b54ad2616c5

comment:14 Changed 3 days ago by nickm

okay, merged again and trying again.

comment:15 Changed 3 days ago by nickm

Milestone: Tor: 0.4.1.x-finalTor: 0.3.5.x-final

comment:16 Changed 18 hours ago by teor

Keywords: 034-unreached-backport 035-unreached-backport added; consider-backport-after-authority-test consider-backport-after-0404-alpha 034-backport-maybe 035-backport removed
Milestone: Tor: 0.3.5.x-finalTor: 0.4.0.x-final
Resolution: fixed
Status: merge_readyclosed

This change is too large to backport: we already fixed the unit test failure in 0.2.9 and later with #29740.

Note: See TracTickets for help on using tickets.