Opened 3 months ago

Closed 3 weeks ago

#22752 closed defect (fixed)

Assertion failure in consensus_cache_entry_handle_get on Windows

Reported by: ahf Owned by: nickm
Priority: High Milestone: Tor: 0.3.1.x-final
Component: Core Tor/Tor Version: Tor: 0.3.1.3-alpha
Severity: Normal Keywords:
Cc: vvort@… Actual Points:
Parent ID: Points:
Reviewer: dgoulet Sponsor: Sponsor4

Description

We've received a bug report on execdir@ with a bug in Tor 0.3.1.3-alpha (dc47d936d47ffc25).

The following log snippet was attached:

24 02:42:42.645 [Warning] Unable to unlink "S:/PortableApps/TorBrowser/Browser/TorBrowser/Data/Tor\\diff-cache/1011"
24 02:42:42.645 [Warning] tor_bug_occurred_(): Bug: consdiffmgr.c:1285: store_multiple: Non-fatal assertion !(ent == NULL) failed. (on Tor 0.3.1.3-alpha dc47d936d47ffc25)
24 02:42:42.645 [Warning] Bug: Non-fatal assertion !(ent == NULL) failed in store_multiple at consdiffmgr.c:1285. (Stack trace not available) (on Tor 0.3.1.3-alpha dc47d936d47ffc25)
24 02:47:50.583 [Error] tor_assertion_failed_(): Bug: conscache.c:521: consensus_cache_entry_handle_get: Assertion ref failed; aborting. (on Tor 0.3.1.3-alpha dc47d936d47ffc25)
24 02:47:50.583 [Error] Bug: Assertion ref failed in consensus_cache_entry_handle_get at conscache.c:521. (Stack trace not available) (on Tor 0.3.1.3-alpha dc47d936d47ffc25)

Child Tickets

Attachments (1)

unmap_hack.patch (508 bytes) - added by Vort 7 weeks ago.

Download all attachments as: .zip

Change History (24)

comment:1 Changed 3 months ago by nickm

Priority: MediumHigh

comment:2 Changed 3 months ago by nickm

Status: newneeds_review

See bug22752_mitigation_031 for a mitigation and diagnostic branch. It doesn't fix this, but it makes it nonfatal and tries to get more useful info if it happens again.

comment:3 Changed 3 months ago by ahf

Status: needs_reviewmerge_ready

LGTM.

comment:4 Changed 3 months ago by nickm

Status: merge_readyneeds_information

Thanks; merged! Putting in needs_information until we get some more diagnostics here.

comment:6 Changed 2 months ago by nickm

Here's the report:

Unable to unlink ".\\Data\\Tor\\LocalState\\diff-cache/1001" while removing file: Permission denied
tor_bug_occurred_(): Bug: consdiffmgr.c:1289: store_multiple: Non-fatal assertion !(ent == NULL) failed. (on Tor 0.3.1.4-alpha fab91a290ded3e74)
Bug: Non-fatal assertion !(ent == NULL) failed in store_multiple at consdiffmgr.c:1289. (Stack trace not available) (on Tor 0.3.1.4-alpha fab91a290ded3e74)
tor_bug_occurred_(): Bug: consdiffmgr.c:328: cdm_diff_ht_purge: Non-fatal assertion !((*diff)->entry == NULL) failed. (on Tor 0.3.1.4-alpha fab91a290ded3e74)
Bug: Non-fatal assertion !((*diff)->entry == NULL) failed in cdm_diff_ht_purge at consdiffmgr.c:328. (Stack trace not available) (on Tor 0.3.1.4-alpha fab91a290ded3e74)
eventdns: All nameservers have failed

So this issues is still present in 0.3.1.4-alpha

comment:7 Changed 7 weeks ago by Vort

Still in 0.3.1.5-alpha:

Aug 05 08:24:12.000 [warn] Unable to unlink "d:\\Tor\\Data\\diff-cache/1085" while removing file: Permission denied
Aug 05 09:22:13.000 [warn] Unable to unlink "d:\\Tor\\Data\\diff-cache/1005" while removing file: Permission denied
Aug 05 09:22:13.000 [warn] tor_bug_occurred_(): Bug: consdiffmgr.c:1300: store_multiple: Non-fatal assertion !(ent == NULL) failed. (on Tor 0.3.1.5-alpha )
Aug 05 09:22:13.000 [warn] Bug: Non-fatal assertion !(ent == NULL) failed in store_multiple at consdiffmgr.c:1300. (Stack trace not available) (on Tor 0.3.1.5-alpha )
Aug 05 10:24:11.000 [warn] tor_bug_occurred_(): Bug: consdiffmgr.c:329: cdm_diff_ht_purge: Non-fatal assertion !((*diff)->entry == NULL) failed. (on Tor 0.3.1.5-alpha )
Aug 05 10:24:11.000 [warn] Bug: Non-fatal assertion !((*diff)->entry == NULL) failed in cdm_diff_ht_purge at consdiffmgr.c:329. (Stack trace not available) (on Tor 0.3.1.5-alpha )
Last edited 7 weeks ago by Vort (previous) (diff)

comment:8 Changed 7 weeks ago by Vort

Looks like this happens when file count for diff-cache reaches 256 (but I'm not sure).

comment:9 Changed 7 weeks ago by Vort

This bug can be reproduced with two virtual machines:

  1. Windows 7, Tor 0.3.1.5-alpha relay.
  2. Ubuntu 16.04, chutney "basic" network.

It takes about 10 minutes until cache fills up with 256 entries and assertions start appearing in log file.

But there is one problem with this test:
In about a minute, Windows 7-based relay start to use 100% of CPU resources.
(Looks like this overload is made by a lot of tor_cond_wait() calls)
So it needs to be limited for one core usage, or your system will be lagging.

comment:10 Changed 7 weeks ago by nickm

I think this bug might be caused by the fact that (I think!) on windows, you can't unlink a file that's in use. But our code tries to unlink these files while they are still mapped.

comment:11 Changed 7 weeks ago by Vort

Cc: vvort@… added

Changed 7 weeks ago by Vort

Attachment: unmap_hack.patch added

comment:12 Changed 7 weeks ago by Vort

I think this bug might be caused by the fact that (I think!) on windows, you can't unlink a file that's in use.

Yes, adding of consensus_cache_entry_unmap call hides "unlink" warnings:
attachment:unmap_hack.patch
But adds other ones, of course.

But our code tries to unlink these files while they are still mapped.

Using of deleted file is a strange thing for me.

comment:13 in reply to:  12 Changed 7 weeks ago by nickm

Replying to Vort:

I think this bug might be caused by the fact that (I think!) on windows, you can't unlink a file that's in use.

Yes, adding of consensus_cache_entry_unmap call hides "unlink" warnings:
attachment:unmap_hack.patch
But adds other ones, of course.

But our code tries to unlink these files while they are still mapped.

Using of deleted file is a strange thing for me.

It's pretty normal on Unix-derived systems. Files are reference-counted, and not actually deleted until nothing else is using them. The "unlink()" system call doesn't actually delete a file -- it just removes a name from from the filesystem. Only when all links and references are gone is the actual data deleted. That's why it's called "unlink()" instead of "delete()".

comment:14 Changed 7 weeks ago by Vort

Looks like in Windows, unlink actually deletes file:
https://msdn.microsoft.com/en-us/library/aa273396(v=vs.60).aspx

There are something similar to unlink from Unix in NTFS:
https://unix.stackexchange.com/a/49306
But I'm not sure about it.

Concerning architecture:
Tor unlinks file just to free name or its goal is to free disk space?
What is the reason to have exactly 256 names, but gigabytes of used space?

Sadly, I don't completely understand what happening here.

comment:15 Changed 4 weeks ago by nickm

Owner: set to nickm
Status: needs_informationaccepted

comment:16 Changed 4 weeks ago by nickm

File names are scarce under the linux seccomp2 sandbox code, where they need to be preallocated and reserved. Attempts to limit disk usage here are over the long term -- everything that's unlinked but kept on disk should be removed as soon as the mmap is closed, which should happen as soon as it's done spooling.

One option here is to just use a different strategy under windows, since we don't need the same filename limits. Another option is to kill off the maps when deleting on windows, since se can't keep them around.

comment:17 Changed 4 weeks ago by nickm

Another option is to kill off the maps when deleting on windows, since we can't keep them around.

I tried this option first, since I thought it would be simple, but I ran into multithreading issues, and it turned out to be fairly tricky. I've left my code in branch bug22752_031_complex in my public repository, but I hope we don't need to use it -- it could be destabilizing if I got _anything_ wrong, which I probably did.

I'm going to try a simpler option now, with fewer multithreading issues.

comment:18 Changed 4 weeks ago by nickm

Status: acceptedneeds_review

I tried the simpler approach in bug22752_031_simple -- I think that's the one we should be reviewing.

comment:19 Changed 4 weeks ago by dgoulet

Reviewer: dgoulet
Status: needs_reviewneeds_revision
  • I see this that is a very puzzling thing for which I think I figured it out:
+  cache->max_entries = max_entries;
+#ifdef MUST_UNMAP_TO_UNLINK
+  max_entries = 1000000;
+#endif
   cache->dir = storage_dir_new(directory, max_entries);

I've looked on how consensus_cache_open() is used and how big max_entries can be and seems it will always be (128 * 2) so the value in the case of Windows is drastically bigger (1000000). And that explains the changes file saying that we want a storage dir bigger than the actual limit of the cache.

That code above needs a comment explaining this because I got very confused on this arbitrary limit of 1000000 and the limit vs the allocation difference.

  • In consensus_cache_register_with_sandbox():
+#ifdef MUST_UNMAP_TO_UNLINK
+  /* Our sandbox doesn't support huge limits like we use here.
+   */
+  tor_assert_nonfatal_unreached();
+#endif

And also no sandbox on Windows but let us assume that we can enable that define on OperatingSystem9000 for instance, we should indicate here that the fix for the requirement to UNMAP is to have a huge file limit thus not compatible with our sandbox because the link between "sandbox" and "must unmap" is not easy to do...

  • consensus_cache_may_overallocate() imo should have documentation on *why* it does exists in the first place. If I was new to the code, I would look at the function which would lead me to #define MUST_UNMAP_TO_UNLINK which only has the comment "/* On Windows, unlink won't work if there's an active mmap. */" which would let me very confused.

I would like somewhere a clear comment on the why of the relationship between "MUST UNMAP" and the fact that we have a very large max entry value for that as a fix.

comment:20 Changed 3 weeks ago by Vort

I have tested 0.3.1.5 + bug22752_031_simple in VM - works good (except for hang - but that is another problem).
Now I will test it with real relay.

comment:21 Changed 3 weeks ago by nickm

Status: needs_revisionneeds_review

Vort: Thanks! Any results so far?

dgoulet: I've added another commit to bug22752_031_simple with more comments.

I think this is probably good to merge; please let me know some time soon if either of you disagree. ;)

comment:22 Changed 3 weeks ago by Vort

Any results so far?

4 days of online: no errors in the log file and exactly 256 files in diff-cache.

comment:23 Changed 3 weeks ago by nickm

Resolution: fixed
Status: needs_reviewclosed

Vort: Okay. I'll take this as likely working.

Dgoulet: merging now to 0.3.1 and forward; but please let me know if you think I didn't get the documentation right in 948be49ce06f744ca456d20f48bfb6d2f5cfb7d2

peace!

Note: See TracTickets for help on using tickets.