Opened 7 weeks ago

Last modified 7 weeks ago

#23081 merge_ready defect

Tor relay crashes at consensus_diff_queue_diff_work() with assertion in_main_thread() failed

Reported by: Vort Owned by:
Priority: Medium Milestone: Tor: 0.3.0.x-final
Component: Core Tor/Tor Version: Tor: 0.3.1.5-alpha
Severity: Minor Keywords: tor-relay regression win32 nt-service 025-backport 028-backport 029-backport 030-backport
Cc: ahf Actual Points:
Parent ID: Points:
Reviewer: Sponsor: Sponsor4-can

Description

When I start a relay with latest Tor version, it almost instantly crashes:

Aug 02 11:04:45.000 [err] tor_assertion_failed_(): Bug: consdiffmgr.c:1601: consensus_diff_queue_diff_work: Assertion in_main_thread() failed; aborting. (on Tor 0.3.1.5-alpha )
Aug 02 11:04:45.000 [err] Bug: Assertion in_main_thread() failed in consensus_diff_queue_diff_work at consdiffmgr.c:1601. (Stack trace not available) (on Tor 0.3.1.5-alpha )

OS: Windows 7 SP1 x64
Tor: 0.3.1.5-alpha x64 (custom MSYS2 build)

Child Tickets

Attachments (3)

tor_threads_bad.png (117.8 KB) - added by Vort 7 weeks ago.
23081.patch (363 bytes) - added by nickm 7 weeks ago.
23081_v2.patch (376 bytes) - added by Vort 7 weeks ago.

Download all attachments as: .zip

Change History (19)

comment:1 Changed 7 weeks ago by asn

Cc: ahf added
Milestone: Tor: unspecifiedTor: 0.3.2.x-final
Sponsor: Sponsor4-can

comment:2 Changed 7 weeks ago by Sebastian

First it would be good to learn whether set_main_thread has been called. So I guess a patch along the lines of

--- a/src/common/compat_threads.c
+++ b/src/common/compat_threads.c
@@ -90,6 +90,7 @@ set_main_thread(void)
 int
 in_main_thread(void)
 {
+  tor_assert(main_thread_id != -1);
   return main_thread_id == tor_get_thread_id();
 }
 

could help debug this

comment:3 Changed 7 weeks ago by Vort

Sebastian, I see the same errors with this change.

comment:4 Changed 7 weeks ago by nickm

exactly the same errors, or something different?

comment:5 Changed 7 weeks ago by nickm

So, here are the possibilities I can think of:

  1. The work is being queued from somewhere other than the main thread.
  2. The main_thread_id variable has not been set.
  3. The main thread has somehow changed its thread ID.

comment:6 Changed 7 weeks ago by nickm

Keywords: regression added
Priority: MediumHigh

comment:7 Changed 7 weeks ago by Vort

exactly the same errors, or something different?

Exactly the same.

  1. The work is being queued from somewhere other than the main thread.

Looks like most probable reason.

One more note:
Error happens only if relay is under the load.
I have two keys: first have weight ~1000, second is unmeasured now.
And with second key Tor did not hit this error yet.

upd. Second relay was running in non-service mode, this thing may be important.

Last edited 7 weeks ago by Vort (previous) (diff)

comment:8 Changed 7 weeks ago by Vort

This bug may be related to Windows services.
tor --service start -> error
tor.exe -f torrc -> no error
But I am not sure about this.

Changed 7 weeks ago by Vort

Attachment: tor_threads_bad.png added

comment:9 Changed 7 weeks ago by Vort

Caught!
Look at the screenshot:
attachment:tor_threads_bad.png

Changed 7 weeks ago by nickm

Attachment: 23081.patch added

comment:10 Changed 7 weeks ago by nickm

Status: newneeds_review

Good diagnosis; I think you're right. Does the attached patch file (23081.patch) fix the issue for you?

comment:11 Changed 7 weeks ago by nickm

Keywords: win32 nt-service added

Changed 7 weeks ago by Vort

Attachment: 23081_v2.patch added

comment:12 Changed 7 weeks ago by Vort

Does the attached patch file (23081.patch) fix the issue for you?

No, that is wrong do_main_loop().
This variant works:
attachment:23081_v2.patch

comment:13 Changed 7 weeks ago by nickm

Keywords: 025-backport 028-backport 029-backport 030-backport added

Thank you!

It looks as if this bug has been present for a long time, so I've made a branch against 0.2.5 (the oldest supported version): bug23081_025. I've merged it to 0.3.1 and later, and I'll mark this ticket for possible backport.

(I currently think we should backport to 0.2.9.)

comment:14 Changed 7 weeks ago by nickm

Milestone: Tor: 0.3.2.x-finalTor: 0.3.0.x-final
Priority: HighMedium
Severity: MajorMinor
Status: needs_reviewmerge_ready

comment:15 Changed 7 weeks ago by Vort

It looks as if this bug has been present for a long time

Tor was doing something wrong, but this was not visible because there was no such assert in previous versions?
I did not saw such error with 0.3.0.9.

comment:16 Changed 7 weeks ago by nickm

It was doing other wrong things. Here's the changes message I wrote up:

+    - When running as a Windows service, set the ID of the main thread
+      correctly. Failure to do so made us fail to send log messages
+      to the controller in 0.2.1.16-rc, slowed down controller
+      event delivery in 0.2.7.3-rc and later, and crash with an assertion
+      failure in 0.3.1.1-alpha. Fixes bug 23081; bugfix on 0.2.1.6-alpha.
+      Patch and diagnosis from "Vort".
Note: See TracTickets for help on using tickets.