Tor start via Windows service fails

changed milestone to %Tor: 0.3.5.x-final

added 035-backport 035-rc-blocker component::core tor/tor milestone::Tor: 0.3.5.x-final nt-service priority::high regression reporter::Vort resolution::fixed reviewer::ahf severity::normal status::closed type::defect version::tor 0.3.5.3-alpha windows labels

Does it say any more if you log at "debug"?

Is there any way you can get a stack trace of the crash, if it crashes?

Trac:
Milestone: N/A to Tor: 0.4.0.x-final

With latest version (9c2b114b), it says more even at notice level: Nov 27 22:05:10.000 [err] tor_assertion_failed_(): Bug: cpuworker.c:501: cpuworker_queue_work: Assertion threadpool failed; aborting. (on Tor 0.4.0.0-alpha-dev ) Nov 27 22:05:10.000 [err] Bug: Assertion threadpool failed in cpuworker_queue_work at cpuworker.c:501. (Stack trace not available) (on Tor 0.4.0.0-alpha-dev )

Trac:
Username: Vort

Tried to insert some code at cpuworker.c:501 to catch the bug -> no result. Reverted my changes -> no assertion anymore. But here is what says gdb now. Maybe it is more related than cpuworker.c problem:

Attaching to process 17568
[New Thread 17568.0x7884]
[New Thread 17568.0x7518]
[New Thread 17568.0x8b90]
[New Thread 17568.0xccb0]
[New Thread 17568.0x8b4c]
[New Thread 17568.0xd754]
Reading symbols from D:\Tor\tor.exe...done.
Continuing.
[Thread 17568.0xd754 exited with code 0]

Thread 4 received signal SIGFPE, Arithmetic exception.
[Switching to Thread 17568.0xccb0]
0x0000000000527af1 in token_bucket_raw_refill_steps (elapsed=54536462,
    cfg=0x658800 <global_bucket>, bucket=0x658808 <global_bucket+8>)
    at src/lib/evloop/token_bucket.c:90
90        if (elapsed > gap / cfg->rate) {
(gdb) bt
#0  0x0000000000527af1 in token_bucket_raw_refill_steps (elapsed=54536462,
    cfg=0x658800 <global_bucket>, bucket=0x658808 <global_bucket+8>)
    at src/lib/evloop/token_bucket.c:90
#1  token_bucket_rw_refill (bucket=0x658800 <global_bucket>,
    now_ts=now_ts@entry=872583402) at src/lib/evloop/token_bucket.c:206
#2  0x000000000040650f in connection_bucket_refill_single (conn=0x1552f60,
    now_ts=872583402) at src/core/mainloop/connection.c:3436
#3  0x000000000040df78 in connection_handle_read_impl (conn=0x1552f60)
    at src/core/mainloop/connection.c:3539
#4  connection_handle_read (conn=conn@entry=0x1552f60)
    at src/core/mainloop/connection.c:3647
#5  0x0000000000413b10 in conn_read_callback (fd=<optimized out>,
    event=<optimized out>, _conn=0x1552f60)
    at src/core/mainloop/mainloop.c:870
#6  0x00000000621d4999 in ?? () from D:\Tor\libevent-2-1-6.dll
#7  0x00000000621d5504 in ?? () from D:\Tor\libevent-2-1-6.dll
#8  0x000000000041496f in run_main_loop_once ()
    at src/core/mainloop/mainloop.c:2926
#9  run_main_loop_until_done () at src/core/mainloop/mainloop.c:2988
#10 do_main_loop () at src/core/mainloop/mainloop.c:2883
#11 0x00000000004f4886 in nt_service_body (argc=<optimized out>,
    argv=<optimized out>) at src/app/main/ntmain.c:301
#12 0x000007fefdb1a82d in SECHOSTRegisterServiceCtrlHandlerExA ()
   from C:\Windows\SYSTEM32\sechost.dll
#13 0x00000000778759cd in KERNEL32BaseThreadInitThunk ()
   from C:\Windows\system32\kernel32.dll
#14 0x00000000779d385d in ntdllRtlUserThreadStart ()
   from C:\Windows\SYSTEM32\ntdll.dll
#15 0x0000000000000000 in ?? ()
Backtrace stopped: previous frame inner to this frame (corrupt stack?)
(gdb)

Trac:
Username: Vort

That's a really odd error:

all of those values are integers, so there shouldn't be a floating-point error
cfg->rate is checked for zero here: https://github.com/torproject/tor/blob/02ba0a2dbbe1032b56d0ba17468f2ee0d95e4f1b/src/lib/evloop/token_bucket.c#L40

I think we're going to need full logs or more information to diagnose this error.

Trac:
Status: new to needs_information

With fresh build cpuworker_queue_work bug popped out again. Let's switch to it. Here is the trace with __asm__("int3"); added:

Thread 4 received signal SIGTRAP, Trace/breakpoint trap.
[Switching to Thread 64900.0x1181c]
cpuworker_queue_work (priority=priority@entry=WQ_PRI_LOW,
    fn=fn@entry=0x47b020 <consensus_diff_worker_threadfn>,
    reply_fn=reply_fn@entry=0x47cd30 <consensus_diff_worker_replyfn>,
    arg=arg@entry=0x285b530) at src/core/mainloop/cpuworker.c:502
502       tor_assert(threadpool);
(gdb) bt
#0  cpuworker_queue_work (priority=priority@entry=WQ_PRI_LOW,
    fn=fn@entry=0x47b020 <consensus_diff_worker_threadfn>,
    reply_fn=reply_fn@entry=0x47cd30 <consensus_diff_worker_replyfn>,
    arg=arg@entry=0x285b530) at src/core/mainloop/cpuworker.c:502
#1  0x000000000047d7ef in consensus_diff_queue_diff_work (diff_to=0x139ab10,
    diff_from=0x1405930) at src/feature/dircache/consdiffmgr.c:1718
#2  consdiffmgr_rescan_flavor_ (flavor=FLAV_MICRODESC)
    at src/feature/dircache/consdiffmgr.c:1013
#3  consdiffmgr_rescan () at src/feature/dircache/consdiffmgr.c:1127
#4  0x00000000621d4b30 in ?? () from D:\Tor\libevent-2-1-6.dll
#5  0x00000000621d5504 in ?? () from D:\Tor\libevent-2-1-6.dll
#6  0x00000000004149bf in run_main_loop_once ()
    at src/core/mainloop/mainloop.c:2926
#7  run_main_loop_until_done () at src/core/mainloop/mainloop.c:2988
#8  do_main_loop () at src/core/mainloop/mainloop.c:2883
#9  0x00000000004f4916 in nt_service_body (argc=<optimized out>,
    argv=<optimized out>) at src/app/main/ntmain.c:301
#10 0x000007fefdb1a82d in SECHOSTRegisterServiceCtrlHandlerExA ()
   from C:\Windows\SYSTEM32\sechost.dll
#11 0x00000000778759cd in KERNEL32BaseThreadInitThunk ()
   from C:\Windows\system32\kernel32.dll
#12 0x00000000779d385d in ntdllRtlUserThreadStart ()
   from C:\Windows\SYSTEM32\ntdll.dll
#13 0x0000000000000000 in ?? ()
Backtrace stopped: previous frame inner to this frame (corrupt stack?)
(gdb) p threadpool
$1 = (threadpool_t *) 0x0
(gdb)

What I see is that threadpool should be initialized in cpu_init function. Which is called from run_tor_main_loop function.

But it have no chance to be executed because do_main_loop is already fired by nt_service_body.

Trac:
Username: Vort
Status: needs_information to new

#23693 (moved)

Trac:
Username: omg

Ok, so we need to call run_tor_main_loop() in ntmain.c, rather than do_main_loop().

Thanks for finding this bug, it looks like it hasn't worked for a long time.

Edit: typo

It looks like this bug was introduced in 0.3.5.3-alpha in commit c7ce6b9821be22e734b79e07e318f2bfba32722d. This regression in NT services must be fixed before 0.3.5-rc.

Trac:
Keywords: windows nt-service deleted, windows nt-service regression 035-backport 035-rc-must added
Version: Tor: 0.3.5.5-alpha to Tor: 0.3.5.3-alpha

Trac:
Priority: Medium to High

Trac:
Keywords: windows nt-service regression 035-backport 035-rc-must deleted, windows nt-service regression 035-backport 035-rc-blocker added

If that's the case, then bug28612 should fix it. PR at https://github.com/torproject/tor/pull/565

Trac:
Status: new to needs_review

Change at src/app/main/ntmain.c:301 is also needed. 0.3.5.5 + patches relay is running now, I will say if something goes wrong.

Trac:
Username: Vort

I've added that change too (I think) -- better now?

After last change, ntmain.c became the same as used in my relay. Relay is running for 1.5 days without problems, so I think this patch is good now.

Trac:
Username: Vort

Trac:
Reviewer: N/A to ahf

Patch looks good and seems to work here.

Trac:
Status: needs_review to merge_ready

Squashed and merged to 0.3.5 and forward!

Trac:
Status: merge_ready to closed
Resolution: N/A to fixed

Trac:
Milestone: Tor: 0.4.0.x-final to Tor: 0.3.5.x-final

closed

moved to tpo/core/tor#28612 (closed)

Tor start via Windows service fails

Child items ...

Activity