Opened 7 weeks ago

Closed 4 weeks ago

#28612 closed defect (fixed)

Tor start via Windows service fails

Reported by: Vort Owned by:
Priority: High Milestone: Tor: 0.4.0.x-final
Component: Core Tor/Tor Version: Tor: 0.3.5.3-alpha
Severity: Normal Keywords: windows nt-service regression 035-backport 035-rc-blocker
Cc: Actual Points:
Parent ID: Points:
Reviewer: ahf Sponsor:

Description

When I run tor --service start with latest Tor version (0.3.5.5-alpha or 3741f9e5) on Windows 7 SP1 x64, Tor process starts, outputs some info into log file and exits (crashes?).
Latest line in log file: Nov 25 19:35:20.000 [notice] Configured to measure statistics. Look for the *-stats files that will first be written to the data directory in 24 hours from now.
Tor was configured with --disable-gcc-hardening and compiled by MSYS2 with following libraries: Libevent 2.1.8-stable, OpenSSL 1.1.1a, Zlib 1.2.11, Liblzma 5.2.4, and Libzstd 1.3.7.

Child Tickets

Change History (17)

comment:1 Changed 7 weeks ago by nickm

Milestone: Tor: 0.4.0.x-final

Does it say any more if you log at "debug"?

Is there any way you can get a stack trace of the crash, if it crashes?

comment:2 Changed 7 weeks ago by Vort

With latest version (9c2b114b), it says more even at notice level:
Nov 27 22:05:10.000 [err] tor_assertion_failed_(): Bug: cpuworker.c:501: cpuworker_queue_work: Assertion threadpool failed; aborting. (on Tor 0.4.0.0-alpha-dev )
Nov 27 22:05:10.000 [err] Bug: Assertion threadpool failed in cpuworker_queue_work at cpuworker.c:501. (Stack trace not available) (on Tor 0.4.0.0-alpha-dev )

comment:3 Changed 7 weeks ago by Vort

Tried to insert some code at cpuworker.c:501 to catch the bug -> no result. Reverted my changes -> no assertion anymore.
But here is what says gdb now. Maybe it is more related than cpuworker.c problem:

Attaching to process 17568
[New Thread 17568.0x7884]
[New Thread 17568.0x7518]
[New Thread 17568.0x8b90]
[New Thread 17568.0xccb0]
[New Thread 17568.0x8b4c]
[New Thread 17568.0xd754]
Reading symbols from D:\Tor\tor.exe...done.
Continuing.
[Thread 17568.0xd754 exited with code 0]

Thread 4 received signal SIGFPE, Arithmetic exception.
[Switching to Thread 17568.0xccb0]
0x0000000000527af1 in token_bucket_raw_refill_steps (elapsed=54536462,
    cfg=0x658800 <global_bucket>, bucket=0x658808 <global_bucket+8>)
    at src/lib/evloop/token_bucket.c:90
90        if (elapsed > gap / cfg->rate) {
(gdb) bt
#0  0x0000000000527af1 in token_bucket_raw_refill_steps (elapsed=54536462,
    cfg=0x658800 <global_bucket>, bucket=0x658808 <global_bucket+8>)
    at src/lib/evloop/token_bucket.c:90
#1  token_bucket_rw_refill (bucket=0x658800 <global_bucket>,
    now_ts=now_ts@entry=872583402) at src/lib/evloop/token_bucket.c:206
#2  0x000000000040650f in connection_bucket_refill_single (conn=0x1552f60,
    now_ts=872583402) at src/core/mainloop/connection.c:3436
#3  0x000000000040df78 in connection_handle_read_impl (conn=0x1552f60)
    at src/core/mainloop/connection.c:3539
#4  connection_handle_read (conn=conn@entry=0x1552f60)
    at src/core/mainloop/connection.c:3647
#5  0x0000000000413b10 in conn_read_callback (fd=<optimized out>,
    event=<optimized out>, _conn=0x1552f60)
    at src/core/mainloop/mainloop.c:870
#6  0x00000000621d4999 in ?? () from D:\Tor\libevent-2-1-6.dll
#7  0x00000000621d5504 in ?? () from D:\Tor\libevent-2-1-6.dll
#8  0x000000000041496f in run_main_loop_once ()
    at src/core/mainloop/mainloop.c:2926
#9  run_main_loop_until_done () at src/core/mainloop/mainloop.c:2988
#10 do_main_loop () at src/core/mainloop/mainloop.c:2883
#11 0x00000000004f4886 in nt_service_body (argc=<optimized out>,
    argv=<optimized out>) at src/app/main/ntmain.c:301
#12 0x000007fefdb1a82d in SECHOST!RegisterServiceCtrlHandlerExA ()
   from C:\Windows\SYSTEM32\sechost.dll
#13 0x00000000778759cd in KERNEL32!BaseThreadInitThunk ()
   from C:\Windows\system32\kernel32.dll
#14 0x00000000779d385d in ntdll!RtlUserThreadStart ()
   from C:\Windows\SYSTEM32\ntdll.dll
#15 0x0000000000000000 in ?? ()
Backtrace stopped: previous frame inner to this frame (corrupt stack?)
(gdb)

comment:4 Changed 6 weeks ago by teor

Status: newneeds_information

That's a really odd error:

  • all of those values are integers, so there shouldn't be a floating-point error
  • cfg->rate is checked for zero here:

https://github.com/torproject/tor/blob/02ba0a2dbbe1032b56d0ba17468f2ee0d95e4f1b/src/lib/evloop/token_bucket.c#L40

I think we're going to need full logs or more information to diagnose this error.

comment:5 Changed 6 weeks ago by Vort

Status: needs_informationnew

With fresh build cpuworker_queue_work bug popped out again.
Let's switch to it. Here is the trace with __asm__("int3"); added:

Thread 4 received signal SIGTRAP, Trace/breakpoint trap.
[Switching to Thread 64900.0x1181c]
cpuworker_queue_work (priority=priority@entry=WQ_PRI_LOW,
    fn=fn@entry=0x47b020 <consensus_diff_worker_threadfn>,
    reply_fn=reply_fn@entry=0x47cd30 <consensus_diff_worker_replyfn>,
    arg=arg@entry=0x285b530) at src/core/mainloop/cpuworker.c:502
502       tor_assert(threadpool);
(gdb) bt
#0  cpuworker_queue_work (priority=priority@entry=WQ_PRI_LOW,
    fn=fn@entry=0x47b020 <consensus_diff_worker_threadfn>,
    reply_fn=reply_fn@entry=0x47cd30 <consensus_diff_worker_replyfn>,
    arg=arg@entry=0x285b530) at src/core/mainloop/cpuworker.c:502
#1  0x000000000047d7ef in consensus_diff_queue_diff_work (diff_to=0x139ab10,
    diff_from=0x1405930) at src/feature/dircache/consdiffmgr.c:1718
#2  consdiffmgr_rescan_flavor_ (flavor=FLAV_MICRODESC)
    at src/feature/dircache/consdiffmgr.c:1013
#3  consdiffmgr_rescan () at src/feature/dircache/consdiffmgr.c:1127
#4  0x00000000621d4b30 in ?? () from D:\Tor\libevent-2-1-6.dll
#5  0x00000000621d5504 in ?? () from D:\Tor\libevent-2-1-6.dll
#6  0x00000000004149bf in run_main_loop_once ()
    at src/core/mainloop/mainloop.c:2926
#7  run_main_loop_until_done () at src/core/mainloop/mainloop.c:2988
#8  do_main_loop () at src/core/mainloop/mainloop.c:2883
#9  0x00000000004f4916 in nt_service_body (argc=<optimized out>,
    argv=<optimized out>) at src/app/main/ntmain.c:301
#10 0x000007fefdb1a82d in SECHOST!RegisterServiceCtrlHandlerExA ()
   from C:\Windows\SYSTEM32\sechost.dll
#11 0x00000000778759cd in KERNEL32!BaseThreadInitThunk ()
   from C:\Windows\system32\kernel32.dll
#12 0x00000000779d385d in ntdll!RtlUserThreadStart ()
   from C:\Windows\SYSTEM32\ntdll.dll
#13 0x0000000000000000 in ?? ()
Backtrace stopped: previous frame inner to this frame (corrupt stack?)
(gdb) p threadpool
$1 = (threadpool_t *) 0x0
(gdb)

What I see is that threadpool should be initialized in cpu_init function.
Which is called from run_tor_main_loop function.

But it have no chance to be executed because do_main_loop is already fired by nt_service_body.

comment:6 Changed 6 weeks ago by omg

comment:7 Changed 6 weeks ago by teor

Ok, so we need to call run_tor_main_loop() in ntmain.c, rather than do_main_loop().

Thanks for finding this bug, it looks like it hasn't worked for a long time.

Edit: typo

Last edited 6 weeks ago by teor (previous) (diff)

comment:8 Changed 6 weeks ago by teor

Keywords: regression 035-backport 035-rc-must added
Version: Tor: 0.3.5.5-alphaTor: 0.3.5.3-alpha

It looks like this bug was introduced in 0.3.5.3-alpha in commit c7ce6b9821be22e734b79e07e318f2bfba32722d.
This regression in NT services must be fixed before 0.3.5-rc.

comment:9 Changed 6 weeks ago by nickm

Priority: MediumHigh

comment:10 Changed 6 weeks ago by teor

Keywords: 035-rc-blocker added; 035-rc-must removed

comment:11 Changed 6 weeks ago by nickm

Status: newneeds_review

If that's the case, then bug28612 should fix it. PR at https://github.com/torproject/tor/pull/565

comment:12 Changed 6 weeks ago by Vort

Change at src/app/main/ntmain.c:301 is also needed.
0.3.5.5 + patches relay is running now, I will say if something goes wrong.

comment:13 Changed 6 weeks ago by nickm

I've added that change too (I think) -- better now?

comment:14 Changed 6 weeks ago by Vort

After last change, ntmain.c became the same as used in my relay.
Relay is running for 1.5 days without problems, so I think this patch is good now.

comment:15 Changed 5 weeks ago by dgoulet

Reviewer: ahf

comment:16 Changed 4 weeks ago by ahf

Status: needs_reviewmerge_ready

Patch looks good and seems to work here.

comment:17 Changed 4 weeks ago by nickm

Resolution: fixed
Status: merge_readyclosed

Squashed and merged to 0.3.5 and forward!

Note: See TracTickets for help on using tickets.