Opened 8 months ago

Last modified 6 months ago

#30187 assigned defect

100% cpu usage in winthreads tor_cond_wait

Reported by: bolvan Owned by: ahf
Priority: High Milestone: Tor: unspecified
Component: Core Tor/Tor Version: Tor: 0.3.5.8
Severity: Normal Keywords: windows 035-backport 042-proposed
Cc: Actual Points:
Parent ID: Points:
Reviewer: Sponsor:

Description

For years I run relay using self-compiled win64 version of tor.
Compiler mingw64.
Relay runs well for some time but suddenly starts using 100% cpu all cores.
I traced where it happens. The following loop never ends :

  do {
    DWORD res;
    res = WaitForSingleObject(cond->event, ms);
    EnterCriticalSection(&cond->lock);
    if (cond->n_to_wake &&
        cond->generation != generation_at_start) {
      --cond->n_to_wake;
      --cond->n_waiting;
      result = 0;
      waiting = 0;
      goto out;
    } else if (res != WAIT_OBJECT_0) {
      result = (res==WAIT_TIMEOUT) ? 1 : -1;
      --cond->n_waiting;
      waiting = 0;
      goto out;
    } else if (ms != INFINITE) {
      endTime = GetTickCount();
      if (startTime + ms_orig <= endTime) {
        result = 1; /* Timeout */
        --cond->n_waiting;
        waiting = 0;
        goto out;
      } else {
        ms = startTime + ms_orig - endTime;
      }
    }
    /* If we make it here, we are still waiting. */
    if (cond->n_to_wake == 0) {
      /* There is nobody else who should wake up; reset
       * the event. */
      ResetEvent(cond->event);
    }
  out:
    LeaveCriticalSection(&cond->lock);
  } while (waiting);

res = WAIT_OBJECT_0;
ms = INFINITE;
cond->n_to_wake=0x11
cond->generation=0x28
generation_at_start=0x28

it means no path with "goto out" ever execute
more than one thread run this loop and each one eat separate core

Some people I shared binaries with report same problem.
Pls check

Child Tickets

Change History (16)

comment:1 Changed 8 months ago by ahf

Keywords: windows added; winthreads tor_cond_wait removed
Owner: set to ahf
Status: newassigned

Interesting, I have not seen this yet myself, but I also never ran a Tor relay on Windows.

Have you been able to reproduce this with 0.4.x/master?

comment:2 Changed 8 months ago by bolvan

For me it never happened in linux. Problem seem to be winthreads specific.
I'll build 0.4.x and check

comment:3 Changed 8 months ago by bolvan

Yes, 0.4.0.4-rc same bug

comment:4 Changed 8 months ago by ahf

Have you been able to debug which call to tor_cond_wait() that is being problematic?

And this is only when running as a relay, right? You have not seen this condition when running as a client?

comment:5 Changed 8 months ago by bolvan

only worker_thread_main calls tor_cond_wait
personally i dont run as a client but another person who does reports client mode do not cause problem, only relay does

comment:6 Changed 8 months ago by nickm

Keywords: 035-backport added
Milestone: Tor: 0.4.0.x-final
Priority: MediumHigh

comment:7 Changed 8 months ago by nickm

One possible fix here would be to use ConditionVariable instead; it's been in Windows since Vista.

If we don't that route, here is a part that looks suspicious to me: the generation count getting stuck at 28 suggests to me that we are using generation wrong. In any case, we should really be either waking up or sleeping with each time through the loop, I think.

comment:8 Changed 8 months ago by cypherpunks

can you describe howto trace this down or link to info about howto? i also have a reproduceable "100% cpu all cores" problem. thanks and thanks for running relay.

comment:9 Changed 8 months ago by bolvan

You can use gdb. I'm not too good in gdb so I used https://github.com/rainers/cv2pdb/releases to convert dwarf debug info to pdb and then used visual studio.
It will ask source file location first. It will be able to set breakpoint although it could not watch vars. I used disassemble and register window to read value. May be this problem caused by gcc optimizations. Remove -O2, -O , ... from Makefile. I havent checked

Last edited 8 months ago by bolvan (previous) (diff)

comment:10 Changed 7 months ago by nickm

Keywords: 042-proposed added
Milestone: Tor: 0.4.0.x-finalTor: unspecified

comment:11 Changed 7 months ago by nickm

(This is worth doing, but it is not in scope for stable.)

comment:12 Changed 7 months ago by bolvan

This bugs makes tor relay unusable under windows.
All windows relay operators should stop their nodes.
Is it serious enough ?

comment:13 Changed 7 months ago by ahf

It is serious and we plan on fixing this bug. Right now (end of May 2019) we are finishing off a sponsor and trying to get 0.4.1 shipped. Once we are done with that work, we will get to this. The priority of this bug is still considered high.

comment:14 Changed 7 months ago by cypherpunks

Is this reproducible with MinGW-W64 trunk?

comment:15 Changed 7 months ago by bolvan

It doesnt seem to be related to the compiler. Its a bug in the windows specific code.
I compiled in recent wingw-w64 on windows and bug was there

comment:16 Changed 6 months ago by cypherpunks

its a win32 builds code bug (x64 too)

Note: See TracTickets for help on using tickets.