Opened 6 months ago

Last modified 10 days ago

#24857 new defect

tor 0.3.1.9 100% cpu load

Reported by: Eugene646 Owned by:
Priority: Medium Milestone: Tor: unspecified
Component: Core Tor/Tor Version: Tor: 0.3.1.9
Severity: Normal Keywords: cpu, windows, linux, performance, regression, 033-triage-20180326, 033-removed-20180326, 034-deferred-20180602, 035-removed-20180711
Cc: chelseakomlo Actual Points:
Parent ID: #25500 Points:
Reviewer: Sponsor:

Description

Hi,

I have tor 0.3.1.9 running as non-exit relay.
From time to time it starts consuming 100% of CPU. Even though network traffic is minimal.
Is it normal? If not, how can I help you to investigate it?

My system is Win 7 x64 SP1, cpu is Core i5-3570.

Child Tickets

Change History (28)

comment:1 Changed 6 months ago by dgoulet

Component: Core TorCore Tor/Tor
Keywords: windows added
Status: newneeds_information

Are you able to see some sort of time period pattern where tor would take all the CPU? Like is it every hour? or every 10 minutes?

Was it only at startup or it keeps doing it often?

comment:2 Changed 6 months ago by Eugene646

No, there is no pattern. It happens some days and may last from few minutes to few hours. Not necessarily at start. There is no reliable way of reproducing.

Maybe I could enable some debug logging or like that?

comment:3 Changed 6 months ago by mr_rokman

I'm also experiencing same issue. System is Win7 64-bit on Intel E6850, Tor version is 0.3.1.9, running as service, non-exit, non-bridge relay, although DirPort is enabled. From time to time Tor process start consuming 100% of both CPU cores, network activity during this spikes on both ORPort and DirPort is next to nothing (1-3 kbps). These spikes' pattern also does not depend on whenever I'm using relay myself. Moreover, task manager shows ~50% of kernel time of CPU utilization (some file/network operations maybe).

I have fragment of 'debug'-level logfile which is very big to post but I could do some parsing on it (like count repeating lines and comparing it to 'notice'-level log), I could also post torrc contents.

Or maybe I should create another ticket? Child ticket possibly?

comment:4 Changed 6 months ago by creideiki

I think I'm seeing this same bug, or at least something very similar.

I run Tor as a non-exit relay in an LXC container on a Gentoo Linux box on an Intel Core i7-6700K, with Tor constantly shuffling 10Mb/s of traffic. Some random time after starting the Tor service, it starts eating one entire CPU core.

I'm not sure if it's related, since I don't have CPU usage history for this container and thus don't know when the 100% CPU usage started, but the Tor log contains a lot of these messages:

Jan 24 05:17:24.000 [warn] Failing because we have 4063 connections already. Please read doc/TUNING for guidance. [over 16000001 similar message(s) suppressed in last 21600 seconds]

If that suppression count is correct, that's 740 messages per second, which seems a bit excessive.

I'm currently running this package:

net-vpn/tor-0.3.2.9::gentoo was built with the following:
USE="-libressl -lzma -scrypt seccomp (-selinux) -systemd -test tor-hardening -web -zstd" ABI_X86="(64)"

The logs suggest the following timeline:

Dec 9: First occurrence of the "Failing because we have X connections already" message, X=29967, on a Tor 0.3.2.2-alpha with 55 days of uptime.
Dec 10: Next message, X=29967.
Dec 15: Tor upgraded to 0.3.2.6-alpha.
Dec 22: Next message, X=29967, at 6 days 6 hours uptime.
Dec 30: Next message, X=29967, at 15 days 6 hours uptime.
Jan 1: 2 messages, X=29967, with 6 hours interval.
Jan 2: 1 message, X=29967, 21 hours after the last one.
Jan 7: Machine is rebooted to kernel 4.14.11 to get initial Meltdown patches. Failure messages start coming 4 minutes after starting the service, with X=4063.
Jan 7-21: Failure messages come regularly every 6 hours, X=4063.
Jan 21: Tor is upgraded to 0.3.2.9. Failure messages start coming 7 minutes after starting the service, with X=4063.
Jan 21-24 (now): Failure messages come regularly every 6 hours, X=4063.

The Tor process uses 100% CPU (i.e. one core), but not very much memory - currently 900M virtual, 300M resident.

"perf top" on the Tor process isn't very helpful; at the top is perf's own BPF stuff at 5-10% CPU time, followed by pthread functions in glibc at ~2-3%. Actual Tor code is way down the list; the only function visible without going ridiculously low is "assert_connection_ok" at 1% CPU time.

"strace" on the Tor process sees mostly calls to epoll_pwait():

[pid 23279] 1516784382.485502 epoll_pwait(3, [{EPOLLIN, {u32=9, u64=9}}], 512, 9, NULL, 8) = 1

Actually, a whole lot of them; strace counts around 69000 such calls per second, with other syscalls during one random second coming in at much lower counts:

      1 accept4
      1 close
      1 setsockopt
    174 getpid
    311 futex
    682 write
    967 getsockopt
    967 ioctl
   1116 epoll_ctl
   1561 read

(Edit: formatting)

Last edited 6 months ago by creideiki (previous) (diff)

comment:5 Changed 6 months ago by creideiki

I restarted the Tor service and watched it closely. The error message "Failing because we have 4063 connections already" appears at the same time as it starts using 100% CPU. This time it took 6 minutes.

At 3 minutes after starting the service, well after the log says the relay is up and running, "perf top" looks nice and sane like this:

    16.76%  tor                 [.] curve25519_donna
     3.18%  libcrypto.so.1.0.0  [.] __bn_sqrx8x_reduction
     1.93%  libcrypto.so.1.0.0  [.] mulx4x_internal
     1.79%  libcrypto.so.1.0.0  [.] bn_sqrx8x_internal
     1.62%  tor                 [.] connection_bucket_refill
     1.58%  tor                 [.] tor_memcmp
     1.40%  libc-2.26.so        [.] _int_malloc
     1.37%  libc-2.26.so        [.] malloc
     1.29%  libcrypto.so.1.0.0  [.] sha256_block_data_order_avx2
     0.99%  libc-2.26.so        [.] __memmove_avx_unaligned_erms
     0.96%  libc-2.26.so        [.] _int_free
     0.93%  tor                 [.] tor_memeq
     0.73%  tor                 [.] buf_datalen
     0.73%  [kernel]            [k] ___bpf_prog_run
     0.71%  [kernel]            [k] syscall_return_via_sysret
     0.61%  libcrypto.so.1.0.0  [.] __ecp_nistz256_mul_montx
     0.57%  [kernel]            [k] __fget
     0.55%  libcrypto.so.1.0.0  [.] __ecp_nistz256_sqr_montx
     0.53%  libcrypto.so.1.0.0  [.] sha1_block_data_order_avx2
     0.52%  tor                 [.] channel_update_bad_for_new_circs
     0.46%  libssl.so.1.0.0     [.] ssl_cipher_apply_rule.isra.1
     0.45%  libpthread-2.26.so  [.] __pthread_mutex_lock
     0.43%  libc-2.26.so        [.] cfree@GLIBC_2.2.5
     0.42%  libcrypto.so.1.0.0  [.] _aesni_ctr32_ghash_6x

After 6 minutes, when the error message has appeared in the log and the process uses 100% CPU, "perf top" looks like this:

     9.45%  [kernel]               [k] syscall_return_via_sysret
     6.33%  [unknown]              [k] 0xfffffe00000b601e
     5.16%  [kernel]               [k] ep_send_events_proc
     3.48%  [kernel]               [k] ___bpf_prog_run
     3.38%  [kernel]               [k] _raw_spin_lock_irqsave
     3.19%  [kernel]               [k] read_tsc
     3.06%  [kernel]               [k] __fget
     2.75%  libpthread-2.26.so     [.] __pthread_mutex_lock
     2.55%  [kernel]               [k] _raw_spin_unlock_irqrestore
     2.49%  [kernel]               [k] seccomp_run_filters
     2.33%  libc-2.26.so           [.] epoll_pwait
     2.27%  [kernel]               [k] select_estimate_accuracy
     2.25%  libpthread-2.26.so     [.] __pthread_mutex_unlock_usercnt
     1.84%  [kernel]               [k] syscall_trace_enter
     1.81%  [kernel]               [k] __seccomp_filter
     1.78%  libevent-2.1.so.6.0.2  [.] event_base_loop
     1.65%  [kernel]               [k] ep_scan_ready_list.constprop.15
     1.57%  [unknown]              [k] 0xfffffe000000601e
     1.52%  [kernel]               [k] ep_poll
     1.41%  [kernel]               [k] ktime_get_ts64
     1.36%  [kernel]               [k] entry_SYSCALL_64_after_hwframe
     1.35%  libevent-2.1.so.6.0.2  [.] event_callback_activate_nolock_
     1.29%  [unknown]              [k] 0xfffffe00000b6000
     1.10%  tor                    [.] assert_connection_ok

And says the Tor process is spending 65% of its time in the kernel. Probably with those 69k/s epoll_pwait() calls.

Network statistics says it's still shuffling its 10Mb/s, though, so it's doing something productive as well.

Final Tor log is not very interesting, but for completeness:

Jan 24 10:49:47.000 [notice] Tor 0.3.2.9 (git-9e8b762fcecfece6) opening log file.
Jan 24 10:49:47.000 [notice] Parsing GEOIP IPv4 file /usr/share/tor/geoip.
Jan 24 10:49:48.000 [notice] Parsing GEOIP IPv6 file /usr/share/tor/geoip6.
Jan 24 10:49:48.000 [notice] Configured to measure statistics. Look for the *-stats files that will first be written to the data directory in 24 hours from now.
Jan 24 10:49:48.000 [notice] Your Tor server's identity key fingerprint is CENSORED
Jan 24 10:49:48.000 [notice] Bootstrapped 0%: Starting
Jan 24 10:49:54.000 [notice] Starting with guard context "default"
Jan 24 10:49:54.000 [notice] Bootstrapped 80%: Connecting to the Tor network
Jan 24 10:49:54.000 [notice] Self-testing indicates your ORPort is reachable from the outside. Excellent.
Jan 24 10:49:54.000 [notice] Bootstrapped 85%: Finishing handshake with first hop
Jan 24 10:49:55.000 [notice] Bootstrapped 90%: Establishing a Tor circuit
Jan 24 10:49:56.000 [notice] Tor has successfully opened a circuit. Looks like client functionality is working.
Jan 24 10:49:56.000 [notice] Bootstrapped 100%: Done
Jan 24 10:49:58.000 [notice] Self-testing indicates your DirPort is reachable from the outside. Excellent. Publishing server descriptor.
Jan 24 10:50:00.000 [notice] Performing bandwidth self-test...done.
Jan 24 10:56:04.000 [warn] Failing because we have 4063 connections already. Please read doc/TUNING for guidance.

comment:6 Changed 6 months ago by creideiki

Keywords: linux added
Status: needs_informationnew
Version: Tor: 0.3.1.9Tor: 0.3.2.9

comment:7 in reply to:  4 Changed 6 months ago by arma

Replying to creideiki:

Jan 24 05:17:24.000 [warn] Failing because we have 4063 connections already. Please read doc/TUNING for guidance. [over 16000001 similar message(s) suppressed in last 21600 seconds]

You need to give Tor access to more file descriptors. Tor relays need to be able to reach the other Tor relays, and 4096-or-so descriptors is not enough.

comment:8 Changed 6 months ago by teor

Milestone: Tor: 0.3.4.x-final

comment:10 Changed 5 months ago by AlexRad

I see next log messages when tor service use 100% CPU

Feb 24 00:15:54.000 [info] store_multiple(): Adding consensus diff from BDCF9DAE9B06BE846F509A2CD4622D0296F4C524A5B97FD55B87A19082BFFD44 to C195F4A134922B69D8DCA02C1C9D92586102A99, compressed with identity
Feb 24 00:15:54.000 [info] store_multiple(): Adding consensus diff from BDCF9DAE9B06BE846F509A2CD4622D0296F4C524A5B97FD55B87A19082BFFD44 to C195F4A134922B69D8DCA02C1C9D92586102A99, compressed with gzip
Feb 24 00:15:54.000 [info] store_multiple(): Adding consensus diff from A2C8550157121D20120939A0CE806E8C80F9EAD261EEF569840D433EC4AADA83 to C195F4A134922B69D8DCA02C1C9D92586102A99, compressed with identity
Feb 24 00:15:54.000 [info] store_multiple(): Adding consensus diff from A2C8550157121D20120939A0CE806E8C80F9EAD261EEF569840D433EC4AADA83 to C195F4A134922B69D8DCA02C1C9D92586102A99, compressed with gzip
Feb 24 00:15:54.000 [info] store_multiple(): Adding consensus diff from 96CCFFE34D19FE0E9A271C171C9E29EE2D95A309CBCDA05071F65B9BDE0F05AC to C195F4A134922B69D8DCA02C1C9D92586102A99, compressed with identity
Feb 24 00:15:54.000 [info] store_multiple(): Adding consensus diff from 96CCFFE34D19FE0E9A271C171C9E29EE2D95A309CBCDA05071F65B9BDE0F05AC to C195F4A134922B69D8DCA02C1C9D92586102A99, compressed with gzip
Feb 24 00:15:54.000 [info] store_multiple(): Adding consensus diff from 743550945E24E932FA2A4548160FB7B94A56E8DB36F0024BDBB0DA0B0C084CBB to C195F4A134922B69D8DCA02C1C9D92586102A99, compressed with identity
Feb 24 00:15:54.000 [info] store_multiple(): Adding consensus diff from 743550945E24E932FA2A4548160FB7B94A56E8DB36F0024BDBB0DA0B0C084CBB to C195F4A134922B69D8DCA02C1C9D92586102A99, compressed with gzip
Feb 24 00:15:55.000 [info] store_multiple(): Adding consensus diff from 6E83B5ED1443A63C1D2A227E40C14876CE9A444D31F2797EBEBD3131875F9804 to C195F4A134922B69D8DCA02C1C9D92586102A99, compressed with identity
Feb 24 00:15:55.000 [info] store_multiple(): Adding consensus diff from 6E83B5ED1443A63C1D2A227E40C14876CE9A444D31F2797EBEBD3131875F9804 to C195F4A134922B69D8DCA02C1C9D92586102A99, compressed with gzip
Feb 24 00:15:55.000 [info] store_multiple(): Adding consensus diff from C1733C9DB3D3FFA41648B4512D3ED9088706828C0B46B63910F9334CD18D71B3 to C195F4A134922B69D8DCA02C1C9D92586102A99, compressed with identity
Feb 24 00:15:55.000 [info] store_multiple(): Adding consensus diff from C1733C9DB3D3FFA41648B4512D3ED9088706828C0B46B63910F9334CD18D71B3 to C195F4A134922B69D8DCA02C1C9D92586102A99, compressed with gzip
Feb 24 00:15:55.000 [info] store_multiple(): Adding consensus diff from D21909A5CCEAE66D820DCB25D5BBF6591B8E4D7199FBE52C6F2B4D7607F445D6 to C195F4A134922B69D8DCA02C1C9D92586102A99, compressed with identity
Feb 24 00:15:55.000 [info] store_multiple(): Adding consensus diff from D21909A5CCEAE66D820DCB25D5BBF6591B8E4D7199FBE52C6F2B4D7607F445D6 to C195F4A134922B69D8DCA02C1C9D92586102A99, compressed with gzip
Feb 24 00:15:55.000 [info] store_multiple(): Adding consensus diff from 244325664A3D3FEA6EEE708E3A7F5457B9313C108C744275A7A6BAB031E1783F to C195F4A134922B69D8DCA02C1C9D92586102A99, compressed with identity
Feb 24 00:15:55.000 [info] store_multiple(): Adding consensus diff from 244325664A3D3FEA6EEE708E3A7F5457B9313C108C744275A7A6BAB031E1783F to C195F4A134922B69D8DCA02C1C9D92586102A99, compressed with gzip
Feb 24 00:15:55.000 [info] store_multiple(): Adding consensus diff from 8A68AF897614AB690785BF1C709249ADCDEA2E78657B41CE55285FAE3D3672CE to C195F4A134922B69D8DCA02C1C9D92586102A99, compressed with identity
Feb 24 00:15:55.000 [info] store_multiple(): Adding consensus diff from 8A68AF897614AB690785BF1C709249ADCDEA2E78657B41CE55285FAE3D3672CE to C195F4A134922B69D8DCA02C1C9D92586102A99, compressed with gzip
Feb 24 00:15:56.000 [info] store_multiple(): Adding consensus diff from 1520D1CA72D0806C7F627C57ED9B1380904CFD24F504762A4B4649DF1B70CC4B to C195F4A134922B69D8DCA02C1C9D92586102A99, compressed with identity
Feb 24 00:15:56.000 [info] store_multiple(): Adding consensus diff from 1520D1CA72D0806C7F627C57ED9B1380904CFD24F504762A4B4649DF1B70CC4B to C195F4A134922B69D8DCA02C1C9D92586102A99, compressed with gzip
Feb 24 00:15:56.000 [info] store_multiple(): Adding consensus diff from ACFCFDD29853CA1EB252867E1643F9B7A2C389DD2BC5B4F3FC44EBA8F2D772CC to C195F4A134922B69D8DCA02C1C9D92586102A99, compressed with identity
Feb 24 00:15:56.000 [info] store_multiple(): Adding consensus diff from ACFCFDD29853CA1EB252867E1643F9B7A2C389DD2BC5B4F3FC44EBA8F2D772CC to C195F4A134922B69D8DCA02C1C9D92586102A99, compressed with gzip
Feb 24 00:15:57.000 [info] store_multiple(): Adding consensus diff from 09783C82602523E03D46483573F927E768EC2C70E1008956EEEFFA80185EC0AB to C195F4A134922B69D8DCA02C1C9D92586102A99, compressed with identity
Feb 24 00:15:57.000 [info] store_multiple(): Adding consensus diff from 09783C82602523E03D46483573F927E768EC2C70E1008956EEEFFA80185EC0AB to C195F4A134922B69D8DCA02C1C9D92586102A99, compressed with gzip
Feb 24 00:15:57.000 [info] store_multiple(): Adding consensus diff from 39AE8F1436B0C78B5A39A90D46EB46A960E30253595DB9B6EC74E80B8532185D to C195F4A134922B69D8DCA02C1C9D92586102A99, compressed with identity
Feb 24 00:15:57.000 [info] store_multiple(): Adding consensus diff from 39AE8F1436B0C78B5A39A90D46EB46A960E30253595DB9B6EC74E80B8532185D to C195F4A134922B69D8DCA02C1C9D92586102A99, compressed with gzip
Feb 24 00:15:57.000 [info] store_multiple(): Adding consensus diff from 4E9D2911D8985DB0810B80DA38A353D7DC89B686CB312B7D8A2CCE8B1CDE5370 to C195F4A134922B69D8DCA02C1C9D92586102A99, compressed with identity
Feb 24 00:15:57.000 [info] store_multiple(): Adding consensus diff from 4E9D2911D8985DB0810B80DA38A353D7DC89B686CB312B7D8A2CCE8B1CDE5370 to C195F4A134922B69D8DCA02C1C9D92586102A99, compressed with gzip

and so on...

comment:11 Changed 5 months ago by teor

Is normal for your relay to use 100% CPU while generating and compressing consensus diffs.
Try fixing the file descriptor limit, if you can.

Also, the next 0.3.1 release will handle extra load from clients better,

comment:12 in reply to:  10 Changed 5 months ago by cypherpunks

Replying to AlexRad:

I see next log messages when tor service use 100% CPU

and so on...

For a long time (10 and more minutes)?

comment:14 Changed 4 months ago by Eugene646

Still happens with tor 0.3.3.2 alpha.

comment:15 Changed 4 months ago by cypherpunks

Keywords: performance regression added
Milestone: Tor: 0.3.4.x-finalTor: 0.3.3.x-final
Version: Tor: 0.3.2.9Tor: 0.3.1.9

It's a regression, started from 0.3.1.x, and it looks like a dumb mingw implementation of some function, used in consdiff. #24826 didn't fix it.

comment:16 Changed 4 months ago by cypherpunks

it looks like a dumb mingw implementation of some function

#ifdef _WIN32
/* On Windows, unlink won't work on a file if the file is actively mmap()ed.
 * That forces us to be less aggressive about unlinking files, and causes other
 * changes throughout our logic.
 */
#define MUST_UNMAP_TO_UNLINK
#endif /* defined(_WIN32) */
#ifdef MUST_UNMAP_TO_UNLINK
  /* If we can't unlink the files that we're still using, then we need to
   * tell the storagedir backend to allow far more files than this consensus
   * cache actually wants, so that it can hold files which, from this cache's
   * perspective, have become useless.
   */
#define VERY_LARGE_STORAGEDIR_LIMIT (1000*1000)
  storagedir_max_entries = VERY_LARGE_STORAGEDIR_LIMIT;
#else /* !(defined(MUST_UNMAP_TO_UNLINK)) */
  /* Otherwise, we can just tell the storagedir to use the same limits
   * as this cache. */
  storagedir_max_entries = max_entries;
#endif /* defined(MUST_UNMAP_TO_UNLINK) */
get_max_age_to_cache(void)
{
  const int32_t DEFAULT_MAX_AGE_TO_CACHE = 8192;
  const int32_t MIN_MAX_AGE_TO_CACHE = 0;
  const int32_t MAX_MAX_AGE_TO_CACHE = 8192;
  const char MAX_AGE_TO_CACHE_NAME[] = "max-consensus-age-to-cache-for-diff";

  const or_options_t *options = get_options();

  if (options->MaxConsensusAgeForDiffs) {
    const int v = options->MaxConsensusAgeForDiffs;
    if (v >= MAX_MAX_AGE_TO_CACHE * 3600)
      return MAX_MAX_AGE_TO_CACHE;
    else
      return v;
  }

  /* The parameter is in hours, so we multiply */
  return 3600 * networkstatus_get_param(NULL,
                                        MAX_AGE_TO_CACHE_NAME,
                                        DEFAULT_MAX_AGE_TO_CACHE,
                                        MIN_MAX_AGE_TO_CACHE,
                                        MAX_MAX_AGE_TO_CACHE);
}

comment:17 Changed 4 months ago by nickm

Keywords: 033-triage-20180326 added

Second batch of triage for 0.3.3: tickets that we didn't cover the first time.

comment:18 Changed 4 months ago by nickm

Keywords: 033-removed-20180326 added

Mark this tickets as removed.

comment:19 Changed 4 months ago by nickm

Milestone: Tor: 0.3.3.x-finalTor: 0.3.4.x-final

considering for 034 as possibly related to our mainloop refactor.

comment:20 Changed 4 months ago by arma

Summary from the bottom comments seem to be that there's some unhealthy interaction between our unlink hack for Windows and the consensus diff file generation?

comment:21 Changed 2 months ago by nickm

I think they are implying (*) that we're storing way too many old entries in the storagedir, and so the number of diffs we compute every hour is growing without bound? In theory this shouldn't actually be happening, though, since we're supposed to be only computing diffs to entries within a certain time range. (We should check to make sure we're actually doing that.)

(*) Please, if when posting an explanation for a bug, try to actually explain what you think is happening, in as much detail as you can. Things are never as obvious to other people as they are to you.

comment:22 Changed 2 months ago by cypherpunks

Something weird, wrong, typos.

Last edited 2 months ago by cypherpunks (previous) (diff)

comment:23 Changed 2 months ago by cypherpunks

Something weird, wrong, typos.

Last edited 2 months ago by cypherpunks (previous) (diff)

comment:24 Changed 7 weeks ago by nickm

Keywords: 034-deferred-20180602 added
Milestone: Tor: 0.3.4.x-finalTor: 0.3.5.x-final

Deferring non-must tickets to 0.3.5

comment:25 Changed 7 weeks ago by arma

See also #25461 and #18614 for other profiling / cpu load tickets. Should we merge them, or pick our favorite and close the others?

comment:26 Changed 7 weeks ago by teor

Parent ID: #25500

comment:27 Changed 7 weeks ago by chelseakomlo

Cc: chelseakomlo added

comment:28 Changed 10 days ago by nickm

Keywords: 035-removed-20180711 added
Milestone: Tor: 0.3.5.x-finalTor: unspecified

These tickets are being triaged out of 0.3.5. The ones marked "035-roadmap-proposed" may return.

Note: See TracTickets for help on using tickets.