Opened 11 months ago

Last modified 4 weeks ago

#24857 needs_revision defect

Tor uses 100% CPU when accessing the cache directory on Windows

Reported by: Eugene646 Owned by:
Priority: Medium Milestone: Tor: unspecified
Component: Core Tor/Tor Version: Tor: 0.3.1.9
Severity: Normal Keywords: cpu, windows, linux, performance, regression, 033-triage-20180326, 033-removed-20180326, 034-deferred-20180602, 035-removed-20180711, 034-backport, 033-backport-maybe, 032-unreached-backport, 040-roadmap-proposed
Cc: chelseakomlo Actual Points:
Parent ID: #25500 Points:
Reviewer: Sponsor:

Description

Hi,

I have tor 0.3.1.9 running as non-exit relay.
From time to time it starts consuming 100% of CPU. Even though network traffic is minimal.
Is it normal? If not, how can I help you to investigate it?

My system is Win 7 x64 SP1, cpu is Core i5-3570.

Child Tickets

Change History (43)

comment:1 Changed 11 months ago by dgoulet

Component: Core TorCore Tor/Tor
Keywords: windows added
Status: newneeds_information

Are you able to see some sort of time period pattern where tor would take all the CPU? Like is it every hour? or every 10 minutes?

Was it only at startup or it keeps doing it often?

comment:2 Changed 11 months ago by Eugene646

No, there is no pattern. It happens some days and may last from few minutes to few hours. Not necessarily at start. There is no reliable way of reproducing.

Maybe I could enable some debug logging or like that?

comment:3 Changed 11 months ago by mr_rokman

I'm also experiencing same issue. System is Win7 64-bit on Intel E6850, Tor version is 0.3.1.9, running as service, non-exit, non-bridge relay, although DirPort is enabled. From time to time Tor process start consuming 100% of both CPU cores, network activity during this spikes on both ORPort and DirPort is next to nothing (1-3 kbps). These spikes' pattern also does not depend on whenever I'm using relay myself. Moreover, task manager shows ~50% of kernel time of CPU utilization (some file/network operations maybe).

I have fragment of 'debug'-level logfile which is very big to post but I could do some parsing on it (like count repeating lines and comparing it to 'notice'-level log), I could also post torrc contents.

Or maybe I should create another ticket? Child ticket possibly?

comment:4 Changed 11 months ago by creideiki

I think I'm seeing this same bug, or at least something very similar.

I run Tor as a non-exit relay in an LXC container on a Gentoo Linux box on an Intel Core i7-6700K, with Tor constantly shuffling 10Mb/s of traffic. Some random time after starting the Tor service, it starts eating one entire CPU core.

I'm not sure if it's related, since I don't have CPU usage history for this container and thus don't know when the 100% CPU usage started, but the Tor log contains a lot of these messages:

Jan 24 05:17:24.000 [warn] Failing because we have 4063 connections already. Please read doc/TUNING for guidance. [over 16000001 similar message(s) suppressed in last 21600 seconds]

If that suppression count is correct, that's 740 messages per second, which seems a bit excessive.

I'm currently running this package:

net-vpn/tor-0.3.2.9::gentoo was built with the following:
USE="-libressl -lzma -scrypt seccomp (-selinux) -systemd -test tor-hardening -web -zstd" ABI_X86="(64)"

The logs suggest the following timeline:

Dec 9: First occurrence of the "Failing because we have X connections already" message, X=29967, on a Tor 0.3.2.2-alpha with 55 days of uptime.
Dec 10: Next message, X=29967.
Dec 15: Tor upgraded to 0.3.2.6-alpha.
Dec 22: Next message, X=29967, at 6 days 6 hours uptime.
Dec 30: Next message, X=29967, at 15 days 6 hours uptime.
Jan 1: 2 messages, X=29967, with 6 hours interval.
Jan 2: 1 message, X=29967, 21 hours after the last one.
Jan 7: Machine is rebooted to kernel 4.14.11 to get initial Meltdown patches. Failure messages start coming 4 minutes after starting the service, with X=4063.
Jan 7-21: Failure messages come regularly every 6 hours, X=4063.
Jan 21: Tor is upgraded to 0.3.2.9. Failure messages start coming 7 minutes after starting the service, with X=4063.
Jan 21-24 (now): Failure messages come regularly every 6 hours, X=4063.

The Tor process uses 100% CPU (i.e. one core), but not very much memory - currently 900M virtual, 300M resident.

"perf top" on the Tor process isn't very helpful; at the top is perf's own BPF stuff at 5-10% CPU time, followed by pthread functions in glibc at ~2-3%. Actual Tor code is way down the list; the only function visible without going ridiculously low is "assert_connection_ok" at 1% CPU time.

"strace" on the Tor process sees mostly calls to epoll_pwait():

[pid 23279] 1516784382.485502 epoll_pwait(3, [{EPOLLIN, {u32=9, u64=9}}], 512, 9, NULL, 8) = 1

Actually, a whole lot of them; strace counts around 69000 such calls per second, with other syscalls during one random second coming in at much lower counts:

      1 accept4
      1 close
      1 setsockopt
    174 getpid
    311 futex
    682 write
    967 getsockopt
    967 ioctl
   1116 epoll_ctl
   1561 read

(Edit: formatting)

Last edited 11 months ago by creideiki (previous) (diff)

comment:5 Changed 11 months ago by creideiki

I restarted the Tor service and watched it closely. The error message "Failing because we have 4063 connections already" appears at the same time as it starts using 100% CPU. This time it took 6 minutes.

At 3 minutes after starting the service, well after the log says the relay is up and running, "perf top" looks nice and sane like this:

    16.76%  tor                 [.] curve25519_donna
     3.18%  libcrypto.so.1.0.0  [.] __bn_sqrx8x_reduction
     1.93%  libcrypto.so.1.0.0  [.] mulx4x_internal
     1.79%  libcrypto.so.1.0.0  [.] bn_sqrx8x_internal
     1.62%  tor                 [.] connection_bucket_refill
     1.58%  tor                 [.] tor_memcmp
     1.40%  libc-2.26.so        [.] _int_malloc
     1.37%  libc-2.26.so        [.] malloc
     1.29%  libcrypto.so.1.0.0  [.] sha256_block_data_order_avx2
     0.99%  libc-2.26.so        [.] __memmove_avx_unaligned_erms
     0.96%  libc-2.26.so        [.] _int_free
     0.93%  tor                 [.] tor_memeq
     0.73%  tor                 [.] buf_datalen
     0.73%  [kernel]            [k] ___bpf_prog_run
     0.71%  [kernel]            [k] syscall_return_via_sysret
     0.61%  libcrypto.so.1.0.0  [.] __ecp_nistz256_mul_montx
     0.57%  [kernel]            [k] __fget
     0.55%  libcrypto.so.1.0.0  [.] __ecp_nistz256_sqr_montx
     0.53%  libcrypto.so.1.0.0  [.] sha1_block_data_order_avx2
     0.52%  tor                 [.] channel_update_bad_for_new_circs
     0.46%  libssl.so.1.0.0     [.] ssl_cipher_apply_rule.isra.1
     0.45%  libpthread-2.26.so  [.] __pthread_mutex_lock
     0.43%  libc-2.26.so        [.] cfree@GLIBC_2.2.5
     0.42%  libcrypto.so.1.0.0  [.] _aesni_ctr32_ghash_6x

After 6 minutes, when the error message has appeared in the log and the process uses 100% CPU, "perf top" looks like this:

     9.45%  [kernel]               [k] syscall_return_via_sysret
     6.33%  [unknown]              [k] 0xfffffe00000b601e
     5.16%  [kernel]               [k] ep_send_events_proc
     3.48%  [kernel]               [k] ___bpf_prog_run
     3.38%  [kernel]               [k] _raw_spin_lock_irqsave
     3.19%  [kernel]               [k] read_tsc
     3.06%  [kernel]               [k] __fget
     2.75%  libpthread-2.26.so     [.] __pthread_mutex_lock
     2.55%  [kernel]               [k] _raw_spin_unlock_irqrestore
     2.49%  [kernel]               [k] seccomp_run_filters
     2.33%  libc-2.26.so           [.] epoll_pwait
     2.27%  [kernel]               [k] select_estimate_accuracy
     2.25%  libpthread-2.26.so     [.] __pthread_mutex_unlock_usercnt
     1.84%  [kernel]               [k] syscall_trace_enter
     1.81%  [kernel]               [k] __seccomp_filter
     1.78%  libevent-2.1.so.6.0.2  [.] event_base_loop
     1.65%  [kernel]               [k] ep_scan_ready_list.constprop.15
     1.57%  [unknown]              [k] 0xfffffe000000601e
     1.52%  [kernel]               [k] ep_poll
     1.41%  [kernel]               [k] ktime_get_ts64
     1.36%  [kernel]               [k] entry_SYSCALL_64_after_hwframe
     1.35%  libevent-2.1.so.6.0.2  [.] event_callback_activate_nolock_
     1.29%  [unknown]              [k] 0xfffffe00000b6000
     1.10%  tor                    [.] assert_connection_ok

And says the Tor process is spending 65% of its time in the kernel. Probably with those 69k/s epoll_pwait() calls.

Network statistics says it's still shuffling its 10Mb/s, though, so it's doing something productive as well.

Final Tor log is not very interesting, but for completeness:

Jan 24 10:49:47.000 [notice] Tor 0.3.2.9 (git-9e8b762fcecfece6) opening log file.
Jan 24 10:49:47.000 [notice] Parsing GEOIP IPv4 file /usr/share/tor/geoip.
Jan 24 10:49:48.000 [notice] Parsing GEOIP IPv6 file /usr/share/tor/geoip6.
Jan 24 10:49:48.000 [notice] Configured to measure statistics. Look for the *-stats files that will first be written to the data directory in 24 hours from now.
Jan 24 10:49:48.000 [notice] Your Tor server's identity key fingerprint is CENSORED
Jan 24 10:49:48.000 [notice] Bootstrapped 0%: Starting
Jan 24 10:49:54.000 [notice] Starting with guard context "default"
Jan 24 10:49:54.000 [notice] Bootstrapped 80%: Connecting to the Tor network
Jan 24 10:49:54.000 [notice] Self-testing indicates your ORPort is reachable from the outside. Excellent.
Jan 24 10:49:54.000 [notice] Bootstrapped 85%: Finishing handshake with first hop
Jan 24 10:49:55.000 [notice] Bootstrapped 90%: Establishing a Tor circuit
Jan 24 10:49:56.000 [notice] Tor has successfully opened a circuit. Looks like client functionality is working.
Jan 24 10:49:56.000 [notice] Bootstrapped 100%: Done
Jan 24 10:49:58.000 [notice] Self-testing indicates your DirPort is reachable from the outside. Excellent. Publishing server descriptor.
Jan 24 10:50:00.000 [notice] Performing bandwidth self-test...done.
Jan 24 10:56:04.000 [warn] Failing because we have 4063 connections already. Please read doc/TUNING for guidance.

comment:6 Changed 11 months ago by creideiki

Keywords: linux added
Status: needs_informationnew
Version: Tor: 0.3.1.9Tor: 0.3.2.9

comment:7 in reply to:  4 Changed 11 months ago by arma

Replying to creideiki:

Jan 24 05:17:24.000 [warn] Failing because we have 4063 connections already. Please read doc/TUNING for guidance. [over 16000001 similar message(s) suppressed in last 21600 seconds]

You need to give Tor access to more file descriptors. Tor relays need to be able to reach the other Tor relays, and 4096-or-so descriptors is not enough.

comment:8 Changed 10 months ago by teor

Milestone: Tor: 0.3.4.x-final

comment:10 Changed 10 months ago by AlexRad

I see next log messages when tor service use 100% CPU

Feb 24 00:15:54.000 [info] store_multiple(): Adding consensus diff from BDCF9DAE9B06BE846F509A2CD4622D0296F4C524A5B97FD55B87A19082BFFD44 to C195F4A134922B69D8DCA02C1C9D92586102A99, compressed with identity
Feb 24 00:15:54.000 [info] store_multiple(): Adding consensus diff from BDCF9DAE9B06BE846F509A2CD4622D0296F4C524A5B97FD55B87A19082BFFD44 to C195F4A134922B69D8DCA02C1C9D92586102A99, compressed with gzip
Feb 24 00:15:54.000 [info] store_multiple(): Adding consensus diff from A2C8550157121D20120939A0CE806E8C80F9EAD261EEF569840D433EC4AADA83 to C195F4A134922B69D8DCA02C1C9D92586102A99, compressed with identity
Feb 24 00:15:54.000 [info] store_multiple(): Adding consensus diff from A2C8550157121D20120939A0CE806E8C80F9EAD261EEF569840D433EC4AADA83 to C195F4A134922B69D8DCA02C1C9D92586102A99, compressed with gzip
Feb 24 00:15:54.000 [info] store_multiple(): Adding consensus diff from 96CCFFE34D19FE0E9A271C171C9E29EE2D95A309CBCDA05071F65B9BDE0F05AC to C195F4A134922B69D8DCA02C1C9D92586102A99, compressed with identity
Feb 24 00:15:54.000 [info] store_multiple(): Adding consensus diff from 96CCFFE34D19FE0E9A271C171C9E29EE2D95A309CBCDA05071F65B9BDE0F05AC to C195F4A134922B69D8DCA02C1C9D92586102A99, compressed with gzip
Feb 24 00:15:54.000 [info] store_multiple(): Adding consensus diff from 743550945E24E932FA2A4548160FB7B94A56E8DB36F0024BDBB0DA0B0C084CBB to C195F4A134922B69D8DCA02C1C9D92586102A99, compressed with identity
Feb 24 00:15:54.000 [info] store_multiple(): Adding consensus diff from 743550945E24E932FA2A4548160FB7B94A56E8DB36F0024BDBB0DA0B0C084CBB to C195F4A134922B69D8DCA02C1C9D92586102A99, compressed with gzip
Feb 24 00:15:55.000 [info] store_multiple(): Adding consensus diff from 6E83B5ED1443A63C1D2A227E40C14876CE9A444D31F2797EBEBD3131875F9804 to C195F4A134922B69D8DCA02C1C9D92586102A99, compressed with identity
Feb 24 00:15:55.000 [info] store_multiple(): Adding consensus diff from 6E83B5ED1443A63C1D2A227E40C14876CE9A444D31F2797EBEBD3131875F9804 to C195F4A134922B69D8DCA02C1C9D92586102A99, compressed with gzip
Feb 24 00:15:55.000 [info] store_multiple(): Adding consensus diff from C1733C9DB3D3FFA41648B4512D3ED9088706828C0B46B63910F9334CD18D71B3 to C195F4A134922B69D8DCA02C1C9D92586102A99, compressed with identity
Feb 24 00:15:55.000 [info] store_multiple(): Adding consensus diff from C1733C9DB3D3FFA41648B4512D3ED9088706828C0B46B63910F9334CD18D71B3 to C195F4A134922B69D8DCA02C1C9D92586102A99, compressed with gzip
Feb 24 00:15:55.000 [info] store_multiple(): Adding consensus diff from D21909A5CCEAE66D820DCB25D5BBF6591B8E4D7199FBE52C6F2B4D7607F445D6 to C195F4A134922B69D8DCA02C1C9D92586102A99, compressed with identity
Feb 24 00:15:55.000 [info] store_multiple(): Adding consensus diff from D21909A5CCEAE66D820DCB25D5BBF6591B8E4D7199FBE52C6F2B4D7607F445D6 to C195F4A134922B69D8DCA02C1C9D92586102A99, compressed with gzip
Feb 24 00:15:55.000 [info] store_multiple(): Adding consensus diff from 244325664A3D3FEA6EEE708E3A7F5457B9313C108C744275A7A6BAB031E1783F to C195F4A134922B69D8DCA02C1C9D92586102A99, compressed with identity
Feb 24 00:15:55.000 [info] store_multiple(): Adding consensus diff from 244325664A3D3FEA6EEE708E3A7F5457B9313C108C744275A7A6BAB031E1783F to C195F4A134922B69D8DCA02C1C9D92586102A99, compressed with gzip
Feb 24 00:15:55.000 [info] store_multiple(): Adding consensus diff from 8A68AF897614AB690785BF1C709249ADCDEA2E78657B41CE55285FAE3D3672CE to C195F4A134922B69D8DCA02C1C9D92586102A99, compressed with identity
Feb 24 00:15:55.000 [info] store_multiple(): Adding consensus diff from 8A68AF897614AB690785BF1C709249ADCDEA2E78657B41CE55285FAE3D3672CE to C195F4A134922B69D8DCA02C1C9D92586102A99, compressed with gzip
Feb 24 00:15:56.000 [info] store_multiple(): Adding consensus diff from 1520D1CA72D0806C7F627C57ED9B1380904CFD24F504762A4B4649DF1B70CC4B to C195F4A134922B69D8DCA02C1C9D92586102A99, compressed with identity
Feb 24 00:15:56.000 [info] store_multiple(): Adding consensus diff from 1520D1CA72D0806C7F627C57ED9B1380904CFD24F504762A4B4649DF1B70CC4B to C195F4A134922B69D8DCA02C1C9D92586102A99, compressed with gzip
Feb 24 00:15:56.000 [info] store_multiple(): Adding consensus diff from ACFCFDD29853CA1EB252867E1643F9B7A2C389DD2BC5B4F3FC44EBA8F2D772CC to C195F4A134922B69D8DCA02C1C9D92586102A99, compressed with identity
Feb 24 00:15:56.000 [info] store_multiple(): Adding consensus diff from ACFCFDD29853CA1EB252867E1643F9B7A2C389DD2BC5B4F3FC44EBA8F2D772CC to C195F4A134922B69D8DCA02C1C9D92586102A99, compressed with gzip
Feb 24 00:15:57.000 [info] store_multiple(): Adding consensus diff from 09783C82602523E03D46483573F927E768EC2C70E1008956EEEFFA80185EC0AB to C195F4A134922B69D8DCA02C1C9D92586102A99, compressed with identity
Feb 24 00:15:57.000 [info] store_multiple(): Adding consensus diff from 09783C82602523E03D46483573F927E768EC2C70E1008956EEEFFA80185EC0AB to C195F4A134922B69D8DCA02C1C9D92586102A99, compressed with gzip
Feb 24 00:15:57.000 [info] store_multiple(): Adding consensus diff from 39AE8F1436B0C78B5A39A90D46EB46A960E30253595DB9B6EC74E80B8532185D to C195F4A134922B69D8DCA02C1C9D92586102A99, compressed with identity
Feb 24 00:15:57.000 [info] store_multiple(): Adding consensus diff from 39AE8F1436B0C78B5A39A90D46EB46A960E30253595DB9B6EC74E80B8532185D to C195F4A134922B69D8DCA02C1C9D92586102A99, compressed with gzip
Feb 24 00:15:57.000 [info] store_multiple(): Adding consensus diff from 4E9D2911D8985DB0810B80DA38A353D7DC89B686CB312B7D8A2CCE8B1CDE5370 to C195F4A134922B69D8DCA02C1C9D92586102A99, compressed with identity
Feb 24 00:15:57.000 [info] store_multiple(): Adding consensus diff from 4E9D2911D8985DB0810B80DA38A353D7DC89B686CB312B7D8A2CCE8B1CDE5370 to C195F4A134922B69D8DCA02C1C9D92586102A99, compressed with gzip

and so on...

comment:11 Changed 10 months ago by teor

Is normal for your relay to use 100% CPU while generating and compressing consensus diffs.
Try fixing the file descriptor limit, if you can.

Also, the next 0.3.1 release will handle extra load from clients better,

comment:12 in reply to:  10 Changed 10 months ago by cypherpunks

Replying to AlexRad:

I see next log messages when tor service use 100% CPU

and so on...

For a long time (10 and more minutes)?

comment:14 Changed 9 months ago by Eugene646

Still happens with tor 0.3.3.2 alpha.

comment:15 Changed 9 months ago by cypherpunks

Keywords: performance regression added
Milestone: Tor: 0.3.4.x-finalTor: 0.3.3.x-final
Version: Tor: 0.3.2.9Tor: 0.3.1.9

It's a regression, started from 0.3.1.x, and it looks like a dumb mingw implementation of some function, used in consdiff. #24826 didn't fix it.

comment:16 Changed 9 months ago by cypherpunks

it looks like a dumb mingw implementation of some function

#ifdef _WIN32
/* On Windows, unlink won't work on a file if the file is actively mmap()ed.
 * That forces us to be less aggressive about unlinking files, and causes other
 * changes throughout our logic.
 */
#define MUST_UNMAP_TO_UNLINK
#endif /* defined(_WIN32) */
#ifdef MUST_UNMAP_TO_UNLINK
  /* If we can't unlink the files that we're still using, then we need to
   * tell the storagedir backend to allow far more files than this consensus
   * cache actually wants, so that it can hold files which, from this cache's
   * perspective, have become useless.
   */
#define VERY_LARGE_STORAGEDIR_LIMIT (1000*1000)
  storagedir_max_entries = VERY_LARGE_STORAGEDIR_LIMIT;
#else /* !(defined(MUST_UNMAP_TO_UNLINK)) */
  /* Otherwise, we can just tell the storagedir to use the same limits
   * as this cache. */
  storagedir_max_entries = max_entries;
#endif /* defined(MUST_UNMAP_TO_UNLINK) */
get_max_age_to_cache(void)
{
  const int32_t DEFAULT_MAX_AGE_TO_CACHE = 8192;
  const int32_t MIN_MAX_AGE_TO_CACHE = 0;
  const int32_t MAX_MAX_AGE_TO_CACHE = 8192;
  const char MAX_AGE_TO_CACHE_NAME[] = "max-consensus-age-to-cache-for-diff";

  const or_options_t *options = get_options();

  if (options->MaxConsensusAgeForDiffs) {
    const int v = options->MaxConsensusAgeForDiffs;
    if (v >= MAX_MAX_AGE_TO_CACHE * 3600)
      return MAX_MAX_AGE_TO_CACHE;
    else
      return v;
  }

  /* The parameter is in hours, so we multiply */
  return 3600 * networkstatus_get_param(NULL,
                                        MAX_AGE_TO_CACHE_NAME,
                                        DEFAULT_MAX_AGE_TO_CACHE,
                                        MIN_MAX_AGE_TO_CACHE,
                                        MAX_MAX_AGE_TO_CACHE);
}

comment:17 Changed 9 months ago by nickm

Keywords: 033-triage-20180326 added

Second batch of triage for 0.3.3: tickets that we didn't cover the first time.

comment:18 Changed 9 months ago by nickm

Keywords: 033-removed-20180326 added

Mark this tickets as removed.

comment:19 Changed 8 months ago by nickm

Milestone: Tor: 0.3.3.x-finalTor: 0.3.4.x-final

considering for 034 as possibly related to our mainloop refactor.

comment:20 Changed 8 months ago by arma

Summary from the bottom comments seem to be that there's some unhealthy interaction between our unlink hack for Windows and the consensus diff file generation?

comment:21 Changed 7 months ago by nickm

I think they are implying (*) that we're storing way too many old entries in the storagedir, and so the number of diffs we compute every hour is growing without bound? In theory this shouldn't actually be happening, though, since we're supposed to be only computing diffs to entries within a certain time range. (We should check to make sure we're actually doing that.)

(*) Please, if when posting an explanation for a bug, try to actually explain what you think is happening, in as much detail as you can. Things are never as obvious to other people as they are to you.

comment:22 Changed 7 months ago by cypherpunks

Something weird, wrong, typos.

Last edited 7 months ago by cypherpunks (previous) (diff)

comment:23 Changed 7 months ago by cypherpunks

Something weird, wrong, typos.

Last edited 7 months ago by cypherpunks (previous) (diff)

comment:24 Changed 6 months ago by nickm

Keywords: 034-deferred-20180602 added
Milestone: Tor: 0.3.4.x-finalTor: 0.3.5.x-final

Deferring non-must tickets to 0.3.5

comment:25 Changed 6 months ago by arma

See also #25461 and #18614 for other profiling / cpu load tickets. Should we merge them, or pick our favorite and close the others?

comment:26 Changed 6 months ago by teor

Parent ID: #25500

comment:27 Changed 6 months ago by chelseakomlo

Cc: chelseakomlo added

comment:28 Changed 5 months ago by nickm

Keywords: 035-removed-20180711 added
Milestone: Tor: 0.3.5.x-finalTor: unspecified

These tickets are being triaged out of 0.3.5. The ones marked "035-roadmap-proposed" may return.

comment:29 Changed 4 months ago by mestriga

I have the same problem in my new relay (Tor 0.3.3.7 from the Windows Expert Bundle). CPU goes to 100% on each diff generation cycle for several minutes (sometimes more than 30 minutes), making the machine barely responsive.

There is definitely a pattern to it. The problem starts once the number of files in the consensus cache directory ("./diff-cache") reaches 64 files (far less than needed to store 72 hours of diffs). CPU usage goes up when diff generation starts and when it goes down again, there are again 64 files in the diff-cache directory. Looking at the contents of the consensus diffs stored there, there are gaps (some diffs missing for which older diffs are present). There is clearly something wrong with the code that manages the limit of files in the diff-cache folder.

Setting MaxConsensusAgeForDiffs to 9 kept the number of files in the diff-cache directory under 64 and made the problem go away completely. Consensus diff generation now takes a few seconds.

comment:30 Changed 4 months ago by mestriga

This change could have caused it:

"Lower the file limit in consdiffmgr, to support seccomp2": https://gitweb.torproject.org/tor.git/commit/?h=maint-0.3.1&id=466e9140888f0a6a504dc642a0426bd2352f0490.

If this problem is happening only on windows, reverting cache_max_num to its previous value on Windows builds by means of a #DEFINE directive should fix it.

comment:31 Changed 4 months ago by teor

Keywords: 032-backport-maybe 033-backport-maybe 034-backport fast-fix added
Milestone: Tor: unspecifiedTor: 0.3.5.x-final
Owner: set to teor
Status: newassigned

Thanks for this suggestion.

But it seems like there is a bug in the code which we should fix.
I'll see if I can reproduce the bug on another OS (I don't have Windows).

comment:32 Changed 4 months ago by teor

Keywords: fast-fix removed
Milestone: Tor: 0.3.5.x-finalTor: unspecified
Owner: teor deleted

I don't think this is a fast fix, because we still need to work out why a low limit causes this issue.

comment:33 Changed 4 months ago by mestriga

On Windows the limit on the number of files is not even enforced (see "consensus_cache_may_overallocate(consensus_cache_t *cache)" in conscache.c).

The conditional code for Windows (#define MUST_UNMAP_TO_UNLINK) made the code harder to read, and a cleaner solution is possible (with the files still mapped, marking them for deletion on close, while possibly moving them to another directory).

Is there a purpose in limiting the number of files in the cache directory for all nodes (both Windows and Linux), preventing them from storing 72  hour of diffs, even when not running sandboxed?

comment:34 in reply to:  33 ; Changed 4 months ago by teor

Replying to mestriga:

On Windows the limit on the number of files is not even enforced (see "consensus_cache_may_overallocate(consensus_cache_t *cache)" in conscache.c).

The limit on the number of files in the cache directory isn't enforced, but the limit on the number of files used by the cache is.

The conditional code for Windows (#define MUST_UNMAP_TO_UNLINK) made the code harder to read, and a cleaner solution is possible (with the files still mapped, marking them for deletion on close, while possibly moving them to another directory).

Do you have a patch you would like to suggest?

Is there a purpose in limiting the number of files in the cache directory for all nodes (both Windows and Linux), preventing them from storing 72  hour of diffs, even when not running sandboxed?

I don't know if a higher limit would cause issues if users toggle the sandbox setting. But it would still be ok to have a higher limit on platforms without seccomp2 (BSD, macOS, and Windows).

Limiting the number of files in the cache directory also limits the disk space used by the relay, and reduces the amount of CPU used to create diffs every hour.

So if we increase the limit, we should consider whether having 72 hours of diffs is actually that helpful. Maybe we could get away with 24 hours of diffs.

comment:35 in reply to:  34 Changed 4 months ago by mestriga

Replying to teor:

Replying to mestriga:

The conditional code for Windows (#define MUST_UNMAP_TO_UNLINK) made the code harder to read, and a cleaner solution is possible (with the files still mapped, marking them for deletion on close, while possibly moving them to another directory).

Do you have a patch you would like to suggest?

I don't know how to build Tor, but I will leave here the Windows code that, unlike DeleteFile, is able to remove a file that it is memory mapped, making the "#define MUST_UNMAP_TO_UNLINK" workaround unnecessary. I tested this code against a file memory mapped with the exact same parameters that Tor uses and confirmed that it works.

  1. Open a handle for exclusive access to the file with the FILE_FLAG_DELETE_ON_CLOSE flag:
file_handle = fileCreateFile(
    tfilename,
    0,
    NULL,
    OPEN_ALWAYS,
    FILE_FLAG_DELETE_ON_CLOSE,
    0);
  1. Check that the handle is valid, fail if it is not (due to a sharing violation or if the file does not exist):
if (file_handle == INVALID_HANDLE_VALUE)
   goto win_err;
  1. Close the handle:
CloseHandle(file_handle);
  1. Whether memory mapped or not, the file is gone!

comment:36 Changed 4 months ago by teor

What happens if tor tries to access the memory mapped by the file after it's gone?

comment:37 Changed 4 months ago by teor

Status: assignedneeds_revision

comment:38 Changed 4 months ago by teor

Summary: tor 0.3.1.9 100% cpu loadTor uses 100% CPU when accessing the cache directory on Windows

comment:39 in reply to:  38 Changed 4 months ago by mestriga

What happens if tor tries to access the memory mapped by the file after it's gone?

I tested it and the memory access works. But I had missed a step in the test, and the unlinking is actually delayed if the file is mapped. This is handled in the code below by linking the file somewhere else.

  1. Open a handle to the file with the FILE_FLAG_DELETE_ON_CLOSE flag. Request DELETE access and allow FILE_SHARE_DELETE to ensure that step 3 will succeed.
file_handle = fileCreateFile(
    tfilename,
    DELETE,
    FILE_SHARE_DELETE,
    NULL,
    OPEN_ALWAYS,
    FILE_FLAG_DELETE_ON_CLOSE,
    0);

  1. Check that the handle is valid, fail if it is not (due to a sharing violation or if the file does not exist):
if (file_handle == INVALID_HANDLE_VALUE)
   goto win_err;

  1. Put the file somewhere else in the same volume (maybe inside the cache directory itself, with a name that the cache can be programmed to ignore).
if (!MoveFile(tfilename, newfilename)) 
   goto win_err;	

4.Close the handle:

CloseHandle(file_handle);

  1. If the file is not mapped (or has some other handle to it open), it is gone. Otherwise, it will be gone automatically when the last handle to it is closed.

comment:40 Changed 4 weeks ago by teor

Keywords: 032-unreached-backport added

0.3.2 is end of life, so 032-backport is now 032-unreached-backport.

comment:41 Changed 4 weeks ago by cypherpunks

Keywords: 032-backport-maybe 033-backport-maybe removed
Milestone: Tor: unspecifiedTor: 0.4.0.x-final

comment:42 Changed 4 weeks ago by teor

Keywords: 033-backport-maybe 040-roadmap-proposed added
Milestone: Tor: 0.4.0.x-final

We tag tickets 040-roadmap-proposed, then pick some of them for 0.4.0.
Unfortunately, we don't have time to fix everything.

comment:43 Changed 4 weeks ago by teor

Milestone: Tor: unspecified
Note: See TracTickets for help on using tickets.