Opened 4 years ago

Last modified 23 months ago

#14056 new defect

IPredator complains that we call time() too much when running over 500Mbps

Reported by: teor Owned by:
Priority: Medium Milestone: Tor: unspecified
Component: Core Tor/Tor Version: Tor: 0.2.5.10
Severity: Normal Keywords: performance tor-relay lorax syscall vdso
Cc: nickm, jens@… Actual Points:
Parent ID: Points:
Reviewer: Sponsor:

Description

IPredator writes:

We are constantly testing kernel 3.x vs. the latest 2.6, but noticed

that with kernel 3.x beyond 500MBit Tor would spend massive amounts of
CPU time on calling time(). On kernel 2.6 we use kernel.vsyscall64=2
which minimizes the impact of constantly doing that.

https://ipredator.se/guide/torserver#performance
(JavaScript may be required, unfortunately.)

nickm, is this still an issue?
This would be possible to simulate using chutney, right?

Child Tickets

Change History (15)

comment:1 Changed 4 years ago by nickm

It should be possible to check that; yeah.

Also, it would be great to get callgraph information here; *why* are we calling time so much?

comment:2 Changed 4 years ago by dgoulet

This surprises me. I've been running perf on my fast relay (https://globe.torproject.org/#/relay/A69221A7EC7498D2F88A0FB795261013FA36CAAE) and time call is just far from being the top CPU consumer...

On Linux, getting the time is a vdso so unless somehow their time call became a syscall, I don't see how this could all of the sudden become a huge impact. A very quick strace of the process could tell us if they see syscall for time (or perf/ftrace/lttng/<insert tracer>).

If not, I would first suspect a kernel issue here with a rate that large.

comment:3 Changed 4 years ago by teor

They say that there is a difference between kernel 2.6 and 3.x series, where 3.x performs much worse when tor calls time() often.

dgoulet, do you know if the vdso / syscall change could be due to the kernel version change?

comment:4 Changed 4 years ago by mmcc

Coincidentally, I just discovered this a couple hours ago. I was trying to figure out what causes Tor relays running on OpenBSD to underperform.

I ran ktrace/kdump on the Tor process to figure out what system calls were being made and found >400,000 gettimeofday() calls. The entire dump was swamped with them. If I recall correctly, most or all gave the same struct pointer as the first argument, too.

The server's down at the moment, so I can't verify some of that.

I came across very similar nebulous performance issues relating to many OpenBSD gettimeofday() calls on other projects' bug trackers. For example:

http://bugs.mysql.com/bug.php?id=46596

Last edited 4 years ago by mmcc (previous) (diff)

comment:5 Changed 4 years ago by teor

I'd like to get a sense of how many calls per second 400,000 represents.
(400,000 would seem to be 100 to 1000 per second, unless you were debugging for a long while.) IPredator doesn't provide any numbers, either.

I'd also like to know which function(s)/call stack(s) these calls are being made from.
(This is essential for us to fix the issue.)

I also wonder if TIME_IS_FAST is being defined as 1 in any of these Linux or BSD builds.
If TIME_IS_FAST is not defined (the default), we could change these calls to approx_time(), which is much faster.
And in that case there should only be around 1 call to time() each second, which is clearly not the behaviour that either mmcc or IPredator are seeing.

We can check if approx_time is available by searching the tor binary or tor debug symbols for the string "approx_time".
grep will do this, or strings, or even gdb if you feel like doing it the hard way.

comment:6 Changed 4 years ago by teor

When I run 200MB through tor 0.2.6.2-alpha via the latest chutney, it takes approximately 45 seconds. (I have modified chutney verify to send larger chunks of random data.)

I see the following call graph for time / gettimeofday (in the OS X C library, time calls gettimeofday):

Running Time	   Samples	Symbol Name
276.0ms  100.0%	   276  100.0%	time
114.0ms   41.3%	   114   41.3%	 channel_timestamp_active
 90.0ms   32.6%	    90   32.6%	 channel_timestamp_recv
 25.0ms    9.0%	    25    9.0%	 conn_read_callback
 14.0ms    5.0%	    14    5.0%	 conn_write_callback
 11.0ms    3.9%	    11    3.9%	 channel_timestamp_drained
 11.0ms    3.9%	    11    3.9%	 connection_or_process_inbuf
  6.0ms    2.1%	     6    2.1%	 channel_timestamp_xmit
  3.0ms    1.0%	     3    1.0%	 event_base_loop
  1.0ms    0.3%	     1    0.3%	 channel_write_cell_queue_entry
  1.0ms    0.3%	     1    0.3%	 connection_mark_for_close_internal_

This is approximately 4.5MB/s, and 3.6% of the samples are for time.
If I understand the profiling tools I'm using, there are around 1 million calls to time during the data transmission, or around 20,000 calls per second (not shown it the table above).

If we switched channel_timestamp_active and channel_timestamp_recv over to approx_time, the number of calls to time() could be reduced by around 75%, but this is still 5,000 calls per second.
If we changed all of the functions above, we could see a reduction of 99.4%, to 120 calls per second. Perhaps this would be enough to fix the performance issues - if not, we can re-do these tests and change another set of functions.

mmcc, did you want to confirm the callers of time / gettimeofday on BSD, and make these fixes?
The fixes should involve replacing commonly called instances (or simply every instance) of time in these functions with approx_time.

comment:7 Changed 4 years ago by mmcc

I was also meaning to figure out how many per second that represents. The ktrace finished immediately, so I'm not sure how far back in time that reaches. My server is for some unknown reason still down, so I'm setting up an OpenBSD VM to begin debugging on.

comment:8 Changed 4 years ago by mmcc

Unless I did something wrong, TIME_IS_FAST is false in Tor 0.2.5.10 on the current OpenBSD release (5.6).

I threw #error "TIME_IS_FAST = True" and its complement into the #if in src/common/util.h.

comment:9 Changed 4 years ago by mmcc

Also, approx_time was in the strings output of the Tor binary.

comment:10 Changed 4 years ago by mmcc

Conveniently, kdump gives the return value of system calls in the following form:

17991 tor      CALL  gettimeofday(0x34ee15caca0,0)
17991 tor      STRU  struct timeval { 1420077862.124873 }
17991 tor      RET   gettimeofday 0

As you can see, the value in the struct is the Unix timestamp, as you would expect.

The kdump I had included all system calls from the second 1420077863, so I grepped and wc -l'ed those. There were 6,974 instances, which implies that the Tor process made 6,974 calls to gettimeofday() in a second.

Version 0, edited 4 years ago by mmcc (next)

comment:11 Changed 4 years ago by mmcc

Also sent to tor-dev@:

On 01/01/2015 03:47 PM, Nick Mathewson wrote:

What part of tor is accounting for all these time/gettimeofday calls?

I set up an OpenBSD 5.6 VM and built Tor 0.2.5.10 on it. I couldn't get
chutney working because of a number of strange errors that seemed
platform-specific, so I went for a less thorough solution. I can't run a
Tor relay from home, so the only activity the Tor process had was some
Firefox traffic I created.

There are surprisingly few syscall analysis tools available on OpenBSD,
and I'm new to syscall/interrupt tuning, so I eventually ended up
selecting the advanced "printf()" technique.

As a quick refresher to those not familiar, there is a linear heirarchy
of time functions in the Tor code, each calling the one below it:

gettimeofday() < tor_gettimeofday() < tor_gettimeofday_cached() <
tor_gettimeofday_cached_monotonic()

The first two will always make the syscall, while the other two will
first try the cache. I applied print statements to instances of all four.

There was a steady stream of tor_gettimeofday() calls from
circuit_expire_building() - about two per second with a single idle TCP
connection.

When there was active traffic flow, there were many more gettimeofday()
calls that were almost exclusively the product of
tor_gettimeofday_cached_monotonic(). This function is called in three
places:

  • once in src/or/relay.c
  • once in src/or/buffers.c
  • once in src/or/circuitlist.c

The first two account for the bulk of the calls, as they are in the core
data relaying logic.

Ultimately, the problem seems to be that the caching is very weak. At
most, only half of the calls to tor_gettimeofday_cached_monotonic() use
the cache. It appears in the vomiting print statements that loading a
single simple HTML page (http://www.openbsd.org/faq/ports/guide.html to
be exact) will cause >30 gettimeofday() syscalls. You can imagine how
that would accumulate for an exit carrying 800 KB/s if the caching
doesn't improve much with additional circuits.

If I understand correctly, this implies either a bug in the caching
logic or the simple fact that libevent calls back and clears the cache
too much for the caching to be very useful.

The cache is only cleared in two places (both pretty self-explanatory,
and both triggered by libevent callbacks):

  • connection_handle_read()
  • connection_handle_write()

Apparently, the above functions are called >30 times when fetching a
JS-less non-HTTPS HTML page. I don't know enough about TCP to comment on
whether this makes sense.

comment:12 Changed 4 years ago by teor

I just reviewed my profiling of an exit relay running chutney verify with 200MB of random data (standard chutney verify uses 10KB of random data).
This is on OS X 10.9.5 with tor 0.2.6.2-alpha-dev running the chutney basic-min network.

In an optimised build of tor, I see around 4% of the CPU usage devoted to time().

Which means that the maximum performance impact of eliminating all calls to time() on OS X is 4%. (This is not much at all.)

But OS X doesn't seem to achieve a very high data transfer rate - perhaps the issue could become more apparent at higher data transfer rates?
(Or maybe I need to put more effort into performance tuning tor on OS X before I come to any firm conclusions.)

comment:13 Changed 4 years ago by nickm

Milestone: Tor: 0.2.6.x-finalTor: unspecified

I *think* there's not much left to do here for 0.2.6. Should we close this, or postpone it to a later milestone?

Being cowardly and doing the latter.

comment:14 Changed 4 years ago by qbi

Cc: jens@… added

comment:15 Changed 23 months ago by nickm

Keywords: syscall vdso added
Severity: Normal
Note: See TracTickets for help on using tickets.