IPredator complains that we call time() too much when running over 500Mbps

changed milestone to %Tor: unspecified

added component::core tor/tor lorax milestone::Tor: unspecified performance priority::medium severity::normal status::new syscall tor-relay type::defect vdso version::tor 0.2.5.10 labels

It should be possible to check that; yeah.

Also, it would be great to get callgraph information here; why are we calling time so much?

This surprises me. I've been running perf on my fast relay (https://globe.torproject.org/#/relay/A69221A7EC7498D2F88A0FB795261013FA36CAAE) and time call is just far from being the top CPU consumer...

On Linux, getting the time is a vdso so unless somehow their time call became a syscall, I don't see how this could all of the sudden become a huge impact. A very quick strace of the process could tell us if they see syscall for time (or perf/ftrace/lttng/).

If not, I would first suspect a kernel issue here with a rate that large.

They say that there is a difference between kernel 2.6 and 3.x series, where 3.x performs much worse when tor calls time() often.

dgoulet, do you know if the vdso / syscall change could be due to the kernel version change?

Coincidentally, I just discovered this a couple hours ago. I was trying to figure out what causes Tor relays running on OpenBSD to underperform.

I ran ktrace/kdump on the Tor process to figure out what system calls were being made and found >400,000 gettimeofday() calls. The entire dump was swamped with them. If I recall correctly, most or all gave the same struct pointer as the first argument, too.

The server's down at the moment, so I can't verify some of that.

I came across very similar nebulous performance issues relating to many OpenBSD gettimeofday() calls on other projects' bug trackers. For example:

http://bugs.mysql.com/bug.php?id=46596

Trac:
Username: mmcc

I'd like to get a sense of how many calls per second 400,000 represents. (400,000 would seem to be 100 to 1000 per second, unless you were debugging for a long while.) IPredator doesn't provide any numbers, either.

I'd also like to know which function(s)/call stack(s) these calls are being made from. (This is essential for us to fix the issue.)

I also wonder if TIME_IS_FAST is being defined as 1 in any of these Linux or BSD builds. If TIME_IS_FAST is not defined (the default), we could change these calls to approx_time(), which is much faster. And in that case there should only be around 1 call to time() each second, which is clearly not the behaviour that either mmcc or IPredator are seeing.

We can check if approx_time is available by searching the tor binary or tor debug symbols for the string "approx_time". grep will do this, or strings, or even gdb if you feel like doing it the hard way.

When I run 200MB through tor 0.2.6.2-alpha via the latest chutney, it takes approximately 45 seconds. (I have modified chutney verify to send larger chunks of random data.)

I see the following call graph for time / gettimeofday (in the OS X C library, time calls gettimeofday):

Running Time	   Samples	Symbol Name
276.0ms  100.0%	   276  100.0%	time
114.0ms   41.3%	   114   41.3%	 channel_timestamp_active
 90.0ms   32.6%	    90   32.6%	 channel_timestamp_recv
 25.0ms    9.0%	    25    9.0%	 conn_read_callback
 14.0ms    5.0%	    14    5.0%	 conn_write_callback
 11.0ms    3.9%	    11    3.9%	 channel_timestamp_drained
 11.0ms    3.9%	    11    3.9%	 connection_or_process_inbuf
  6.0ms    2.1%	     6    2.1%	 channel_timestamp_xmit
  3.0ms    1.0%	     3    1.0%	 event_base_loop
  1.0ms    0.3%	     1    0.3%	 channel_write_cell_queue_entry
  1.0ms    0.3%	     1    0.3%	 connection_mark_for_close_internal_

This is approximately 4.5MB/s, and 3.6% of the samples are for time. If I understand the profiling tools I'm using, there are around 1 million calls to time during the data transmission, or around 20,000 calls per second (not shown it the table above).

If we switched channel_timestamp_active and channel_timestamp_recv over to approx_time, the number of calls to time() could be reduced by around 75%, but this is still 5,000 calls per second. If we changed all of the functions above, we could see a reduction of 99.4%, to 120 calls per second. Perhaps this would be enough to fix the performance issues - if not, we can re-do these tests and change another set of functions.

mmcc, did you want to confirm the callers of time / gettimeofday on BSD, and make these fixes? The fixes should involve replacing commonly called instances (or simply every instance) of time in these functions with approx_time.

I was also meaning to figure out how many per second that represents. The ktrace finished immediately, so I'm not sure how far back in time that reaches. My server is for some unknown reason still down, so I'm setting up an OpenBSD VM to begin debugging on.

Trac:
Username: mmcc

Unless I did something wrong, TIME_IS_FAST is false in Tor 0.2.5.10 on the current OpenBSD release (5.6).

I threw #error "TIME_IS_FAST = True" and its complement into the #if in src/common/util.h.

Trac:
Username: mmcc

Also, approx_time was in the strings output of the Tor binary.

Trac:
Username: mmcc

Conveniently, kdump gives the return value of system calls in the following form:

17991 tor      CALL  gettimeofday(0x34ee15caca0,0)
17991 tor      STRU  struct timeval { 1420077862.124873 }
17991 tor      RET   gettimeofday 0

As you can see, the value in the struct is the Unix timestamp, as you would expect.

The kdump I had included all system calls from the second 1420077863, so I grepped and wc -l'ed those. There were 6,974 instances, which implies that the Tor process made 6,974 calls to gettimeofday() in a second.

For the sake of thoroughness, the subsequent seconds had the following gettimeofday() call counts:

5,103
3,796
4,628
4,651
6,362
6,632
4,788
4,335
3,186

Trac:
Username: mmcc

Also sent to tor-dev@:

On 01/01/2015 03:47 PM, Nick Mathewson wrote:

What part of tor is accounting for all these time/gettimeofday calls?

I set up an OpenBSD 5.6 VM and built Tor 0.2.5.10 on it. I couldn't get chutney working because of a number of strange errors that seemed platform-specific, so I went for a less thorough solution. I can't run a Tor relay from home, so the only activity the Tor process had was some Firefox traffic I created.

There are surprisingly few syscall analysis tools available on OpenBSD, and I'm new to syscall/interrupt tuning, so I eventually ended up selecting the advanced "printf()" technique.

As a quick refresher to those not familiar, there is a linear heirarchy of time functions in the Tor code, each calling the one below it:

gettimeofday() < tor_gettimeofday() < tor_gettimeofday_cached() < tor_gettimeofday_cached_monotonic()

The first two will always make the syscall, while the other two will first try the cache. I applied print statements to instances of all four.

There was a steady stream of tor_gettimeofday() calls from circuit_expire_building() - about two per second with a single idle TCP connection.

When there was active traffic flow, there were many more gettimeofday() calls that were almost exclusively the product of tor_gettimeofday_cached_monotonic(). This function is called in three places:

once in src/or/relay.c
once in src/or/buffers.c
once in src/or/circuitlist.c

The first two account for the bulk of the calls, as they are in the core data relaying logic.

Ultimately, the problem seems to be that the caching is very weak. At most, only half of the calls to tor_gettimeofday_cached_monotonic() use the cache. It appears in the vomiting print statements that loading a single simple HTML page (http://www.openbsd.org/faq/ports/guide.html to be exact) will cause >30 gettimeofday() syscalls. You can imagine how that would accumulate for an exit carrying 800 KB/s if the caching doesn't improve much with additional circuits.

If I understand correctly, this implies either a bug in the caching logic or the simple fact that libevent calls back and clears the cache too much for the caching to be very useful.

The cache is only cleared in two places (both pretty self-explanatory, and both triggered by libevent callbacks):

connection_handle_read()
connection_handle_write()

Apparently, the above functions are called >30 times when fetching a JS-less non-HTTPS HTML page. I don't know enough about TCP to comment on whether this makes sense.

Trac:
Username: mmcc

I just reviewed my profiling of an exit relay running chutney verify with 200MB of random data (standard chutney verify uses 10KB of random data). This is on OS X 10.9.5 with tor 0.2.6.2-alpha-dev running the chutney basic-min network.

In an optimised build of tor, I see around 4% of the CPU usage devoted to time().

Which means that the maximum performance impact of eliminating all calls to time() on OS X is 4%. (This is not much at all.)

But OS X doesn't seem to achieve a very high data transfer rate - perhaps the issue could become more apparent at higher data transfer rates? (Or maybe I need to put more effort into performance tuning tor on OS X before I come to any firm conclusions.)

I think there's not much left to do here for 0.2.6. Should we close this, or postpone it to a later milestone?

Being cowardly and doing the latter.

Trac:
Milestone: Tor: 0.2.6.x-final to Tor: unspecified

Trac:
Cc: nickm to nickm, jens@kubieziel.de

Trac:
Sponsor: N/A to N/A
Severity: N/A to Normal
Reviewer: N/A to N/A
Keywords: performance tor-relay lorax deleted, performance tor-relay lorax syscall vdso added

Did we resolve this ticket, perhaps by changing some common calls to approx_time()? Or is it still an issue for fast relays?

moved to tpo/core/tor#14056 (closed)

IPredator complains that we call time() too much when running over 500Mbps

Child items 0

Activity