We are constantly testing kernel 3.x vs. the latest 2.6, but noticed
that with kernel 3.x beyond 500MBit Tor would spend massive amounts of
CPU time on calling time(). On kernel 2.6 we use kernel.vsyscall64=2
which minimizes the impact of constantly doing that.
On Linux, getting the time is a vdso so unless somehow their time call became a syscall, I don't see how this could all of the sudden become a huge impact. A very quick strace of the process could tell us if they see syscall for time (or perf/ftrace/lttng/).
If not, I would first suspect a kernel issue here with a rate that large.
Coincidentally, I just discovered this a couple hours ago. I was trying to figure out what causes Tor relays running on OpenBSD to underperform.
I ran ktrace/kdump on the Tor process to figure out what system calls were being made and found >400,000 gettimeofday() calls. The entire dump was swamped with them. If I recall correctly, most or all gave the same struct pointer as the first argument, too.
The server's down at the moment, so I can't verify some of that.
I came across very similar nebulous performance issues relating to many OpenBSD gettimeofday() calls on other projects' bug trackers. For example:
I'd like to get a sense of how many calls per second 400,000 represents.
(400,000 would seem to be 100 to 1000 per second, unless you were debugging for a long while.) IPredator doesn't provide any numbers, either.
I'd also like to know which function(s)/call stack(s) these calls are being made from.
(This is essential for us to fix the issue.)
I also wonder if TIME_IS_FAST is being defined as 1 in any of these Linux or BSD builds.
If TIME_IS_FAST is not defined (the default), we could change these calls to approx_time(), which is much faster.
And in that case there should only be around 1 call to time() each second, which is clearly not the behaviour that either mmcc or IPredator are seeing.
We can check if approx_time is available by searching the tor binary or tor debug symbols for the string "approx_time".
grep will do this, or strings, or even gdb if you feel like doing it the hard way.
When I run 200MB through tor 0.2.6.2-alpha via the latest chutney, it takes approximately 45 seconds. (I have modified chutney verify to send larger chunks of random data.)
I see the following call graph for time / gettimeofday (in the OS X C library, time calls gettimeofday):
This is approximately 4.5MB/s, and 3.6% of the samples are for time.
If I understand the profiling tools I'm using, there are around 1 million calls to time during the data transmission, or around 20,000 calls per second (not shown it the table above).
If we switched channel_timestamp_active and channel_timestamp_recv over to approx_time, the number of calls to time() could be reduced by around 75%, but this is still 5,000 calls per second.
If we changed all of the functions above, we could see a reduction of 99.4%, to 120 calls per second. Perhaps this would be enough to fix the performance issues - if not, we can re-do these tests and change another set of functions.
mmcc, did you want to confirm the callers of time / gettimeofday on BSD, and make these fixes?
The fixes should involve replacing commonly called instances (or simply every instance) of time in these functions with approx_time.
I was also meaning to figure out how many per second that represents. The ktrace finished immediately, so I'm not sure how far back in time that reaches. My server is for some unknown reason still down, so I'm setting up an OpenBSD VM to begin debugging on.
Conveniently, kdump gives the return value of system calls in the following form:
17991 tor CALL gettimeofday(0x34ee15caca0,0)17991 tor STRU struct timeval { 1420077862.124873 }17991 tor RET gettimeofday 0
As you can see, the value in the struct is the Unix timestamp, as you would expect.
The kdump I had included all system calls from the second 1420077863, so I grepped and wc -l'ed those. There were 6,974 instances, which implies that the Tor process made 6,974 calls to gettimeofday() in a second.
For the sake of thoroughness, the subsequent seconds had the following gettimeofday() call counts:
What part of tor is accounting for all these time/gettimeofday calls?
I set up an OpenBSD 5.6 VM and built Tor 0.2.5.10 on it. I couldn't get
chutney working because of a number of strange errors that seemed
platform-specific, so I went for a less thorough solution. I can't run a
Tor relay from home, so the only activity the Tor process had was some
Firefox traffic I created.
There are surprisingly few syscall analysis tools available on OpenBSD,
and I'm new to syscall/interrupt tuning, so I eventually ended up
selecting the advanced "printf()" technique.
As a quick refresher to those not familiar, there is a linear heirarchy
of time functions in the Tor code, each calling the one below it:
The first two will always make the syscall, while the other two will
first try the cache. I applied print statements to instances of all four.
There was a steady stream of tor_gettimeofday() calls from
circuit_expire_building() - about two per second with a single idle TCP
connection.
When there was active traffic flow, there were many more gettimeofday()
calls that were almost exclusively the product of
tor_gettimeofday_cached_monotonic(). This function is called in three
places:
once in src/or/relay.c
once in src/or/buffers.c
once in src/or/circuitlist.c
The first two account for the bulk of the calls, as they are in the core
data relaying logic.
Ultimately, the problem seems to be that the caching is very weak. At
most, only half of the calls to tor_gettimeofday_cached_monotonic() use
the cache. It appears in the vomiting print statements that loading a
single simple HTML page (http://www.openbsd.org/faq/ports/guide.html to
be exact) will cause >30 gettimeofday() syscalls. You can imagine how
that would accumulate for an exit carrying 800 KB/s if the caching
doesn't improve much with additional circuits.
If I understand correctly, this implies either a bug in the caching
logic or the simple fact that libevent calls back and clears the cache
too much for the caching to be very useful.
The cache is only cleared in two places (both pretty self-explanatory,
and both triggered by libevent callbacks):
connection_handle_read()
connection_handle_write()
Apparently, the above functions are called >30 times when fetching a
JS-less non-HTTPS HTML page. I don't know enough about TCP to comment on
whether this makes sense.
I just reviewed my profiling of an exit relay running chutney verify with 200MB of random data (standard chutney verify uses 10KB of random data).
This is on OS X 10.9.5 with tor 0.2.6.2-alpha-dev running the chutney basic-min network.
In an optimised build of tor, I see around 4% of the CPU usage devoted to time().
Which means that the maximum performance impact of eliminating all calls to time() on OS X is 4%. (This is not much at all.)
But OS X doesn't seem to achieve a very high data transfer rate - perhaps the issue could become more apparent at higher data transfer rates?
(Or maybe I need to put more effort into performance tuning tor on OS X before I come to any firm conclusions.)