Opened 6 years ago

Closed 5 years ago

#9682 closed enhancement (fixed)

Better work queue implementation for cpuworkers

Reported by: nickm Owned by:
Priority: High Milestone: Tor: 0.2.6.x-final
Component: Core Tor/Tor Version:
Severity: Keywords: tor-relay, performance, cpuworker, 026-triaged-0 nickm-patch
Cc: iang Actual Points:
Parent ID: Points:
Reviewer: Sponsor:

Description

Our current implementation for passing work to cpuworkers and getting answers from them is pretty bulletproof: we share an fd with each one, write our requests there as a serialized struct, and get our answers back as serialized struct. The cpuworker learns about new requests by calling read(); we learn about new answers from libevent().

But probably this isn't as efficient as it could be:

  • We should have a work queue implementation that doesn't require a cpuworker to wait for the main process to give it more data after each request it answers.
  • We should have a work queue implementation that uses condition variables as appropriate to notify cpuworkers of new data.
  • We should use appropriate libevent mechanisms notify the main thread of new answers. (There are a bunch of ways to implement a condition variable that wakes libevent; we should pick one.)
  • We should manage the communication in-memory rather than over sockets.

Child Tickets

Change History (24)

comment:1 Changed 6 years ago by nickm

This ticket came out of a comment of Ian's on #9574.

comment:2 Changed 6 years ago by iang

Cc: iang added

comment:3 Changed 6 years ago by nickm

Andrea's work-in-progress parallel_relay_crypt_rebase_2 has a partial implementation of a condition-based work queue.

Mark Ellzey's https://github.com/ellzey/libevhtp/tree/master/evthr has some libevent-related work queue stuff.

comment:5 Changed 6 years ago by nickm

Status: newneeds_review

There's an implementation in branch better_workqueues in my public repository. It works for me on a small chutney network. It's not totally obvious to me right now how to benchmark it, though. (In particular, what I'm not so sure about is how to measure our *current* timing stuff, and the benefits of pipelining in this branch.)

comment:6 Changed 6 years ago by iang

To test both the current and new versions, how about setting up a tiny test network, and having a bunch of clients pound one particular node with create cells (TAP, ntor, or a mix) and see how each version holds up?

comment:7 Changed 6 years ago by dgoulet

In terms of locking and synchronization, this looks good to me! I just have some basic questions.

  • In workerthread_t data structure, I don't see any use of is_running or is_shut_down other then setting them to one. I guess you are plaining ahead or leftovers?
  • I'm wondering if "waiting" (in workerthread_t) is really useful. tor_cond_signal_one() will simply do nothing if no thread is waiting. I guess one can argue that with the if(waiting) you can avoid a libc call but not sure if it is a gain in performance.

Just a suggestion. In terms of benchmark, one thing that can be very helpful for that is user space tracing. Basically, you add "Trace Event" to the code base which can record a payload of your choice and provide very very accurate timings along with possible extra informations (perf counters, tid, ...) when it's triggered. It gather a trace that can be analyzed later on or live, and with LTTng for instance, it's not going through the kernel at each event unlike Dtrace or Systemtap thus it's high performance and low intrusiveness.

I'll be happy to cook a branch with that support and you can see if it's useful to you or not. The tools are packaged in Debian/Ubuntu/Arch.

comment:8 Changed 6 years ago by andrea

Review, part 1:

13a3383b5562ba608e55143a8717bbcb35604ae5:

  • We're mostly just moving code out of compat.{c,h} here; looks okay to me but it's a big diff, so consider that conditional on that code not actually being changed a lot in the process.

3b7755c5b9a9f29a07d932dcdd78d8b79613818e:

  • Is the timeout arg actually necessary? Not sure I'm seeing it used other than in the test case.
  • Is struct timeval the right way to pass it (vs., say, a uint64_t count of microseconds?); both the pthreads and windows versions seem to be doing a bit of dancing around to work with that.
  • Otherwise, this code looks okay to me.

5abaa4c2cd9f814616bf16b48b634f35b76b5d40:

  • This one looks fine to me.

cc88b47fa4441233fd6dce448a0c77860da55d8c:

  • This one looks fine to me.

e85f99c1b89e5e44786025b1b6f74cffd0f38ba3:

  • I think the thread pool/work queue logic is okay for the feature set it supports, but I don't see any way to shut down a thread pool, or any way for the workler thread to exit other than being killed from another thread. Could be a problem if it ever gets used for something other than pure computation where there's cleanup to be done.
  • /* XXXX inside or outside of lock?? */ Looks like you got this right; it'd definitely be a race condition if it were outside.
  • From a queueing theory perspective, it's sub-optimal to distribute work up front by assigning it to a thread in threadpool_queue_work(); you'll get more efficient utilization and lower latency by having a single queue that threads pull from when they finish a job.

352d758d843801fb6bfe6b5ac4daeaa997c79c1c:

  • This looks okay to me.

1b9cbcb9ad2d23b6a1b12cc6866f0fd7e40dd810:

  • This looks good to me.

3c586e98c82b52e0803963f98bc36e17f6c2fb71:

  • Looks okay. Is this threadpool_queue_for_all() mechanism going to provide the thread shutdown capability?

8ef11d4fd328da71d2c95763bcbfd281268f6359:

  • This looks okay to me.

47a7888c8cf56c52a47c6f69c9644c48ef4ee76b:

  • Looks fine to me.

52c33d8eff931da43b5d4dd551cf75e934862bff:

  • Don't you mean s/reentrant/recursive/ ?

comment:9 Changed 6 years ago by andrea

Review, part 2:

066770588292696ea3d20b3beeaa82521746e192:

  • Looks fine to me.

461ace08feb4d40653532e7078981ead7f750058:

  • Looks fine to me.

2921b6fb7eb93955053d01e4b1e54de8f33447b3:

  • Looks fine to me.

764eb698a4f6dcaaa02f3343589fc9ae763133ff:

  • Where is SPIN_COUNT used?

05a184ce130856be553985dbdac1d4df8d36902c:

  • Wait, pthread_cond_timedwait() wants an absolute time()? Really? How many ways can this horribly break if the system time changes at the wrong moment?

ffb48b392271b459ef81ab1f80084bc7f829c07b:

  • Ah, good, you caught that too. :)

456c0bd480f34b53e809db4282afc042b5fe6a77:

  • This looks okay to me.

6db4d6b6ec0ffffd8184cabc83ec3f8c2f9c7948:

  • This looks okay to me.

e343a5b6d4c5e609a0531959cf700ffeabdea95a:

  • So we're using threadpool_queue_for_all() to shut threads down.

a26ebecd3e6424d7a07b50c5e4c6e76acecd7c12

  • This looks okay to me.

1ffbef76792a3b40e4a05ca01e63d4e009574c1b:

  • Good catch.

7f746b3f7fe7a7f25d05ab47d49fc788ab7ff4df:

  • Looks okay to me.

comment:10 Changed 6 years ago by andrea

Review, part 3:

801e35942d01165e777c3f3ad49f99aca9ab1f6a:

  • I'm not seeing any obvious problems with the logic here, but this is kind of a big hairy patch touching important stuff. I presume it's gotten some testing time on a live relay?

d859513ec4c6d1f7cd91122c67d74c507ca29abd:

  • Looks okay to me.

b0f487174daa0b970c453a8ff29a0f73325aca0a:

  • Looks okay to me.

3780a9843fd951cbd6fccb2fd39f25a307c1af41:

  • Looks okay to me.

comment:11 in reply to:  8 Changed 6 years ago by nickm

Replying to andrea:

Review, part 1:

13a3383b5562ba608e55143a8717bbcb35604ae5:

  • We're mostly just moving code out of compat.{c,h} here; looks okay to me but it's a big diff, so consider that conditional on that code not actually being changed a lot in the process.

3b7755c5b9a9f29a07d932dcdd78d8b79613818e:

  • Is the timeout arg actually necessary? Not sure I'm seeing it used other than in the test case.

It's not necessary, but it may be eventually.

  • Is struct timeval the right way to pass it (vs., say, a uint64_t count of microseconds?); both the pthreads and windows versions seem to be doing a bit of dancing around to work with that.

I like a uint64_t of microseconds, though working with it is going to be a PITA due (as you note) to the nutty "abstime" input character of the input.

  • Otherwise, this code looks okay to me.

[...]

e85f99c1b89e5e44786025b1b6f74cffd0f38ba3:

  • I think the thread pool/work queue logic is okay for the feature set it supports, but I don't see any way to shut down a thread pool, or any way for the workler thread to exit other than being killed from another thread. Could be a problem if it ever gets used for something other than pure computation where there's cleanup to be done.

See your note later.

  • /* XXXX inside or outside of lock?? */ Looks like you got this right; it'd definitely be a race condition if it were outside.
  • From a queueing theory perspective, it's sub-optimal to distribute work up front by assigning it to a thread in threadpool_queue_work(); you'll get more efficient utilization and lower latency by having a single queue that threads pull from when they finish a job.

Wouldn't I get lots of lock contention when they tried to pull off that queue?

Also, I talked to a queue guy and he told me that I should have separate queues, but that they should be no more than 1 item deep. Can you point me to the relevant literature here? Can we come up with a benchmark to try to demonstrate the difference?

3c586e98c82b52e0803963f98bc36e17f6c2fb71:

  • Looks okay. Is this threadpool_queue_for_all() mechanism going to provide the thread shutdown capability?

Yup. Also the key update mechanism.

52c33d8eff931da43b5d4dd551cf75e934862bff:

  • Don't you mean s/reentrant/recursive/ ?

Yes; fixed that in ffb48b39.

comment:12 in reply to:  9 Changed 6 years ago by nickm

Replying to andrea:

Review, part 2:

764eb698a4f6dcaaa02f3343589fc9ae763133ff:

  • Where is SPIN_COUNT used?

InitializeCriticalSectionAndSpinCount, which is a Windows thing. It's like the hybrid spinlock/heavy-lock implementation that most good systems use these days, but unlike the good systems, Windows expects you to try to figure out how many times to try spinning before you hand off responsibility to the OS.

05a184ce130856be553985dbdac1d4df8d36902c:

  • Wait, pthread_cond_timedwait() wants an absolute time()? Really? How many ways can this horribly break if the system time changes at the wrong moment?

Probably many ways; you'd need to read each implementation to find out. The glibc one does a clock_gettime or a gettimeofday to convert the input to a relative time right before it passes it to the kernel, so there's a small window for the clock to jump.

There's also pthread_condattr_setclock that we could use to make this use (e.g.) CLOCK_MONOTONIC_COARSE instead or something.

comment:13 in reply to:  10 Changed 6 years ago by nickm

Replying to andrea:

Review, part 3:

801e35942d01165e777c3f3ad49f99aca9ab1f6a:

  • I'm not seeing any obvious problems with the logic here, but this is kind of a big hairy patch touching important stuff. I presume it's gotten some testing time on a live relay?

Well kinda; I tried it for a while on a chutney network that I browsed through at home, but that won't get the load or all of the weird timings you'd see on the live internet.

comment:14 Changed 6 years ago by nickm

Milestone: Tor: 0.2.5.x-finalTor: 0.2.6.x-final
Priority: normalmajor

comment:15 Changed 6 years ago by nickm

Keywords: 026-triaged added
Status: needs_reviewneeds_revision

I still like this, but towelenee has been working on this code for gsoc. I should review his code and consider that, not mine, for inclusion.

comment:16 Changed 5 years ago by nickm

Keywords: 026-triaged-0 added; 026-triaged removed

comment:17 Changed 5 years ago by nickm

Keywords: nckm-patch added

comment:18 Changed 5 years ago by nickm

Keywords: nickm-patch added; nckm-patch removed

comment:19 Changed 5 years ago by nickm

I've rebased this as "better_workqueue_v3"

comment:20 Changed 5 years ago by nickm

better_workqueue_v3 now makes the sensible change everyone asked for, and has all the threads share a single queue.

comment:21 Changed 5 years ago by nickm

Status: needs_revisionneeds_review

comment:22 Changed 5 years ago by dgoulet

1) tor_cond_wait() in src/common/compat_pthreads.c

  • There are culprits here that I've experienced in my before-tor-life. First of all, pthread_cond_timedwait() can return EINTR in the glibc-doc (http://dev.man-online.org/man3/pthread_cond_timedwait/) but not in the POSIX manual page... We can either assume that if EINTR is sent back well whatever, just return an error or handling EINTR just to be safe? (FYI, same goes for condwait()).
  • Second thing, that has been being discussed in comment:12, is the NTP correction issue. gettimeofday() is not monotonic so if an NTP adjustment occurs after the call but before the timedwait, the timeout will be hit immediately (or will wait a LONG time) and I've seen that happen with daemon being launched at boot before time adjustement. Can we live with returning 1 even if it's not a really timeout, maybe? I'm more scared about LONG timeout if the clock jumps back in time.
  • Last thing, gettimeofday() is not tested for error, it's higly unlikely that it fails but it can so blindly use a timeout values might end up the timedwait in a weird state of waiting a long time!

2) in src/common/compat_threads.c

  • read()/write()/send()/recv(), they never handle EINTR errno value which when encountered, it should be retry. I'm picky on that errno value because I've seen countless time that being returned on loaded Linux machine and breaking stuff. Again, if the error is acceptable here that's fine but as long as the caller can handle it. For instance, replyqueue_process() win workqueue.c, drain_fn() is called but the return value being 0 or -1 does not stop the next loop to assume data is ready which is not.

3) in workerthread_s, the "waiting" variable seems not very useful, set to 1 before the cond wait and set to 0 after but not used elsewhere unless I'm missing something? Same for "is_running" and "is_shut_down".

4) I don't see any way how "threads" in threadpool_s can be cleaned up if a worker dies. I don't see side effects except threadpool_queue_update() will allocate unused memory. Should the worker cleanup itself from the pool once it dies?

Done for now. All in all, this looks really nice and good improvement :)

comment:23 Changed 5 years ago by dgoulet

Ok so as discussed on IRC, I've pushed to branch better_workqueue_v3 in my repo the support for pthread_cond_timedwait() to use a monotonic clock so time adjustment will not impact the timeout value.

There are also two commits, first one fixes Copyright years in the tests and second one is a comments/whitespaces fix. Please squash or ignore them at will.

1) in cpuworker.c, workqueue_entry_cancel() calls tor_free on the workqueue entry but I think it should call workqueue_entry_free so it's properly cleaned.

The rest lgtm! Big feature so I'll make sure to test it quite a bit in chutney.

comment:24 Changed 5 years ago by nickm

Resolution: fixed
Status: needs_reviewclosed

At long last merged! Will look at that issue.

Note: See TracTickets for help on using tickets.