Splitting relay crypto across multiple CPU cores is less essential than maybe it once was, now that we can use aesni, but it can still be critical-path. It's likely to become important again if we shift to something like Salsa20, or some large-block cipher based on it.
Heck yeah! Could you tell me as much as possible about which version of Tor they were made with, built how, using which versions of openssl and libevent, built how?
I attempted to partially address this ticket, but could use additional insight from someone more experienced with Tor. I am learning the Tor daemon code, but I understand the cell crypto is done (primarily) in relay_crypt() for middle relays and circuit_package_relay_cell() for exit relays. My changes ONLY address the relay_crypt() case. I realize my code is not up to Tor project coding standards, but so far, I've been focused on learning the Tor code base and trying to get this to work.
In general, I refactored circuit_receive_relay_cell() in relay.c (which calls relay_crypt() and eventually the AES crypt routines) to use the workqueue.c infrastructure similar to cpuworker.c.
When the refactored code runs in single threaded mode, all seems good in limited tests. Once I activate the thread pool and start sending it work with threadpool_queue_work(), it Bootstraps 100% okay and runs for several minutes before crashing on cells it doesn't handle properly. It seems to pass several cells successfully, but then crashes on the bandwidth test(?).
In my branch, commit 842edc9 shows my refactored, single threaded version. Commit 940d1bd
shows my attempt at pushing relay_crypt() into a thread pool of 1.
In a separate post, I'll write up some explanations of what I was trying to do.
Trac: Username: jsturgix Sponsor: N/AtoN/A Severity: N/Ato Normal
I looked for an approach that I could generalize and apply to both the relay_crypt() case and the circuit_package_relay_cell() case. At first glance, I didn't see anything easy, and since there were already a number of moving parts unfamiliar to me, I focused on the relay_crypt() case.
In general, this was my thought process and approach:
(1) I created new files src/or/cryptothreads.c and src/or/cryptothreads.h. These are modeled after src/or/cpuworker.c and create the thread pool. cpuworker.c is big and I thought cryptothreads.c might also become big. Now it is small and it might make sense to roll cryptothreads.c into another existing source file like src/or/relay.c.
(2) From src/or/main.c, I call crypto_threads_init() (in cryptothreads.c) to initialize the events and thread pool handling.
(3) In command_process_relay_cell() (src/or/command.c), I encapsulated and moved everything after the call to circuit_receive_relay_cell() into circuit_receive_relay_cell_post() (relay.c). The idea was circuit_receive_relay_cell() would eventually queue the crypto task, but circuit_receive_relay_cell_post() would still be executed by the thread pool callback function in the context of the main thread. In other words, command_process_relay_cell() needs unwind and eventually return back to event loop monitoring; and circuit_receive_relay_cell_post() is still called but asynchronously.
(4) I basically broke circuit_receive_relay_cell() (relay.c) into two parts: cryptothread_threadfn() and cryptothread_replyfn(). cryptothread_threadfn() is run by a thread in the thread pool and calls down relay_crypt() -> relay_crypt_one_payload() -> crypto_cipher_crypt_inplace() and so forth into AES routines. When cryptothread_threadfn() finishes, the main thread (through its event loop) is signaled task complete and the main thread then calls cryptothread_replyfn(). There is some glue to make this happen such as queue_job_for_cryptothread() (reply.c) and replyqueue_process_cb() (cryptothread.c), but uses the existing src/common/workqueue.c implementation as modeled by cpuworker.c.
Initially, I did not think relay_crypt() accessed any resources shared by the main thread, so I have NOT added any synchronized access of shared data and I suspect this is the problem. All/most? access of shared data seemed to be done in the main thread's context after responding to an event (to include the thread pool callback function cryptothread_replyfn()) but admittedly I don't have a good grasp of the cell structures and cell/circuit queues used in the main thread. Me thinks I have reasoned incorrectly since the differences between the refactored single-thread version and the multiple thread version are relatively few.
From what I remember (or perhaps assumed), the functionality in src/common/workqueue.c is properly synchronized because it is already being used (but less intensely?).
These tickets were tagged "6s194" as ideas for possible term projects for students in MIT subject 6.S194 spring 2016. I'm retagging with term-project-ideas, so that the students can use the 6s194 tag for tickets they're actually working on.