Right now, Tor does nearly all of its work in one main thread. We have a basic "CPUWorker" implementation that we use for doing server-side onionskin crypto in a separate thread, but thanks to improvements long ago, server-side onionskin crypto on longer dominates. If we could split the work of relay AES-CTR crypto and SSL crypto across multiple threads, that would be pretty helpful in letting high-performance servers saturate their connections. (Blutmagie has wanted this for some while.)
Trac: Description: Right now, Tor does nearly all of its work in one main thread. We have a basic "CPUWorker" implementation that we use for doing server-side onionskin crypto in a separate thread, but thanks to improvements long ago, server-side onionskin crypto on longer dominates. If we could split the work of relay AES-CTR crypto and SSL crypto across multiple threads, that would be pretty helpful in letting high-performance servers saturate their connections. (Blutmagie has wanted this for some while.)
to
Right now, Tor does nearly all of its work in one main thread. We have a basic "CPUWorker" implementation that we use for doing server-side onionskin crypto in a separate thread, but thanks to improvements long ago, server-side onionskin crypto on longer dominates. If we could split the work of relay AES-CTR crypto and SSL crypto across multiple threads, that would be pretty helpful in letting high-performance servers saturate their connections. (Blutmagie has wanted this for some while.)
Trac: Actualpoints: N/AtoN/A Points: N/AtoN/A Type: task to project Summary: Project: Split relay and link crypto across multiple CPU cores to Split relay and link crypto across multiple CPU cores
May I suggest to get this at critical priority?
21th century crypto software can't afford to be not fully-threaded ;)
No CPU sold today is mono-core anymore, and I sure few people would run a tor dedicated relay up 24/24 to see it used at only 1/n'th of its capacity.
These tickets were tagged "6s194" as ideas for possible term projects for students in MIT subject 6.S194 spring 2016. I'm retagging with term-project-ideas, so that the students can use the 6s194 tag for tickets they're actually working on.
How likely is it that this functionality (or parts of it) can be implemented in Rust? Would it require a lot of refactoring or is it already fairly modularized?
How likely is it that this functionality (or parts of it) can be implemented in Rust? Would it require a lot of refactoring or is it already fairly modularized?
Isis offered a glimpse of the answer: https://blog.torproject.org/comment/269723#comment-269723
How likely is it that this functionality (or parts of it) can be implemented in Rust? Would it require a lot of refactoring or is it already fairly modularized?
It's not so well modularized right now. The big problem here is that the code is written with the assjumption that relay crypto finishes immediately, but with this change, we'd sometimes have to wait on another thread before we had cells to send on a given circuit.
How likely is it that this functionality (or parts of it) can be implemented in Rust? Would it require a lot of refactoring or is it already fairly modularized?
It's not so well modularized right now. The big problem here is that the code is written with the assjumption that relay crypto finishes immediately, but with this change, we'd sometimes have to wait on another thread before we had cells to send on a given circuit.
Ok, understood- it seems like relay_crypt_one_payload is the place where this would happen, and instead of blocking, it would emit an event once the relay crypto finishes.
Ok, I have a start of a plan which I'm looking forward to discussing/further refining in Seattle. I took a large amount from https://trac.torproject.org/projects/tor/wiki/org/projects/Tor/MultithreadedCrypto but there are some things which are out of date (circuit priority logic, for example) so any further pointers on what is different between when that wiki was written and where we are today would be helpful.
Below is a pad with a high level plan/starting implementation ideas; I've also attached a high level (pretty rough, sorry!) proposed architectural diagram to this ticket. Looking forward to further discussion, particularly around the proposal to use Rust and any Rust/C integration issues that could be particularly painful, and also any better ideas about how to cleanly register/edge trigger events.
one small thing: as I said on IRC, my informal profiling seems to show that a significant amount of CPU time is spent in the kernel networking stack, including TCP/IP and waiting for the network device. it's possible that that last bit is partially because of an old virtio-net though. it could potentially be easier to integrate recv multithreading at this point; or maybe not! maybe it would be easier (even in total) to just do crypto first, and then the other stuff.
one small thing: as I said on IRC, my informal profiling seems to show that a significant amount of CPU time is spent in the kernel networking stack, including TCP/IP and waiting for the network device. it's possible that that last bit is partially because of an old virtio-net though. it could potentially be easier to integrate recv multithreading at this point; or maybe not! maybe it would be easier (even in total) to just do crypto first, and then the other stuff.
That is a good point- this is a separate piece of work to this specific task though. However, it would probably be good to open an issue for "Recent profiling outcomes" so that we can take a closer look and track/make other issues for discoveries like this.
There are early plans to distribute crypto operations across multiple cores, but there might be a better way.
(I emailed before, but I just found the tiny reply link-button)
The ticket states the goal is to saturate the bandwidth available (by using all the cores as efficiently as possible).
I don't understand why a relay needs to have a "main thread". Network traffic arrives as an async operation and can be sent back out asynchronously. So a final strategy shouldn't have a central thread. The main thread might still be needed for startup, runtime adjustment, and system upkeep, but not for the core network-crypto processing; that should never need to touch the main thread.
The current proposal speaks about multi-threading crypto operations, let's call that "A) Speed - Speeding up processing of a single cell". Instead, I propose "B) Concurrency - Restructuring so multiple cells can be processed concurrently".
A cell of data should arrive via IO-Completion thread on a random CPU core, have crypto transformation applied on the same one core, then be dispatched onward out via the network. This seems to be quite a simple approach where I would think crypto code can remain the same "single-threaded" implementation.
Approach [A] will have diminishing returns as the number of cores increases. You can only break up a cell unit of work so much until you're encrypting one byte per cpu core. However, with approach [B], if you have millions of CPU cores (as an extreme) you can be processing millions of cells concurrently. Therefore, I believe approach [B] would be more scalable.
There would be circuit-state to maintain. Concurrent cells on the same circuit should be queued or thread-locked. I suspect thread-locking will be simple enough - the best approach.
Given that it's only a problem for the biggest nodes, a design should be chosen that is very time-efficient to implement and focuses on achieving the goals of such users, not focusing on squeezing every drop of performance, for performance sake. I believe this is that efficient and focused design.
I've been saying this whole time that my (admittedly very informal) benchmarks show that the time spent in relay crypto is not significant, as long as AES is hardware accelerated (e.g. AES-NI). I assume that this task will require a significant amount of effort to bang out the final design and implement it. Therefore, if someone wants to do this, it is my opinion that they should first make better benchmarks, or find better benchmarks (I heard dgoulet was doing something...) showing that with modern openssl and hardware accelerated AES, parallelization is required.
Additionally: as I understand, the current design is highly single-threaded. In particular, the scheduler is a key component of modern Tor, and if I follow correctly, is sort of a bottleneck to full parallelism.