Opened 9 years ago

Last modified 6 weeks ago

#1749 assigned project

Split relay and link crypto across multiple CPU cores

Reported by: nickm Owned by: chelseakomlo
Priority: High Milestone: Tor: unspecified
Component: Core Tor/Tor Version:
Severity: Normal Keywords: tor-relay, term-project-ideas, threads, performance, 035-roadmap-master, 035-triaged-in-20180711
Cc: Samdney, arthuredelstein, neel Actual Points:
Parent ID: Points: 10
Reviewer: Sponsor:

Description (last modified by nickm)

Right now, Tor does nearly all of its work in one main thread. We have a basic "CPUWorker" implementation that we use for doing server-side onionskin crypto in a separate thread, but thanks to improvements long ago, server-side onionskin crypto on longer dominates. If we could split the work of relay AES-CTR crypto and SSL crypto across multiple threads, that would be pretty helpful in letting high-performance servers saturate their connections. (Blutmagie has wanted this for some while.)

Child Tickets:

#1760
Parallel Crypto: Design a good crypto parallelization plan and architecture
#26296
Refactor cell crypto to pre/post crypto operations


Child Tickets

TicketStatusOwnerSummaryComponent
#1760closedParallel Crypto: Design a good crypto parallelization plan and architectureCore Tor/Tor
#26296assignedchelseakomloRefactor cell crypto to pre/post crypto operationsCore Tor/Tor

Attachments (5)

relay.2.c (1.6 KB) - added by towelenee 5 years ago.
relay.c (1.6 KB) - added by towelenee 5 years ago.
Here is my changes for relay.c It use multiply cores by openMp, but it needs changes in Makefile.am
relay.3.c (2.5 KB) - added by towelenee 5 years ago.
relay.4.c (2.2 KB) - added by towelenee 5 years ago.
Last patch doesn't use openmp, just pthreads
MultiThreadedCrypto.png (100.1 KB) - added by chelseakomlo 14 months ago.

Download all attachments as: .zip

Change History (38)

comment:1 Changed 9 years ago by nickm

Owner: set to nickm
Status: newaccepted
Type: defecttask

comment:2 Changed 9 years ago by nickm

Description: modified (diff)

comment:3 Changed 9 years ago by nickm

Milestone: Tor: 0.2.3.x-final

At least the relay crypto part of this should happen in 0.2.3.x

comment:4 Changed 8 years ago by karsten

Summary: Project: Split relay and link crypto across multiple CPU coresSplit relay and link crypto across multiple CPU cores
Type: taskproject

comment:5 Changed 8 years ago by nickm

Milestone: Tor: 0.2.3.x-finalTor: unspecified

comment:6 Changed 7 years ago by nickm

Milestone: Tor: unspecifiedTor: 0.2.4.x-final
Priority: normalmajor

comment:7 Changed 7 years ago by nickm

Keywords: tor-relay added

comment:8 Changed 7 years ago by nickm

Component: Tor RelayTor

comment:9 Changed 7 years ago by nickm

Milestone: Tor: 0.2.4.x-finalTor: unspecified

Added a sub-ticket for the relay component.

comment:10 Changed 6 years ago by elgo

May I suggest to get this at critical priority?
21th century crypto software can't afford to be not fully-threaded ;)
No CPU sold today is mono-core anymore, and I sure few people would run a tor dedicated relay up 24/24 to see it used at only 1/n'th of its capacity.

Changed 5 years ago by towelenee

Attachment: relay.2.c added

Changed 5 years ago by towelenee

Attachment: relay.c added

Here is my changes for relay.c It use multiply cores by openMp, but it needs changes in Makefile.am

Changed 5 years ago by towelenee

Attachment: relay.3.c added

Changed 5 years ago by towelenee

Attachment: relay.4.c added

Last patch doesn't use openmp, just pthreads

comment:11 Changed 3 years ago by nickm

Keywords: 6s194 added

comment:12 Changed 3 years ago by nickm

Keywords: term-project-ideas added; 6s194 removed

These tickets were tagged "6s194" as ideas for possible term projects for students in MIT subject 6.S194 spring 2016. I'm retagging with term-project-ideas, so that the students can use the 6s194 tag for tickets they're actually working on.

comment:13 Changed 2 years ago by nickm

Keywords: threads performance added
Points: 10
Severity: Normal

comment:14 Changed 15 months ago by chelseakomlo

How likely is it that this functionality (or parts of it) can be implemented in Rust? Would it require a lot of refactoring or is it already fairly modularized?

comment:15 in reply to:  14 Changed 15 months ago by cypherpunks

Replying to chelseakomlo:

How likely is it that this functionality (or parts of it) can be implemented in Rust? Would it require a lot of refactoring or is it already fairly modularized?

Isis offered a glimpse of the answer: https://blog.torproject.org/comment/269723#comment-269723

comment:16 in reply to:  14 ; Changed 15 months ago by nickm

Replying to chelseakomlo:

How likely is it that this functionality (or parts of it) can be implemented in Rust? Would it require a lot of refactoring or is it already fairly modularized?

It's not so well modularized right now. The big problem here is that the code is written with the assjumption that relay crypto finishes immediately, but with this change, we'd sometimes have to wait on another thread before we had cells to send on a given circuit.

comment:17 in reply to:  16 Changed 14 months ago by chelseakomlo

Replying to nickm:

Replying to chelseakomlo:

How likely is it that this functionality (or parts of it) can be implemented in Rust? Would it require a lot of refactoring or is it already fairly modularized?

It's not so well modularized right now. The big problem here is that the code is written with the assjumption that relay crypto finishes immediately, but with this change, we'd sometimes have to wait on another thread before we had cells to send on a given circuit.

Ok, understood- it seems like relay_crypt_one_payload is the place where this would happen, and instead of blocking, it would emit an event once the relay crypto finishes.

I'll dig more into https://trac.torproject.org/projects/tor/wiki/org/projects/Tor/MultithreadedCrypto, and will come up with a mini implementation plan for review.

comment:18 Changed 14 months ago by chelseakomlo

Owner: changed from nickm to chelseakomlo
Status: acceptedassigned

comment:19 Changed 14 months ago by Samdney

Cc: Samdney added

Add me as observer. I already spend some time with this. Maybe I can help :)

comment:20 Changed 14 months ago by chelseakomlo

Ok, I have a start of a plan which I'm looking forward to discussing/further refining in Seattle. I took a large amount from https://trac.torproject.org/projects/tor/wiki/org/projects/Tor/MultithreadedCrypto but there are some things which are out of date (circuit priority logic, for example) so any further pointers on what is different between when that wiki was written and where we are today would be helpful.

Below is a pad with a high level plan/starting implementation ideas; I've also attached a high level (pretty rough, sorry!) proposed architectural diagram to this ticket. Looking forward to further discussion, particularly around the proposal to use Rust and any Rust/C integration issues that could be particularly painful, and also any better ideas about how to cleanly register/edge trigger events.

https://pad.riseup.net/p/MultiThreadedCrypto_ImplementationPlan-keep

Changed 14 months ago by chelseakomlo

Attachment: MultiThreadedCrypto.png added

comment:21 Changed 14 months ago by dgoulet

I have few comment about the proposed design. This is something I thought about a while back but never got cycles to implement.

Where should I discuss the plan? I would avoid using the ticket for that. I think a tor-dev@ thread would be ideal here?

comment:22 Changed 14 months ago by Hello71

one small thing: as I said on IRC, my informal profiling seems to show that a significant amount of CPU time is spent in the kernel networking stack, including TCP/IP and waiting for the network device. it's possible that that last bit is partially because of an old virtio-net though. it could potentially be easier to integrate recv multithreading at this point; or maybe not! maybe it would be easier (even in total) to just do crypto first, and then the other stuff.

comment:23 in reply to:  22 Changed 14 months ago by chelseakomlo

Replying to Hello71:

one small thing: as I said on IRC, my informal profiling seems to show that a significant amount of CPU time is spent in the kernel networking stack, including TCP/IP and waiting for the network device. it's possible that that last bit is partially because of an old virtio-net though. it could potentially be easier to integrate recv multithreading at this point; or maybe not! maybe it would be easier (even in total) to just do crypto first, and then the other stuff.

That is a good point- this is a separate piece of work to this specific task though. However, it would probably be good to open an issue for "Recent profiling outcomes" so that we can take a closer look and track/make other issues for discoveries like this.

comment:24 Changed 13 months ago by nickm

Keywords: 035-roadmap-master added
Milestone: Tor: unspecifiedTor: 0.3.5.x-final

comment:25 Changed 12 months ago by nickm

Keywords: 035-triaged-in-20180711 added

comment:26 Changed 12 months ago by Vort

Hello71, chelseakomlo: I've made such ticket some time ago: #23433.

comment:27 Changed 10 months ago by nickm

Milestone: Tor: 0.3.5.x-finalTor: unspecified

comment:28 Changed 6 months ago by arthuredelstein

Cc: arthuredelstein added

comment:29 in reply to:  description ; Changed 6 months ago by schroeder

There are early plans to distribute crypto operations across multiple cores, but there might be a better way.

(I emailed before, but I just found the tiny reply link-button)

The ticket states the goal is to saturate the bandwidth available (by using all the cores as efficiently as possible).

I don't understand why a relay needs to have a "main thread". Network traffic arrives as an async operation and can be sent back out asynchronously. So a final strategy shouldn't have a central thread. The main thread might still be needed for startup, runtime adjustment, and system upkeep, but not for the core network-crypto processing; that should never need to touch the main thread.

The current proposal speaks about multi-threading crypto operations, let's call that "A) Speed - Speeding up processing of a single cell". Instead, I propose "B) Concurrency - Restructuring so multiple cells can be processed concurrently".

A cell of data should arrive via IO-Completion thread on a random CPU core, have crypto transformation applied on the same one core, then be dispatched onward out via the network. This seems to be quite a simple approach where I would think crypto code can remain the same "single-threaded" implementation.

Approach [A] will have diminishing returns as the number of cores increases. You can only break up a cell unit of work so much until you're encrypting one byte per cpu core. However, with approach [B], if you have millions of CPU cores (as an extreme) you can be processing millions of cells concurrently. Therefore, I believe approach [B] would be more scalable.

There would be circuit-state to maintain. Concurrent cells on the same circuit should be queued or thread-locked. I suspect thread-locking will be simple enough - the best approach.

Given that it's only a problem for the biggest nodes, a design should be chosen that is very time-efficient to implement and focuses on achieving the goals of such users, not focusing on squeezing every drop of performance, for performance sake. I believe this is that efficient and focused design.

What do you think?

Last edited 6 months ago by schroeder (previous) (diff)

comment:30 in reply to:  29 Changed 6 months ago by Hello71

I've been saying this whole time that my (admittedly very informal) benchmarks show that the time spent in relay crypto is not significant, as long as AES is hardware accelerated (e.g. AES-NI). I assume that this task will require a significant amount of effort to bang out the final design and implement it. Therefore, if someone wants to do this, it is my opinion that they should first make better benchmarks, or find better benchmarks (I heard dgoulet was doing something...) showing that with modern openssl and hardware accelerated AES, parallelization is required.

Additionally: as I understand, the current design is highly single-threaded. In particular, the scheduler is a key component of modern Tor, and if I follow correctly, is sort of a bottleneck to full parallelism.

comment:31 Changed 6 weeks ago by neel

Cc: neel added

Is any work being done on this? Do we require Rust support before work can start?

I'd love to have multicore relays as well (for my home server hosted on residential FTTH).

comment:32 Changed 6 weeks ago by cypherpunks

it woud also help virtualized servers, with high bbandwidth butlow cpu single core power

comment:33 in reply to:  31 Changed 6 weeks ago by teor

Replying to neel:

Is any work being done on this?

No.

Do we require Rust support before work can start?

No, but reliable multithreaded code is hard to write in C.

Maybe start with some easier tasks first?

Note: See TracTickets for help on using tickets.