Opened 6 years ago

Last modified 8 months ago

#8897 needs_revision enhancement

Faster curve25519 implementation for ntor

Reported by: nickm Owned by: nickm
Priority: Medium Milestone: Tor: unspecified
Component: Core Tor/Tor Version:
Severity: Normal Keywords: tor-relay, performance, ntor, curve25519, crypto, impl-shopping, 034-triage-20180328, 034-removed-20180328
Cc: isis Actual Points:
Parent ID: #9662 Points: large
Reviewer: Sponsor:

Description

Floodberry's curve25519 implementations at https://github.com/floodyberry/curve25519-donna are mostly C, and claim to be faster still than the ones we're using now, especially on intel cpus. We should evaluate them and consider switching.

Also, if we find an ed25519 implementation we like and wind up using it, we should evaluate using its component pieces to build an optimized curve25519 implementation for calculations on the base point as per http://www.imperialviolet.org/2013/05/10/fastercurve25519.html ; Adam has some example code based on one of the amd64 assembly implementations.

Child Tickets

Change History (42)

comment:1 Changed 5 years ago by arma

Nick, if this is something near ready to go, it would be great to make it go (in 0.2.5). That way when we get overwhelmed with ntor creates, some of the relays will be able to power through many of them.

comment:2 Changed 5 years ago by arma

Parent ID: #9657

comment:3 Changed 5 years ago by nickm

Owner: set to nickm
Priority: normalmajor
Status: newaccepted

I will bump up priority here. I don't, however, have a ready-to-merge patch. I'll put that on the do-queue, though.

comment:4 Changed 5 years ago by nickm

Parent ID: #9657#9662

comment:5 Changed 5 years ago by nickm

The "optimized basepoint multiply" part is now #9663. This ticket is only about "use a faster curve25519".

comment:6 Changed 5 years ago by nickm

Some initial (not terribly good) code in branch ticket8897_floodyberry_curve25519. Read commit msg; more work needed.

comment:7 Changed 5 years ago by nickm

More hacking done. I expect a modest improvement on 64-bit systems and a big win on 32-bit ones.

comment:8 Changed 5 years ago by nickm

Status: acceptedneeds_review

The branch is now "ticket8897_9663_v2". Will tweak a little more, but could probably stand some review.

comment:9 Changed 5 years ago by nickm

This and the #9663 changes together save 28% of the runtime for the server side of the ntor handshake when built with gcc 4.7 on my 64-bit laptop (136 vs 175 usec); 18% when built with clang (177 vs 213 usec); and 22% (155 vs 200 usec) when built with gcc 4.2-llvm. (yeah, yeah, I know, we should be using CPU counters. That's another issue.)

Note that using a recent gcc gets a pretty big performance boost in and of itself with this code.

comment:10 Changed 5 years ago by andrea

My review of the ticket8897_9663_v2 branch at https://trac.torproject.org/projects/tor/ticket/9663#comment:4 also covers this ticket.

comment:11 Changed 5 years ago by nickm

Milestone: Tor: 0.2.5.x-finalTor: 0.2.6.x-final

comment:12 Changed 5 years ago by nickm

Status: needs_reviewneeds_revision

I should take another pass over this branch.

comment:13 Changed 5 years ago by nickm

Keywords: 026-triaged added

comment:14 Changed 4 years ago by nickm

Keywords: 026-triaged-0 added; 026-triaged removed

comment:15 Changed 4 years ago by nickm

Based on profile information from #11332, maybe we should let this take a back-seat for a while.

Having a look at Andrea's results in "tor-58f4200789d0cc47ebd88f3091207cf4dd493573-profile-run.gprof.txt", I see 7956 calls to onion_skin_ntor_server_handshake vs 78824 to onion_skin_TAP_server_handshake(). But curve25519_donna accounts for less than .01% of total runtime. If we imagine that we replaced all of the TAP handhakes with ntor handshakes, that would still be only about 0.07% of runtime.

Possibly it's just not worth our time to optimize this stuff right now.

comment:16 Changed 4 years ago by nickm

Milestone: Tor: 0.2.6.x-finalTor: 0.2.???

Kicking this into 0.2.???. It's nice, it's fun, but unless curve25519 shows up higher in our profiles, it's useless.

comment:17 Changed 4 years ago by arma

Sounds plausible. We in part wanted a faster ntor implementation for the future where the botnet upgraded its clients -- but that future hasn't appeared yet, and it's been a while, so hey.

Seems like if we want to optimize handling of create cells, we'd get a lot more mileage out of making #9682 happen.

comment:18 Changed 4 years ago by arma

(To be clearer, the optimization here is for the cpuworker threads, and the optimization in #9682 is for the main thread. We could process many more ntor cells if we could get them to cpuworkers more efficiently.)

comment:19 Changed 4 years ago by nickm

Milestone: Tor: 0.2.???Tor: 0.2.7.x-final

These might also be worth looking at in 0.2.7

comment:20 Changed 4 years ago by nickm

Keywords: 027-triaged-1-in added

Marking some tickets as triaged-in for 0.2.7 based on early triage

comment:21 Changed 3 years ago by yawning

Has anyone (apart from me) benchmarked this stuff recently in isolation from #9663?

On a i5-4250U (TurboBoost etc), I get:

  • Andrew M. SSE2 (-O3 -m64 -DCURVE25519_SSE2): 203194 ticks (clang: 160530 ticks)
  • Andrew M. (-O3 -m64): 125878 ticks (clang: 136594 ticks)
  • agl c64 (-O3 -m64): 134442 ticks (clang: 150482 ticks)

This is compiling with gcc 5.1.0, clang 3.6.1. On x86_64 this doesn't feel worth it at all, especially considering that anyone who cares about Curve25519 performance is going to probably be hosting a high traffic HS, and should be on something modern with a 64 bit processor.

How much do we care about 32 bit performance for this sort of thing anyway?

comment:22 Changed 3 years ago by nickm

I think I benchmarked this stuff a tiny bit, long long ago. I think 32-bit performance is mostly nice-to-have, but not critical. A performance improvement that only helps with 64 bit is fine.

The numbers you give for the -m64 cases seem like they're a 7-10% improvement in curve25519 processing time. I agree that's not necessarily low-hanging fruit for 0.2.7, but it doesn't mean we should write it off permanently. 7% here and 7% there can add up to real savings in the long run.

That said, I'm not going to call this a critical optimization. :)

comment:23 Changed 3 years ago by nickm

Milestone: Tor: 0.2.7.x-finalTor: 0.2.8.x-final

comment:24 Changed 3 years ago by nickm

Points: large

comment:25 Changed 3 years ago by nickm

Priority: HighMedium

comment:26 Changed 3 years ago by nickm

Keywords: pre028-patch added

comment:27 Changed 3 years ago by nickm

Milestone: Tor: 0.2.8.x-finalTor: unspecified
Severity: Normal

18:39 < Yawning> 7% doesn't feel great for moving to something less tested

I'm sold. Not now at least.

comment:28 Changed 3 years ago by yawning

So, some researchers went and did an AVX2 Curve25519 implementation.

http://link.springer.com/chapter/10.1007/978-3-319-22174-8_18?no-access=true (Paywall)

Having read the paper, when they say "slight" they mean it (a few percent over djb's amd64 assembly), and their code isn't available. Their work should be applicable to AVX512 but since that's still Xeon only even with Skylake, I'm inclined to also leave this to a "if we get really desperate" sort of thing.

(Key generation gets more of a boost, but we already have faster code there.)

comment:29 Changed 3 years ago by yawning

comment:30 Changed 3 years ago by yawning

More random thoughts on this topic so I don't forget. We most certainly should pull in ARM optimized X25519, since using NEON is a big boost, especially on 32 bit platforms.

comment:31 Changed 19 months ago by nickm

Keywords: 027-triaged-in added

comment:32 Changed 19 months ago by nickm

Keywords: 027-triaged-in removed

comment:33 Changed 19 months ago by nickm

Keywords: 027-triaged-1-in removed

comment:34 Changed 19 months ago by nickm

Keywords: 026-triaged-0 removed

comment:35 Changed 19 months ago by nickm

Keywords: pre028-patch removed

comment:36 Changed 19 months ago by nickm

Keywords: curve25519 crypto impl-shopping added

I also heard rumors that curve25519-dalek was nice...

comment:37 Changed 19 months ago by isis

Cc: isis added

comment:38 Changed 12 months ago by Hello71

my profiling results indicate that approximately 18% of my x86_64 Fast relay's CPU time is spent in curve25519_donna. therefore, using the 7-10% number from previously, this optimization will result in overall savings of 1.3-1.8%. not huge, but certainly not negligible.

comment:39 Changed 12 months ago by teor

Milestone: Tor: unspecifiedTor: 0.3.4.x-final

Let's think about this in 0.3.4, because the 0.3.3 freeze is only a few weeks away.
We should see if dalek is ready by then.

comment:40 Changed 9 months ago by nickm

Keywords: 034-triage-20180328 added

comment:41 Changed 9 months ago by nickm

Keywords: 034-removed-20180328 added

Per our triage process, these tickets are pending removal from 0.3.4.

comment:42 Changed 8 months ago by nickm

Milestone: Tor: 0.3.4.x-finalTor: unspecified

These needs_revision, tickets, tagged with 034-removed-*, are no longer in-scope for 0.3.4. We can reconsider any of them, if somebody does the necessary revision.

Note: See TracTickets for help on using tickets.