Floodberry's curve25519 implementations at https://github.com/floodyberry/curve25519-donna are mostly C, and claim to be faster still than the ones we're using now, especially on intel cpus. We should evaluate them and consider switching.
Also, if we find an ed25519 implementation we like and wind up using it, we should evaluate using its component pieces to build an optimized curve25519 implementation for calculations on the base point as per http://www.imperialviolet.org/2013/05/10/fastercurve25519.html ; Adam has some example code based on one of the amd64 assembly implementations.
To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information
Child items ...
Show closed items
Linked items 0
Link issues together to show that they're related.
Learn more.
Nick, if this is something near ready to go, it would be great to make it go (in 0.2.5). That way when we get overwhelmed with ntor creates, some of the relays will be able to power through many of them.
This and the #9663 (moved) changes together save 28% of the runtime for the server side of the ntor handshake when built with gcc 4.7 on my 64-bit laptop (136 vs 175 usec); 18% when built with clang (177 vs 213 usec); and 22% (155 vs 200 usec) when built with gcc 4.2-llvm. (yeah, yeah, I know, we should be using CPU counters. That's another issue.)
Note that using a recent gcc gets a pretty big performance boost in and of itself with this code.
Based on profile information from #11332 (moved), maybe we should let this take a back-seat for a while.
Having a look at Andrea's results in "tor-58f4200789d0cc47ebd88f3091207cf4dd493573-profile-run.gprof.txt", I see 7956 calls to onion_skin_ntor_server_handshake vs 78824 to onion_skin_TAP_server_handshake(). But curve25519_donna accounts for less than .01% of total runtime. If we imagine that we replaced all of the TAP handhakes with ntor handshakes, that would still be only about 0.07% of runtime.
Possibly it's just not worth our time to optimize this stuff right now.
Sounds plausible. We in part wanted a faster ntor implementation for the future where the botnet upgraded its clients -- but that future hasn't appeared yet, and it's been a while, so hey.
Seems like if we want to optimize handling of create cells, we'd get a lot more mileage out of making #9682 (moved) happen.
(To be clearer, the optimization here is for the cpuworker threads, and the optimization in #9682 (moved) is for the main thread. We could process many more ntor cells if we could get them to cpuworkers more efficiently.)
This is compiling with gcc 5.1.0, clang 3.6.1. On x86_64 this doesn't feel worth it at all, especially considering that anyone who cares about Curve25519 performance is going to probably be hosting a high traffic HS, and should be on something modern with a 64 bit processor.
How much do we care about 32 bit performance for this sort of thing anyway?
I think I benchmarked this stuff a tiny bit, long long ago. I think 32-bit performance is mostly nice-to-have, but not critical. A performance improvement that only helps with 64 bit is fine.
The numbers you give for the -m64 cases seem like they're a 7-10% improvement in curve25519 processing time. I agree that's not necessarily low-hanging fruit for 0.2.7, but it doesn't mean we should write it off permanently. 7% here and 7% there can add up to real savings in the long run.
That said, I'm not going to call this a critical optimization. :)
Having read the paper, when they say "slight" they mean it (a few percent over djb's amd64 assembly), and their code isn't available. Their work should be applicable to AVX512 but since that's still Xeon only even with Skylake, I'm inclined to also leave this to a "if we get really desperate" sort of thing.
(Key generation gets more of a boost, but we already have faster code there.)
More random thoughts on this topic so I don't forget. We most certainly should pull in ARM optimized X25519, since using NEON is a big boost, especially on 32 bit platforms.
my profiling results indicate that approximately 18% of my x86_64 Fast relay's CPU time is spent in curve25519_donna. therefore, using the 7-10% number from previously, this optimization will result in overall savings of 1.3-1.8%. not huge, but certainly not negligible.
These needs_revision, tickets, tagged with 034-removed-*, are no longer in-scope for 0.3.4. We can reconsider any of them, if somebody does the necessary revision.
Trac: Milestone: Tor: 0.3.4.x-final to Tor: unspecified