Faster curve25519 implementation for ntor

changed milestone to %Tor: unspecified

added 034-removed-20180328 034-triage-20180328 component::core tor/tor crypto curve25519 impl-shopping milestone::Tor: unspecified ntor parent::9662 performance points::large priority::medium severity::normal status::needs-revision tor-relay type::enhancement labels

Nick, if this is something near ready to go, it would be great to make it go (in 0.2.5). That way when we get overwhelmed with ntor creates, some of the relays will be able to power through many of them.

Trac:
Parent: N/A to #9657 (moved)

I will bump up priority here. I don't, however, have a ready-to-merge patch. I'll put that on the do-queue, though.

Trac:
Status: new to accepted
Owner: N/A to nickm
Priority: normal to major

Trac:
Parent: #9657 (moved) to #9662 (moved)

The "optimized basepoint multiply" part is now #9663 (moved). This ticket is only about "use a faster curve25519".

Some initial (not terribly good) code in branch ticket8897_floodyberry_curve25519. Read commit msg; more work needed.

More hacking done. I expect a modest improvement on 64-bit systems and a big win on 32-bit ones.

The branch is now "ticket8897_9663_v2". Will tweak a little more, but could probably stand some review.

Trac:
Status: accepted to needs_review

This and the #9663 (moved) changes together save 28% of the runtime for the server side of the ntor handshake when built with gcc 4.7 on my 64-bit laptop (136 vs 175 usec); 18% when built with clang (177 vs 213 usec); and 22% (155 vs 200 usec) when built with gcc 4.2-llvm. (yeah, yeah, I know, we should be using CPU counters. That's another issue.)

Note that using a recent gcc gets a pretty big performance boost in and of itself with this code.

My review of the ticket8897_9663_v2 branch at https://trac.torproject.org/projects/tor/ticket/9663#comment:4 also covers this ticket.

Trac:
Milestone: Tor: 0.2.5.x-final to Tor: 0.2.6.x-final

I should take another pass over this branch.

Trac:
Status: needs_review to needs_revision

Trac:
Keywords: tor-relay performance ntor deleted, tor-relay performance ntor 026-triaged added

Trac:
Keywords: tor-relay performance ntor 026-triaged deleted, performance, ntor, 026-triaged-0, tor-relay added

Based on profile information from #11332 (moved), maybe we should let this take a back-seat for a while.

Having a look at Andrea's results in "tor-58f4200789d0cc47ebd88f3091207cf4dd493573-profile-run.gprof.txt", I see 7956 calls to onion_skin_ntor_server_handshake vs 78824 to onion_skin_TAP_server_handshake(). But curve25519_donna accounts for less than .01% of total runtime. If we imagine that we replaced all of the TAP handhakes with ntor handshakes, that would still be only about 0.07% of runtime.

Possibly it's just not worth our time to optimize this stuff right now.

Kicking this into 0.2.???. It's nice, it's fun, but unless curve25519 shows up higher in our profiles, it's useless.

Trac:
Milestone: Tor: 0.2.6.x-final to Tor: 0.2.???

Sounds plausible. We in part wanted a faster ntor implementation for the future where the botnet upgraded its clients -- but that future hasn't appeared yet, and it's been a while, so hey.

Seems like if we want to optimize handling of create cells, we'd get a lot more mileage out of making #9682 (moved) happen.

(To be clearer, the optimization here is for the cpuworker threads, and the optimization in #9682 (moved) is for the main thread. We could process many more ntor cells if we could get them to cpuworkers more efficiently.)

These might also be worth looking at in 0.2.7

Trac:
Milestone: Tor: 0.2.??? to Tor: 0.2.7.x-final

Marking some tickets as triaged-in for 0.2.7 based on early triage

Trac:
Keywords: N/A deleted, 027-triaged-1-in added

Has anyone (apart from me) benchmarked this stuff recently in isolation from #9663 (moved)?

On a i5-4250U (TurboBoost etc), I get:

Andrew M. SSE2 (-O3 -m64 -DCURVE25519_SSE2): 203194 ticks (clang: 160530 ticks)
Andrew M. (-O3 -m64): 125878 ticks (clang: 136594 ticks)
agl c64 (-O3 -m64): 134442 ticks (clang: 150482 ticks)

This is compiling with gcc 5.1.0, clang 3.6.1. On x86_64 this doesn't feel worth it at all, especially considering that anyone who cares about Curve25519 performance is going to probably be hosting a high traffic HS, and should be on something modern with a 64 bit processor.

How much do we care about 32 bit performance for this sort of thing anyway?

I think I benchmarked this stuff a tiny bit, long long ago. I think 32-bit performance is mostly nice-to-have, but not critical. A performance improvement that only helps with 64 bit is fine.

The numbers you give for the -m64 cases seem like they're a 7-10% improvement in curve25519 processing time. I agree that's not necessarily low-hanging fruit for 0.2.7, but it doesn't mean we should write it off permanently. 7% here and 7% there can add up to real savings in the long run.

That said, I'm not going to call this a critical optimization. :)

Trac:
Milestone: Tor: 0.2.7.x-final to Tor: 0.2.8.x-final

Trac:
Points: N/A to large
Sponsor: N/A to N/A

Trac:
Priority: High to Medium

Trac:
Keywords: N/A deleted, pre028-patch added

18:39 < Yawning> 7% doesn't feel great for moving to something less tested

I'm sold. Not now at least.

Trac:
Severity: N/A to Normal
Milestone: Tor: 0.2.8.x-final to Tor: unspecified

So, some researchers went and did an AVX2 Curve25519 implementation.

http://link.springer.com/chapter/10.1007/978-3-319-22174-8_18?no-access=true (Paywall)

Having read the paper, when they say "slight" they mean it (a few percent over djb's amd64 assembly), and their code isn't available. Their work should be applicable to AVX512 but since that's still Xeon only even with Skylake, I'm inclined to also leave this to a "if we get really desperate" sort of thing.

(Key generation gets more of a boost, but we already have faster code there.)

This also looks promising, though it's modern intel optimized.

http://csrc.nist.gov/groups/ST/ecc-workshop-2015/presentations/session6-chou-tung.pdf https://www.win.tue.nl/~tchou/sandy2x/

Public domain.

Trac:
Reviewer: N/A to N/A

More random thoughts on this topic so I don't forget. We most certainly should pull in ARM optimized X25519, since using NEON is a big boost, especially on 32 bit platforms.

Trac:
Keywords: N/A deleted, 027-triaged-in added

Trac:
Keywords: 027-triaged-in deleted, N/A added

Trac:
Keywords: 027-triaged-1-in deleted, N/A added

Trac:
Keywords: 026-triaged-0 deleted, N/A added

Trac:
Keywords: pre028-patch deleted, N/A added

I also heard rumors that curve25519-dalek was nice...

Trac:
Keywords: N/A deleted, curve25519 crypto impl-shopping added

Trac:
Cc: N/A to isis

my profiling results indicate that approximately 18% of my x86_64 Fast relay's CPU time is spent in curve25519_donna. therefore, using the 7-10% number from previously, this optimization will result in overall savings of 1.3-1.8%. not huge, but certainly not negligible.

Let's think about this in 0.3.4, because the 0.3.3 freeze is only a few weeks away. We should see if dalek is ready by then.

Trac:
Milestone: Tor: unspecified to Tor: 0.3.4.x-final

Trac:
Keywords: curve25519 crypto impl-shopping deleted, curve25519, impl-shopping, crypto, 034-triage-20180328 added

Per our triage process, these tickets are pending removal from 0.3.4.

Trac:
Keywords: N/A deleted, 034-removed-20180328 added

These needs_revision, tickets, tagged with 034-removed-*, are no longer in-scope for 0.3.4. We can reconsider any of them, if somebody does the necessary revision.

Trac:
Milestone: Tor: 0.3.4.x-final to Tor: unspecified

Trac:
Status: needs_revision to assigned
Owner: nickm to N/A

None of these revisions are in my near-term plans.

Trac:
Status: assigned to needs_revision

mentioned in issue #9662 (moved)

mentioned in issue #9663 (moved)

mentioned in issue #12464 (moved)

mentioned in issue #15463 (moved)

moved to tpo/core/tor#8897 (closed)

Faster curve25519 implementation for ntor

Child items ...

Activity