Opened 6 years ago

Closed 4 years ago

#9663 closed enhancement (implemented)

Table-based basepoint multiply optimizations for ntor handshake

Reported by: nickm Owned by:
Priority: Medium Milestone: Tor: 0.2.7.x-final
Component: Core Tor/Tor Version:
Severity: Keywords: tor-relay, performance, ntor, 026-triaged-0, 027-triaged-1-in
Cc: yawning Actual Points:
Parent ID: #9662 Points:
Reviewer: Sponsor:

Description

If you know in advance that you'll be performing a large number of scalar multiplications by a given point, you can have an optimized implementation of the scalar multiply optimization. (Adam Langley explains this better than I am likely to do in this ticket description.) We can take advantage of this in a more-obvious way and a less obvious way.

The obvious place to use this is when we're multiplying by a basepoint in order to generate ephemeral keys. There are a few implementations of this technique in libraries we're already talking about; see https://github.com/floodyberry/ed25519-donna for one example.

Child Tickets

TicketTypeStatusOwnerSummary
#16467enhancementclosedFaster Ed25519 implementation.

Change History (22)

comment:1 Changed 6 years ago by nickm

I've played with this somewhat, and have noticed an interesting phenomenon. In benchmarks, the curved25519_basepoint_scalarmult function from ed25519-donna shaves something like 20 usec off public key generation on my laptop. You might think that would shave 20 usec off the server-side ntor handshake, but you'd be wrong: the server-side ntor handshake becomes about 10 usec _slower_. My best theory so far is that this is because of the large set of tables that need to get loaded into L1 cache.

If that's the case, this in turn suggests that we should implement #9664, and calculate ephemeral public keys in bulk ahead of time.

comment:2 Changed 6 years ago by nickm

(My code here is at the moment #if-0'd out in ticket8897_floodyberry_curve25519)

comment:3 Changed 6 years ago by nickm

Status: newneeds_review

Whoops; my benchmarks were borked because of a bug in how we built the bench executable. Fixed now.

ticket8897_9663_v2 has this as well as the #8897 code; needs review.

comment:4 Changed 5 years ago by andrea

Code review follows for the ticket8897_9663_v2 branch:

ee5afe07090f1ae163e7d8bac5e62b233106f319:

  • Good lord that's a lot of hairy code; since it's all external I'll presume it okay for now.

c2eeeae1cf88085f24edd7f01364e8734924b14d:

  • This looks fine to me.

4961107f90cf7d6eb5258a48c47a4fd009dbe5a5:

  • This looks fine to me.

eec326b05d043da1d81e5ac8f7231e83e08a8d1e:

  • This looks fine to me presuming the constants were derived by using a known-good implementation.

dc8534dadfe87f762c072bc7d1bbf37326a676f0:

  • This looks fine to me.

34ff6bd40f29a7df6346b2c2bb49ce42141afb5a:

  • This looks fine to me.

82175b53276f2be4a63d940ba224c87890128548:

  • This looks fine to me.

77dd3fd5017b9331b1a265419145b62fd1a5298c:

  • This looks okay but perhaps in the interest of paranoia we should also spot-check the fallback?

b0c9c55d6dece98030aed99f076a7223bb0f5e64:

  • This looks fine to me.

c4ae4850efc46c04e7409e119aff16e327227a38:

  • It strikes me as grossly counterintuitive that it is possible for OPTIONAL_INLINE to expand to a thing containing attribute((always_inline)).
  • Otherwise, this seems fine.

f4e9ecae2b0239f2c16c18808114842092b5007d:

  • I repeat my remarks about ee5afe07090f1ae163e7d8bac5e62b233106f319 for this.

298cfa90555feb01aebac71419f67aa8467d9bcc:

  • This looks good to me.

79337975c2f9b0ea045b48666c523294624c2558:

  • This looks good to me.

4b463e20f0b1939764a09eaa74b380ec282daf81:

  • This looks good to me.

4147cb2d4dc33c51c91fd2d6275af0b4773b738d:

  • This looks good to me.

812674f4971e8f843322e81e20f2638f525e875e:

  • This looks good to me.

88b0f91a0107b0988b52d09d8ec7c9c8180ed669:

  • This looks good to me.

86a239f7f392fc4d6f52e3e671a9f2a8808359e5:

  • This looks good to me.

0284e22e7b37a0ce8fc0fc1f27fd2af768768af3:

  • This looks good to me.

comment:5 in reply to:  4 Changed 5 years ago by nickm

Replying to andrea:

Code review follows for the ticket8897_9663_v2 branch:

ee5afe07090f1ae163e7d8bac5e62b233106f319:

  • Good lord that's a lot of hairy code; since it's all external I'll presume it okay for now.

Actually I've gotten cold feet; I'm going to make it off-by-default for now.

77dd3fd5017b9331b1a265419145b62fd1a5298c:

  • This looks okay but perhaps in the interest of paranoia we should also spot-check the fallback?

Okay, will do.

f4e9ecae2b0239f2c16c18808114842092b5007d:

  • I repeat my remarks about ee5afe07090f1ae163e7d8bac5e62b233106f319 for this.

This one I'm going to leave on-by-default, since none of its potential inputs are hostile.

comment:6 Changed 5 years ago by nickm

Milestone: Tor: 0.2.5.x-finalTor: 0.2.6.x-final

Too big for 0.2.5.x at this point IMO. Andrea concurs.

comment:7 Changed 5 years ago by nickm

Keywords: 026-triaged added
Status: needs_reviewneeds_revision

I should take another pass over this before consideration for 0.2.6

comment:8 Changed 5 years ago by nickm

Keywords: 026-triaged-0 added; 026-triaged removed

comment:9 Changed 5 years ago by nickm

This actually might not be worthwhile: see my comment on #8897. I can probably scavange a lot of this branch for more general ed25519 stuff, though.

comment:10 Changed 5 years ago by nickm

Milestone: Tor: 0.2.6.x-finalTor: 0.2.???

Yeah, trying this for 0.2.???. We can pull some back into 0.2.6 if the ed25519_ref10 stuff turns out to be too slow for us.

comment:11 Changed 4 years ago by nickm

Milestone: Tor: 0.2.???Tor: 0.2.7.x-final

These might also be worth looking at in 0.2.7

comment:12 Changed 4 years ago by nickm

Keywords: 027-triaged-1-in added

Marking some tickets as triaged-in for 0.2.7 based on early triage

comment:13 Changed 4 years ago by yawning

Cc: yawning added

comment:14 Changed 4 years ago by yawning

I said I'll look at this, and I've been casually poking at it. I don't have a branch quite ready for merge yet since I need to make things fall back to curve25519-donna and ref10.

Numbers (NB: i5-4250U, Turboboost etc):

  • onion_ntor: Client-side, part 1: 77.590756 usec. -> Client-side, part 1: 24.009142 usec.
  • ed25519:
    • Generate public key: 60.96 usec -> Generate public key: 22.22 usec
    • Sign a short message: 62.16 usec -> Sign a short message: 23.84 usec
    • Verify signature: 161.53 usec -> Verify signature: 77.68 usec
    • Blind a public key: 139.10 usec -> Blind a public key: 70.78 usec
    • Convert public point from curve25519: 9.81 usec -> Convert public point from curve25519: 5.30 usec

I didn't set -DED25519_SSE2, so this is the portable 64 bit code. For reference the current ntor code takes ~240 usec in it's entirety, so this is still worth doing especially since it gains us a good Ed25519 perf. boost and support for batch verification.

TODO:

  • Cross verification with the slower implementation at startup time, and the corresponding fallback code (Benchmark numbers obtained with something that only calls the new code, tests pass...).
  • Wire in the Ed25519-donna batch verification code (nickm wrote something that claims to do this, but he integrated Ed25519-donna slightly differently from how I did, so it needs minor adjustments). I'll probably leave this up to him if/when my WIP branch gets merged.
  • Curve25519->Ed25519 conversion (Not worth doing unless we anticipate removing ref10 entirely). (Edit: I lied. It took all of 5 mins to write, so this is now done.)
  • Unbreak the Visual Studio build (I'm not touching this crap, someone who cares about windows can).
  • The inevitable "teor files a bunch of bugs because of clang != gcc" fixes.
Last edited 4 years ago by yawning (previous) (diff)

comment:15 Changed 4 years ago by yawning

I figure I'll checkpoint my branch to a repo: https://github.com/Yawning/tor/compare/feature9663

The changes required to src/common/crypto_curve25519.c and src/common/crypto_ed25519.c to actually use the new code are not committed yet (though I pulled in ed25519-donna such that the interface exposed to tor matches ref0, so a simple search/replace is all that is needed to verify basic functionality).

I will scavenge the cross checking from nickm's old feature branch for the rest of this, and seek review. I'm not sure what we want to do with SSE2 support. Enabling it on x86_64 is worse for performance, and we would need CPUID query capability to enable it on our i386 builds (which is easy, but if the portable code is faster than ref0 I am unsure if it is worth the effort).

comment:16 Changed 4 years ago by nickm

Hm. I thought that basically everything had sse2 these days. Is there a way to tell if we're targeting a 32-bit architecture with sse2, and if so, enable the sse2 code?

comment:17 Changed 4 years ago by nickm

eg (untested)

#if defined(__SSE2__) && !(defined(__x86_64__) || defined(__amd64__)) 
....
#endif

?

comment:18 in reply to:  17 ; Changed 4 years ago by yawning

Replying to nickm:

eg (untested)

#if defined(__SSE2__) && !(defined(__x86_64__) || defined(__amd64__)) 
....
#endif

?

This should work (might need x86intrin.h?). I'd like to see a comparison between ref10, 32 bit donna, and SSE2 donna before we go down this path, and for the purposes of the feature branch as is, I will ignore the issue for now since doing the integration work in a safe manner is more important.

Will we ever consider removing the ref10 code/fallback stuff? I don't think it particularly makes a difference to how I'm going to do this, but I think the portable versions of the donna code is well tested at this point.

comment:19 in reply to:  18 ; Changed 4 years ago by nickm

Replying to yawning:

Replying to nickm:

eg (untested)

#if defined(__SSE2__) && !(defined(__x86_64__) || defined(__amd64__)) 
....
#endif

?

This should work (might need x86intrin.h?).

(Probably won't need that; those are supposed to be predefined by the compiler.)

I'd like to see a comparison between ref10, 32 bit donna, and SSE2 donna before we go down this path, and for the purposes of the feature branch as is, I will ignore the issue for now since doing the integration work in a safe manner is more important.

Makes sense.

Will we ever consider removing the ref10 code/fallback stuff? I don't think it particularly makes a difference to how I'm going to do this, but I think the portable versions of the donna code is well tested at this point.

We can consider removing it, sure. I think I'd like to leave it in for a while as a fallback though.

comment:20 in reply to:  19 Changed 4 years ago by yawning

Replying to nickm:

Makes sense.

I don't think the ed25519 stuff is particularly critical path (yet), so the numbers that really matter are the curved25519 figures. I guess we can call for someone to test the various options on those sorts of targets when we integrate the code.

Will we ever consider removing the ref10 code/fallback stuff? I don't think it particularly makes a difference to how I'm going to do this, but I think the portable versions of the donna code is well tested at this point.

We can consider removing it, sure. I think I'd like to leave it in for a while as a fallback though.

Makes sense. Ironically, by pulling in all of ed25519-donna, our codebase grew a 2nd copy of ref10, since it's included as part of the "fuzz vs the ref implementation" stuff that's in the source tree.

I'm thinking something along the lines of:

  • 0.2.7.x-alpha: Cross check on startup, crosscheck every N operations.
  • 0.2.7.x-stable: Cross check on startup.
  • 0.2.8.x-alpha: Yank the ref10 code, assume that people run "make check" and that our unit tests catch pathological compiler failures.

With progression along each phase being dependent on nothing totally horrific cropping up.

comment:21 in reply to:  15 Changed 4 years ago by yawning

Ok, my branch from 15 (https://github.com/Yawning/tor/compare/feature9663) now fulfills this ticket, with fallback to donna (or NaCl if people actually build with that).

The self-test/fallback logic was scavenged from nickm's ticket8897_9663_v2 branch, though I made the spot check validate against known values from the NaCl paper as well.

I assume we actually want to use the rest of ed25519-donna since it massively outperforms ref10, so I'm not going to call my branch "ready". I have opened #16467 for the Ed25519 related components of this ticket.

comment:22 Changed 4 years ago by nickm

Resolution: implemented
Status: needs_revisionclosed

Merged with #16467. Wooooot!

Note: See TracTickets for help on using tickets.