Opened 4 years ago

Closed 4 years ago

#16467 closed enhancement (implemented)

Faster Ed25519 implementation.

Reported by: yawning Owned by:
Priority: Medium Milestone: Tor: 0.2.7.x-final
Component: Core Tor/Tor Version: Tor: 0.2.7
Severity: Keywords: performance, tor-relay
Cc: Actual Points:
Parent ID: #9663 Points:
Reviewer: Sponsor:

Description

The ref10 Ed25519 code currently being used is rather slow, and does not support batch verification. Andrew Moon's Ed25519-donna (https://github.com/floodyberry/ed25519-donna) is both faster and provides batch verification, in addition to the optimized Curve25519 scalar basepoint multiply that #9663 depends on.

We should use Ed25519-donna instead of ref10 for our Ed25519 implementation, at first with cross-checking/fallback, and eventually as our sole Ed25519 provider.

Child Tickets

Change History (20)

comment:1 Changed 4 years ago by yawning

Making #9663 the parent for now, since there's a cross-dependency there, and I've been doing the integration as: https://github.com/Yawning/tor/compare/feature9663

For those that are too lazy to look at #9663 to pull out things relevant here:

The "on my desktop with TurboBoost etc" x86_64 performance gains look like:

  • Generate public key: 60.96 usec -> 22.22 usec
  • Sign a short message: 62.16 usec -> 23.84 usec
  • Verify signature: 161.53 usec -> 77.68 usec
  • Blind a public key: 139.10 usec -> 70.78 usec
  • Convert public point from curve25519: 9.81 usec -> 5.30 usec

TODO/Open questions:

  • Write the self-test/fallback code.
  • Actually use the batch verification code.
  • Is it worth potential pain to enable SSE2 on ia32 targets? (SSE2 performance appears to be worse with x86_64). Benchmarks needed, it is probably an ok assumption that the "portable" 32 bit code outperforms ref10 though.
  • Someone who cares maybe should unbreak the Visual Studio build (Apparently the VS version of -donna doesn't have SSE2 assembly, but it should build).

comment:2 Changed 4 years ago by yawning

Status: newneeds_review

Ok, first pass at: https://github.com/Yawning/tor/compare/feature9663

It may be sort of annoying to review since the branch also does #9663, however only b61fa4c59c4d3c9226f4514521a5193b77a911dd really is the Curve25519 ntor optimization, the rest falling under this ticket (pulling in donna, adding support for our constructs, adding test cases etc).

I'm not sure if the struct full of function pointers approach I took for runtime Ed25519 implementation selection was the best way to do things, but this is both easy to rip back out, and easy to add another Ed25519 implementation if someone happens to write ed25519-super-turbo-2 or whatever.

For Ed25519, the unit tests fuzz all the things, the runtime sanity check calls the donna internal test code, and validates that public key derivation/signing/checking signatures appears to be working (Maybe this should fuzz as well, but I don't think it will really catch anything).

TODO: Batch verification, leaving it up to nickm (I left a comment in the #if 0ed out code where there's a bug...)

comment:3 Changed 4 years ago by teor

Code review using clang (Xcode 6.4 (6E35b) Apple LLVM version 6.1.0 (clang-602.0.53) (based on LLVM 3.6.0svn)) on OS X 10.10.4.

Build & static analysis:

clang would prefer iters to be a #define

test_crypto_ed25519_fuzz_donna(void *arg)
{
  const int iters = 1024;
  uint8_t msg[iters];

tor/src/test/test_crypto.c:1643:15: Variable length array folded to constant array as an extension

Other than that, there are no build or static analyzer complaints from clang.

comment:4 Changed 4 years ago by teor

(I spoke too soon, it turns out I hadn't updated my Xcode project to include the new source files.)

OS X defines an ALIGN macro to calculate pointer alignment based on a passed pointer value. It's not the same as the compiler alignment attribute. Can we rename the macro across the ed25519/donna codebase?
tor/src/ext/ed25519/donna/ed25519-donna-portable.h:23:10: 'ALIGN' macro redefined

expand256_modm doesn't appear to initialise the bytes of work above len, or, if len is less than 32, the bytes of out above 32.

The clang static analyzer to complain about garbage values being passed to the + operator in the call stack:

  • ed25519_donna_keygen
  • ed25519_donna_pubkey
  • ge25519_scalarmult_base_niels
  • curve25519_sub_reduce
  • tor/src/ext/ed25519/donna/ed25519.c:202:3: The left operand of '+' is a garbage value (within a call to 'ge25519_scalarmult_base_niels')
    • which is actually line 85 in curve25519-donna-64bit.h:
      out[0] = a[0] + fourP0    - b[0]    ; c = (out[0] >> 51); out[0] &= reduce_mask_51;
      

I've tried zeroing out most of the variables involved, but I might have missed some.
I can't to work out how to fix this analysis issue. I wonder if the assembly is confusing clang, but it's worked fine with other assembly in the past.

Should we be using the di_ops functions for memset, memcpy and similar?

comment:5 Changed 4 years ago by teor

clang's static analyzer also complains about the same issue when ed25519_donna_sign calls ge25519_scalarmult_base_niels.

comment:6 Changed 4 years ago by teor

There's also a typo in README.tor:
"next-gen hidden srevice descriptors"

comment:7 in reply to:  3 Changed 4 years ago by yawning

Replying to teor:

clang would prefer iters to be a #define

test_crypto_ed25519_fuzz_donna(void *arg)
{
  const int iters = 1024;
  uint8_t msg[iters];

tor/src/test/test_crypto.c:1643:15: Variable length array folded to constant array as an extension

Fixed. (116a56a234631978c46a1bf9ac1649eef8ae6bf4)

OS X defines an ALIGN macro to calculate pointer alignment based on a passed pointer value. It's not the same as the compiler alignment attribute. Can we rename the macro across the ed25519/donna codebase?
tor/src/ext/ed25519/donna/ed25519-donna-portable.h:23:10: 'ALIGN' macro redefined

I'd rather not (OSX defining ALIGN to me is "shitting up the global namespace), since it makes it even harder to fold in changes to upstream. There's a pull request open that undef's ALIGN (https://github.com/floodyberry/ed25519-donna/pull/21), which keeps the codebases closer.

Fixed. (d83a65041b648a30835be29a9a54cfe34ab5608b)

expand256_modm doesn't appear to initialise the bytes of work above len, or, if len is less than 32, the bytes of out above 32.

Huh? unsigned char work[64] = {0}; The entirety of work is initialized to 0. This is language standard behavior. All of out gets set as well (typedef bignum256modm_element_t bignum256modm[5];).

Should we be using the di_ops functions for memset, memcpy and similar?

None of the memsets are in danger of being optimized away (I use memwipe where this is a possibility).

The one place where a data dependent memcmp is used outside of tests is in the batch verify code, in a routine explicitly tagged as vartime I will assume the implementer knew what he was doing.

There's also a typo in README.tor: "next-gen hidden srevice descriptors"

Fixed. (d83a65041b648a30835be29a9a54cfe34ab5608b)

The clang static analyzer to complain about garbage values being passed to the + operator in the call stack:

Are these actual issues or just false positives? If they're false positives, I'm inclined not to change anything (squashing things just to appease tools isn't behavior I particularly like).

comment:8 Changed 4 years ago by teor

Apologies for misreading the code on expand256_modm. I was trying to work out why the static analyzer was complaining.

I can't tell whether they are actual issues, or false positives, just from reading the code.

comment:9 Changed 4 years ago by teor

I built using this incredibly convoluted set of compiler command-line flags with clang-3.7:
clang -g -fno-omit-frame-pointer -fasynchronous-unwind-tables -fno-optimize-sibling-calls -fno-inline -DPARANOIA -fstack-protector-all -fsanitize=undefined -fno-sanitize-recover=all -fsanitize-blacklist=contrib/clang/sanitize_blacklist.txt -fno-omit-frame-pointer -fasynchronous-unwind-tables -fno-optimize-sibling-calls -fno-inline -fsanitize=address -fsanitize-blacklist=contrib/clang/sanitize_blacklist.txt -fno-omit-frame-pointer -fasynchronous-unwind-tables -fno-optimize-sibling-calls -fno-inline -O0 -arch x86_64

And I ran out of registers:

In file included from src/ext/ed25519/donna/ed25519.c:18:
In file included from src/ext/ed25519/donna/ed25519-donna.h:101:
src/ext/ed25519/donna/ed25519-donna-64bit-x86.h:14:3: error: inline assembly requires more registers than available
  "movq %0, %%rax                  ;\n"
  ^
src/ext/ed25519/donna/ed25519-donna-64bit-x86.h:14:3: error: inline assembly requires more registers than available
2 errors generated.
make[1]: *** [src/ext/ed25519/donna/src_ext_ed25519_donna_libed25519_donna_a-ed25519.o] Error 1

But a minimal configure with no extra clang options works fine:
./configure --with-openssl-dir=/opt/local --with-libevent-dir=/opt/local

This appears to be an issue with the options I'm using taking up registers. I bet it's either the sanitizers, or the debugging / optimization-disabling options.

I'll try to work out which. Any pointers in the right direction would be appreciated.

comment:10 Changed 4 years ago by teor

This set of flags also works fine:
-O2 -funroll-loops -ffp-contract=on -fno-omit-frame-pointer -mrelax-all -fstrict-aliasing -Wstrict-aliasing -arch x86_64

Leaving me with the following (deduplicated) potential problematic options:
-fasynchronous-unwind-tables -fno-optimize-sibling-calls -fno-inline -fstack-protector-all -fsanitize=undefined -fno-sanitize-recover=all -fsanitize=address -O0

I'll do some builds with each option and see.

If the sanitizers are the issue, I'll add the problematic code to:
-fsanitize-blacklist=contrib/clang/sanitize_blacklist.txt

comment:11 Changed 4 years ago by teor

With the first set of flags in comment 10, I see the following performance gains on OS X 10.10.4, clang 3.7, x86_64, 2 GHz Intel Core i7:

===== ed25519 =====
Ed25519-donna = disabled.
Generate public key: 66.08 usec
Sign a short message: 85.41 usec
Verify signature: 251.77 usec
Convert public point from curve25519: 13.07 usec
Blind a public key: 230.85 usec
Ed25519-donna = enabled.
Generate public key: 25.10 usec
Sign a short message: 26.55 usec
Verify signature: 108.35 usec
Convert public point from curve25519: 5.65 usec
Blind a public key: 120.96 usec

comment:12 Changed 4 years ago by teor

-fsanitize=address causes clang to run out of registers when compiling ge25519_scalarmult_base_choose_niels in src/ext/ed25519/donna/ed25519-donna-64bit-x86.h.

Adding the function to the sanitizer blacklist resolves this issue, as long as people use it.

I can make a branch for it if you'd like.
It would add:

# The optimised x86_64 assembly version of ge25519_scalarmult_base_choose_niels
# runs out of registers when compiled with ASan                                 
fun:ge25519_scalarmult_base_choose_niels

to the end of contrib/clang/sanitize_blacklist.txt

comment:13 Changed 4 years ago by yawning

If that all that's needed I can add that and save you the trouble (Especially since I'll probably squash/rebase my branch anyway). Sorry for the extra work, and thanks.

I'm somewhat curious to see if the ntor benchmark also is faster with my additions, but since the donna code appears to be getting built correctly by clang, I'm 99% sure the table based precomputation stuff is faster (as intended).

comment:14 Changed 4 years ago by teor

No worries, better to catch incompatibilities at this stage.

By the way, .gitignore probably needs an entry for
src/ext/ed25519/donna/libed25519_donna.a
It seems to already have wildcard entries for other object files produced during the process.

comment:15 Changed 4 years ago by yawning

Take 2: https://github.com/Yawning/tor/compare/feature16467_9663

Major changes are, that I squash/rebased things, and restructured how to handle the tor specific code in donna slightly (ed25591_donna_tor.h, ed25519_tor.c instead of chainsawing up the latter).

This should have all the fixes for the issues found by teor except for the clang static analysis garbage value thing which I assume is a false positive. I opted to solve the asan issue by disabling inline assembly entirely in donna for asan enabled builds. Performance is going to be horrific anyway, so might as well sanitize something.

I'll wait for nickm to get back from vacation at this point, unless the various random people find more issues (unlikely?). I'm inclined to say "use my branch or something that approximates it for the entirely of 0.2.7's lifespan, then yank ref10 in the 0.2.8-alpha timeframe", but that would be more of a nickm call than anything else.

comment:16 Changed 4 years ago by teor

I believe that clang static analyzer complains about garbage values because it doesn't know the length of the char * buffers passed in to ed25519_donna_keygen and ed25519_donna_sign.

As far as I can tell, clang assumes that char *'s are NUL-terminated strings, and expects to see a strlen or similar call. So it gets confused when the byte buffers are walked in 4-byte chunks, up to 64 bytes. (And I'm not sure it's a simple matter to convince clang how that the buffers are 64 bytes, nor do I feel it's a particularly productive activity.)

Yawning, would you mind if we added a comment to those two functions about the spurious errors?
It would save the effort of someone tracking them down again.
But I really don't mind either way.

comment:17 in reply to:  16 Changed 4 years ago by yawning

Replying to teor:

Yawning, would you mind if we added a comment to those two functions about the spurious errors?
It would save the effort of someone tracking them down again.
But I really don't mind either way.

I don't mind at this point since I split off the tor version of the ed25519 code into it's own file when I did the squash/rebase, mostly for my sanity (In general I want to try to minimize changes to the code that's still common with upstream, but this stuff is mostly tor specific). There's a giant comment describing what our external interface does, is that where something like this should be, or should the comment(s) live inside each function?

comment:18 Changed 4 years ago by teor

They'd be most helpful in each function, where the errors are logged by clang sanitizer.

comment:19 in reply to:  18 Changed 4 years ago by yawning

Replying to teor:

They'd be most helpful in each function, where the errors are logged by clang sanitizer.

I'm now confused as to which functions exactly produce the errors, possibly because I moved code around etc. Additionally, I'm not exactly sure if I buy the explanation that Clang wants a strlen, since assuming unsigned char * is a NUL-terminated string seems like a rather poor assumption to be making in the first place.

I'm going to leave things as is, we can always add comments later as needed.

comment:20 Changed 4 years ago by nickm

Resolution: implemented
Status: needs_reviewclosed

I'm merging, but I'll be glad to take more comments in it. Can somebody open another ticket for batch verification?

Note: See TracTickets for help on using tickets.