Opened 8 years ago

Closed 8 years ago

Last modified 7 years ago

#4443 closed enhancement (duplicate)

Implement counter mode faster on x86 and x86_64

Reported by: nickm Owned by:
Priority: Medium Milestone: Tor: 0.2.3.x-final
Component: Core Tor/Tor Version:
Severity: Keywords: tor-relay
Cc: Actual Points:
Parent ID: Points:
Reviewer: Sponsor:

Description

Our current counter-mode implementation assumes that 32-bit math is fastest, that any endianness is possible, and that unaligned access is forbidden or slow.

This all slows it down. We can do much better than we do now on little-endian systems, and much better if we happen to me somewhere where unaligned integer access is just as fast as aligned integer access (this seems to be the case on x86 and x86_64.

The tricks are to change our current counter-increment logic to something like

   for (p=15;p>=0;p--) {
     if (counter_byte[p]++ == 0)
        break;
   }

and to change the main XOR loop logic to this on systems where unaligned access works and works quickly:

  {current loop until c==0}

  while (len>16) {
     XOR 16 bytes at once via 32-bit or 64-bit XORs.
     increment counter
     crypt another block
     len -= 16
  }

  {current loop until len == 0}

On my laptop with a relatively unoptimized openssl, preliminary testing suggests that this should shave over 5% off AES runtime for encrypting a bunch of 509-byte cell payloads. It should be even more impressive on aesni hardware.

[We can't just use OpenSSL's relatively-good-now counter mode, since I don't think it supports EVP.]

Child Tickets

Change History (5)

comment:1 Changed 8 years ago by tmpname0901

unaligned integer access is just as fast as aligned integer access (this seems to be the case on x86 and x86_64

This is definitely not true. There is a performance penalty across the x86(_64) line for memory accesses which are not aligned to data type.

From a performance perspective, unaligned memory access are Bad; accesses straddling cache lines are Really Bad; accesses straddling pages are Horrifically Bad. I think you get the idea :-)

comment:2 in reply to:  1 Changed 8 years ago by nickm

Replying to tmpname0901:

From a performance perspective, unaligned memory access are Bad; accesses straddling cache lines are Really Bad; accesses straddling pages are Horrifically Bad. I think you get the idea :-)

FWIW, doing byte-at-a-time xors ain't so hot either. I'll need to benchmark here: If you're right, then the openssl crypto/modes/ctr128.c implementation is incredibly stupid. (Look at its logic when the STRICT_ALIGNMENT macro is undefined.)

(Note to self: never ever ever design another protocol where things get encrypted in chunks of bytes not a multiple of 8.)

comment:3 Changed 8 years ago by nickm

Resolution: duplicate
Status: newclosed

Actually, this is so hard after all: openssl 1.0.0 can be made to support both AES_ and EVP_ backends with its splufty counter-mode implementation, for a 7% speedup over our previous thing. Closing this as a duplicate of #4526 , since that one actually has code pending review.

comment:4 Changed 7 years ago by nickm

Keywords: tor-relay added

comment:5 Changed 7 years ago by nickm

Component: Tor RelayTor
Note: See TracTickets for help on using tickets.