[PATCH] D47239: [InstCombine] Combine XOR and AES insructions on ARM/ARM64

Wed May 23 14:11:22 PDT 2018

mbrase added a comment.

Hi, I can try to give a little more background on the motivation behind this patch, and why the key might be zero.

I’m trying to cross compile some code which uses x86 AES intrinsics on Aarch64 with LLVM.   However, the x86 AES instructions have slightly different semantics than the ARM instructions.  One important difference is that x86 performs an XOR at the end of their AESENC/AESENCLAST instructions, whereas ARM does the XOR at the beginning of their AESE instruction.

To emulate the x86 AES intrinsics on Aarch64, I defined some inline functions as a drop in replacement, to avoid rewriting the algorithm by hand.  Here is what they look like:

  #include <stdint.h>
  #include <arm_neon.h>

  typedef uint8x16_t __m128i;  // __m128i is an x86 type, map it to ARM neon type

  // https://software.intel.com/sites/landingpage/IntrinsicsGuide/#text=_mm_aesenc_si128&expand=221
  static inline __m128i _mm_aesenc_si128 (__m128i a, __m128i RoundKey)
  {
      return vaesmcq_u8(vaeseq_u8(a, (__m128i){})) ^ RoundKey;
  }

  // https://software.intel.com/sites/landingpage/IntrinsicsGuide/#text=_mm_aesenclast_si128&expand=222
  static inline __m128i _mm_aesenclast_si128 (__m128i a, __m128i RoundKey)
  {
      return vaeseq_u8(a, (__m128i){}) ^ RoundKey;
  }

To skip the XOR from the ARM AESE instruction, I pass in a key value of zero, and then manually XOR the real key at the end of the function.   The only downside to this approach is that when the inline functions are substituted in the algorithm, they have an unnecessary EOR instruction when multiple rounds of AES encryption are done back to back.  Consider the following test code:

  // Performs 3 rounds of AES encryption
  // Note: In proper 128-bit AES encryption, 10 rounds are used
  __m128i aes_block(__m128i data, __m128i k0, __m128i k1, __m128i k2, __m128i k3)
  {
      data = data ^ k0;
      data = _mm_aesenc_si128(data, k1);
      data = _mm_aesenc_si128(data, k2);
      data = _mm_aesenclast_si128(data, k3);
      return data;
  }

Compiles down to this:

  0000000000000024 <aes_block>:
    24:   6e201c20        eor     v0.16b, v1.16b, v0.16b  ; EOR can be combined with AESE
    28:   6f00e401        movi    v1.2d, #0x0
    2c:   4e284820        aese    v0.16b, v1.16b
    30:   4e286800        aesmc   v0.16b, v0.16b
    34:   6e221c00        eor     v0.16b, v0.16b, v2.16b  ; EOR can be combined with AESE
    38:   4e284820        aese    v0.16b, v1.16b
    3c:   4e286800        aesmc   v0.16b, v0.16b
    40:   6e231c00        eor     v0.16b, v0.16b, v3.16b  ; EOR can be combined with AESE
    44:   4e284820        aese    v0.16b, v1.16b
    48:   6e241c00        eor     v0.16b, v0.16b, v4.16b
    4c:   d65f03c0        ret

My goal with this intrinsic is to teach LLVM how to eliminate some of these extra EOR instructions.  With my patch, LLVM produces this:

  0000000000000024 <aes_block>:
    24:   4e284820        aese    v0.16b, v1.16b
    28:   4e286800        aesmc   v0.16b, v0.16b
    2c:   4e284840        aese    v0.16b, v2.16b
    30:   4e286800        aesmc   v0.16b, v0.16b
    34:   4e284860        aese    v0.16b, v3.16b
    38:   6e241c00        eor     v0.16b, v0.16b, v4.16b
    3c:   d65f03c0        ret

Repository:
  rL LLVM

https://reviews.llvm.org/D47239