[PATCH] D56474: [ARM] [NEON] Add ROTR/ROTL lowering

Wed Jan 9 18:48:40 PST 2019

easyaspi314 added a comment.

Huh. Apparently, `vshr`/`vsli` is actually faster.

I used the benchmark tool on my  my NEON-optimized xxHash variant <https://github.com/easyaspi314/xxhash> to test this, as it is a real-life usage of the rotate pattern.

Compiled on my LG G3 (Quad core Snapdragon 801/Cortex-A15 underclocked to 1.8 GHz) with Clang 7.0.1 from the Termux repos, and benchmarked in the Termux app while tapping the screen to maintain a stable frequency.
`clang -march=native -O2`

The main loop basically looks like this:

      uint32x4_t v;
      const uint32x4_t prime1, prime2; // literals
      const uint8_t *p, limit; // unaligned data pointer
      ...
      do {
          /* note: vld1q_u8 is to work around alignment bug */
          const uint32x4_t inp = vreinterpretq_u32_u8(vld1q_u8(p));
          v = vmlaq_u32(v, prime2, inp);
  #ifdef VSLI
          v = vsliq_n_u32(vshrq_n_u32(v, 19), v, 13);
  #else 
          v = vorrq_u32(vshrq_n_u32(v, 19), vshlq_n_u32(v, 13));
  #endif 
          v = vmulq_u32(v, prime1);
          p += 16;
      } while (p < limit); 

The benchmark gets 4.1 GB/s with `-DVSLI`, but only 3.7 GB/s with `vshl`/`vorr`. Similarly, the variation using two vectors in parallel gets 5.7 GB/s with `-DVSLI`, but only 5.3 GB/s.

Considering that all the other variables are the same, I presume that maybe writeback latency is to blame.

Repository:
  rL LLVM

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D56474/new/

https://reviews.llvm.org/D56474