[PATCH] D56474: [ARM] [NEON] Add ROTR/ROTL lowering
easyaspi314 (Devin) via Phabricator via llvm-commits
llvm-commits at lists.llvm.org
Wed Jan 9 18:48:40 PST 2019
easyaspi314 added a comment.
Huh. Apparently, `vshr`/`vsli` is actually faster.
I used the benchmark tool on my my NEON-optimized xxHash variant <https://github.com/easyaspi314/xxhash> to test this, as it is a real-life usage of the rotate pattern.
Compiled on my LG G3 (Quad core Snapdragon 801/Cortex-A15 underclocked to 1.8 GHz) with Clang 7.0.1 from the Termux repos, and benchmarked in the Termux app while tapping the screen to maintain a stable frequency.
`clang -march=native -O2`
The main loop basically looks like this:
uint32x4_t v;
const uint32x4_t prime1, prime2; // literals
const uint8_t *p, limit; // unaligned data pointer
...
do {
/* note: vld1q_u8 is to work around alignment bug */
const uint32x4_t inp = vreinterpretq_u32_u8(vld1q_u8(p));
v = vmlaq_u32(v, prime2, inp);
#ifdef VSLI
v = vsliq_n_u32(vshrq_n_u32(v, 19), v, 13);
#else
v = vorrq_u32(vshrq_n_u32(v, 19), vshlq_n_u32(v, 13));
#endif
v = vmulq_u32(v, prime1);
p += 16;
} while (p < limit);
The benchmark gets 4.1 GB/s with `-DVSLI`, but only 3.7 GB/s with `vshl`/`vorr`. Similarly, the variation using two vectors in parallel gets 5.7 GB/s with `-DVSLI`, but only 5.3 GB/s.
Considering that all the other variables are the same, I presume that maybe writeback latency is to blame.
Repository:
rL LLVM
CHANGES SINCE LAST ACTION
https://reviews.llvm.org/D56474/new/
https://reviews.llvm.org/D56474
More information about the llvm-commits
mailing list