[PATCH] D130618: [AArch64][LoopVectorize] Enable tail-folding of simple loops on neoverse-v1

Tue Apr 25 02:07:17 PDT 2023

dmgreen added a comment.

OK. My understanding was that the Neoverse-V1 was one of the worst cpus for tail folding due to the limited throughput of while instructions. Doesn't tail folding prevent interleaving too? Newer generations should do better than the V1.

This is one of the "simplest" loops I can think of. My understanding is that it will go around half the speed with this patch: https://godbolt.org/z/K4e3PMdar

Are you sure this shouldn't be based on whether the loop is small enough to desire interleaving? As in the difference between Simple and Reductions was never really about reductions, those were just the loops that hit the problem the hardest. (There may be some other issues with reductions codegen, but the limited interleaving/throughput still remains).  The real problem is that we need to make sure we don't limit the interleaving for small loops, or that the while instructions become a bottleneck for throughput.

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D130618/new/

https://reviews.llvm.org/D130618