[PATCH] D130618: [AArch64][LoopVectorize] Enable tail-folding of simple loops on neoverse-v1

Dave Green via Phabricator via llvm-commits llvm-commits at lists.llvm.org
Tue Apr 25 02:07:17 PDT 2023


dmgreen added a comment.

OK. My understanding was that the Neoverse-V1 was one of the worst cpus for tail folding due to the limited throughput of while instructions. Doesn't tail folding prevent interleaving too? Newer generations should do better than the V1.

This is one of the "simplest" loops I can think of. My understanding is that it will go around half the speed with this patch: https://godbolt.org/z/K4e3PMdar

Are you sure this shouldn't be based on whether the loop is small enough to desire interleaving? As in the difference between Simple and Reductions was never really about reductions, those were just the loops that hit the problem the hardest. (There may be some other issues with reductions codegen, but the limited interleaving/throughput still remains).  The real problem is that we need to make sure we don't limit the interleaving for small loops, or that the while instructions become a bottleneck for throughput.


CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D130618/new/

https://reviews.llvm.org/D130618



More information about the llvm-commits mailing list