[PATCH] D130618: [AArch64][LoopVectorize] Enable tail-folding of simple loops on neoverse-v1

Wed May 3 08:51:44 PDT 2023

david-arm marked an inline comment as done.
david-arm added a comment.

In D130618#4315481 <https://reviews.llvm.org/D130618#4315481>, @dmgreen wrote:

> Thanks. This still makes me a bit nervous, considering what we know about predication and the performance results I've seen.
>
> Can you explain more about why the limit is 10 instructions? As far as I could see the limit on interleaving in the vectorizer is a cost of 20, and with many instructions like geps, phis and branches being free that will be quite a bit more than 10 instructions. We could have the limit lower than the default for interleaving if that makes more sense, but 10 seems quite low.

I guess we have to choose a limit somewhere. Whatever number we pick there will always be an example that proves it's bad. The goal here is not to make it the fastest for every single case, which is not really possible as shown by holes we sometimes find in our cost model. We want to make it good overall for the majority of cases. This patch and the number chosen here are not fixed - these are things that we can evolve over time based on real evidence.

I specifically chose 10 as a starting point because that's based on the example you gave me:

  void test(float *x, float *y, int n) {
    for (int i = 0; i < n; i++)
      x[i] = -y[i];
  }

which has 9 LLVM IR instructions. Ignoring tail-folding completely, if you build this with current HEAD of LLVM you'll notice that interleaving is slightly faster than not interleaving. However, when you change this to:

  void test(float *x, float *y, int n) {
    for (int i = 0; i < n; i++)
      x[i] -= y[i];
  }

suddenly the non-interleaved version (-mllvm -force-vector-interleave=1) becomes 18% faster than the interleaved version on neoverse-v1. There is only one extra instruction in the loop, which makes that 10. This means that today we already have an upstream performance bug caused by a poor interleaving choice. Just by chance more than anything else, this loop becomes faster when applying this patch.

Using your suggestion of 20 proved detrimental for x264 performance on neoverse-v1, with ~5% performance regression.

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D130618/new/

https://reviews.llvm.org/D130618