[llvm] [LoopVectorize] Don't discount instructions scalarized due to tail folding (PR #109289)

Thu Oct 17 05:55:20 PDT 2024

================
@@ -0,0 +1,281 @@
+; NOTE: Assertions have been autogenerated by utils/update_analyze_test_checks.py UTC_ARGS: --filter "fold tail" --filter "estimated cost" --filter "costs" --filter "Selecting VF" --filter "loop costs" --version 5
+; RUN: opt -passes=loop-vectorize -debug-only=loop-vectorize -disable-output -S < %s 2>&1 | FileCheck %s
+
+; REQUIRE: asserts
+
+target triple = "aarch64-unknown-linux-gnu"
+
+; These tests check that if the only way to vectorize is to tail fold a store by
+; masking then we properly account for the cost of creating a predicated block
+; for each vector element.
+
+define void @store_const_fixed_trip_count(ptr %dst) {
+; CHECK-LABEL: 'store_const_fixed_trip_count'
+; CHECK:  LV: can fold tail by masking.
+; CHECK:  LV: Found an estimated cost of 0 for VF 1 For instruction: %iv = phi i64 [ 0, %entry ], [ %iv.next, %loop ]
+; CHECK:  LV: Found an estimated cost of 0 for VF 1 For instruction: %gep = getelementptr i8, ptr %dst, i64 %iv
+; CHECK:  LV: Found an estimated cost of 2 for VF 1 For instruction: store i8 1, ptr %gep, align 1
+; CHECK:  LV: Found an estimated cost of 1 for VF 1 For instruction: %iv.next = add i64 %iv, 1
+; CHECK:  LV: Found an estimated cost of 1 for VF 1 For instruction: %ec = icmp eq i64 %iv.next, 7
+; CHECK:  LV: Found an estimated cost of 0 for VF 1 For instruction: br i1 %ec, label %exit, label %loop
+; CHECK:  LV: Scalar loop costs: 4.
+; CHECK:  LV: Found an estimated cost of 0 for VF 1 For instruction: %iv = phi i64 [ 0, %entry ], [ %iv.next, %loop ]
+; CHECK:  LV: Found an estimated cost of 0 for VF 1 For instruction: %gep = getelementptr i8, ptr %dst, i64 %iv
+; CHECK:  LV: Found an estimated cost of 2 for VF 1 For instruction: store i8 1, ptr %gep, align 1
+; CHECK:  LV: Found an estimated cost of 1 for VF 1 For instruction: %iv.next = add i64 %iv, 1
+; CHECK:  LV: Found an estimated cost of 1 for VF 1 For instruction: %ec = icmp eq i64 %iv.next, 7
+; CHECK:  LV: Found an estimated cost of 0 for VF 1 For instruction: br i1 %ec, label %exit, label %loop
+; CHECK:  LV: Scalar loop costs: 4.
+; CHECK:  LV: Found an estimated cost of 0 for VF 2 For instruction: %iv = phi i64 [ 0, %entry ], [ %iv.next, %loop ]
+; CHECK:  LV: Found an estimated cost of 0 for VF 2 For instruction: %gep = getelementptr i8, ptr %dst, i64 %iv
+; CHECK:  LV: Found an estimated cost of 8 for VF 2 For instruction: store i8 1, ptr %gep, align 1
+; CHECK:  LV: Found an estimated cost of 1 for VF 2 For instruction: %iv.next = add i64 %iv, 1
+; CHECK:  LV: Found an estimated cost of 1 for VF 2 For instruction: %ec = icmp eq i64 %iv.next, 7
+; CHECK:  LV: Found an estimated cost of 0 for VF 2 For instruction: br i1 %ec, label %exit, label %loop
+; CHECK:  LV: Vector loop of width 2 costs: 5.
+; CHECK:  LV: Found an estimated cost of 0 for VF 4 For instruction: %iv = phi i64 [ 0, %entry ], [ %iv.next, %loop ]
+; CHECK:  LV: Found an estimated cost of 0 for VF 4 For instruction: %gep = getelementptr i8, ptr %dst, i64 %iv
+; CHECK:  LV: Found an estimated cost of 16 for VF 4 For instruction: store i8 1, ptr %gep, align 1
+; CHECK:  LV: Found an estimated cost of 2 for VF 4 For instruction: %iv.next = add i64 %iv, 1
+; CHECK:  LV: Found an estimated cost of 1 for VF 4 For instruction: %ec = icmp eq i64 %iv.next, 7
+; CHECK:  LV: Found an estimated cost of 0 for VF 4 For instruction: br i1 %ec, label %exit, label %loop
+; CHECK:  LV: Vector loop of width 4 costs: 4.
+; CHECK:  LV: Found an estimated cost of 0 for VF 8 For instruction: %iv = phi i64 [ 0, %entry ], [ %iv.next, %loop ]
+; CHECK:  LV: Found an estimated cost of 0 for VF 8 For instruction: %gep = getelementptr i8, ptr %dst, i64 %iv
+; CHECK:  LV: Found an estimated cost of 32 for VF 8 For instruction: store i8 1, ptr %gep, align 1
----------------
john-brawn-arm wrote:

> The fix you have for computePredInstDiscount could be used as a workaround for now, but ultimately I think the best solution would be to avoid tail-folding altogether.

There are situations where tail-folding is faster than not vectorizing, e.g. the following example
```
__attribute__((noinline))
void fn(char *p, char a, char b, char c) {
  for (char i = 0; i < 7; ++i) {
    p[i] = i*a + (i>>1)*b + (i>>2)*c;
  }
}

int main() {
  char arr[256];
  fn(arr, 3, 5, 7);
  return arr[0];
}
```
On cortex-a53 compiling with -O3 -fno-loop-unroll, fn takes 65 cycles if we vectorize using tail folding and 74 cycles if we don't vectorize. I don't know how realistic this is as an example though, as without loop unrolling disabled it gets unrolled and that takes 43 cycles, but it does show that there are situations where just disabling tail folding outright leads to a worse result.

Also in general I think it's better to make decisions based on calculating the cost of things if we can, unless we can guarantee that a certain decision is definitely going to be worse, which we can't do here. I also don't think the change I'm making to computePredInstDiscount is a workaround, as what it's doing is making the cost we calculate more accurate to the actual cost of doing this, and more accurate costs should lead to better decisions.

https://github.com/llvm/llvm-project/pull/109289