[llvm] [LoopVectorize] Don't discount instructions scalarized due to tail folding (PR #109289)

Thu Oct 3 04:14:08 PDT 2024

================
@@ -0,0 +1,281 @@
+; NOTE: Assertions have been autogenerated by utils/update_analyze_test_checks.py UTC_ARGS: --filter "fold tail" --filter "estimated cost" --filter "costs" --filter "Selecting VF" --filter "loop costs" --version 5
+; RUN: opt -passes=loop-vectorize -debug-only=loop-vectorize -disable-output -S < %s 2>&1 | FileCheck %s
+
+; REQUIRE: asserts
+
+target triple = "aarch64-unknown-linux-gnu"
+
+; These tests check that if the only way to vectorize is to tail fold a store by
+; masking then we properly account for the cost of creating a predicated block
+; for each vector element.
+
+define void @store_const_fixed_trip_count(ptr %dst) {
+; CHECK-LABEL: 'store_const_fixed_trip_count'
+; CHECK:  LV: can fold tail by masking.
+; CHECK:  LV: Found an estimated cost of 0 for VF 1 For instruction: %iv = phi i64 [ 0, %entry ], [ %iv.next, %loop ]
+; CHECK:  LV: Found an estimated cost of 0 for VF 1 For instruction: %gep = getelementptr i8, ptr %dst, i64 %iv
+; CHECK:  LV: Found an estimated cost of 2 for VF 1 For instruction: store i8 1, ptr %gep, align 1
+; CHECK:  LV: Found an estimated cost of 1 for VF 1 For instruction: %iv.next = add i64 %iv, 1
+; CHECK:  LV: Found an estimated cost of 1 for VF 1 For instruction: %ec = icmp eq i64 %iv.next, 7
+; CHECK:  LV: Found an estimated cost of 0 for VF 1 For instruction: br i1 %ec, label %exit, label %loop
+; CHECK:  LV: Scalar loop costs: 4.
+; CHECK:  LV: Found an estimated cost of 0 for VF 1 For instruction: %iv = phi i64 [ 0, %entry ], [ %iv.next, %loop ]
+; CHECK:  LV: Found an estimated cost of 0 for VF 1 For instruction: %gep = getelementptr i8, ptr %dst, i64 %iv
+; CHECK:  LV: Found an estimated cost of 2 for VF 1 For instruction: store i8 1, ptr %gep, align 1
+; CHECK:  LV: Found an estimated cost of 1 for VF 1 For instruction: %iv.next = add i64 %iv, 1
+; CHECK:  LV: Found an estimated cost of 1 for VF 1 For instruction: %ec = icmp eq i64 %iv.next, 7
+; CHECK:  LV: Found an estimated cost of 0 for VF 1 For instruction: br i1 %ec, label %exit, label %loop
+; CHECK:  LV: Scalar loop costs: 4.
+; CHECK:  LV: Found an estimated cost of 0 for VF 2 For instruction: %iv = phi i64 [ 0, %entry ], [ %iv.next, %loop ]
+; CHECK:  LV: Found an estimated cost of 0 for VF 2 For instruction: %gep = getelementptr i8, ptr %dst, i64 %iv
+; CHECK:  LV: Found an estimated cost of 8 for VF 2 For instruction: store i8 1, ptr %gep, align 1
+; CHECK:  LV: Found an estimated cost of 1 for VF 2 For instruction: %iv.next = add i64 %iv, 1
+; CHECK:  LV: Found an estimated cost of 1 for VF 2 For instruction: %ec = icmp eq i64 %iv.next, 7
+; CHECK:  LV: Found an estimated cost of 0 for VF 2 For instruction: br i1 %ec, label %exit, label %loop
+; CHECK:  LV: Vector loop of width 2 costs: 5.
+; CHECK:  LV: Found an estimated cost of 0 for VF 4 For instruction: %iv = phi i64 [ 0, %entry ], [ %iv.next, %loop ]
+; CHECK:  LV: Found an estimated cost of 0 for VF 4 For instruction: %gep = getelementptr i8, ptr %dst, i64 %iv
+; CHECK:  LV: Found an estimated cost of 16 for VF 4 For instruction: store i8 1, ptr %gep, align 1
+; CHECK:  LV: Found an estimated cost of 2 for VF 4 For instruction: %iv.next = add i64 %iv, 1
+; CHECK:  LV: Found an estimated cost of 1 for VF 4 For instruction: %ec = icmp eq i64 %iv.next, 7
+; CHECK:  LV: Found an estimated cost of 0 for VF 4 For instruction: br i1 %ec, label %exit, label %loop
+; CHECK:  LV: Vector loop of width 4 costs: 4.
+; CHECK:  LV: Found an estimated cost of 0 for VF 8 For instruction: %iv = phi i64 [ 0, %entry ], [ %iv.next, %loop ]
+; CHECK:  LV: Found an estimated cost of 0 for VF 8 For instruction: %gep = getelementptr i8, ptr %dst, i64 %iv
+; CHECK:  LV: Found an estimated cost of 32 for VF 8 For instruction: store i8 1, ptr %gep, align 1
----------------
john-brawn-arm wrote:

> Thanks for adding the test! However, I still don't know how your patch affects the cost model. For example, what were the costs before your change since that helps to understand why this patch now chooses VF=1

Current costs, without this patch:
```
LV: Found an estimated cost of 0 for VF 1 For instruction:   %iv = phi i64 [ 0, %entry ], [ %iv.next, %loop ]
LV: Found an estimated cost of 0 for VF 1 For instruction:   %iv.trunc = trunc i64 %iv to i8
LV: Found an estimated cost of 0 for VF 1 For instruction:   %gep = getelementptr i8, ptr %dst, i64 %iv
LV: Found an estimated cost of 2 for VF 1 For instruction:   store i8 %iv.trunc, ptr %gep, align 1
LV: Found an estimated cost of 1 for VF 1 For instruction:   %iv.next = add i64 %iv, 1
LV: Found an estimated cost of 1 for VF 1 For instruction:   %ec = icmp eq i64 %iv.next, 7
LV: Found an estimated cost of 0 for VF 1 For instruction:   br i1 %ec, label %exit, label %loop
LV: Scalar loop costs: 4.
LV: Found an estimated cost of 0 for VF 2 For instruction:   %iv = phi i64 [ 0, %entry ], [ %iv.next, %loop ]
LV: Found an estimated cost of 2 for VF 2 For instruction:   %iv.trunc = trunc i64 %iv to i8
LV: Found an estimated cost of 0 for VF 2 For instruction:   %gep = getelementptr i8, ptr %dst, i64 %iv
LV: Found an estimated cost of 2 for VF 2 For instruction:   store i8 %iv.trunc, ptr %gep, align 1
LV: Found an estimated cost of 1 for VF 2 For instruction:   %iv.next = add i64 %iv, 1
LV: Found an estimated cost of 1 for VF 2 For instruction:   %ec = icmp eq i64 %iv.next, 7
LV: Found an estimated cost of 0 for VF 2 For instruction:   br i1 %ec, label %exit, label %loop
LV: Vector loop of width 2 costs: 3.
LV: Found an estimated cost of 0 for VF 4 For instruction:   %iv = phi i64 [ 0, %entry ], [ %iv.next, %loop ]
LV: Found an estimated cost of 4 for VF 4 For instruction:   %iv.trunc = trunc i64 %iv to i8
LV: Found an estimated cost of 0 for VF 4 For instruction:   %gep = getelementptr i8, ptr %dst, i64 %iv
LV: Found an estimated cost of 4 for VF 4 For instruction:   store i8 %iv.trunc, ptr %gep, align 1
LV: Found an estimated cost of 2 for VF 4 For instruction:   %iv.next = add i64 %iv, 1
LV: Found an estimated cost of 1 for VF 4 For instruction:   %ec = icmp eq i64 %iv.next, 7
LV: Found an estimated cost of 0 for VF 4 For instruction:   br i1 %ec, label %exit, label %loop
LV: Vector loop of width 4 costs: 2.
LV: Found an estimated cost of 0 for VF 8 For instruction:   %iv = phi i64 [ 0, %entry ], [ %iv.next, %loop ]
LV: Found an estimated cost of 8 for VF 8 For instruction:   %iv.trunc = trunc i64 %iv to i8
LV: Found an estimated cost of 0 for VF 8 For instruction:   %gep = getelementptr i8, ptr %dst, i64 %iv
LV: Found an estimated cost of 8 for VF 8 For instruction:   store i8 %iv.trunc, ptr %gep, align 1
LV: Found an estimated cost of 4 for VF 8 For instruction:   %iv.next = add i64 %iv, 1
LV: Found an estimated cost of 1 for VF 8 For instruction:   %ec = icmp eq i64 %iv.next, 7
LV: Found an estimated cost of 0 for VF 8 For instruction:   br i1 %ec, label %exit, label %loop
LV: Vector loop of width 8 costs: 2.
LV: Found an estimated cost of 0 for VF 16 For instruction:   %iv = phi i64 [ 0, %entry ], [ %iv.next, %loop ]
LV: Found an estimated cost of 16 for VF 16 For instruction:   %iv.trunc = trunc i64 %iv to i8
LV: Found an estimated cost of 0 for VF 16 For instruction:   %gep = getelementptr i8, ptr %dst, i64 %iv
LV: Found an estimated cost of 16 for VF 16 For instruction:   store i8 %iv.trunc, ptr %gep, align 1
LV: Found an estimated cost of 8 for VF 16 For instruction:   %iv.next = add i64 %iv, 1
LV: Found an estimated cost of 1 for VF 16 For instruction:   %ec = icmp eq i64 %iv.next, 7
LV: Found an estimated cost of 0 for VF 16 For instruction:   br i1 %ec, label %exit, label %loop
LV: Vector loop of width 16 costs: 2.
LV: Selecting VF: 8.
```

>However, my concern here is that we might be fixing this the wrong way. For example, it sounds like we could achieve the same effect if in computePredInstDiscount we increased the scalarisation cost to match the vector cost, so that the discount disappears?

That sounds like doing more work for the same end result? We already know that there can be no discount, because the "vector cost" for an instruction we've already decided to scalarize is already the scalarized cost, so there's no point in calculating the same thing again.
 
> Let's suppose we force the vectoriser to choose VF=8 and in one scenario we choose to scalarise, and in the other we vectorise what would the end result look like? If the generated code looks equally bad in both cases then I'd expect the vector cost and scalar cost to be the same. I worry that we're effectively hiding a bug in the cost model by simply bypassing it for tail-folded loops, however the problem may still remain for normal loops with control flow.

If we've already decided to scalarize by the time we've called computePredInstDiscount (i.e. getWideningDecision returns CM_Scalarize) then that means using a vector instruction is impossible (because no vector instruction exists for the operation we're trying to do, in this case we're not using sve so we don't have a masked vector store instruction).


https://github.com/llvm/llvm-project/pull/109289