[llvm] [LoopVectorize] Don't discount instructions scalarized due to tail folding (PR #109289)

Mon Oct 14 04:58:42 PDT 2024

================
@@ -0,0 +1,281 @@
+; NOTE: Assertions have been autogenerated by utils/update_analyze_test_checks.py UTC_ARGS: --filter "fold tail" --filter "estimated cost" --filter "costs" --filter "Selecting VF" --filter "loop costs" --version 5
+; RUN: opt -passes=loop-vectorize -debug-only=loop-vectorize -disable-output -S < %s 2>&1 | FileCheck %s
+
+; REQUIRE: asserts
+
+target triple = "aarch64-unknown-linux-gnu"
+
+; These tests check that if the only way to vectorize is to tail fold a store by
+; masking then we properly account for the cost of creating a predicated block
+; for each vector element.
+
+define void @store_const_fixed_trip_count(ptr %dst) {
+; CHECK-LABEL: 'store_const_fixed_trip_count'
+; CHECK:  LV: can fold tail by masking.
+; CHECK:  LV: Found an estimated cost of 0 for VF 1 For instruction: %iv = phi i64 [ 0, %entry ], [ %iv.next, %loop ]
+; CHECK:  LV: Found an estimated cost of 0 for VF 1 For instruction: %gep = getelementptr i8, ptr %dst, i64 %iv
+; CHECK:  LV: Found an estimated cost of 2 for VF 1 For instruction: store i8 1, ptr %gep, align 1
+; CHECK:  LV: Found an estimated cost of 1 for VF 1 For instruction: %iv.next = add i64 %iv, 1
+; CHECK:  LV: Found an estimated cost of 1 for VF 1 For instruction: %ec = icmp eq i64 %iv.next, 7
+; CHECK:  LV: Found an estimated cost of 0 for VF 1 For instruction: br i1 %ec, label %exit, label %loop
+; CHECK:  LV: Scalar loop costs: 4.
+; CHECK:  LV: Found an estimated cost of 0 for VF 1 For instruction: %iv = phi i64 [ 0, %entry ], [ %iv.next, %loop ]
+; CHECK:  LV: Found an estimated cost of 0 for VF 1 For instruction: %gep = getelementptr i8, ptr %dst, i64 %iv
+; CHECK:  LV: Found an estimated cost of 2 for VF 1 For instruction: store i8 1, ptr %gep, align 1
+; CHECK:  LV: Found an estimated cost of 1 for VF 1 For instruction: %iv.next = add i64 %iv, 1
+; CHECK:  LV: Found an estimated cost of 1 for VF 1 For instruction: %ec = icmp eq i64 %iv.next, 7
+; CHECK:  LV: Found an estimated cost of 0 for VF 1 For instruction: br i1 %ec, label %exit, label %loop
+; CHECK:  LV: Scalar loop costs: 4.
+; CHECK:  LV: Found an estimated cost of 0 for VF 2 For instruction: %iv = phi i64 [ 0, %entry ], [ %iv.next, %loop ]
+; CHECK:  LV: Found an estimated cost of 0 for VF 2 For instruction: %gep = getelementptr i8, ptr %dst, i64 %iv
+; CHECK:  LV: Found an estimated cost of 8 for VF 2 For instruction: store i8 1, ptr %gep, align 1
+; CHECK:  LV: Found an estimated cost of 1 for VF 2 For instruction: %iv.next = add i64 %iv, 1
+; CHECK:  LV: Found an estimated cost of 1 for VF 2 For instruction: %ec = icmp eq i64 %iv.next, 7
+; CHECK:  LV: Found an estimated cost of 0 for VF 2 For instruction: br i1 %ec, label %exit, label %loop
+; CHECK:  LV: Vector loop of width 2 costs: 5.
+; CHECK:  LV: Found an estimated cost of 0 for VF 4 For instruction: %iv = phi i64 [ 0, %entry ], [ %iv.next, %loop ]
+; CHECK:  LV: Found an estimated cost of 0 for VF 4 For instruction: %gep = getelementptr i8, ptr %dst, i64 %iv
+; CHECK:  LV: Found an estimated cost of 16 for VF 4 For instruction: store i8 1, ptr %gep, align 1
+; CHECK:  LV: Found an estimated cost of 2 for VF 4 For instruction: %iv.next = add i64 %iv, 1
+; CHECK:  LV: Found an estimated cost of 1 for VF 4 For instruction: %ec = icmp eq i64 %iv.next, 7
+; CHECK:  LV: Found an estimated cost of 0 for VF 4 For instruction: br i1 %ec, label %exit, label %loop
+; CHECK:  LV: Vector loop of width 4 costs: 4.
+; CHECK:  LV: Found an estimated cost of 0 for VF 8 For instruction: %iv = phi i64 [ 0, %entry ], [ %iv.next, %loop ]
+; CHECK:  LV: Found an estimated cost of 0 for VF 8 For instruction: %gep = getelementptr i8, ptr %dst, i64 %iv
+; CHECK:  LV: Found an estimated cost of 32 for VF 8 For instruction: store i8 1, ptr %gep, align 1
----------------
john-brawn-arm wrote:

> Also, another root cause here is that we shouldn't even be attempting to tail-fold loops with loads and stores in them if the target does not support masked loads and stores. It's a poor upfront decision made by the vectoriser with no analysis of the trade-off between tail-folding and not tail-folding.

It appears it used to be the case that not having masked loads and stores disabled tail folding, but https://github.com/llvm/llvm-project/commit/9d1b2acaaa224 changed it to using a cost model instead.

> One potential alternative to this patch that I think is worth exploring is before deciding to tail-fold for low trip count loops we can query the target's support for masked loads and stores (see `isLegalMaskedStore` and `isLegalMaskedLoad` in LoopVectorize.cpp). If the target says they are not legal just don't bother tail-folding as it's almost certainly not worth it. Perhaps the end result may be even faster, since we may create an unpredicated vector body for the first X iterations? What do you think? @fhahn any thoughts?

TargetTransformInfo::preferPredicateOverEpilogue already defaults to preferring a scalar epilogue, and the two targets that implement their own version (AArch64 and ARM) prefer scalar epilogue when we don't have masked loads and stores. So the only situation where we're even considering using tail predication if we don't have masked loads/stores is when using a scalar epilogue is impossible, so the choice is between vectorizing using tail folding or not vectorizing at all.


https://github.com/llvm/llvm-project/pull/109289