[llvm] [LoopVectorize] Don't discount instructions scalarized due to tail folding (PR #109289)

Thu Oct 3 01:27:04 PDT 2024

================
@@ -0,0 +1,281 @@
+; NOTE: Assertions have been autogenerated by utils/update_analyze_test_checks.py UTC_ARGS: --filter "fold tail" --filter "estimated cost" --filter "costs" --filter "Selecting VF" --filter "loop costs" --version 5
+; RUN: opt -passes=loop-vectorize -debug-only=loop-vectorize -disable-output -S < %s 2>&1 | FileCheck %s
+
+; REQUIRE: asserts
+
+target triple = "aarch64-unknown-linux-gnu"
+
+; These tests check that if the only way to vectorize is to tail fold a store by
+; masking then we properly account for the cost of creating a predicated block
+; for each vector element.
+
+define void @store_const_fixed_trip_count(ptr %dst) {
+; CHECK-LABEL: 'store_const_fixed_trip_count'
+; CHECK:  LV: can fold tail by masking.
+; CHECK:  LV: Found an estimated cost of 0 for VF 1 For instruction: %iv = phi i64 [ 0, %entry ], [ %iv.next, %loop ]
+; CHECK:  LV: Found an estimated cost of 0 for VF 1 For instruction: %gep = getelementptr i8, ptr %dst, i64 %iv
+; CHECK:  LV: Found an estimated cost of 2 for VF 1 For instruction: store i8 1, ptr %gep, align 1
+; CHECK:  LV: Found an estimated cost of 1 for VF 1 For instruction: %iv.next = add i64 %iv, 1
+; CHECK:  LV: Found an estimated cost of 1 for VF 1 For instruction: %ec = icmp eq i64 %iv.next, 7
+; CHECK:  LV: Found an estimated cost of 0 for VF 1 For instruction: br i1 %ec, label %exit, label %loop
+; CHECK:  LV: Scalar loop costs: 4.
+; CHECK:  LV: Found an estimated cost of 0 for VF 1 For instruction: %iv = phi i64 [ 0, %entry ], [ %iv.next, %loop ]
+; CHECK:  LV: Found an estimated cost of 0 for VF 1 For instruction: %gep = getelementptr i8, ptr %dst, i64 %iv
+; CHECK:  LV: Found an estimated cost of 2 for VF 1 For instruction: store i8 1, ptr %gep, align 1
+; CHECK:  LV: Found an estimated cost of 1 for VF 1 For instruction: %iv.next = add i64 %iv, 1
+; CHECK:  LV: Found an estimated cost of 1 for VF 1 For instruction: %ec = icmp eq i64 %iv.next, 7
+; CHECK:  LV: Found an estimated cost of 0 for VF 1 For instruction: br i1 %ec, label %exit, label %loop
+; CHECK:  LV: Scalar loop costs: 4.
+; CHECK:  LV: Found an estimated cost of 0 for VF 2 For instruction: %iv = phi i64 [ 0, %entry ], [ %iv.next, %loop ]
+; CHECK:  LV: Found an estimated cost of 0 for VF 2 For instruction: %gep = getelementptr i8, ptr %dst, i64 %iv
+; CHECK:  LV: Found an estimated cost of 8 for VF 2 For instruction: store i8 1, ptr %gep, align 1
+; CHECK:  LV: Found an estimated cost of 1 for VF 2 For instruction: %iv.next = add i64 %iv, 1
+; CHECK:  LV: Found an estimated cost of 1 for VF 2 For instruction: %ec = icmp eq i64 %iv.next, 7
+; CHECK:  LV: Found an estimated cost of 0 for VF 2 For instruction: br i1 %ec, label %exit, label %loop
+; CHECK:  LV: Vector loop of width 2 costs: 5.
+; CHECK:  LV: Found an estimated cost of 0 for VF 4 For instruction: %iv = phi i64 [ 0, %entry ], [ %iv.next, %loop ]
+; CHECK:  LV: Found an estimated cost of 0 for VF 4 For instruction: %gep = getelementptr i8, ptr %dst, i64 %iv
+; CHECK:  LV: Found an estimated cost of 16 for VF 4 For instruction: store i8 1, ptr %gep, align 1
+; CHECK:  LV: Found an estimated cost of 2 for VF 4 For instruction: %iv.next = add i64 %iv, 1
+; CHECK:  LV: Found an estimated cost of 1 for VF 4 For instruction: %ec = icmp eq i64 %iv.next, 7
+; CHECK:  LV: Found an estimated cost of 0 for VF 4 For instruction: br i1 %ec, label %exit, label %loop
+; CHECK:  LV: Vector loop of width 4 costs: 4.
+; CHECK:  LV: Found an estimated cost of 0 for VF 8 For instruction: %iv = phi i64 [ 0, %entry ], [ %iv.next, %loop ]
+; CHECK:  LV: Found an estimated cost of 0 for VF 8 For instruction: %gep = getelementptr i8, ptr %dst, i64 %iv
+; CHECK:  LV: Found an estimated cost of 32 for VF 8 For instruction: store i8 1, ptr %gep, align 1
----------------
david-arm wrote:

Thanks for adding the test! However, I still don't know how your patch affects the cost model. For example, what were the costs before your change since that helps to understand why this patch now chooses VF=1. I suspect what your change does is increase the effective cost of the store by saying that we shouldn't scalarise it. However, my concern here is that we might be fixing this the wrong way. For example, it sounds like we could achieve the same effect if in computePredInstDiscount we increased the scalarisation cost to match the vector cost, so that the discount disappears?

Let's suppose we force the vectoriser to choose VF=8 and in one scenario we choose to scalarise, and in the other we vectorise what would the end result look like? If the generated code looks equally bad in both cases then I'd expect the vector cost and scalar cost to be the same. I worry that we're effectively hiding a bug in the cost model by simply bypassing it for tail-folded loops, however the problem may still remain for normal loops with control flow.

https://github.com/llvm/llvm-project/pull/109289