[llvm] [VPlan] Don't use the legacy cost model for loop conditions (PR #156864)

Thu Sep 11 05:33:23 PDT 2025

================
@@ -174,16 +173,34 @@ attributes #0 = { "target-cpu"="knl" }
 define void @PR40816() #1 {
 ; CHECK-LABEL: define void @PR40816(
 ; CHECK-SAME: ) #[[ATTR1:[0-9]+]] {
-; CHECK-NEXT:  [[ENTRY:.*]]:
-; CHECK-NEXT:    br label %[[FOR_BODY:.*]]
-; CHECK:       [[FOR_BODY]]:
-; CHECK-NEXT:    [[TMP0:%.*]] = phi i32 [ 0, %[[ENTRY]] ], [ [[INC:%.*]], %[[FOR_BODY]] ]
-; CHECK-NEXT:    store i32 [[TMP0]], ptr @b, align 1
-; CHECK-NEXT:    [[CMP2:%.*]] = icmp eq i32 [[TMP0]], 2
-; CHECK-NEXT:    [[INC]] = add nuw nsw i32 [[TMP0]], 1
-; CHECK-NEXT:    br i1 [[CMP2]], label %[[RETURN:.*]], label %[[FOR_BODY]]
-; CHECK:       [[RETURN]]:
-; CHECK-NEXT:    ret void
+; CHECK-NEXT:  [[ENTRY:.*:]]
----------------
john-brawn-arm wrote:

It looks like what's going on here is:
- Currently the load from arrayidx is considered to be part of calculating the loop exit condition, and so the cost is calculated in LoopVectorizationPlanner::precomputeCosts. It gets a very high cost due to useEmulatedMaskMemRefHack so we don't vectorize.
- In the vplan something has figured out that the loop has a constant trip count due to the load being from a constant array, so the load has been removed.
- With this patch that means we don't use the cost of the load, as it no longer exists, and the resulting cost says that vectorization is profitable.

If I manually transform the function into what it is after the vplan transformation then it looks like
```
define void @PR40816_adj() #1 {
entry:
  br label %for.body

for.body:
  %0 = phi i32 [ 0, %entry ], [ %inc, %for.body ]
  store i32 %0, ptr @b, align 1
  %inc = add nuw nsw i32 %0, 1
  %cmp = icmp uge i32 %inc, 7
  br i1 %cmp, label %return, label %for.body

return:
  ret void
}
```
and this currently gets vectorized. This is in fact very similar to the test low_trip_count_fold_tail_scalarized_store in llvm/test/Transforms/LoopVectorize/AArch64/conditional-branches-cost.ll.

I think this ultimately comes down to this FIXME in LoopVectorizationCostModel::setCostBasedWideningDecision
```
        // Load: Scalar load + broadcast                                                                                                                                   
        // Store: Scalar store + isLoopInvariantStoreValue ? 0 : extract                                                                                                   
        // FIXME: This cost is a significant under-estimate for tail folded                                                                                                
        // memory ops.                                                                                                                                                     
        const InstructionCost ScalarizationCost =
            IsLegalToScalarize() ? getUniformMemOpCost(&I, VF)
                                 : InstructionCost::getInvalid();
```

https://github.com/llvm/llvm-project/pull/156864