[llvm] [SLPVectorizer] Use accurate cost for external users of resize shuffles (PR #137419)

Wed Jun 11 12:00:18 PDT 2025

================
@@ -22,13 +21,11 @@ define void @foo(double %i) {
 ; CHECK-NEXT:    [[TMP17:%.*]] = call i1 @llvm.vector.reduce.and.v8i1(<8 x i1> [[TMP16]])
 ; CHECK-NEXT:    br i1 [[TMP17]], label [[BB58:%.*]], label [[BB115:%.*]]
 ; CHECK:       bb115:
-; CHECK-NEXT:    [[TMP18:%.*]] = fmul <2 x double> zeroinitializer, [[TMP4]]
-; CHECK-NEXT:    [[TMP19:%.*]] = extractelement <2 x double> [[TMP18]], i32 0
-; CHECK-NEXT:    [[TMP20:%.*]] = extractelement <2 x double> [[TMP18]], i32 1
+; CHECK-NEXT:    [[TMP19:%.*]] = fmul double 0.000000e+00, [[I103]]
+; CHECK-NEXT:    [[TMP20:%.*]] = fmul double 0.000000e+00, [[I82]]
 ; CHECK-NEXT:    [[I118:%.*]] = fadd double [[TMP19]], [[TMP20]]
 ; CHECK-NEXT:    [[TMP21:%.*]] = fmul <4 x double> zeroinitializer, [[TMP1]]
-; CHECK-NEXT:    [[TMP22:%.*]] = shufflevector <2 x double> [[TMP4]], <2 x double> poison, <4 x i32> <i32 0, i32 1, i32 poison, i32 poison>
-; CHECK-NEXT:    [[TMP23:%.*]] = shufflevector <4 x double> <double 0.000000e+00, double 0.000000e+00, double 0.000000e+00, double poison>, <4 x double> [[TMP22]], <4 x i32> <i32 0, i32 1, i32 2, i32 5>
+; CHECK-NEXT:    [[TMP23:%.*]] = insertelement <4 x double> <double 0.000000e+00, double 0.000000e+00, double 0.000000e+00, double poison>, double [[I82]], i32 3
----------------
jrbyrnes wrote:

The effect this PR has had on this test also looks good to me.

The effect is less immediately obvious on this test compared to the last, because we are evaluating vectorizing a list after already vectorizing several others. In the partially vectorized version, we have some insertelements that are not present in the original IR. These insertelements are external users for our tree.

We have a VF of 2 and are inserting the 2nd element into 4th position of external insertelement vector (external user width is 4). The vector we are inserting into is a literal 0.0 vector, and is the other operand in any external shuffles.

The external use mask for our vectorized tree is <-1, -1, -1, 1>

Previously, we used a resize mask of <-1, -1> to calculate the resize cost and a mask of <0, 1, 2, 7> to calculate the external shuffle cost. These shuffles both have costs of 1 so we vectorize. However, this isn't how the SLP codegen works -- we don't shuffle during the resize (though that may be more optimal in this case). In actuality, we resize using mask <0, 1, -1 -1> which is just identity with poison. Then, we shuffle using <0, 1, 2, 5>. In the new version, we are much more accurate, and use <-1, 1, -1 -1> to calculate the resize cost (identity with poison), and <0, 1, 2, 5> to calculate the shuffle cost. These have costs 0 and 3 respectively, and push us over the cost threshold, so we do not vectorize this part.

There is difference in the generated assembly before / after.

https://github.com/llvm/llvm-project/pull/137419