[llvm] [SLP]Fix graph traversal in getSpillCost (PR #124984)

Thu Jan 30 10:18:27 PST 2025

================
@@ -149,37 +149,27 @@ define <4 x float> @exp_4x(ptr %a) {
 ; CHECK-SAME: (ptr [[A:%.*]]) #[[ATTR1]] {
 ; CHECK-NEXT:  entry:
 ; CHECK-NEXT:    [[TMP0:%.*]] = load <4 x float>, ptr [[A]], align 16
-; CHECK-NEXT:    [[VECEXT:%.*]] = extractelement <4 x float> [[TMP0]], i32 0
-; CHECK-NEXT:    [[TMP1:%.*]] = tail call fast float @expf(float [[VECEXT]])
-; CHECK-NEXT:    [[VECINS:%.*]] = insertelement <4 x float> undef, float [[TMP1]], i32 0
-; CHECK-NEXT:    [[VECEXT_1:%.*]] = extractelement <4 x float> [[TMP0]], i32 1
-; CHECK-NEXT:    [[TMP2:%.*]] = tail call fast float @expf(float [[VECEXT_1]])
-; CHECK-NEXT:    [[VECINS_1:%.*]] = insertelement <4 x float> [[VECINS]], float [[TMP2]], i32 1
-; CHECK-NEXT:    [[VECEXT_2:%.*]] = extractelement <4 x float> [[TMP0]], i32 2
-; CHECK-NEXT:    [[TMP3:%.*]] = tail call fast float @expf(float [[VECEXT_2]])
-; CHECK-NEXT:    [[VECINS_2:%.*]] = insertelement <4 x float> [[VECINS_1]], float [[TMP3]], i32 2
-; CHECK-NEXT:    [[VECEXT_3:%.*]] = extractelement <4 x float> [[TMP0]], i32 3
-; CHECK-NEXT:    [[TMP4:%.*]] = tail call fast float @expf(float [[VECEXT_3]])
-; CHECK-NEXT:    [[VECINS_3:%.*]] = insertelement <4 x float> [[VECINS_2]], float [[TMP4]], i32 3
-; CHECK-NEXT:    ret <4 x float> [[VECINS_3]]
+; CHECK-NEXT:    [[TMP1:%.*]] = shufflevector <4 x float> [[TMP0]], <4 x float> poison, <2 x i32> <i32 0, i32 1>
----------------
preames wrote:

I think I got confused here by the fact the extracts were interwoven with the scalar calls in the input.  Generally, I see that occurring in the output of the vectorizer, and it results in an unprofitable overall result.  But in this case, that's also the input.  

If I reorganize this test to have all the extracts, all the calls, then all the inserts, I get the result I was expecting - no change from input - both before and after this change.  This means the delta in this patch is specific to particular IR order here.  

I was initially thinking this had to do with the scalar vs vector typing issue in NoCallIntrinsic, but unfortunately, a quick and dirty patch shows that while that does benefit a few other cases (only in combination with this patch), it doesn't impact this example at all.

I did some digging into the cost  for this routine, and noticed something interesting.  A VF=4, the spill cost is computed as 8.  But at VF=2, the spill cost is 0.  I don't understand why that is true.  Might be something worth digging into here?

One side observation worth noting - as this example shows, sometimes vectorization can *remove* spill cost.  It might be worth enhancing this logic to account for that at some point.  



https://github.com/llvm/llvm-project/pull/124984