[PATCH] D101460: [SLP]Try to vectorize tiny trees with shuffled gathers of extractelements.

Thu Apr 29 01:23:46 PDT 2021

david-arm added inline comments.

================
Comment at: llvm/test/Transforms/SLPVectorizer/AArch64/accelerate-vector-functions-inseltpoison.ll:43
+; NOACCELERATE-NEXT:    [[TMP7:%.*]] = tail call fast float @llvm.sin.f32(float [[VECEXT_3]])
+; NOACCELERATE-NEXT:    [[VECINS_3:%.*]] = insertelement <4 x float> [[VECINS_2]], float [[TMP7]], i32 3
 ; NOACCELERATE-NEXT:    ret <4 x float> [[VECINS_3]]
----------------
ABataev wrote:
> RKSimon wrote:
> > why do many of these libm vectorizations result in a v2f32 and 2 * f32 scalar calls? I'd expect either 2 x v2f32 or a v4f32.
> Cost model. Cost of 4x calls is too high (`Call cost 18 (58-40) for   %1 = tail call fast float @llvm.sin.f32(float %vecext`) and the cost of 2x calls is high (`Call cost 6 (26-20) for   %1 = tail call fast float @llvm.sin.f32(float %vecext)`), but the cost of the extractelements with indices 1-2 is 5 (they are removed by the vectorizer) + compensate of the costs for inserts.
I guess it is a bit difficult to follow the logic here. I think I can understand that extracting element 0 is basically free so keeping the first scalar llvm.sin.f32 makes sense I suppose? Then we decide to make a vector call for elements 1 + 2, although I can't see where they are removed by the vectoriser? It still looks like we have 4 extractelements from the original <4 x float> vector.

I did try out the patch though and I can see with these changes we end up with 5 more lines of assembly in the generated code for this function, so it doesn't seem like a win to be honest. Perhaps there is an issue with the AArch64 cost model for the math calls?

================
Comment at: llvm/test/Transforms/SLPVectorizer/AArch64/ext-trunc.ll:21
 ; CHECK-NEXT:    [[E2:%.*]] = extractelement <4 x i32> [[SUB0]], i32 2
-; CHECK-NEXT:    [[S2:%.*]] = sext i32 [[E2]] to i64
-; CHECK-NEXT:    [[GEP2:%.*]] = getelementptr inbounds i64, i64* [[P]], i64 [[S2]]
+; CHECK-NEXT:    [[TMP0:%.*]] = insertelement <2 x i32> poison, i32 [[E1]], i32 0
+; CHECK-NEXT:    [[TMP1:%.*]] = insertelement <2 x i32> [[TMP0]], i32 [[E2]], i32 1
----------------
At first glance this looks worse, but I've tried out your patch and can see the generated code is the same because the entire first sequence of inserts, sext and trunc get folded away, since the sext + trunc is basically a no-op.

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D101460/new/

https://reviews.llvm.org/D101460