[PATCH] D122145: [SLP] Cluster ordering for loads

Tue Apr 5 07:41:18 PDT 2022

ABataev accepted this revision.
ABataev added a comment.
This revision is now accepted and ready to land.

LG

================
Comment at: llvm/test/Transforms/SLPVectorizer/AArch64/loadorder.ll:354-355
 ; CHECK-NEXT:    [[TMP7:%.*]] = load <4 x i16>, <4 x i16>* [[TMP6]], align 2
-; CHECK-NEXT:    [[TMP8:%.*]] = load i16, i16* [[ARRAYIDX20]], align 2
-; CHECK-NEXT:    [[TMP9:%.*]] = load i16, i16* [[ARRAYIDX23]], align 2
-; CHECK-NEXT:    [[TMP10:%.*]] = load i16, i16* [[ARRAYIDX26]], align 2
-; CHECK-NEXT:    [[TMP11:%.*]] = load i16, i16* [[ARRAYIDX29]], align 2
-; CHECK-NEXT:    [[TMP12:%.*]] = shufflevector <4 x i16> [[TMP7]], <4 x i16> poison, <8 x i32> <i32 1, i32 0, i32 3, i32 2, i32 undef, i32 undef, i32 undef, i32 undef>
-; CHECK-NEXT:    [[TMP13:%.*]] = insertelement <8 x i16> [[TMP12]], i16 [[TMP9]], i64 4
-; CHECK-NEXT:    [[TMP14:%.*]] = insertelement <8 x i16> [[TMP13]], i16 [[TMP8]], i64 5
-; CHECK-NEXT:    [[TMP15:%.*]] = insertelement <8 x i16> [[TMP14]], i16 [[TMP11]], i64 6
-; CHECK-NEXT:    [[TMP16:%.*]] = insertelement <8 x i16> [[TMP15]], i16 [[TMP10]], i64 7
-; CHECK-NEXT:    [[TMP17:%.*]] = shufflevector <4 x i16> [[TMP1]], <4 x i16> poison, <8 x i32> <i32 1, i32 0, i32 3, i32 2, i32 undef, i32 undef, i32 undef, i32 undef>
-; CHECK-NEXT:    [[TMP18:%.*]] = insertelement <8 x i16> [[TMP17]], i16 [[TMP3]], i64 4
-; CHECK-NEXT:    [[TMP19:%.*]] = insertelement <8 x i16> [[TMP18]], i16 [[TMP2]], i64 5
-; CHECK-NEXT:    [[TMP20:%.*]] = insertelement <8 x i16> [[TMP19]], i16 [[TMP5]], i64 6
-; CHECK-NEXT:    [[TMP21:%.*]] = insertelement <8 x i16> [[TMP20]], i16 [[TMP4]], i64 7
-; CHECK-NEXT:    [[TMP22:%.*]] = mul <8 x i16> [[TMP16]], [[TMP21]]
-; CHECK-NEXT:    [[TMP23:%.*]] = call i16 @llvm.vector.reduce.add.v8i16(<8 x i16> [[TMP22]])
-; CHECK-NEXT:    ret i16 [[TMP23]]
+; CHECK-NEXT:    [[TMP8:%.*]] = mul <4 x i16> [[TMP5]], [[TMP1]]
+; CHECK-NEXT:    [[TMP9:%.*]] = mul <4 x i16> [[TMP7]], [[TMP3]]
+; CHECK-NEXT:    [[TMP10:%.*]] = shufflevector <4 x i16> [[TMP8]], <4 x i16> [[TMP9]], <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>
----------------
dmgreen wrote:
> ABataev wrote:
> > Looks like a regression here, worth investigation.
> Because of the two v4i16 mul's? It looks like it does OK overall with the nicer order of the loads: https://godbolt.org/z/9f44fPeTW, https://godbolt.org/z/eonoM8Ys7 for x86
> 
> From what I can see, the SLP vectorizer produces a single v8i16 mul. It is instcombine that then splits that up because it thinks that one shuffle is better than 2:
> ```
> *** IR Dump After SLPVectorizerPass on reduce_blockstrided4 ***
> define i16 @reduce_blockstrided4(i16* nocapture noundef readonly %x, i16* nocapture noundef readonly %y, i32 noundef %stride) {
> entry:
>   %idxprom = sext i32 %stride to i64
>   %arrayidx4 = getelementptr inbounds i16, i16* %x, i64 %idxprom
>   %arrayidx20 = getelementptr inbounds i16, i16* %y, i64 %idxprom
>   %0 = bitcast i16* %x to <4 x i16>*
>   %1 = load <4 x i16>, <4 x i16>* %0, align 2
>   %2 = bitcast i16* %arrayidx4 to <4 x i16>*
>   %3 = load <4 x i16>, <4 x i16>* %2, align 2
>   %4 = bitcast i16* %y to <4 x i16>*
>   %5 = load <4 x i16>, <4 x i16>* %4, align 2
>   %6 = bitcast i16* %arrayidx20 to <4 x i16>*
>   %7 = load <4 x i16>, <4 x i16>* %6, align 2
>   %8 = shufflevector <4 x i16> %5, <4 x i16> poison, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 undef, i32 undef, i32 undef, i32 undef>
>   %9 = shufflevector <4 x i16> %7, <4 x i16> poison, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 undef, i32 undef, i32 undef, i32 undef>
>   %10 = shufflevector <8 x i16> %8, <8 x i16> %9, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 8, i32 9, i32 10, i32 11>
>   %11 = shufflevector <4 x i16> %1, <4 x i16> poison, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 undef, i32 undef, i32 undef, i32 undef>
>   %12 = shufflevector <4 x i16> %3, <4 x i16> poison, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 undef, i32 undef, i32 undef, i32 undef>
>   %13 = shufflevector <8 x i16> %11, <8 x i16> %12, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 8, i32 9, i32 10, i32 11>
>   %14 = mul <8 x i16> %10, %13
>   %15 = call i16 @llvm.vector.reduce.add.v8i16(<8 x i16> %14)
>   ret i16 %15
> }
> ```
> 
> I can look into fixing that if you think it's worth doing. I'm not sure how yet (instcombine can't look at the cost model), but I've often worried about the amount of vector shuffles that instcombine transforms. Maybe it can be moved to VectorCombine so to get better costing.
I think we neeed to improve SLP vectorizer here to reduce the number of emitted shuffles, hope it will be fixed soon.

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D122145/new/

https://reviews.llvm.org/D122145