[PATCH] D122145: [SLP] Cluster ordering for loads

Tue Apr 5 07:36:25 PDT 2022

dmgreen added inline comments.

================
Comment at: llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp:3399-3400
+  for (auto &Base : Bases) {
+    for (auto &T : Base.second)
+      SortedIndices.push_back(std::get<2>(T));
+  }
----------------
ABataev wrote:
> What if we have non-power-of-2 number of elements in each cluster?
There are a couple of test in reduce_blockstrided3 and store_blockstrided3 with blocks of size 3. The first was a few instruction shorter under X86 (didn't change on AArch64). The second was different-but-not-worse on AArch64 (didn't change under X86). That's the only tests I've seen with non-power-2 clusters though, so it's not very exhaustive testing.

(A quick test of a "reduce_blockstrided5" seems to be better too - a lot less shuffling in the version I tried under X86 and more vectorization under AArch64)

It can depend on the costmodel. I think both AArch64 and X86 will have a much lower cost for insert-subvectors that are aligned and a power2 in size. And how bad the initial ordering is - if it allows more less-than-full-width vectorization that might still be a win.

I can make it more conservative if you think that's best? I don't have a strong opinion either way.

================
Comment at: llvm/test/Transforms/SLPVectorizer/AArch64/loadorder.ll:354-355
 ; CHECK-NEXT:    [[TMP7:%.*]] = load <4 x i16>, <4 x i16>* [[TMP6]], align 2
-; CHECK-NEXT:    [[TMP8:%.*]] = load i16, i16* [[ARRAYIDX20]], align 2
-; CHECK-NEXT:    [[TMP9:%.*]] = load i16, i16* [[ARRAYIDX23]], align 2
-; CHECK-NEXT:    [[TMP10:%.*]] = load i16, i16* [[ARRAYIDX26]], align 2
-; CHECK-NEXT:    [[TMP11:%.*]] = load i16, i16* [[ARRAYIDX29]], align 2
-; CHECK-NEXT:    [[TMP12:%.*]] = shufflevector <4 x i16> [[TMP7]], <4 x i16> poison, <8 x i32> <i32 1, i32 0, i32 3, i32 2, i32 undef, i32 undef, i32 undef, i32 undef>
-; CHECK-NEXT:    [[TMP13:%.*]] = insertelement <8 x i16> [[TMP12]], i16 [[TMP9]], i64 4
-; CHECK-NEXT:    [[TMP14:%.*]] = insertelement <8 x i16> [[TMP13]], i16 [[TMP8]], i64 5
-; CHECK-NEXT:    [[TMP15:%.*]] = insertelement <8 x i16> [[TMP14]], i16 [[TMP11]], i64 6
-; CHECK-NEXT:    [[TMP16:%.*]] = insertelement <8 x i16> [[TMP15]], i16 [[TMP10]], i64 7
-; CHECK-NEXT:    [[TMP17:%.*]] = shufflevector <4 x i16> [[TMP1]], <4 x i16> poison, <8 x i32> <i32 1, i32 0, i32 3, i32 2, i32 undef, i32 undef, i32 undef, i32 undef>
-; CHECK-NEXT:    [[TMP18:%.*]] = insertelement <8 x i16> [[TMP17]], i16 [[TMP3]], i64 4
-; CHECK-NEXT:    [[TMP19:%.*]] = insertelement <8 x i16> [[TMP18]], i16 [[TMP2]], i64 5
-; CHECK-NEXT:    [[TMP20:%.*]] = insertelement <8 x i16> [[TMP19]], i16 [[TMP5]], i64 6
-; CHECK-NEXT:    [[TMP21:%.*]] = insertelement <8 x i16> [[TMP20]], i16 [[TMP4]], i64 7
-; CHECK-NEXT:    [[TMP22:%.*]] = mul <8 x i16> [[TMP16]], [[TMP21]]
-; CHECK-NEXT:    [[TMP23:%.*]] = call i16 @llvm.vector.reduce.add.v8i16(<8 x i16> [[TMP22]])
-; CHECK-NEXT:    ret i16 [[TMP23]]
+; CHECK-NEXT:    [[TMP8:%.*]] = mul <4 x i16> [[TMP5]], [[TMP1]]
+; CHECK-NEXT:    [[TMP9:%.*]] = mul <4 x i16> [[TMP7]], [[TMP3]]
+; CHECK-NEXT:    [[TMP10:%.*]] = shufflevector <4 x i16> [[TMP8]], <4 x i16> [[TMP9]], <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>
----------------
ABataev wrote:
> Looks like a regression here, worth investigation.
Because of the two v4i16 mul's? It looks like it does OK overall with the nicer order of the loads: https://godbolt.org/z/9f44fPeTW, https://godbolt.org/z/eonoM8Ys7 for x86

>From what I can see, the SLP vectorizer produces a single v8i16 mul. It is instcombine that then splits that up because it thinks that one shuffle is better than 2:
```
*** IR Dump After SLPVectorizerPass on reduce_blockstrided4 ***
define i16 @reduce_blockstrided4(i16* nocapture noundef readonly %x, i16* nocapture noundef readonly %y, i32 noundef %stride) {
entry:
  %idxprom = sext i32 %stride to i64
  %arrayidx4 = getelementptr inbounds i16, i16* %x, i64 %idxprom
  %arrayidx20 = getelementptr inbounds i16, i16* %y, i64 %idxprom
  %0 = bitcast i16* %x to <4 x i16>*
  %1 = load <4 x i16>, <4 x i16>* %0, align 2
  %2 = bitcast i16* %arrayidx4 to <4 x i16>*
  %3 = load <4 x i16>, <4 x i16>* %2, align 2
  %4 = bitcast i16* %y to <4 x i16>*
  %5 = load <4 x i16>, <4 x i16>* %4, align 2
  %6 = bitcast i16* %arrayidx20 to <4 x i16>*
  %7 = load <4 x i16>, <4 x i16>* %6, align 2
  %8 = shufflevector <4 x i16> %5, <4 x i16> poison, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 undef, i32 undef, i32 undef, i32 undef>
  %9 = shufflevector <4 x i16> %7, <4 x i16> poison, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 undef, i32 undef, i32 undef, i32 undef>
  %10 = shufflevector <8 x i16> %8, <8 x i16> %9, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 8, i32 9, i32 10, i32 11>
  %11 = shufflevector <4 x i16> %1, <4 x i16> poison, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 undef, i32 undef, i32 undef, i32 undef>
  %12 = shufflevector <4 x i16> %3, <4 x i16> poison, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 undef, i32 undef, i32 undef, i32 undef>
  %13 = shufflevector <8 x i16> %11, <8 x i16> %12, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 8, i32 9, i32 10, i32 11>
  %14 = mul <8 x i16> %10, %13
  %15 = call i16 @llvm.vector.reduce.add.v8i16(<8 x i16> %14)
  ret i16 %15
}
```

I can look into fixing that if you think it's worth doing. I'm not sure how yet (instcombine can't look at the cost model), but I've often worried about the amount of vector shuffles that instcombine transforms. Maybe it can be moved to VectorCombine so to get better costing.

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D122145/new/

https://reviews.llvm.org/D122145