[PATCH] D98714: [SLP] Add insertelement instructions to vectorizable tree

Wed May 5 12:45:22 PDT 2021

anton-afanasyev marked 9 inline comments as done.
anton-afanasyev added inline comments.

================
Comment at: llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp:2840-2845
+      int MinIndex = INT_MAX;
+      for (auto *V : VL)
+        MinIndex = std::min(
+            MinIndex,
+            int(cast<ConstantInt>(cast<InsertElementInst>(V)->getOperand(2))
+                    ->getZExtValue()));
----------------
ABataev wrote:
> Outline this code into a separate helper function
Ok, moved to `GetConstantOperand()`

================
Comment at: llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp:2846-2853
+      bool IsConsecutive = true;
+      for (int I = 0, E = VL.size(); I < E; I++) {
+        int Index =
+            cast<ConstantInt>(cast<InsertElementInst>(VL[I])->getOperand(2))
+                ->getZExtValue() -
+            MinIndex;
+        IsConsecutive &= (I == Index);
----------------
ABataev wrote:
> Beter to turn this into lambda
Ok, added `GetConstantOperand()`

================
Comment at: llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp:3648
+        MinIndex = std::min(MinIndex, Index);
+      }
+
----------------
RKSimon wrote:
> Use getScalarizationOverhead instead of accumulating insertelement costs? That should avoid the need for the subvector costs below as well.
Thanks, done this way.

================
Comment at: llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp:3650-3656
+      // TODO: fix getShuffleCost() for SK_InsertSubvector
+      // and uncomment this to accurately tune costs. For typical cases
+      // this should be zero when subvector is whole register.
+      // if (SrcVecTy->getNumElements() != VL.size()) {
+      //   Cost += TTI->getShuffleCost(TargetTransformInfo::SK_InsertSubvector,
+      //                               SrcVecTy, None, MinIndex, VecTy);
+      // }
----------------
ABataev wrote:
> I think this should be uncommented, otherwise, we are too optimistic about the cost.
I've removed this commented code after `getScalarizationOverhead()` used as @RKSimon suggested.

================
Comment at: llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp:3975
+  // Vectorizing inserts of gathered values is redundant.
+  if (isa<InsertElementInst>(VectorizableTree[0]->Scalars[0]) &&
+      VectorizableTree[1]->State == TreeEntry::NeedToGather)
----------------
anton-afanasyev wrote:
> ABataev wrote:
> > I think better to check that all elements are `InsertElementInst` and move this check closer  to the end of the function
> If any of elements is `InsertElementInst`, they are all the same, so checking one is enough. And we can't move it closer to the end, since the next check can return true.
Moreover, have to move it upper, to `isTreeTinyAndNotFullyVectorizable()` function, since need to exclude explicitly cases when vectorizing inserts of gathered values. It makes no sense and otherwise we can fall into infinite loop of generating inserts and vectorizing them again (for instance, when `-slp-min-tree-size=2` is set).

================
Comment at: llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp:6525-6527
+      ArrayRef<Value *> Ops = InsertUses.empty()
+                                  ? VL.slice(I, OpsWidth)
+                                  : InsertUses.slice(I, OpsWidth);
----------------
anton-afanasyev wrote:
> ABataev wrote:
> > Why not just pass `InsertUses` as ёМДё here and drop `InsertUses` parameter completely?
> Yes, I've planned to do this, but can't: we make a decision about vectorization based on the _operands_:
> ```
>   InstructionsState S = getSameOpcode(VL);
>   if (!S.getOpcode())
>     return false;
>  ...
>  unsigned Sz = R.getVectorElementSize(I0);
> ```
> but then make vectorization starting from inserts.
Ok, eventually found a way to drop `InsertUses` parameter.

================
Comment at: llvm/test/Transforms/SLPVectorizer/AArch64/accelerate-vector-functions.ll:229
+; NOACCELERATE-NEXT:    [[TMP6:%.*]] = tail call fast float @expf(float [[VECEXT_3]])
+; NOACCELERATE-NEXT:    [[VECINS_3:%.*]] = insertelement <4 x float> [[VECINS_21]], float [[TMP6]], i32 3
 ; NOACCELERATE-NEXT:    ret <4 x float> [[VECINS_3]]
----------------
RKSimon wrote:
> I'm curious why we ended up with a partial vectorization of 2 x expf + llvm.exp.v2f32 here instead of llvm.exp.v4f32
I've debugged this case. Without acceleration `@llvm.exp.v4f32` is too expensive (call cost is 18=58-40 vs -30 = 10-40 with accelaration), whereas `@llvm.exp.v2f32` is cheaper (call cost is 6=26-20).

================
Comment at: llvm/test/Transforms/SLPVectorizer/AArch64/accelerate-vector-functions.ll:312
+; NOACCELERATE-NEXT:    [[TMP6:%.*]] = tail call fast float @logf(float [[VECEXT_3]])
+; NOACCELERATE-NEXT:    [[VECINS_3:%.*]] = insertelement <4 x float> [[VECINS_21]], float [[TMP6]], i32 3
 ; NOACCELERATE-NEXT:    ret <4 x float> [[VECINS_3]]
----------------
RKSimon wrote:
> Why not llvm.log.v4f32?
This case is the same as above.

================
Comment at: llvm/test/Transforms/SLPVectorizer/X86/alternate-int-inseltpoison.ll:146
 define <8 x i32> @ashr_shl_v8i32_const(<8 x i32> %a) {
-; SSE-LABEL: @ashr_shl_v8i32_const(
-; SSE-NEXT:    [[TMP1:%.*]] = shufflevector <8 x i32> [[A:%.*]], <8 x i32> undef, <4 x i32> <i32 0, i32 1, i32 2, i32 3>
-; SSE-NEXT:    [[TMP2:%.*]] = ashr <4 x i32> [[TMP1]], <i32 2, i32 2, i32 2, i32 2>
-; SSE-NEXT:    [[TMP3:%.*]] = shufflevector <8 x i32> [[A]], <8 x i32> undef, <4 x i32> <i32 4, i32 5, i32 6, i32 7>
-; SSE-NEXT:    [[TMP4:%.*]] = shl <4 x i32> [[TMP3]], <i32 3, i32 3, i32 3, i32 3>
-; SSE-NEXT:    [[R7:%.*]] = shufflevector <4 x i32> [[TMP2]], <4 x i32> [[TMP4]], <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>
-; SSE-NEXT:    ret <8 x i32> [[R7]]
-;
 ; AVX1-LABEL: @ashr_shl_v8i32_const(
 ; AVX1-NEXT:    [[TMP1:%.*]] = shufflevector <8 x i32> [[A:%.*]], <8 x i32> undef, <4 x i32> <i32 0, i32 1, i32 2, i32 3>
----------------
RKSimon wrote:
> Where did the SSE checks go? Add back the SSE check prefixes?
Thanks, fixed

================
Comment at: llvm/test/Transforms/SLPVectorizer/X86/sext-inseltpoison.ll:3
+; RUN: opt < %s -mtriple=x86_64-unknown -basic-aa -slp-vectorizer -S | FileCheck %s --check-prefixes=SSE
+; RUN: opt < %s -mtriple=x86_64-unknown -mcpu=slm -basic-aa -slp-vectorizer -S | FileCheck %s --check-prefixes=SSE
 ; RUN: opt < %s -mtriple=x86_64-unknown -mcpu=corei7-avx -basic-aa -slp-vectorizer -S | FileCheck %s --check-prefixes=AVX
----------------
RKSimon wrote:
> Add these back - you've lost SSE2/SLM test coverage.
Ok, fixed

================
Comment at: llvm/test/Transforms/SLPVectorizer/X86/sext.ll:3
+; RUN: opt < %s -mtriple=x86_64-unknown -basic-aa -slp-vectorizer -S | FileCheck %s --check-prefixes=SSE
+; RUN: opt < %s -mtriple=x86_64-unknown -mcpu=slm -basic-aa -slp-vectorizer -S | FileCheck %s --check-prefixes=SSE
 ; RUN: opt < %s -mtriple=x86_64-unknown -mcpu=corei7-avx -basic-aa -slp-vectorizer -S | FileCheck %s --check-prefixes=AVX
----------------
RKSimon wrote:
> Add these back - you've lost SSE2/SLM test coverage.
Ok, fixed

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D98714/new/

https://reviews.llvm.org/D98714