[PATCH] D98714: [SLP] Add insertelement instructions to vectorizable tree

Fri May 7 03:52:38 PDT 2021

anton-afanasyev marked 2 inline comments as done.
anton-afanasyev added inline comments.

================
Comment at: llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp:4670-4671
+        // Create shuffle to resize vector
+        for (unsigned I = 0; I < NumElts; I++)
+          Mask[I] = (I < NumScalars) ? I : UndefMaskElem;
+        V = Builder.CreateShuffleVector(V, UndefValue::get(V->getType()), Mask);
----------------
ABataev wrote:
> anton-afanasyev wrote:
> > ABataev wrote:
> > > anton-afanasyev wrote:
> > > > ABataev wrote:
> > > > > anton-afanasyev wrote:
> > > > > > ABataev wrote:
> > > > > > > ```
> > > > > > > SmallVector<int, 16> Mask(NumElts, UndefMaskElem);
> > > > > > > std::iota(Mask.begin(), std::next(Mask.begin(), NumScalars), 0); 
> > > > > > > ```
> > > > > > > 
> > > > > > > Also, I think just `std::iota(Mask.begin(), Mask.end(), 0); ` shall work too
> > > > > > Thanks, changed to `std::iota(Mask.begin(), std::next(Mask.begin(), NumScalars), 0);`
> > > > > > 
> > > > > > Yes, `std::iota(Mask.begin(), Mask.end(), 0);` gives the same result, but wouldn't it lead to redundant code lowered?
> > > > > `std::iota(Mask.begin(), Mask.end(), 0);` will produce the same code as `std::iota(Mask.begin(), std::next(Mask.begin(), NumScalars), 0);` but only if `NumScalars <= Mask.size() * 2`. Otherwise the compiler may crash in some cases. So, better to keep `std::iota(Mask.begin(), std::next(Mask.begin(), NumScalars), 0);`
> > > > `Mask.size()` is just `NumElts` and `NumScalars <= NumElts` (is it worth to add assert for this?), so `NumScalars <= Mask.size() * 2` for all cases.
> > > > 
> > > > What I did mean by "the same result" is that shuffle
> > > > ```
> > > > %a = shufflevector <2 x float> %b, <2 x float> undef, <4 x i32> <i32 0, i32 1, i32 undef, i32 undef>
> > > > ```
> > > > is equivalent to
> > > > ```
> > > > %a = shufflevector <2 x float> %b, <2 x float> undef, <4 x i32> <i32 0, i32 1, i32 2, i32 3>
> > > > ```
> > > > (here `NumScalars == 2` and `NumElts == 4`).
> > > We may have a situation say NumScalars is 2 and NumElts is 8, if insert the same scalars several times.
> > Do you mean "NumScalars is 8 and NumElts is 2"?
> > This case is excluded, we can get only unique inserts (inserts with unique index) from `findBuildAggregate()` for now. Also, we even assert
> > ```
> > assert(ReuseShuffleIndicies.empty() && "All inserts should be unique");
> > ```
> > at first line of `InsertElement` at the `buildTree_rec()` function.
> Well, maybe :) Anyway, let's keep `std::iota(Mask.begin(), std::next(Mask.begin(), NumScalars), 0);` just to avoid fixin something in future when we add support for vectorization of more code patterns
Sure!

================
Comment at: llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp:4672-4680
+        V = Builder.CreateShuffleVector(V, UndefValue::get(V->getType()), Mask);
+
+        for (unsigned I = 0; I < NumElts; I++)
+          Mask[I] =
+              (I < MinIndex || I >= MaxIndex) ? I : NumElts - MinIndex + I;
+
+        V = Builder.CreateShuffleVector(
----------------
ABataev wrote:
> anton-afanasyev wrote:
> > ABataev wrote:
> > > anton-afanasyev wrote:
> > > > anton-afanasyev wrote:
> > > > > ABataev wrote:
> > > > > > ABataev wrote:
> > > > > > > Can we use `ShuffleBuilder` here?
> > > > > > Did you include these costs in the cost model?
> > > > > They are included in `getScalarizationOverhead()`.
> > > > Yes, I thought about `ShuffleBuilder`, but here we just need to create two shuffles of special kind for vector resizing. It requires `ShuffleInstructionBuilder` expanding, don't think it's worth it.
> > > I rather doubt in it. First, you subtract the scalarization overhead cost from the vector cost, but here you need to add the costs of subvector insert and permutation of 2 vectors
> > >>> Did you include these costs in the cost model?
> > >> They are included in getScalarizationOverhead().
> > > I rather doubt in it. First, you subtract the scalarization overhead cost from the vector cost, but here you need to add the costs of subvector insert and permutation of 2 vectors
> > 
> > These subvector inserts are lowered to nop actually, so they cost nothing. We need this code when processing big vector of scalars part-by-part, but every chunk fits the whole vector register (condition `bits >= MinVecRegSize`), so actually there is no inserting. The result consisting of several vector registers is returned then.
> This might be not true for some targets/code patterns. Plus, the second pattern is a permutation/combining of 2 vectors, its cost is at least 1
> This might be not true for some targets/code patterns. Plus, the second pattern is a permutation/combining of 2 vectors, its cost is at least 1

Both of these shuffles are just the one operation of "inserting", we need the first one to expand source vector since shufflevector needs the same size of operands.

We can add cost of this shuffles, but this prevents several vectorization being performed before, since this cost is redundant (may be `TTI->getShuffleCost()` should be tuned for `TargetTransformInfo::SK_InsertVector`?)

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D98714/new/

https://reviews.llvm.org/D98714