[PATCH] D98714: [SLP] Add insertelement instructions to vectorizable tree

Tue May 11 07:19:03 PDT 2021

anton-afanasyev marked 6 inline comments as done.
anton-afanasyev added inline comments.

================
Comment at: llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp:3666-3669
+      InstructionCost Cost = 0;
+      Cost -= TTI->getScalarizationOverhead(SrcVecTy, DemandedElts,
+                                            /*Insert*/ true, /*Extract*/ false);
+      return Cost;
----------------
ABataev wrote:
> Just:
> ```
> return -TTI->getScalarizationOverhead(SrcVecTy, DemandedElts,
>                                             /*Insert*/ true, /*Extract*/ false);
> ```
I changed this code, used `Cost` variable.

================
Comment at: llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp:4215-4217
+    // No extract cost for vector "scalar"
+    if (isa<FixedVectorType>(EU.Scalar->getType()))
+      continue;
----------------
ABataev wrote:
> anton-afanasyev wrote:
> > ABataev wrote:
> > > anton-afanasyev wrote:
> > > > ABataev wrote:
> > > > > I don't think this is correct. I think you need to use code from my patch D101555 for better cost estimation here.
> > > > Why? We don't generate any extract instructions for external using of vector, so no need to cost it.
> > > > 
> > > > I think you mixed two different cases up.
> > > > The patch D101555 you referenced is about cost estimation when insert is _user_, but here is the case when insert is _used_. We do not really need to "extract" it, since its user uses vector value rather than scalar one.
> > > > 
> > > > Also I don't think we need to use code from D101555 in this patch, since it does the same by the other way. The main idea of this patch is to unify the way we process inserts (the only vector tree node for now) vs ordinary tree nodes. The inserts are tree node now, and they are sorted by index, so no need to shuffle them. We could use `ReorderIndicies` if needed, but no need, since operands are sorted as well.
> > > I think you're missing shuffle cost here. If the external user is a vector and extracted element is inserted into the different lane - it is at least shuffle and need to add its cost.
> > > I think you're missing shuffle cost here. 
> > The cost of what shuffle? We don't generate any shuffle.
> > 
> > > If the external user is a vector ...
> > Not "the external user is a vector", but its operand is a vector. We do not need to extract any special lane, since _whole_ vector is using and replacing after "vectorization". In this special case (when tree node is inserts) we have vector "scalars" (inserts have vector type) and their "vectorization" is just removing (i.e. replacing by vectorized operands).
> Ok, I see. You correctly excluded the cost of the final (sub)vector reuse.
> But I suggest to improve the cost model for inserts.
> ```
> i1 = insertelement undef, v0
> i2 = insertelement i1, v1
> ....
> ii1 = insertelement undef, v1
> ```
> If <v0,v1> gets vectorized, you need to count the cost of the extract of `v1`. Instead, we can count it as shuffle and build the final shuffle. It can be much more profitable than relying on extract element costs/instructions. But probably this can be addressed in the next patch.
Ok, I see you too. Though I don't see how this case is addressed within D101555, two inserts `i1` and `ii1` cannot be cought to the same tree and they don't occur in one `InsertUses` therefore.

Anyway, I suggest to address this in the separate patch. It's rather rare case in natural life.

================
Comment at: llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp:4672-4680
+        V = Builder.CreateShuffleVector(V, UndefValue::get(V->getType()), Mask);
+
+        for (unsigned I = 0; I < NumElts; I++)
+          Mask[I] =
+              (I < MinIndex || I >= MaxIndex) ? I : NumElts - MinIndex + I;
+
+        V = Builder.CreateShuffleVector(
----------------
ABataev wrote:
> ABataev wrote:
> > anton-afanasyev wrote:
> > > ABataev wrote:
> > > > anton-afanasyev wrote:
> > > > > ABataev wrote:
> > > > > > anton-afanasyev wrote:
> > > > > > > anton-afanasyev wrote:
> > > > > > > > ABataev wrote:
> > > > > > > > > ABataev wrote:
> > > > > > > > > > Can we use `ShuffleBuilder` here?
> > > > > > > > > Did you include these costs in the cost model?
> > > > > > > > They are included in `getScalarizationOverhead()`.
> > > > > > > Yes, I thought about `ShuffleBuilder`, but here we just need to create two shuffles of special kind for vector resizing. It requires `ShuffleInstructionBuilder` expanding, don't think it's worth it.
> > > > > > I rather doubt in it. First, you subtract the scalarization overhead cost from the vector cost, but here you need to add the costs of subvector insert and permutation of 2 vectors
> > > > > >>> Did you include these costs in the cost model?
> > > > > >> They are included in getScalarizationOverhead().
> > > > > > I rather doubt in it. First, you subtract the scalarization overhead cost from the vector cost, but here you need to add the costs of subvector insert and permutation of 2 vectors
> > > > > 
> > > > > These subvector inserts are lowered to nop actually, so they cost nothing. We need this code when processing big vector of scalars part-by-part, but every chunk fits the whole vector register (condition `bits >= MinVecRegSize`), so actually there is no inserting. The result consisting of several vector registers is returned then.
> > > > This might be not true for some targets/code patterns. Plus, the second pattern is a permutation/combining of 2 vectors, its cost is at least 1
> > > > This might be not true for some targets/code patterns. Plus, the second pattern is a permutation/combining of 2 vectors, its cost is at least 1
> > > 
> > > Both of these shuffles are just the one operation of "inserting", we need the first one to expand source vector since shufflevector needs the same size of operands.
> > > 
> > > We can add cost of this shuffles, but this prevents several vectorization being performed before, since this cost is redundant (may be `TTI->getShuffleCost()` should be tuned for `TargetTransformInfo::SK_InsertVector`?)
> > Need to check that the vectorized code is really profitable and if so, need to tune the cost model, yes.
> You can do this check by yourself. We have something similar for extracts, where we check if need to perform shuffle/copying of subvector. But we definitely need it.
> You can do this check by yourself. We have something similar for extracts, where we check if need to perform shuffle/copying of subvector. But we definitely need it.

Ok, I've used something similar for inserts as for extracts. Do you think it's better to make this check within `getShuffleCost()` in `X86TargetTransformInfo.cpp` and `AArch64TargetTransformInfo.cpp`?

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D98714/new/

https://reviews.llvm.org/D98714