[PATCH] D99719: [SLP] Better estimate cost of no-op extracts on target vectors.

Thu Apr 1 08:49:43 PDT 2021

fhahn added inline comments.

================
Comment at: llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp:3495-3498
+        unsigned MaxVecRegSize = getMaxVecRegSize();
+        unsigned EltSize = getVectorElementSize(VL[0]);
+        unsigned EltsPerVector = MaxVecRegSize / EltSize;
+        unsigned Idx = 0;
----------------
ABataev wrote:
> fhahn wrote:
> > ABataev wrote:
> > > fhahn wrote:
> > > > ABataev wrote:
> > > > > I think it is better to use `TLI->getTypeLegalizationCost(DL, cast<ExtractElementInst>(V)->getVectorOperandType());` to get the real machine vector type and the number of splits.
> > > > I think using `TargetLoweringInfo` would indeed be better, but unfortunately I don't think we can access it here, as it is defined in CodeGen? I tried to see if there are any other such uses in `llvm/lib/Transforms` but couldn't. Perhaps there's a way to use it I am missing?
> > > You can use `TTI->getNumberOfParts()` to get the number of registers and then calculate EltsPerVector.
> > > Also, what if there are extracts from 2 different vectors with the different numbers of elements?
> > That's convenient, thanks! I just gave it a try, but I stumbled over a problem. For example, on AArch64, `<2 x i32>` fits and can be used as the lower half of a vector register, so `EltsPerVector` would be 2 (and rightly so). But this has the unfortunate effect that in some cases we would vectorize some operations earlier with `<2 x i32>`, rather than vectorizing a larger expression with `<4 x i32>`. By using the larger vector register, we make sure to only do so to use the largest VF.
> > 
> > Arguably using `getNumberOfParts` is the right thing to use here, but I really want to avoid introducing any regressions and I don't think there's a way at the moment to skip vectorizing eagerly if it would prevent optimizing with a wider VF later on. WDYT?
> > 
> > > Also, what if there are extracts from 2 different vectors with the different numbers of elements?
> > 
> > At the moment all extracts in a block need to have the same vector register, so the types should also be the same. The `extracts_first_2_lanes_different_vectors` test should check for that case.
> 1. Could you give an example, please?
> 2. Then maybe guard these extra checks with something like:
> ```
> if (*ShuffleKind == TargetTransformInfo::SK_PermuteSingleSrc) {
>  ...
> }
> ```
> ?
Just had another look at the failure and it was caused by computing `Cost = TTI->getShuffleCost(ShuffleKind.getValue(), VecTy, Mask);` up-front, but then subtracting the cost of individual extracts.

I think this may reduce the cost too aggressively. 

For example in the code below, the shuffle-cost on AArch64 is 1, but the cost to extract from lane `1` is 3. The new code should just cancel out the cost of the shuffle, but in this case it made the cost more profitable than it should be! I think it should be more correct to subtract the cost of a single shuffle for a vector with `EltsPerVector` elements. That should be more symmetric to the computed `Cost`. 

```
  %v0.0 = extractelement <4 x i32> %v0, i32 0
  %v0.1 = extractelement <4 x i32> %v0, i32 1
```

I think there's a similar problem in the `AllUsersVectorized && !ScalarToTreeEntry.count()` case, but it is not obvious to me how to improve that, given that we potentially need to subtract the cost for only a subset of extracts. Perhaps we should ensure `Cost` is at least 0, to avoid the problem?

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D99719/new/

https://reviews.llvm.org/D99719