[PATCH] D140789: [SLP] Unify GEP cost modeling for load, store and GEP nodes.

Tue Jan 3 11:19:50 PST 2023

ABataev added inline comments.

================
Comment at: llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp:6893-6898
+      // When GEP used more than once it won't be removed after vectorization
+      // (should we check that all uses are inside vec tree instead?).
+      // GEPs with all constant indices also considered to have zero cost.
+      // TODO: it is target dependent, so need to implement and then use a TTI
+      // interface.
+      if (Ptr && Ptr->hasOneUse() && !Ptr->hasAllConstantIndices())
----------------
vdmitrie wrote:
> ABataev wrote:
> > If pointer has multiple uses, it still will be vectorized + added the cost of the external use. I think currently, we still may add the cost of the external use for such geps. Shall we drop Ptr->hasOneUse() for some nodes, like scattervectorize, but not for vector loads/stores?
> We shall treat GEP is not a regular instruction. For regular instruction we can perform vector operation and than extract a lane to get a value. We cannot do the same for a GEP. When scalar GEPs has just one use it means that their user will be removed as a result of vectorization. But this does not happen for GEPs as there is no vector version of GEP exists. Instead we just cast base pointer to required vector type. If some non-base pointer GEPs have more that one use that means they may still have uses after the tree was vectorized (i.e. GEP will not be removed). That is my understanding about what logic was put behind that hasOneUse check. I agree that it can be not quite satisfactory and I left the comment about "all uses inside vectorizable tree". In this regard test case geps-non-pow-2.ll probably represents an exception from the above as GEPs are arguments of PHIs (and we do not really know what is relationships between the GEPs) and an external use will produce an extract rather than leaves the original GEP instruction.  Note that in this case all nodes in the tree are either "vectorize" or splats (See attached pdf). I.e. we probably may drop hasOneUse when an in-tree user of GEPs is PHI but I believe it would be incorrect to do that if GEP user is a load/store node (regardless of vectorize/scattervectorize kind).
> 
> 
I'm not saying about particular test, just a common question.
 Say, we have something like this:
```
%gep1 = getelementptr
%gep2 = getelementptr
%a = load %gep1
%b = load %gep2
%c = load %gep1
```

If 2 first loads gets vectorized, the third load will get extractelement from vector getelementptr:
```
%gep1 = getelement <x>
%vec_a = load < 2 x> %gep1
%gep = extractelement %gep1, 0
%c load %gep
```
try the next code:

```
define i32 @jumbled-load(ptr noalias nocapture %in, ptr noalias nocapture %inn, ptr noalias nocapture %out) {
  %load.1 = load i32, ptr %in, align 4
  %gep.1 = getelementptr inbounds i32, ptr %in, i64 3
  %load.2 = load i32, ptr %gep.1, align 4
  %gep.2 = getelementptr inbounds i32, ptr %in, i64 6
  %load.3 = load i32, ptr %gep.2, align 4
  %gep.3 = getelementptr inbounds i32, ptr %in, i64 9
  %load.4 = load i32, ptr %gep.3, align 4
  %load.5 = load i32, ptr %inn, align 4
  %gep.4 = getelementptr inbounds i32, ptr %inn, i64 1
  %load.6 = load i32, ptr %gep.4, align 4
  %gep.5 = getelementptr inbounds i32, ptr %inn, i64 2
  %load.7 = load i32, ptr %gep.5, align 4
  %gep.6 = getelementptr inbounds i32, ptr %inn, i64 3
  %load.8 = load i32, ptr %gep.6, align 4
  %mul.1 = mul i32 %load.1, %load.5
  %mul.2 = mul i32 %load.2, %load.6
  %mul.3 = mul i32 %load.3, %load.7
  %mul.4 = mul i32 %load.4, %load.8
  %gep.8 = getelementptr inbounds i32, ptr %out, i64 1
  %gep.9 = getelementptr inbounds i32, ptr %out, i64 2
  %gep.10 = getelementptr inbounds i32, ptr %out, i64 3
  store i32 %mul.1, ptr %gep.9, align 4
  store i32 %mul.2, ptr %out, align 4
  store i32 %mul.3, ptr %gep.10, align 4
  store i32 %mul.4, ptr %gep.8, align 4

  %ll = load i32, ptr %gep.1, align 4

  ret i32 undef
}
```
opt -S -mtriple=x86_64-unknown -mattr=+avx512vl -passes=slp-vectorizer ./repro.ll
```
define i32 @jumbled-load(ptr noalias nocapture %in, ptr noalias nocapture %inn, ptr noalias nocapture %out) #0 {
  %1 = insertelement <4 x ptr> poison, ptr %in, i32 0
  %2 = shufflevector <4 x ptr> %1, <4 x ptr> poison, <4 x i32> zeroinitializer
  %3 = getelementptr i32, <4 x ptr> %2, <4 x i64> <i64 3, i64 9, i64 0, i64 6>
  %4 = call <4 x i32> @llvm.masked.gather.v4i32.v4p0(<4 x ptr> %3, i32 4, <4 x i1> <i1 true, i1 true, i1 true, i1 true>, <4 x i32> poison)
  %5 = load <4 x i32>, ptr %inn, align 4
  %6 = shufflevector <4 x i32> %5, <4 x i32> poison, <4 x i32> <i32 1, i32 3, i32 0, i32 2>
  %7 = mul <4 x i32> %4, %6
  store <4 x i32> %7, ptr %out, align 4
  %8 = extractelement <4 x ptr> %3, i32 0
  %ll = load i32, ptr %8, align 4
  ret i32 undef
}

```
For vector loads/stores we maybe do not emit extractelement but we can account its cost. Need to exclude this extra cost for the geps with multiple uses

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D140789/new/

https://reviews.llvm.org/D140789