[PATCH] D111460: [X86][LoopVectorize] "Fix" `X86TTIImpl::getAddressComputationCost()`

Tue Oct 19 01:31:00 PDT 2021

pengfei added a comment.

>> I think the high cost might have practical consideration, e.g., impact some benchmarks etc.
>
> Sure, any change might have practical considerations, in either direction.

I agree, but we should evaluate it carefully to make sure we get good than bad, right?

>> IIUC, you are saying we prefer naive interleaving to scalarized gather, right?
>
> Quote please? Define we prefer?

X86TargetTransformInfo.cpp:3853, "interleaved load is better in general in reality"

>> Can we make sure we always generate the interleaving sequence, e.g., when not have a constant stride?
>
> Why? I don't think we should be doing anything like that. It is up to the vectorizer
> to pick the best sequence, given the cost estimates provided by TTI for each one.

Only when the estimate is precise enough. Underestimating the cost of scalarized gather sequence will fool vectorizer in practice.
IA optimization manual says gather/scatter is prefered, emulated gather/scatter cost shouldn’t be lower than real gather/scatter.

================
Comment at: llvm/lib/Target/X86/X86TargetTransformInfo.cpp:3847
   // ADD instruction.
-  if (Ty->isVectorTy() && SE) {
+  if (Ty->isVectorTy() && SE && !ST->hasAVX2()) {
+    // FIXME: this is a hack to artificially favor interleaved loads over gather
----------------
lebedev.ri wrote:
> pengfei wrote:
> > Does this still fall into @dmgreen 's previous comments?
> > 
> > > The base getAddressComputationCost never returns anything other than 0 and doesn't distinguish between scalar and vector types.
> I do not understand the question.
> This if-block is a hack, and the patch disables it for AVX2+.
> 
> 
It is a question whether this is a hack or working as intended. It already checks both `isStridedAccess` and `getConstantStrideStep`. Shouldn't these be ture for interleaved load/store and return 0 finally? Given they are constant stride cases.
We still have to handle none stride or variable stride, e.g.,
```
                (%base, %index, scale)
                add %base, %stride
                (%base, %index, scale)
                add %base, %stride
                (%base, %index, scale)
                add %base, %stride
```
Returning 10 for none stride access might be too high, but returning 1 for variable stride still makes sense here.
Besides, checking for AVX2+ doesn't make sense either. We also have many targets have fast insert/extract support on SSE4.1 above.

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D111460/new/

https://reviews.llvm.org/D111460