[PATCH] D111460: [X86][LoopVectorize] "Fix" `X86TTIImpl::getAddressComputationCost()`

Mon Oct 18 06:37:23 PDT 2021

lebedev.ri added a comment.

In D111460#3070039 <https://reviews.llvm.org/D111460#3070039>, @pengfei wrote:

> Thanks for the explanation! Apologize for my silly questions, I'm still learning in this realm :)

>> It's mostly tangential, the longer this patch drags out, the more tests may need to be updated, which isn't fun.
>
> I don't get the point, why a tangential patch results in test updated?

Sorry, i'm not really following the question.

>> Naive interleaving sequence is: wide load + extracts + inserts,
>> while scalarized gather sequence is: narrow loads + inserts,
>>
>> Right now "gather" sequence is modelled as having artificially-high cost
>> (what this patch changes), effectively prohibiting using it for vectorization. 
>> But afterwards, naturally, the cost of scalarized gather sequence lowers,
>> becoming more attractive for vectorization, and sometimes even more than 
>> for the naive interleaving sequence, as you can see in `interleaved-load-i16-stride-5.ll`.
>
> I think the high cost might have practical consideration, e.g., impact some benchmarks etc.

Sure, any change might have practical considerations, in either direction.

> What's the value to give a cost smaller than naive interleaving sequence?

Basically, we can't have a cake and eat it too.
We don't give "scalarized gather sequence" the cost lower than the
cost for "naive interleaving sequence", we give it it's real cost,
that happens to be lower than for the "naive interleaving sequence".

> IIUC, you are saying we prefer naive interleaving to scalarized gather, right?

Quote please? Define `we prefer`?

>> But in general, we can assume that sequence of shuffles may be better than 
>> a sequence of extracts+inserts, so we may be better off with interleaving
>> rather than gather. But we can't really improve the cost modelling for the
>> naive interleaving sequence (because we can't really properly cost model shuffles),
>> so what we have to do is hardcode the costs for the interestting interleaving tuples
>> in the cost tables, like i have done in those 90+ patches.
>
> Do you mean this patch will enable the use of hardcode the costs?

Define `use`?

> Can we make sure we always generate the interleaving sequence, e.g., when not have a constant stride?

Why? I don't think we should be doing anything like that. It is up to the vectorizer
to pick the best sequence, given the cost estimates provided by TTI for each one.

================
Comment at: llvm/lib/Target/X86/X86TargetTransformInfo.cpp:3847
   // ADD instruction.
-  if (Ty->isVectorTy() && SE) {
+  if (Ty->isVectorTy() && SE && !ST->hasAVX2()) {
+    // FIXME: this is a hack to artificially favor interleaved loads over gather
----------------
pengfei wrote:
> Does this still fall into @dmgreen 's previous comments?
> 
> > The base getAddressComputationCost never returns anything other than 0 and doesn't distinguish between scalar and vector types.
I do not understand the question.
This if-block is a hack, and the patch disables it for AVX2+.

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D111460/new/

https://reviews.llvm.org/D111460