[PATCH] D111460: [X86][LoopVectorize] "Fix" `X86TTIImpl::getAddressComputationCost()`

Wed Oct 27 04:01:39 PDT 2021

lebedev.ri added inline comments.

================
Comment at: llvm/lib/Target/X86/X86TargetTransformInfo.cpp:3855
   // ADD instruction.
-  if (Ty->isVectorTy() && SE) {
+  if (Ty->isVectorTy() && SE && !ST->hasAVX2()) {
+    // FIXME: this is a hack to artificially favor interleaved loads over gather
----------------
Does this comment look better to everyone?

================
Comment at: llvm/lib/Transforms/Vectorize/LoopVectorize.cpp:7061
   Type *PtrTy = ToVectorTy(Ptr->getType(), VF);
+  // FIXME: PtrTy should not be a vector, it's a hack.

----------------
pengfei wrote:
> RKSimon wrote:
> > dmgreen wrote:
> > > pengfei wrote:
> > > > lebedev.ri wrote:
> > > > > pengfei wrote:
> > > > > > It would be more productive if it includes the reason why we can’t change it right away.
> > > > > > Lack of such explanation led to the original version of the patch that repeats D93129.
> > > > > I'm open to suggestions as to how this should be amended.
> > > > Sorry, I don't have a concrete suggestion. I have concerns because I saw disagreement from Dave and didn't see consensus have been made.
> > > My comments were mostly for ARM/AArch64, which this patch no longer affects. I don't have as much knowledge on X86 matters and no evidence suggesting one way or another whether this is a good idea I'm afraid.
> > > 
> > > I don't think this comment is particularly useful though. Personally I would personally just remove this comment and leave the patch to the X86 backend.
> > @fhahn What happened to D93129 that actually dealt with this?
> Removing the comment sounds good to me.
Hm, i'd prefer to leave at least some mark here. How about this instead?

================
Comment at: llvm/test/Transforms/LoopVectorize/X86/x86-interleaved-accesses-masked-group.ll:1450
-; ENABLED_MASKED_STRIDED-NEXT:    [[INTERLEAVED_VEC:%.*]] = shufflevector <8 x i8> [[TMP6]], <8 x i8> [[TMP7]], <16 x i32> <i32 0, i32 8, i32 1, i32 9, i32 2, i32 10, i32 3, i32 11, i32 4, i32 12, i32 5, i32 13, i32 6, i32 14, i32 7, i32 15>
-; ENABLED_MASKED_STRIDED-NEXT:    call void @llvm.masked.store.v16i8.p0v16i8(<16 x i8> [[INTERLEAVED_VEC]], <16 x i8>* [[TMP10]], i32 1, <16 x i1> [[INTERLEAVED_MASK]])
 ; ENABLED_MASKED_STRIDED-NEXT:    [[INDEX_NEXT]] = add nuw i32 [[INDEX]], 8
----------------
pengfei wrote:
> I think masked gather/scatter/load/store etc. is definitely better than branches on AVX512. I think we should consider the cost of branches.
So, this is kind-of expected. Note, `ENABLED_MASKED_STRIDED` is check prefix for the run line
with `-enable-masked-interleaved-mem-accesses`, which is an override
for `TTI::enableMaskedInterleavedAccessVectorization`, where latter returns false for X86.
This is really about vectorizing using masked memory ops when the memory ops are strided,
so further interleaving/deinterleaving is required. So we don't really support this on X86.

The actual problem here is in `X86TTIImpl::getInterleavedMemoryOpCost()`:
```
  if (UseMaskForCond || UseMaskForGaps)
    return BaseT::getInterleavedMemoryOpCost(Opcode, VecTy, Factor, Indices,
                                             Alignment, AddressSpace, CostKind,
                                             UseMaskForCond, UseMaskForGaps);
```
So the moment we ask for the interleaving costs when there is masking involved,
X86 cost model immediately fallbacks to the scalarized cost model,
while the only thing it needs to to is to ask `X86TTIImpl::getMaskedMemoryOpCost)`
instead of `X86TTIImpl::getMemoryOpCost()`. 

I think this can be improved, but i'm not sure if this is a blocker or not.
Again, this used to work because without this patch,
the cost of naive scalar gather is much higher than for the naive interleaved load.

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D111460/new/

https://reviews.llvm.org/D111460