[PATCH] D111460: [X86][LoopVectorize] "Fix" `X86TTIImpl::getAddressComputationCost()`

Fri Oct 29 01:25:57 PDT 2021

pengfei added a subscriber: dorit.
pengfei added a comment.

In D111460#3093902 <https://reviews.llvm.org/D111460#3093902>, @lebedev.ri wrote:

> In D111460#3093892 <https://reviews.llvm.org/D111460#3093892>, @RKSimon wrote:
>
>> Did you ever manage to extend BaseT::getInterleavedMemoryOpCost to compare scalarization vs shuffle costs (and select the lowest)?
>
> Hm, why would i/we do that?
> LV already does that: https://github.com/llvm/llvm-project/blob/5d9318638e892143c7788191ca1d89445582880b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp#L7509-L7547

I think the problem is `getMemInstScalarizationCost` is considering the interleaved cases, which doesn't look totally independent to me.

================
Comment at: llvm/lib/Transforms/Vectorize/LoopVectorize.cpp:7046-7048
   // Figure out whether the access is strided and get the stride value
   // if it's known in compile time
   const SCEV *PtrSCEV = getAddressAccessSCEV(Ptr, Legal, PSE, TheLoop);
----------------
I'm wondering why not/if we can improve this function? The comments say we can get the stride information here. If this is true, we don't need the change in X86TargetTransformInfo.cpp, right? Since it already checks `isStridedAccess` and `getConstantStrideStep` there.

================
Comment at: llvm/test/Transforms/LoopVectorize/X86/x86-interleaved-accesses-masked-group.ll:1450
-; ENABLED_MASKED_STRIDED-NEXT:    [[INTERLEAVED_VEC:%.*]] = shufflevector <8 x i8> [[TMP6]], <8 x i8> [[TMP7]], <16 x i32> <i32 0, i32 8, i32 1, i32 9, i32 2, i32 10, i32 3, i32 11, i32 4, i32 12, i32 5, i32 13, i32 6, i32 14, i32 7, i32 15>
-; ENABLED_MASKED_STRIDED-NEXT:    call void @llvm.masked.store.v16i8.p0v16i8(<16 x i8> [[INTERLEAVED_VEC]], <16 x i8>* [[TMP10]], i32 1, <16 x i1> [[INTERLEAVED_MASK]])
 ; ENABLED_MASKED_STRIDED-NEXT:    [[INDEX_NEXT]] = add nuw i32 [[INDEX]], 8
----------------
lebedev.ri wrote:
> lebedev.ri wrote:
> > pengfei wrote:
> > > I think masked gather/scatter/load/store etc. is definitely better than branches on AVX512. I think we should consider the cost of branches.
> > So, this is kind-of expected. Note, `ENABLED_MASKED_STRIDED` is check prefix for the run line
> > with `-enable-masked-interleaved-mem-accesses`, which is an override
> > for `TTI::enableMaskedInterleavedAccessVectorization`, where latter returns false for X86.
> > This is really about vectorizing using masked memory ops when the memory ops are strided,
> > so further interleaving/deinterleaving is required. So we don't really support this on X86.
> > 
> > The actual problem here is in `X86TTIImpl::getInterleavedMemoryOpCost()`:
> > ```
> >   if (UseMaskForCond || UseMaskForGaps)
> >     return BaseT::getInterleavedMemoryOpCost(Opcode, VecTy, Factor, Indices,
> >                                              Alignment, AddressSpace, CostKind,
> >                                              UseMaskForCond, UseMaskForGaps);
> > ```
> > So the moment we ask for the interleaving costs when there is masking involved,
> > X86 cost model immediately fallbacks to the scalarized cost model,
> > while the only thing it needs to to is to ask `X86TTIImpl::getMaskedMemoryOpCost)`
> > instead of `X86TTIImpl::getMemoryOpCost()`. 
> > 
> > I think this can be improved, but i'm not sure if this is a blocker or not.
> > Again, this used to work because without this patch,
> > the cost of naive scalar gather is much higher than for the naive interleaved load.
> I took a look, and it seems like we'd need to have yet another cost table,
> and record the cost of  the following 35 shuffle sequences: https://godbolt.org/z/Mv8MYGxqP
> 
> Can be done, but unless we consider this to be a prerequisite here
> (note that `TTI::enableMaskedInterleavedAccessVectorization()` says `false` for X86),
> it's a bit on the edge of my interest, at least right now.
> I think this can be improved, but i'm not sure if this is a blocker or not.

I think the ENABLED_MASKED_STRIDED checks are intentional for the masked interleave cases. See comments above the function. I would think this is a blocker. Add @dorit @Ayal who worked with the initial patches D53011 and D53559.

> it seems like we'd need to have yet another cost table

Can we give a coarse-gained calculation first? I think we shouldn't break the dependence of the existing code.

> https://godbolt.org/z/Mv8MYGxqP

Why don't use llvm.masked.load/store directly in the tests?

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D111460/new/

https://reviews.llvm.org/D111460