[PATCH] D111460: [X86][LoopVectorize] "Fix" `X86TTIImpl::getAddressComputationCost()`

Sun Oct 31 07:05:25 PDT 2021

lebedev.ri added a comment.

The more i look at it, the more i'm surprised it all currently appears to work somewhat.
Other obvious issue - `X86TTIImpl::getGSVectorCost()` does not account for the legalization cost,
especially if there is a variable mask. (`X86TTIImpl::getMaskedMemoryOpCost()`, OTOH, does it right.)
Same for masked interleaved loads.

================
Comment at: llvm/test/Transforms/LoopVectorize/X86/x86-interleaved-accesses-masked-group.ll:1450
-; ENABLED_MASKED_STRIDED-NEXT:    [[INTERLEAVED_VEC:%.*]] = shufflevector <8 x i8> [[TMP6]], <8 x i8> [[TMP7]], <16 x i32> <i32 0, i32 8, i32 1, i32 9, i32 2, i32 10, i32 3, i32 11, i32 4, i32 12, i32 5, i32 13, i32 6, i32 14, i32 7, i32 15>
-; ENABLED_MASKED_STRIDED-NEXT:    call void @llvm.masked.store.v16i8.p0v16i8(<16 x i8> [[INTERLEAVED_VEC]], <16 x i8>* [[TMP10]], i32 1, <16 x i1> [[INTERLEAVED_MASK]])
 ; ENABLED_MASKED_STRIDED-NEXT:    [[INDEX_NEXT]] = add nuw i32 [[INDEX]], 8
----------------
pengfei wrote:
> lebedev.ri wrote:
> > lebedev.ri wrote:
> > > pengfei wrote:
> > > > I think masked gather/scatter/load/store etc. is definitely better than branches on AVX512. I think we should consider the cost of branches.
> > > So, this is kind-of expected. Note, `ENABLED_MASKED_STRIDED` is check prefix for the run line
> > > with `-enable-masked-interleaved-mem-accesses`, which is an override
> > > for `TTI::enableMaskedInterleavedAccessVectorization`, where latter returns false for X86.
> > > This is really about vectorizing using masked memory ops when the memory ops are strided,
> > > so further interleaving/deinterleaving is required. So we don't really support this on X86.
> > > 
> > > The actual problem here is in `X86TTIImpl::getInterleavedMemoryOpCost()`:
> > > ```
> > >   if (UseMaskForCond || UseMaskForGaps)
> > >     return BaseT::getInterleavedMemoryOpCost(Opcode, VecTy, Factor, Indices,
> > >                                              Alignment, AddressSpace, CostKind,
> > >                                              UseMaskForCond, UseMaskForGaps);
> > > ```
> > > So the moment we ask for the interleaving costs when there is masking involved,
> > > X86 cost model immediately fallbacks to the scalarized cost model,
> > > while the only thing it needs to to is to ask `X86TTIImpl::getMaskedMemoryOpCost)`
> > > instead of `X86TTIImpl::getMemoryOpCost()`. 
> > > 
> > > I think this can be improved, but i'm not sure if this is a blocker or not.
> > > Again, this used to work because without this patch,
> > > the cost of naive scalar gather is much higher than for the naive interleaved load.
> > I took a look, and it seems like we'd need to have yet another cost table,
> > and record the cost of  the following 35 shuffle sequences: https://godbolt.org/z/Mv8MYGxqP
> > 
> > Can be done, but unless we consider this to be a prerequisite here
> > (note that `TTI::enableMaskedInterleavedAccessVectorization()` says `false` for X86),
> > it's a bit on the edge of my interest, at least right now.
> > I think this can be improved, but i'm not sure if this is a blocker or not.
> 
> I think the ENABLED_MASKED_STRIDED checks are intentional for the masked interleave cases. See comments above the function. I would think this is a blocker. Add @dorit @Ayal who worked with the initial patches D53011 and D53559.
> 
> > it seems like we'd need to have yet another cost table
> 
> Can we give a coarse-gained calculation first? I think we shouldn't break the dependence of the existing code.
> 
> > https://godbolt.org/z/Mv8MYGxqP
> 
> Why don't use llvm.masked.load/store directly in the tests?
Attempted that in D112873, doesn't really help all that much.

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D111460/new/

https://reviews.llvm.org/D111460