[PATCH] D111460: [X86][LoopVectorize] "Fix" `X86TTIImpl::getAddressComputationCost()`

Thu Oct 28 08:15:34 PDT 2021

lebedev.ri added inline comments.

================
Comment at: llvm/test/Transforms/LoopVectorize/X86/x86-interleaved-accesses-masked-group.ll:1450
-; ENABLED_MASKED_STRIDED-NEXT:    [[INTERLEAVED_VEC:%.*]] = shufflevector <8 x i8> [[TMP6]], <8 x i8> [[TMP7]], <16 x i32> <i32 0, i32 8, i32 1, i32 9, i32 2, i32 10, i32 3, i32 11, i32 4, i32 12, i32 5, i32 13, i32 6, i32 14, i32 7, i32 15>
-; ENABLED_MASKED_STRIDED-NEXT:    call void @llvm.masked.store.v16i8.p0v16i8(<16 x i8> [[INTERLEAVED_VEC]], <16 x i8>* [[TMP10]], i32 1, <16 x i1> [[INTERLEAVED_MASK]])
 ; ENABLED_MASKED_STRIDED-NEXT:    [[INDEX_NEXT]] = add nuw i32 [[INDEX]], 8
----------------
lebedev.ri wrote:
> pengfei wrote:
> > I think masked gather/scatter/load/store etc. is definitely better than branches on AVX512. I think we should consider the cost of branches.
> So, this is kind-of expected. Note, `ENABLED_MASKED_STRIDED` is check prefix for the run line
> with `-enable-masked-interleaved-mem-accesses`, which is an override
> for `TTI::enableMaskedInterleavedAccessVectorization`, where latter returns false for X86.
> This is really about vectorizing using masked memory ops when the memory ops are strided,
> so further interleaving/deinterleaving is required. So we don't really support this on X86.
> 
> The actual problem here is in `X86TTIImpl::getInterleavedMemoryOpCost()`:
> ```
>   if (UseMaskForCond || UseMaskForGaps)
>     return BaseT::getInterleavedMemoryOpCost(Opcode, VecTy, Factor, Indices,
>                                              Alignment, AddressSpace, CostKind,
>                                              UseMaskForCond, UseMaskForGaps);
> ```
> So the moment we ask for the interleaving costs when there is masking involved,
> X86 cost model immediately fallbacks to the scalarized cost model,
> while the only thing it needs to to is to ask `X86TTIImpl::getMaskedMemoryOpCost)`
> instead of `X86TTIImpl::getMemoryOpCost()`. 
> 
> I think this can be improved, but i'm not sure if this is a blocker or not.
> Again, this used to work because without this patch,
> the cost of naive scalar gather is much higher than for the naive interleaved load.
I took a look, and it seems like we'd need to have yet another cost table,
and record the cost of  the following 35 shuffle sequences: https://godbolt.org/z/Mv8MYGxqP

Can be done, but unless we consider this to be a prerequisite here
(note that `TTI::enableMaskedInterleavedAccessVectorization()` says `false` for X86),
it's a bit on the edge of my interest, at least right now.

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D111460/new/

https://reviews.llvm.org/D111460