[PATCH] D111460: [X86][LoopVectorize] "Fix" `X86TTIImpl::getAddressComputationCost()`

Wed Nov 3 02:37:29 PDT 2021

pengfei added inline comments.

================
Comment at: llvm/test/Transforms/LoopVectorize/X86/x86-interleaved-accesses-masked-group.ll:1450
-; ENABLED_MASKED_STRIDED-NEXT:    [[INTERLEAVED_VEC:%.*]] = shufflevector <8 x i8> [[TMP6]], <8 x i8> [[TMP7]], <16 x i32> <i32 0, i32 8, i32 1, i32 9, i32 2, i32 10, i32 3, i32 11, i32 4, i32 12, i32 5, i32 13, i32 6, i32 14, i32 7, i32 15>
-; ENABLED_MASKED_STRIDED-NEXT:    call void @llvm.masked.store.v16i8.p0v16i8(<16 x i8> [[INTERLEAVED_VEC]], <16 x i8>* [[TMP10]], i32 1, <16 x i1> [[INTERLEAVED_MASK]])
 ; ENABLED_MASKED_STRIDED-NEXT:    [[INDEX_NEXT]] = add nuw i32 [[INDEX]], 8
----------------
lebedev.ri wrote:
> lebedev.ri wrote:
> > pengfei wrote:
> > > lebedev.ri wrote:
> > > > lebedev.ri wrote:
> > > > > pengfei wrote:
> > > > > > I think masked gather/scatter/load/store etc. is definitely better than branches on AVX512. I think we should consider the cost of branches.
> > > > > So, this is kind-of expected. Note, `ENABLED_MASKED_STRIDED` is check prefix for the run line
> > > > > with `-enable-masked-interleaved-mem-accesses`, which is an override
> > > > > for `TTI::enableMaskedInterleavedAccessVectorization`, where latter returns false for X86.
> > > > > This is really about vectorizing using masked memory ops when the memory ops are strided,
> > > > > so further interleaving/deinterleaving is required. So we don't really support this on X86.
> > > > > 
> > > > > The actual problem here is in `X86TTIImpl::getInterleavedMemoryOpCost()`:
> > > > > ```
> > > > >   if (UseMaskForCond || UseMaskForGaps)
> > > > >     return BaseT::getInterleavedMemoryOpCost(Opcode, VecTy, Factor, Indices,
> > > > >                                              Alignment, AddressSpace, CostKind,
> > > > >                                              UseMaskForCond, UseMaskForGaps);
> > > > > ```
> > > > > So the moment we ask for the interleaving costs when there is masking involved,
> > > > > X86 cost model immediately fallbacks to the scalarized cost model,
> > > > > while the only thing it needs to to is to ask `X86TTIImpl::getMaskedMemoryOpCost)`
> > > > > instead of `X86TTIImpl::getMemoryOpCost()`. 
> > > > > 
> > > > > I think this can be improved, but i'm not sure if this is a blocker or not.
> > > > > Again, this used to work because without this patch,
> > > > > the cost of naive scalar gather is much higher than for the naive interleaved load.
> > > > I took a look, and it seems like we'd need to have yet another cost table,
> > > > and record the cost of  the following 35 shuffle sequences: https://godbolt.org/z/Mv8MYGxqP
> > > > 
> > > > Can be done, but unless we consider this to be a prerequisite here
> > > > (note that `TTI::enableMaskedInterleavedAccessVectorization()` says `false` for X86),
> > > > it's a bit on the edge of my interest, at least right now.
> > > > I think this can be improved, but i'm not sure if this is a blocker or not.
> > > 
> > > I think the ENABLED_MASKED_STRIDED checks are intentional for the masked interleave cases. See comments above the function. I would think this is a blocker. Add @dorit @Ayal who worked with the initial patches D53011 and D53559.
> > > 
> > > > it seems like we'd need to have yet another cost table
> > > 
> > > Can we give a coarse-gained calculation first? I think we shouldn't break the dependence of the existing code.
> > > 
> > > > https://godbolt.org/z/Mv8MYGxqP
> > > 
> > > Why don't use llvm.masked.load/store directly in the tests?
> > Attempted that in D112873, doesn't really help all that much.
> ping @pengfei - does D112873 look satisfactory to you to unblock this? :)
Sorry, D112873 doesn't reduce my concerns given:
1) I'm not fully understand why the change need to be done within `getInterleavedMemoryOpCost` and what's the relationship between it and the change in `getAddressComputationCost` here. I'd expect the calculation of branches happens in `getAddressComputationCost` where it may have the minimal side effect. Besides, the change in D112873 without a full test doesn't look good to me either.
2) Your words "The more i look at it, the more i'm surprised it ..." make me feel like the code is fragile and one more brick will crush the tower :)

I'm OK with leaving these tests as the last patch if you don't have interest with AVX512. You can add FIXME here and leave it to folks who have worked or may be interested it in future.

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D111460/new/

https://reviews.llvm.org/D111460